Prepare Leaflet number per leaf data from TRY for use#

The data on Leaflet number per leaf from TRY informs on the number of leaflets in compound leaves.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Leaflet number per leaf"]

To get an overview of the data, we sort them, and show them as a table.

# extract original data string
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

We need to clean the data to only keep single numbers or intervals, of which means will be taken later. English month abbreviations are likely artefacts from automatic conversions in Excel. Invisible control characters need to be replaced by hyphens. We remove numbers in parentheses and years, also a likely result of automated conversion in Excel. Additionally, URLs, the word “foli(ol)ate” and some remaining special character are removed. We record the occurrence of the word “pairs”, as for these entries, the number of leaflets needs to be doubled. Then, we remove all letter characters from the data.

# repair data with English months abbreviations, probably due to automatic conversion in Excel
for (i in seq_along(month.abb)) oriVals <- gsub(month.abb[i], i, oriVals)

# replace an invisible control character by "-"
oriVals <- sub("\x96", "\x2d", oriVals)

# remove numbers in parentheses
oriVals <- sub("\\(\\d+\\)", "", oriVals)

# remove years starting with 20 in what appears to be automatic conversion in Excel
oriVals <- sub("/20\\d{2}$", "", oriVals)

# remove URLs
oriVals[grepl("^http", oriVals)] <- NA

# remove scientific term used to describe numbers of leaflets
oriVals <- sub("foli(ol)?ate", "", oriVals)

# replace remaining confounding characters by whitespace
oriVals <- gsub("\\-|\\(|\\)|/|x|\\?", " ", oriVals)

# identify numbers that refer to leaflet pairs, not to the leaflet number
# this will be corrected later
pairs <- grepl("pairs", oriVals)

# remove alphabetical characters
oriVals <- gsub("[a-z]", "", oriVals)

We are now ready to split the remaining strings into single numbers and calculate their means. We also account for the occurrence of the word “pairs” by multiplying such strings with two. Finally, we round the data to one decimal place and store it into a new column called “CleanedValueStr”.

# split remaining strings into numbers and calculate means
oriVals <- strsplit(oriVals, " ")
oriVals <- sapply(oriVals, function(x) mean(as.numeric(x), na.rm = TRUE))
oriVals[is.na(oriVals) | oriVals == 0] <- NA # homogeneize 0 (as this is erroneous data) and NaN to NA

# double numbers for pairs
oriVals[pairs] <- 2 * oriVals[pairs]

# round numbers as means introduce decimals
oriVals <- round(oriVals, 1)

# integrate into TRY
TRY[TraitName == "Leaflet number per leaf", CleanedValueStr := oriVals]

Although not necessary, we may change the trait name.

# add classification into whole plant trait or plant part trait to trait name
TRY[TraitName == "Leaflet number per leaf", TraitName := "Leaf leaflet number per leaf"]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))