Prepare Leaf shape data from TRY for use#

The leaf shape data from TRY informs on the leaf morphology of a plant. Leaf morphology often is an important characteristic for the identification of a plant family, genus, or species. In TRY, leaf shape is a trait encompassing numeric measures of length and area ratios as well as categorical traits describing different parts of the leaf. Therefore, the original trait is split into various traits using this script.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest#select data of interest
TRYSubset <- TRY[TraitName == "Leaf shape"]

The leaf shape data from TRY is a container for a large number of different traits, both numeric and categorical ones. We will start processing the categorical traits and work on the numeric traits thereafter.

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

Some data are coded as “yes”, with the actual value being written in the “OriglName” column. We will convert this data to the actual values.

# show leaf shape characteristics where presence of those was indicated with "yes" (e.g. Driptip --> yes)
table(TRYSubset$OriglName[TRYSubset$OrigValueStr == "yes"])
oriVals[!is.na(oriVals) & oriVals == "yes"] <- TRYSubset$OriglName[!is.na(oriVals) & oriVals == "yes"]

It looks like a good idea to remove purely numeric values, accounting for some values that are coded with “e-“.

oriVals[grepl("^\\d?(\\d|\\.|e-)*\\d$", oriVals)] <- NA

Even the categorical leaf shape data needs to be divided into six different traits.

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

The column OriglName in TRY gives the original names of the traits as they were used by the authors of the individual datasets. We will use this column to know which values actually belong to which trait, given that from the values alone, this is not always clear. As an example, “round” or “rounded” may refer to the leaf tip as well as the outline of the entire leaf. The origlNames variable we will create know contains strings that are matched by the admissible OriglName values for each trait.

# create a list of vectors containing the search strings to look for to classify the different
# categories and one for the new names
# additionally, create a vector for the OriglName column, and one to store the new trait values temporarily together
# with the six new trait names
searchNames <- list()
searchResultsCols <- list()
origlNames <- list()
newValsAll <- rep(NA, length(oriVals))
newTraitNamesAll <- rep("", length(oriVals))

# leaf tip
searchNames[[1]] <- c(
	"pointed|acute|acuminate|^(leaf )?drip-?tip|aristate|attenuate",
	"obtuse",
	"small tip|mucron(ul)?ate",
	"emarginate|retuse|terminal notch",
	"truncate"
)
searchResultsCols[[1]] <- c("acute", "obtuse", "mucronate", "retuse", "truncate")
origlNames[1] <- "tip|leaf shape: 3\\.|terminal notch"

# leaf margin
searchNames[[2]] <- c(
	"entire",
	"toothed|dentate",
	"serr(ul)?ate|runcinate",
	"crenate",
	"sinuate"
)
searchResultsCols[[2]] <- c("entire", "dentate", "serrate", "crenate", "sinuate")
origlNames[2] <- "leaf shape|leaf shape: 2\\."

# leaf base
searchNames[[3]] <- c(
	"rounded",
	"truncate",
	"sagg?itt?ate|hastate",
	"cuneate|decurrent|attenuate"
)
searchResultsCols[[3]] <- c("rounded", "truncate", "hastate", "cuneate")
origlNames[3] <- "base|leaf shape: 5\\."

# leaf joint
searchNames[[4]] <- c(
	"petiolate",
	"^sessile",
	"subsessile",
	"clasping",
	"sheathing",
	"connate"
)
searchResultsCols[[4]] <- c("petiolate", "sessile", "subsessile", "clasping", "sheating", "connate")
origlNames[4] <- "leaf shape: 6\\."

# leaf compoundness
searchNames[[5]] <- c(
	"simple|full",
	"lobed|lobate",
	"(bi)?pinnate|pinnatifid|digitate|ti?riff?id|bifid",
	"palmate",
	"flabellate|flavate"
)
searchResultsCols[[5]] <- c("simple", "lobate", "compound,pinnate", "compound,palmate", "flabellate")
origlNames[5] <- "leaf - form|leaf shape|leaf shape: 2\\.|leafshape|shape"

# leaf shape (of entire leaf)
searchNames[[6]] <- c(
	"(^|\\W)ovate|oval|ovoid",
	"elliptic|curved|broad|1-3 ?(times|x) as long as (broad|wide)",
	"oblong|straight|parallel-sided|long-leaf|narrowed|lingua?late|>?3 ?(times|x) as long as (wide|broad)",
	"(^|\\W)cor?date",
	"orbicular|length = width|round",
	"grass-like|l?inear",
	"(ob)?lanceolate",
	"obovate",
	"rhomboid",
	"scale-like",
	"needle",
	"spath?ulate",
	"deltoid",
	"obcordate|cuneate",
	"reniform"
)
searchResultsCols[[6]] <- c(
	"ovate", "elliptic", "oblong", "cordate", "orbicular", "linear",
	"lanceolate", "obovate", "rhomboid", "scale", "needle", "spathulate",
	"deltoid", "obcordate", "reniform"
)
origlNames[6] <- "elliptic|leaf - form|leaf shape|leaf shape 4\\.|leaf_shape|leafshape|linear|oblong|obovate|ovate|shape"

To make the workflow concise, we will process them with a loop.

  • we search for the strings defined before and give names to the new categories

  • we have a look at the results

  • as these categories should be exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search

  • we create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.

  • we rename the traits according to our names defined before. The last trait will keep the original name “Leaf shape”. To avoid double processing, we will temporarily rename the trait to “Leaf shapeTemp”.

for (i in seq_along(searchNames)) {
	# remove entries not belonging to the correct OriglName
	tempVals <- oriVals
	tempVals[!(grepl(origlNames[i], TRYSubset$OriglName, ignore.case = TRUE))] <- NA

	# search for the strings defined before
	searchResults <- sapply(searchNames[[i]], grepl, tempVals, ignore.case = TRUE)

	# name columns of searchResults matrix like categories
	colnames(searchResults) <- searchResultsCols[[i]]

	# show the number of matches to each category
	print(colSums(searchResults))

	# remove ambiguous entries
	searchResults[rowSums(searchResults) > 1, ] <- FALSE

	# use the searchResults matrix to create new value strings by concatenating all data found
	newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
		paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
	})
	newVals[newVals == ""] <- NA

	newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
	newTraitNamesAll[!is.na(newVals)] <- paste("Leaf", c("tip", "margin", "base", "joint", "compoundness", "shapeTemp")[i])
}

# integrate data into TRY
TRY[TraitName == "Leaf shape", CleanedValueStr := newValsAll]
TRY[TraitName == "Leaf shape", OrigUnitStr := NA]
TRY[TraitName == "Leaf shape", TraitName := newTraitNamesAll]

We will now process the numerical data stored as leaf shape.

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

We remove purely categorical values by removing every entry without a number in it. Additionally, we remove the number -999 used to code NA, as well as larger signs (“>”). As we may convert the data into standardized units, we change it to numeric. Finally, we remove values below zero, as those are not expected here.

# remove purely categorical values
oriVals[!grepl("\\d", oriVals)] <- NA

# remove -999, ranges and larger signs
oriVals <- sub("^-999$", NA, oriVals)
oriVals <- gsub(".*\\d-", "", oriVals) # avoid removing e-04
oriVals <- gsub(">", "", oriVals)

# convert strings to numeric values
oriVals <- as.numeric(oriVals)

# remove entries of zero or below
oriVals[!is.na(oriVals) & oriVals <= 0] <- NA

Let’s get an overview of different measures available by checking data names belonging to measurements with units given.

# thirteen measurement ways to consider
table(TRYSubset$OrigUnitStr[TRYSubset$OrigUnitStr != ""], TRYSubset$OriglName[TRYSubset$OrigUnitStr != ""])

We see that data with the entry “formC” (=form coefficient) likely has the wrong unit. Several definitions for the leaf form coefficient exist, of which the last best matches the value range.

  • 4 * w * area / perimeter^2 (with w == constant)

  • area / perimeter

  • perimeter / area

Therefore, we correct the “OrigUnitStr” column to cm/cm2.

TRYSubset[OriglName == "formC", OrigUnitStr := "cm/cm2"]

The numeric leaf shape data needs to be divided into different traits. We will work by units to identify them. Again, we will store the data temporarily in the newValsAll variable and the traits in the newTraitNamesAll variable. The reason why we do not use the one already defined is that some values are converted both into numeric and categorical data (by using the leaf length/width ratio).

newValsAll <- newUnitsAll <- rep(NA, length(oriVals))
newTraitNamesAll <- rep("", length(oriVals))

cm/cm: leaf length/width (or inverse!)

length/width

newVals <- oriVals
newVals[TRYSubset$OrigUnitStr != "cm/cm"] <- NA
table(TRYSubset$OriglName[!is.na(newVals)]) # aspectR: width/length, LS: length/width
invVals <- rep(NA, length(unique(TRYSubset$OriglName[!is.na(newVals)])))

invert some values from width/length to length/width

invVals[unique(TRYSubset$OriglName[!is.na(newVals)]) %in% c("Leaf length: width", "LS", "leaf length to width ratio", "LengthWidth")] <- 0
invVals[unique(TRYSubset$OriglName[!is.na(newVals)]) %in% c("AspectRatio(Width/Length)", "aspectR")] <- 1
for (i in seq_along(unique(TRYSubset$OriglName[!is.na(newVals)]))) {
	if (invVals[i] > 0) {
		newVals[TRYSubset$OriglName == unique(TRYSubset$OriglName[!is.na(newVals)])[i]] <-
			1 / newVals[TRYSubset$OriglName == unique(TRYSubset$OriglName[!is.na(newVals)])[i]]
	}
}

# write data into template
newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
newTraitNamesAll[!is.na(newVals)] <- "Leaf length/width"
newUnitsAll[!is.na(newVals)] <- "cm/cm"

Using the length/width ratio, we can add some data to the categorical leaf shape data. We store it under the variable name “Leaf shapeTemp” to avoid processing the data twice. Later, this data will be renamed to “Leaf shape”.

TRY[
	which(TRY$TraitName == "Leaf shapeTemp")[!is.na(newVals) & newVals < 0.9 & newVals <= 1.1],
	CleanedValueStr := "orbicular"
]
TRY[
	which(TRY$TraitName == "Leaf shapeTemp")[!is.na(newVals) & newVals > 1.1 & newVals <= 3],
	CleanedValueStr := "elliptic"
]
TRY[
	which(TRY$TraitName == "Leaf shapeTemp")[!is.na(newVals) & newVals < 3 & newVals <= 6],
	CleanedValueStr := "oblong"
]
TRY[
	which(TRY$TraitName == "Leaf shapeTemp")[!is.na(newVals) & newVals > 6],
	CleanedValueStr := "linear"
]

cm2/cm2: leaf form index (area/length^2) OR leaf perimeter^2/leaf area

area/length^2

newVals <- oriVals
newVals[TRYSubset$OrigUnitStr != "cm2/cm2"] <- NA
table(TRYSubset$OriglName[!is.na(newVals)])
newVals[TRYSubset$OriglName != "Leaf perimeter^2: area"] <- NA
newVals <- 1 / newVals

# write data into template
newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
newTraitNamesAll[!is.na(newVals)] <- "Leaf (perimeter^2)/area"
newUnitsAll[!is.na(newVals)] <- "cm^2/cm^2"

perimeter^2/leaf area

newVals <- oriVals
newVals[TRYSubset$OrigUnitStr != "cm2/cm2"] <- NA
newVals[TRYSubset$OriglName != "Leaf Form index"] <- NA

# write data into template
newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
newTraitNamesAll[!is.na(newVals)] <- "Leaf (length^2)/area"
newUnitsAll[!is.na(newVals)] <- "cm^2/cm^2"

cm/cm2: perimeter/area and form coefficient OR leaf fractal dimension

perimeter/area

newVals <- oriVals
newVals[TRYSubset$OrigUnitStr != "cm/cm2"] <- NA
table(TRYSubset$OriglName[!is.na(newVals)])
newVals[!(TRYSubset$OriglName %in% c("Leaf perimeter: area", "formC", "FormCoefficient"))] <- NA

# write data into template
newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
newTraitNamesAll[!is.na(newVals)] <- "Leaf perimeter/area"
newUnitsAll[!is.na(newVals)] <- "cm/cm^2"

fractal dimension

newVals <- oriVals
newVals[TRYSubset$OrigUnitStr != "cm/cm2"] <- NA
newVals[TRYSubset$OriglName != "Leaf_fractal_dimension_(cm/cm2)"] <- NA

# write data into template
newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
newTraitNamesAll[!is.na(newVals)] <- "Leaf fractal dimension"
newUnitsAll[!is.na(newVals)] <- ""

cm2/cm: area/perimeter

area/perimeter

newVals <- oriVals
newVals[TRYSubset$OrigUnitStr != "cm2/cm"] <- NA
table(TRYSubset$OriglName[!is.na(newVals)])
newVals <- 1 / newVals

# write data into template
newValsAll[!is.na(newVals)] <- newVals[!is.na(newVals)]
newTraitNamesAll[!is.na(newVals)] <- "Leaf perimeter/area"
newUnitsAll[!is.na(newVals)] <- "cm/cm^2"

After processing all data, we integrate it into TRY and rename “Leaf shapeTemp” to “Leaf shape” and save the data.

# integrate data into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Leaf shape", CleanedValueStr := newValsAll]
TRY[TraitName == "Leaf shape", OrigUnitStr := newUnitsAll]
TRY[TraitName == "Leaf shape", TraitName := newTraitNamesAll]

# rename Leaf shapeTemp
TRY[TraitName == "Leaf shapeTemp", TraitName := "Leaf shape"]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))