Prepare Leaf distribution along the shoot axis (arrangement type) data from TRY for use#

The Leaf distribution along the shoot axis (arrangement type) data from TRY informs on how the leaves of the plants are arranged around the stem. They may be rosulate, semi-rosulate, opposite, whorled, alternate, erosulate or have a leaf crown.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Leaf distribution along the shoot axis (arrangement type)"]

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

searchNames <- c(
	"^(always)? ?(rosette|rosulate)",
	"(h|s)emi-?(rosette|rosulate)",
	"opposite|bundled in 2s",
	"whorled|spirally|bundled in 5s",
	"alternate",
	"erosulate|no rosette|leaves distributed regularly along the stem",
	"tuft|crown"
)

We can now search for the strings defined before and give names to the new categories.

# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)

# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c("rosulate", "semi-rosulate", "opposite", "whorled", "alternate", "erosulate", "leaf crown")

Let’s have a look at the results.

# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)

Some categories are sub-categories of others, and this needs to be taken into account.

searchResults[searchResults[, 3] == TRUE, 6] <- TRUE # opposite is erosulate
searchResults[searchResults[, 4] == TRUE, 6] <- TRUE # whorled is erosulate
searchResults[searchResults[, 5] == TRUE, 6] <- TRUE # alternate is erosulate

Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.

newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[newVals == ""] <- NA

# integrate into TRY
TRY[TraitName == "Leaf distribution along the shoot axis (arrangement type)", CleanedValueStr := newVals]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))