Prepare Plant life form data from TRY for use#

The Plant life form data from TRY informs on the Raunkiaer life form of a plant. Raunkiaer life forms describe the location of the overwintering buds of a plant during seasons with adverse conditions (mostly drought and cold).

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Plant life form (Raunkiaer life form)"]

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

There are some coded entries we need to take care of. The values “yes” and “TRUE” have different meanings here. We will check those for the different values of the “DataName” column and replace the values by those in “DataName”. We then remove purely numeric values and remove the hyphen in “hemi-epiphytic” to ease the latter matching process.

# repair coded entries
datNames <- names(table(TRYSubset[oriVals == TRUE]$DataName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[DataName == datNames[i] & oriVals == TRUE][1:2])
	print("--------------------------------------")
}
for (i in seq_along(datNames)) {
	oriVals[TRYSubset$DataName == datNames[i] & oriVals == TRUE] <- sub(".*: ", "", datNames[i])
	oriVals[TRYSubset$DataName == datNames[i] & oriVals == "yes"] <- sub(".*: ", "", datNames[i])
}

# remove purely numeric values
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA # remove numeric values

# repair hemi-epiphyte matching problem
oriVals <- gsub("emi-epiph", "emiepiph", oriVals)

Apparently, there are some trait categories mixed here. We will prepare a matrix with six columns to separately save entries belonging to different traits.

newVals <- matrix(NA, length(oriVals), 6)

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

# get strings with "phyte" ending
unique(tolower(unlist(regmatches(oriVals, gregexpr("\\w+phyte", oriVals)))))

# create a vector containing the search strings to look for
searchNames <- c(
	"chama?ephyte|(^|\\W)cha?(\\W|$)",
	"geophyte|(^|\\W)g(\\W|$)",
	"hemicryptophyte",
	"(?<!pseudo)phanerophyte?|shrub|tree|(^|\\W)ph(\\W|$)",
	"th?erophyte|annual|(^|\\W)th?(\\W|$)",
	"hydrophyte",
	"pseudophanerophyte",
	"cryptophyte",
	"(?<!emi)epiph(y|i)tes?|(^|\\W)ep(\\W|$)",
	"(h|s)emi-?epiph(y|i)tes?",
	"helophyte",
	"parasit",
	"lian(a|e)",
	"succulent",
	"fern"
)

We can now search for the strings defined before and give names to the new categories.

# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, perl = TRUE)

# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c(
	"chamaephyte", "geophyte", "hemicryptophyte", "phanerophyte",
	"therophyte", "hydrophyte", "pseudophanerophyte", "cryptophyte", "epiphyte", "hemiepiphyte",
	"helophyte", "parasitic", "liana", "succulent", "pteridophyte"
)

Let’s have a look at the results.

# show the number of matches to each category
colSums(searchResults)

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)

We account for the fact that some categories are sub-categories of “cryptophyte”. We only report epiphytism in the case both epiphytism and hemi-epiphytism are found.

searchResults[searchResults[, 2] == TRUE, 8] <- TRUE # geo is crypto
searchResults[searchResults[, 6] == TRUE, 8] <- TRUE # hydro is crypto
searchResults[searchResults[, 11] == TRUE, 8] <- TRUE # helo is crypto
searchResults[searchResults[, 9] == TRUE, 10] <- FALSE # if epiphyte and hemiepiphyte just state epiphyte

Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”. We separate entries relating to different traits.

# use the searchResults matrix to create new value strings by concatenating all data found
newVals[, 1] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[c(1:8, 11)][searchResults[x, c(1:8, 11)]], collapse = ",")
})
newVals[, 2] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[c(9, 10)][searchResults[x, c(9, 10)]], collapse = ",")
})
newVals[, 3] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[12][searchResults[x, 12]], collapse = ",")
})
newVals[, 4] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[13][searchResults[x, 13]], collapse = ",")
})
newVals[, 5] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[14][searchResults[x, 14]], collapse = ",")
})
newVals[, 6] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[15][searchResults[x, 15]], collapse = ",")
})
newVals[newVals == ""] <- NA

# move values to other traits
TRY[TraitName == "Plant life form (Raunkiaer life form)", CleanedValueStr := newVals[, 2]]
TRY[TraitName == "Plant life form (Raunkiaer life form)", TraitName := "gotoPlant epiphytism"]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant life form (Raunkiaer life form)", CleanedValueStr := newVals[, 3]]
TRY[TraitName == "Plant life form (Raunkiaer life form)", TraitName := "gotoPlant parasitism"]
TRY[TraitName == "gotoPlant parasitism" & CleanedValueStr == "parasitic", CleanedValueStr := TRUE]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant life form (Raunkiaer life form)", CleanedValueStr := newVals[, 4]]
TRY[TraitName == "Plant life form (Raunkiaer life form)", TraitName := "gotoPlant growth form 1"]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant life form (Raunkiaer life form)", CleanedValueStr := newVals[, 5]]
TRY[TraitName == "Plant life form (Raunkiaer life form)", TraitName := "gotoPlant succulence"]
TRY[TraitName == "gotoPlant succulence" & CleanedValueStr == "succulent", CleanedValueStr := TRUE]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant life form (Raunkiaer life form)", CleanedValueStr := newVals[, 6]]
TRY[TraitName == "Plant life form (Raunkiaer life form)", TraitName := "gotoPlant growth form 2"]

# integrate into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant life form (Raunkiaer life form)", CleanedValueStr := newVals[, 1]]

Although not necessary, we may change the trait name.

TRY[TraitName == "Plant life form (Raunkiaer life form)", TraitName := "Plant Raunkiaer life form"]

As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase in file size, we remove the rows of the duplicated data without values in the “CleanedValueStr” column.

TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]

We have used an existing trait name with the prefix “goto” to classify some data. This was done to eventually move the data to the respective trait, but avoid another round of pre-processing. So only run the following line if this is the last of various pre-processing scripts you want to use.

TRY[, TraitName := sub("^goto", "", TraitName)]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))