Prepare Plant life span (longevity) data from TRY for use

Prepare Plant life span (longevity) data from TRY for use#

The Plant life span (longevity) data from TRY informs on how long a plant lives. There are numeric and ordinal/categorical values included: Numeric values indicate numbers of years, while categorical classify plants into annual, biennial, or perennial.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

TRY data, available here
the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Plant lifespan (longevity)"]

The plant lifespan data from TRY is a container for numeric (years) and categorical data. We will start to process the categorical data and do the numeric afterwards. We will prepare a matrix with two columns to separately save entries belonging to these to types.

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
# ignore data with units which is likely numeric
valueOverview <- table(TRYSubset$OrigValueStr[TRYSubset$OrigUnitStr == ""])
valueOverview[order(valueOverview)]

# prepare matrix to save new values in
newVals <- matrix(NA, length(oriVals), 2)

There are some coded entries we need to take care of. The value “yes” has different meanings here. We will check those for the different values of the “OriglName” column and replace the values by those in “OriglName”. We will then remove some special characters, and convert the data to numeric. Finally, we will use the numeric data to augment the categorical one, as the years fall into the categories “annual”, “biennial”, and perennial.

# repair coded entries
datNames <- names(table(TRYSubset[oriVals == "yes"]$OriglName))
for (i in seq_along(datNames)) {
	print(datNames[i])
	print(TRYSubset[OriglName == datNames[i]][1:2])
	print("--------------------------------------")
}
searchNames <- c("perennial", "annual", "biennial")
for (i in seq_along(searchNames)) {
	oriVals[grepl(searchNames[i], TRYSubset$OriglName, ignore.case = TRUE) & oriVals == "yes"] <- searchNames[i]
}

We create a vector containing numerical data. This will allow us to augment our categorical data later on.

# remove ranges and larger signs
oriValsN <- gsub("years?", "", oriVals)
oriValsN <- gsub(".*-", "", oriValsN)
oriValsN <- gsub(">", "", oriValsN)
oriValsN <- gsub(".*,", "", oriValsN)

# convert strings into numeric values
oriValsN <- as.numeric(oriValsN)

# add information to categorical data
oriVals[oriValsN <= 1] <- "annual" # <= one year
oriVals[oriValsN > 1 & oriValsN <= 2] <- "biennial" #>= one year and <= two years
oriVals[oriValsN > 2 & oriValsN <= 5] <- "pluriennial" # two years to five years
oriVals[oriValsN > 5] <- "perennial" # five or more years
rm(oriValsN)

# remove purely numeric values and others that have no lowercase character included
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

# create a vector containing the search strings to look for
searchNames <- c(
	"(^| |\\W)per(e)?(nnial)?|long|woody|tree|shrub|hemicryptophyte|poly-?annual|pluri-?ennial",
	"(^| )an(n)?(ual)?|short|ephemeral",
	"b(i)?-?(asa|e)nnial|bia?s?a?nnual"
)

We can now search for the strings defined before and give names to the new categories.

# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)

# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c("perennial", "annual", "biennial")

Let’s have a look at the results.

# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]

# show the number of entries that were not matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more than one category
sum(rowSums(searchResults) > 1)

As these categories should be exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search.

# remove contradictory entries
searchResults[rowSums(searchResults) > 1, ] <- FALSE

Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.

# use the searchResults matrix to create new value strings by concatenating all data found
newVals[, 1] <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 1][newVals[, 1] == ""] <- NA

We will now process the numerical values.

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(TRYSubset$OrigValueStr[is.na(newVals[, 1])]) # ignore data classified as categorical
valueOverview[order(valueOverview)]

We remove ranges and larger signs and convert the data to numeric. We could also calculate means in the case of ranges, but it seems not worth the effort here.

oriVals <- gsub("years?", "", oriVals)
oriVals <- gsub(".*-", "", oriVals)
oriVals <- gsub(">", "", oriVals)
oriVals <- gsub(".*,", "", oriVals)

# convert strings into numeric values
oriVals <- as.numeric(oriVals)

We round the data to one decimal place, and to no decimal place when numbers are > 5.

oriVals <- round(oriVals, 1) # round years to one decimal place
# round to full years when more than five
oriVals[!is.na(oriVals) & oriVals > 5] <- round(oriVals[!is.na(oriVals) & oriVals > 5])
newVals[, 2] <- oriVals

Now let’s move the values into the new column “CleanedValueStr”. We will also change the trait names to distinguish between numeric and categorical data.

# integrate categorical data into TRY
TRY[TraitName == "Plant lifespan (longevity)", CleanedValueStr := newVals[, 1]]
TRY[TraitName == "Plant lifespan (longevity)", TraitName := "Plant lifespan (longevity) categories"]

# integrate numerical data into TRY (append TRYSubset to TRY to have one copy of categorical
# and one of numeric data, respectively)
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant lifespan (longevity)", CleanedValueStr := newVals[, 2]]
TRY[TraitName == "Plant lifespan (longevity)", OrigUnitStr := "years"]
TRY[TraitName == "Plant lifespan (longevity)", TraitName := "Plant lifespan (longevity) numeric"]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))

Prepare Plant life span (longevity) data from TRY for use

Contents

Prepare Plant life span (longevity) data from TRY for use#

Requirements#

Code#