Prepare Plant woodiness data from TRY for use#
The Plant woodiness data from TRY informs on whether plants are woody or non-woody or if they have a woody base.
Erroneously, there is also data on whether a plant is an epiphyte, a liana, a graminoid or a herb included. This data is moved to growth form 1, growth form 2, and epiphytism, respectively.
If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
# select data of interest
TRYSubset <- TRY[TraitName == "Plant woodiness"]
To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values
# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)
# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]
Apparently, there are some trait categories mixed here. We will prepare a matrix with four columns to separately save entries from different plant functional traits.
newVals <- matrix(NA, length(oriVals), 4)
There are a couple of coded entries that need to be repaired. The coding is dataset-specific, so we use the dataset ID to do the decoding. We also remove the hyphen found in “hemi-epiphyte” to ease later matching. Finally, we remove purely numeric values.
# repair coded entries
datNames <- names(table(TRYSubset[oriVals == "1"]$DatasetID))
for (i in seq_along(datNames)) {
print(datNames[i])
print(TRYSubset[DatasetID == datNames[i]][1:2])
print("--------------------------------------")
}
oriVals[TRYSubset$DatasetID %in% c("9", "86") & oriVals == "0"] <- "non-woody"
oriVals[TRYSubset$DatasetID %in% c("9", "86") & oriVals == "1"] <- "semi-woody"
oriVals[TRYSubset$DatasetID %in% c("9", "86") & oriVals == "2"] <- "woody"
oriVals[TRYSubset$DatasetID %in% c("9", "86") & oriVals == "3"] <- "woody"
oriVals[TRYSubset$DatasetID %in% c("73", "118", "153", "699") & oriVals == "0"] <- "non-woody"
oriVals[TRYSubset$DatasetID %in% c("73", "118", "153", "699") & oriVals == "1"] <- "woody"
# repair hemi-epiphyte matching problem
oriVals <- gsub("emi-epiph", "emiepiph", oriVals)
# remove purely numeric values and others that have no lowercase character included
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA
The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).
Note the regular expression with the (?<!…) part. This is called a negative lookbehind. It makes sure … is not found before the following sequence. To use it, we need to set the option perl=TRUE in the grepl function later on.
searchNames <- c(
"(nono?-?woody|fibrous|^0$|false|^none?$|no(t|n) woody)",
"woody? at base|woody rootstock|woody base|semi-woody|basal",
"(?<!emi)epiph(y|i)tes?",
"(h|s)emi-?epiph(y|i)tes?",
"liana",
"graminoid|grass",
"^w$|true|^y$|suffrutex|^1$|(^|/)woody(/|$)|probably woody",
"^h$|herb"
)
We can now search for the strings defined before and give names to the new categories.
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, perl = TRUE)
# name columns of searchResults matrix like woodiness categories
colnames(searchResults) <- c(
"non-woody", "woody base", "epiphyte", "hemi-epiphyte",
"liana", "graminoid", "woody", "herb"
)
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
As some of these categories should exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search.
searchResults[searchResults[, 1] == TRUE & searchResults[, 2] == TRUE, c(1, 2)] <- FALSE # non-woody and woody base
searchResults[searchResults[, 1] == TRUE & searchResults[, 7] == TRUE, c(1, 7)] <- FALSE # non-woody and woody
searchResults[searchResults[, 2] == TRUE & searchResults[, 7] == TRUE, 2] <- FALSE # woody base and woody
searchResults[searchResults[, 3] == TRUE & searchResults[, 4] == TRUE, 4] <- FALSE # epiphyte and hemiepiphyte
searchResults[searchResults[, 5] == TRUE & searchResults[, 8] == TRUE, c(5, 8)] <- FALSE # liana and herb
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”. We separate entries relating to different traits.
newVals[, 1] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[c(1, 2, 7)][searchResults[x, c(1, 2, 7)]], collapse = ",")
})
newVals[, 2] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[c(3, 4)][searchResults[x, c(3, 4)]], collapse = ",")
})
newVals[, 3] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[c(5, 8)][searchResults[x, c(5, 8)]], collapse = ",")
})
newVals[, 4] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[6][searchResults[x, 6]], collapse = ",")
})
newVals[newVals == ""] <- NA
# move values to other traits
TRY[TraitName == "Plant woodiness", CleanedValueStr := newVals[, 2]]
TRY[TraitName == "Plant woodiness", TraitName := "gotoPlant epiphytism"]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant woodiness", CleanedValueStr := newVals[, 3]]
TRY[TraitName == "Plant woodiness", TraitName := "gotoPlant growth form 1"]
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant woodiness", CleanedValueStr := newVals[, 4]]
TRY[TraitName == "Plant woodiness", TraitName := "gotoPlant growth form 2"]
# integrate into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant woodiness", CleanedValueStr := newVals[, 1]]
As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase in file size, we remove the rows of the duplicated data without values in the “CleanedValueStr” column.
TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]
We have used an existing trait name with the prefix “goto” to classify some data. This was done to eventually move the data to the respective trait, but avoid another round of pre-processing. So only run the following line if this is the last of various pre-processing scripts you want to use.
TRY[, TraitName := sub("^goto", "", TraitName)]
Let’s write the data to a file.
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))