Prepare Plant growth form data from TRY for use#
The Plant growth form data from TRY informs on plants main growth forms, specific sub-growth forms, plant life forms, plant parasitism, plant woodiness and plant epiphytism.
If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
# select data of interest
TRYSubset <- TRY[TraitName == "Plant growth form"]
To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values
# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)
# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]
It looks like a good idea to remove purely numeric values.
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA
Apparently, there are some trait categories mixed here. We will prepare a matrix with ten columns to separately save entries from different plant functional traits and process them consecutively.
newVals <- matrix(NA, length(oriVals), 10)
Process plant growth form 1 - main growth forms#
The main growth forms as defined here are tree, liana, shrub, herb, and moss. We will later create the trait name “Plant growth form 1” to store this data.
The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).
searchNames <- c(
"arbol|trees?",
"(liann?as?|climbers?|vines?)",
"(s(c|h)r?ubs?|bushy?|arbust)",
"herbs?",
"(moss(es)?|bryophytes?)"
)
We can now search for the strings defined before and give names to the new categories.
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c("tree", "liana", "shrub", "herb", "moss")
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
As these categories should be exclusive. In the case where several categories are found in one entry, we keep one category only, in the order tree>liana>shrub>herb. This means that if we find the strings “tree” and “liana” in one entry, we just keep “tree”. For mosses, if we find an additional growth form classification in an entry, we remove it completely.
# remove contradictory entries
searchResults[searchResults[, 5] == TRUE & rowSums(searchResults) > 1, ] <- FALSE # mosses are no other growth form
# consider logical relationships
# if growing as tree state this only
searchResults[searchResults[, 1] == TRUE & rowSums(searchResults) > 1, (2:5)] <- FALSE
# if growing as liana state this only
searchResults[searchResults[, 2] == TRUE & rowSums(searchResults) > 1, (3:5)] <- FALSE
# if growing as shrub state this only
searchResults[searchResults[, 3] == TRUE & rowSums(searchResults) > 1, (4:5)] <- FALSE
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
newVals[, 1] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 1][newVals[, 1] == ""] <- NA
Process plant growth form 2 - specific sub-growth forms#
Sub-growth forms are mostly sub-categories of the main growth forms. They are “grasses”, “forbs”, “ferns”, “lycophytes”, “cacti”, “palms”, and “legumes”. We will later move them to a trait called “Plant growth form 2”.
The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).
searchNames <- c(
"(gras?s(es)?|sed?ges?|graminoids?|bamboos?|bambusoids?)",
"forbs?",
"ferns?|pteridophytes?",
"lycophytes?",
"cact(us|i)",
"palms?",
"legume"
)
We can now search for the strings defined before and give names to the new categories.
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c("graminoid", "forb", "fern", "lycophyte", "cactus", "palm", "legume")
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
As these categories should be exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search.
searchResults[searchResults[, 4] == TRUE, 5] <- FALSE # lycophytes are no cacti
searchResults[rowSums(searchResults) > 1, ] <- FALSE # more than one category
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
newVals[, 2] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 2][newVals[, 2] == ""] <- NA
Process growth form 3 - plant life form#
Plant life forms as defined here describe a plant’s ability to cope with special conditions regarding its substrate, water availability, amount of salt in the soil, or nutrition strategy. It is not to be confounded with the Raunkiaer life form, which is another trait. Examples of the plant life form are xerophyte, halophyte, or mesophyte. Nevertheless, many values belonging to Raunkiar life forms will be identified and moved to this trait later on.
# replace "chaemae" to "chamae" in original data string
oriVals <- gsub("chaemae", "chamae", oriVals)
# search for the pattern "[[:lower:]]+phyte" to extract substring from oriVals and remove duplicate values
searchNames <- unique(regmatches(oriVals, regexpr("[[:lower:]]+phyte", oriVals)))
# look for the patterns "epi", "lyco", "bryo" and "pterido" as these plant groups are dealt with
# in growth form 2 and epiphytism
searchNames <- searchNames[!grepl("epi", searchNames)]
searchNames <- searchNames[!grepl("lyco", searchNames)]
searchNames <- searchNames[!grepl("bryo", searchNames)]
searchNames <- searchNames[!grepl("pterido", searchNames)]
searchNames <- searchNames[!grepl("xeromeso", searchNames)]
searchNames <- searchNames[!grepl("mesoxero", searchNames)]
searchNames <- searchNames[!grepl("hydrohalo", searchNames)]
searchNames <- sub("xerohalo", "halo", searchNames)
searchNames <- c(searchNames, "^cryptophyte")
We can now search for the strings defined before and give names to the new categories.
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- sub("\\^", "", searchNames)
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
As these categories should be exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search. We also account for the fact that some categories are sub-categories of “cryptophyte”.
# remove contradictory entries
searchResults[rowSums(searchResults) > 1, ] <- FALSE
# consider logical relationships
searchResults[searchResults[, colnames(searchResults) == "geophyte"] == TRUE, which(colnames(searchResults) == "cryptophyte")] <- TRUE # geo is crypto
searchResults[searchResults[, colnames(searchResults) == "hydrophyte"] == TRUE, which(colnames(searchResults) == "cryptophyte")] <- TRUE # hydro is crypto
searchResults[searchResults[, colnames(searchResults) == "helohyte"] == TRUE, which(colnames(searchResults) == "cryptophyte")] <- TRUE # helo is crypto
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
newVals[, 3] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 3][newVals[, 3] == ""] <- NA
Process growth form 4 - parasitic or not#
The growth form data has information on whether plants are parasitic. As this is independent of their other growth form traits, it needs to be recorded separately.
# search for the strings defined before
searchResults <- grepl("parasit", oriVals)
# use the searchResults matrix to create new value strings
newVals[searchResults == TRUE, 4] <- TRUE
Process growth form 5 - woody or not#
There is also information on plant woodiness. Woodiness occurs in trees, lianas, and shrubs, and will therefore be recorded separately.
# search for the strings defined before
searchResults <- grepl("woody", oriVals)
# use the searchResults matrix to create new value strings
# used because of later differentiation between woody, woody base, non-woody
newVals[searchResults == TRUE, 5] <- "woody"
Process growth form 6 - epiphyte or not#
Epiphytism, the growth on other plants, occurs across all main growth form categories, and needs to be recorded separately.
# repair hemi-epiphyte matching problem
oriVals <- gsub("emi\\-epiph", "emiepiph", oriVals)
# create a vector containing the search strings to look for
searchNames <- c("(?<!emi)epiph(y|i)tes?", "(h|s)emi-?epiph(y|i)tes?", "terrestrial|free-?standing")
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, perl = TRUE)
# name columns of searchResults matrix like specific sub-growth form categories
colnames(searchResults) <- c("epiphyte", "hemi-epiphyte", "terrestrial")
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
As epiphytism is the most derived of the three categories, we only report epiphytism in case it is reported, and only hemi-epiphytism in case both the latter and “terrestrial” is reported.
# consider logical relationships
# if growing as epiphyte state this only
searchResults[searchResults[, 1] == TRUE & rowSums(searchResults) > 1, c(2, 3)] <- FALSE
# if growing as hemiepiphyte state this only
searchResults[searchResults[, 2] == TRUE & rowSums(searchResults) > 1, 3] <- FALSE
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
# use the searchResults matrix to create new value strings
newVals[, 6] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 6][newVals[, 6] == ""] <- NA
Process growth form 7 - add aquatic and semi-aquatic to Raunkiaer life form#
The data also gives information on whether a plant is (semi)aquatic. However, as this is also a Raunkiaer life form category, we move this data there.
searchNames <- c("(?<!emi)aqua", "(h|s)emi-?aqua")
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, perl = TRUE)
# name columns of searchResults matrix like specific sub-growth form categories
colnames(searchResults) <- c("hydrophyte,cryptophyte", "hemihydrophyte,hemicryptophyte")
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
As there may be very few hemi-hydrophytes, we will add them to hydrophytes and remove the hemi-hydrophyte category altogether when finding less than 10 entries.
if (sum(searchResults[, 2]) < 10) {
searchResults[searchResults[, 2] == TRUE, 1] <- TRUE
searchResults[, 2] <- FALSE
}
# add data to Raunkiear lifeform
for (i in seq_len(ncol(searchResults))) {
newVals[!is.na(newVals[, 3]) & !grepl(colnames(searchResults)[i], newVals[, 3]) & searchResults[, i] == TRUE, 3] <-
paste(newVals[!is.na(newVals[, 3]) & !grepl(colnames(searchResults)[i], newVals[, 3]) & searchResults[, i] == TRUE, 3],
colnames(searchResults)[i],
sep = ","
)
newVals[is.na(newVals[, 3]) & !grepl(colnames(searchResults)[i], newVals[, 3]) & searchResults[, i] == TRUE, 3] <-
colnames(searchResults)[i]
}
Process growth form 8 - main shoot growth direction/pattern#
We will now extract data informing on the growth direction of the main shoot of a plant - from erect perpendicular growth towards rambling growth.
searchNames <- c("prostrate", "scandent", "rambling", "erect", "pendant", "ascending")
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like specific sub-growth form categories
colnames(searchResults) <- c("prostrate", "scandent", "rambling", "erect", "pendant", "ascending")
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
# use the searchResults matrix to create new value strings by concatenating all data found
newVals[, 8] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 8][newVals[, 8] == ""] <- NA
Process growth form 9 - life span#
Some growth form data actually relates to plant life span, i.e. whether a plant is annual, biennial, or perennial.
# create a vector containing the search strings to look for
searchNames <- c(
"(^| |\\W)per(e)?(nnial)?|long|poly-?annual|pluri-?ennial",
"(^| )an(n)?(ual)?|short|ephemeral",
"b(i)?-?(asa|e)nnial|bia?s?a?nnual"
)
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like specific sub-growth form categories
colnames(searchResults) <- c("perennial", "annual", "biennial")
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
# use the searchResults matrix to create new value strings by concatenating all data found
newVals[, 9] <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[, 9][newVals[, 9] == ""] <- NA
Process growth form 10 - succulence#
We can also find information on plant succulence in the growth form data, which is independent of the other traits.
# search for the strings defined before
searchResults <- grepl("succulent", oriVals)
# use the searchResults matrix to create new value strings
newVals[searchResults == TRUE, 10] <- TRUE
We will now integrate the new data into our file. Note that we will append several copies of the growth form subset of the data, with the new values in the “CleanedValueStr” column.
# integrate growth form 1 into TRY
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 1]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant growth form 1"]
# integrate growth form 2 into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 2]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant growth form 2"]
# integrate growth form 3 into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 3]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant life form"]
# move values to other traits
TRY[TraitName == "Plant life form" & CleanedValueStr == "geophyte", CleanedValueStr := "geophyte,cryptophyte"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "hydrophyte", CleanedValueStr := "hydrophyte,cryptophyte"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "helophyte", CleanedValueStr := "helophyte,cryptophyte"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "macrophyte", CleanedValueStr := "phanerophyte"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "geophyte,cryptophyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "hydrophyte,cryptophyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "helophyte,cryptophyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "chamaephyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "phanerophyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "therophyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "hemicryptophyte", TraitName := "gotoPlant Raunkiaer life form"]
TRY[TraitName == "Plant life form" & CleanedValueStr == "hemicryptophyte", TraitName := "gotoPlant Raunkiaer life form"]
# integrate parasitism into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 4]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant parasitism"]
# integrate woodiness into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 5]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "gotoPlant woodiness"]
# integrate epiphytism into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 6]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant epiphytism"]
# integrate shoot growth into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 8]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant shoot growth"]
# integrate lifespan into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 9]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "gotoPlant lifespan (longevity) categories"]
# integrate succulence growth into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Plant growth form", CleanedValueStr := newVals[, 10]]
TRY[TraitName == "Plant growth form", OrigUnitStr := NA]
TRY[TraitName == "Plant growth form", TraitName := "Plant succulence"]
As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase in file size, we remove the rows of the duplicated data without values in the “CleanedValueStr” column.
TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]
We have used an existing trait name with the prefix “goto” to classify some data. This was done to eventually move the data to the respective trait, but avoid another round of pre-processing. So only run the following line if this is the last of various pre-processing scripts you want to use.
TRY[, TraitName := sub("^goto", "", TraitName)]
Let’s write the data to a file.
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))