Prepare Plant functional type (PFT) data from TRY for use#

The Plant functional type (PFT) data from TRY summarizes some properties of the species: the climate zone they occur in, their leaf type, the habitat they grow in, their carbon fixation mechanism, deciduousness, and major clade. Here, all data that can be classified into other traits is removed and only the remaining data is left as PFT.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Plant functional type (PFT)"]

To get an overview of the data, we sort all values, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

It looks like a good idea to remove purely numeric values.

oriVals[!grepl("[[:lower:]]", oriVals)] <- NA

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

# create a vector containing the search strings to look for
searchNames <- c(
	# leaf shape
	"(^(BET_TE|EA|BDT_TE|BET_Tr|TDB|TEB|SEB)$|broadlea|bl)",
	"(^(NET_B|NET_TE|TEN|TDN)$|needle|nl)",
	# evergreen or deciduous
	"(^(EA|BET_TE|BET_Tr|NET_B|EG|TDB|TEB|SEB|TEN|TDN)$|evergreen|ev)",
	"(^(BDT_TE|DA|DG)$|deciduous|dc)",
	# tropical, temperate, boreal, arctic
	"(^(BET_Tr)$|tropical|trp)",
	"(^(BDT_TE|BDT_TE|NET_TE)$|temperate|tmp)",
	"(^(NET_B)$|boreal|bor)",
	"arctic",
	# angiosperm, gymnosperm, or fern
	"(^(EA|DA)$|(?<!(sl|av))an)", # avoid false positives through savanna/grassland
	"(^(EG|GC3|GC4|DG)$|conifer|gy)",
	"PtC3H",
	# C3 or C4
	"C3",
	"C4",
	# tree, shrub, liana, or herb
	"(^(BET_TE|BDET_TE|BET_Tr|NET_B|NET_TE|TEB|TDB|BlT|EvBlT|AnEvBlT|DcBlT|AnDcBlT|TEN|NlT|GyEvNlT|EvNlT|TDN|GyEvBlT|DcNlT|GyDcNlT|GyDcBlT)$|tree)", # nolint: line_length_linter.
	"(^(EvSh|SEB|S|SEv|AnEvBlS|DcS|AnDcBlS|AnEvBlSC4|GyEvBlS)$|shrub)",
	"vine",
	"(^(C3H|TmpH|AnC3H|C4H|AnC4H|TrpH|PtC3H)$|crop|grass(?=!land)|forb|herb)",
	# habitat
	"rainforest",
	"savanna",
	"tundra",
	"grassland",
	"desert"
)

We can now search for the strings defined before and give names to the new categories. We do this for each trait the categories will belong to. We also prepare a matrix to save new values in.

# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, ignore.case = TRUE, perl = TRUE)

# name columns of searchResults matrix like corrected searchNames
searchResultsCols <- list()
searchResultsCols[[1]] <- c("broadleaved", "needle")
searchResultsCols[[2]] <- c("evergreen", "deciduous")
searchResultsCols[[3]] <- c("tropical", "temperate", "boreal", "arctic")
searchResultsCols[[4]] <- c("angiosperm", "gymnosperm", "pteridophyte")
searchResultsCols[[5]] <- c("C3", "C4")
searchResultsCols[[6]] <- c("tree", "shrub", "liana", "herb")
searchResultsCols[[7]] <- c("forest", "savanna", "tundra", "grassland", "desert")
colnames(searchResults) <- unlist(searchResultsCols)

# prepare matrix to save new values in
newVals <- matrix(NA, length(oriVals), length(searchResultsCols))

Let’s have a look at the results.

# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)

# check which entries were classified into > 1 groups
table(oriVals[rowSums(searchResults) > 1])

Several of the traits defined have exclusive categories. We remove ambiguous entries.

# remove contradictory entries
# only one category possible
for (i in c(1, 2, 4, 5, 6)) {
	searchResults[
		rowSums(searchResults[, colnames(searchResults) %in% searchResultsCols[[i]]]) > 1,
		colnames(searchResults) %in% searchResultsCols[[i]]
	] <- FALSE
}

We can now write the data into the prepared new results matrix.

# use the searchResults matrix to create new value strings by concatenating all data found
for (i in seq_along(searchResultsCols)) {
	searchResultsTemp <- searchResults[, colnames(searchResults) %in% searchResultsCols[[i]], drop = FALSE]
	newVals[, i] <- sapply(seq_len(nrow(searchResultsTemp)), function(x) {
		paste(searchResultsCols[[i]][searchResultsTemp[x, ]], collapse = ",")
	})
}
newVals[newVals == ""] <- NA

As some values belong to other traits, we move them there. Interestingly, the original trait is completely removed because the data to define plant functional types belongs to other traits.

# move values to other traits
traitNames <- c(
	"gotoLeaf shape", "gotoLeaf phenology type", "gotoSpecies occurrence range: climate type",
	"Plant major plant group", "gotoPlant photosynthesis pathway", "gotoPlant main growth form",
	"gotoSpecies habitat characterization: vegetation type"
)
for (i in seq_along(traitNames)) {
	if (i > 1) TRY <- rbind(TRY, TRYSubset, fill = TRUE)
	TRY[TraitName == "Plant functional type (PFT)", CleanedValueStr := newVals[, i]]
	TRY[TraitName == "Plant functional type (PFT)", TraitName := traitNames[i]]
}

As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase in file size, we remove the rows of the duplicated data without values in the “CleanedValueStr” column.

TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]

We have used an existing trait name with the prefix “goto” to classify some data. This was done to eventually move the data to the respective trait, but avoid another round of pre-processing. So only run the following line if this is the last of various pre-processing scripts you want to use.

TRY[, TraitName := sub("^goto", "", TraitName)]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))