Prepare Plant nitrogen(N) fixation capacity data from TRY for use#
The Plant nitrogen(N) fixation capacity data from TRY informs on whether plants are able to fix nitrogen from the air or not. Usually, plants can only uptake nitrogen through their roots and therefore depend on the soil nitrogen reservoir. The ability to fix nitrogen from the air is thus a strong advantage in nitrogen-poor habitats. The plants do not fix nitrogen on their own, but with the help of symbiotic bacteria. Bacteria from the family Rhizobiaceae are often found in Fabaceae. The genus Frankia can be found with Alnus and other species. The family Nostocaceae grows with Cycas and Gunnera, among others. Here, we only distinguish between nitrogen fixers and non-fixers.
If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
# select data of interest
TRYSubset <- TRY[TraitName == "Plant nitrogen(N) fixation capacity"]
To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.
# extract original data strings
oriVals <- TRYSubset$OrigValueStr
# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)
# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]
There are a couple of coded entries that need to be repaired. We find the categories corresponding to the values used for coding in the “DataName” field of the data. Let’s check first where we find specifically the value “true” and “si”(Spanish). We then remove purely numeric values.
# repair coded entries
datNames <- names(table(TRYSubset[oriVals == "true"]$DataName))
for (i in seq_along(datNames)) {
print(datNames[i])
print(TRYSubset[DataName == datNames[i] & oriVals == "true"][1:2])
print("--------------------------------------")
}
for (i in seq_along(datNames)) {
oriVals[TRYSubset$DataName == datNames[i] & oriVals == "true"] <- "n2 fixer"
}
datNames <- names(table(TRYSubset[oriVals == "si"]$DataName))
for (i in seq_along(datNames)) {
print(datNames[i])
print(TRYSubset[DataName == datNames[i] & oriVals == "si"][1:2])
print("--------------------------------------")
}
for (i in seq_along(datNames)) {
oriVals[TRYSubset$DataName == datNames[i] & oriVals == "si"] <- "n2 fixer"
}
# remove purely numeric values and others that have no lowercase character included
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA
The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).
searchNames <- c(
"^no?$|^0$|not an n fixer|non fixer|low|none|not n2 fixing",
"^y(es)?$|^1$|rhizobia|yes, an n fixer|present|frankia|no-n-fixer|high|^n2-? ?fixing|nostocaceae|^n2?(-| )fixer"
)
We can now search for the strings defined before and give names to the new categories.
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c(FALSE, TRUE)
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
As these categories should exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search.
searchResults[rowSums(searchResults) > 1, ] <- FALSE
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[newVals == ""] <- NA
# integrate into TRY
TRY[TraitName == "Plant nitrogen(N) fixation capacity", CleanedValueStr := newVals]
Let’s write the data to a file.
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))