Batch pre-processing of categorical data from TRY#
This script is intended to run several or all of the TRY pre-processing scripts at once. This has the advantage that the dataset will only be loaded and saved once, considerably speeding up the pre-processing. As this script sources the other pre-processing scripts in their original R versions, it is necessary to download the whole bundle at the TRY main page.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
R script files for the individual traits that will be sourced available at the TRY main page
a TRY trait list available here. Click on Go next to “Trait table” and then download the trait table using the Download button.
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
For convenience, the script is able to harmonize TraitIDs and TraitName values. This means that if datasets come without a TraitName column but with a TraitID column, the script is able to select the right pre-processing algorithms. To this end, it uses a table with TraitID and TraitName. Refer to the requirements for details. We will remove colons (:) from trait names, as they cannot be used in file names and pre-processing scripts are named as the traits they process.
# get TRY trait overview (adapt this!)
traitIDs <- fread(paste0(.brd, "categorical traits/TRYv6 trait overview.txt"))
traitIDs$Trait <- gsub(":", "", traitIDs$Trait, fixed = TRUE) # there are no colons in filenames
Let’s get the pre-processing scripts available.
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub/code repository/dataset pre-processing/TRY"))
We will now define the traits we want to be cleaned based on the pre-processing files found.
traitFiles < -list.files(pattern = "^PlantHub TRY preprocessing.*\\.R")
traitFiles < -sub("PlantHub TRY preprocessing ", "", traitFiles)
traitFiles < -sub("\\.R", "", traitFiles)
(traitFiles < -traitFiles[!(traitFiles %in% c("main", "template"))])
We now need to choose traits to clean by selecting the respective numbers. The default is to clean all traits. Individual scripts will have no effect if they do not find the data they need, i.e. if a trait is not available in the dataset.
traitNums <- c(seq_along(traitFiles)) # change this if you do not want all traits
traitFiles <- traitFiles[traitNums]
The following algorithm searches for trait names from trait IDs, if trait names cannot be found in the dataset.
if (!("TraitName" %in% colnames(TRY))) {
traitFileIDs <- rep(NA, length(traitFiles))
for (i in seq_along(traitFiles)) {
foundIDs <- traitIDs[sapply(traitIDs$Trait, function(x) grepl(x, traitFiles[i], fixed = TRUE))]
if (nrow(foundIDs) == 1) {
traitFileIDs[i] <- foundIDs$TraitID
} else if (nrow(foundIDs) > 1) {
print(foundIDs)
foundID <- readline(prompt = "Enter number of which to choose: ")
if (!(foundID) %in% seq_len(nrow(foundIDs))) {
break
} else {
traitFileIDs[i] < -foundIDs$TraitID[foundID]
}
}
}
}
We will now run all selected files, sourcing one after the other, just omitting repeated read and write processes.
for (ii in seq_along(traitFiles)) {
print(paste0("Processing ", traitFiles[ii]))
fullFile <- scan(paste0("PlantHub TRY preprocessing ", traitFiles[ii], ".R"),
what = character(), sep = "\n", blank.lines.skip = FALSE, quiet = TRUE
)
fullFile[grepl("^(write.table|fwrite)", fullFile)] <- paste0("#", fullFile[grepl("^(write.table|fwrite)", fullFile)])
fullFile[grepl("^(library|rm\\(|try\\(setwd|TRY <- |TRY\\[,TraitName := sub\\(\"\\^goto\")", fullFile)] <-
paste0("#", fullFile[grepl("^(library|rm\\(|try\\(setwd|TRY <- |TRY\\[,TraitName := sub\\(\"\\^goto\")", fullFile)])
if (!("TraitName" %in% colnames(TRY))) {
fullFile[grepl("^TRYSubset <- ", fullFile)] <- paste0("#", fullFile[grepl("^TRYSubset <- ", fullFile)])
TRYSubset <- TRY[TraitID == traitFileIDs[ii]]
}
source(textConnection(fullFile))
}
As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase in file size, we remove the rows of the duplicated data without values in the “CleanedValueStr” column.
TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]
We have used an existing trait name with the prefix “goto” to classify some data. This was done to eventually move the data to the respective trait, but avoid another round of pre-processing. We can now remove the “goto” prefix so that all data belonging to one trait shares the same trait name.
TRY[, TraitName := sub("^goto", "", TraitName)]
Let’s write the data to a file.
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))