Prepare Fruit type data from TRY for use#
The Fruit type data from TRY informs on whether the fruit of a species is dry or fleshy, or whether it is a subcategory of these main types. It may also give information about opening mechanisms in some fruits, or whether fruits are dehiscent or indehiscent.
If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
# select data of interest
TRYSubset <- TRY[TraitName == "Fruit type"]
To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values
# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)
# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]
There are some entries that need to be de-coded. They have a numeric 1 as value, and the actual value is written in the OriglName column.
# decode coded entries
oriVals[oriVals == "1"] <- TRYSubset[oriVals == "1"]$OriglName
It looks like a good idea to remove purely numeric values.
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA
The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).
# create a vector containing the search strings to look for
searchNames <- c(
"dry",
"achene",
"capsule",
"^nut",
"aggregate nutlets",
"^follicle",
"aggregate follicles",
"siliqu",
"samara|wing",
"utricle",
"loment",
"pod|legume",
"fleshy",
"^drupe",
"aggregate drupelets",
"berry",
"aggregate berries",
"pome",
"syconium",
"aril",
"sarcotesta",
# opening type
"apocarp",
"^syncarp",
"pseudosyncarp",
"schizocarp",
# dehiscence
"[^n]dehiscent",
"indehiscent"
)
We can now search for the strings defined before and give names to the values defined. We also prepare a matrix to save the new values in, as they will be separated into several ne traits.
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)
# name columns of searchResults matrix like corrected searchNames
searchResultsCols <- list()
searchResultsCols[[1]] <- c(
"dry", "achene", "capsule", "nut",
"aggregate nutlets", "follicle", "aggregate follicles", "siliqua",
"samara", "utricle", "loment", "pod",
"fleshy", "drupe", "aggregate drupelets", "berry", "aggregate berries",
"pome", "syconium", "aril", "sarcotesta"
)
searchResultsCols[[2]] <- c("apocarp", "syncarp", "pseudosyncarp", "schizocarp")
searchResultsCols[[3]] <- c("dehiscent", "indehiscent")
colnames(searchResults) <- unlist(searchResultsCols)
# prepare matrix to save new values in
newVals <- matrix(NA, length(oriVals), length(searchResultsCols))
Let’s have a look at the results.
# show the number of matches to each category
colSums(searchResults)
# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]
# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)
# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)
Some of the categories defined have exclusive values, so we remove ambiguous entries. This applies to the opening type and the dehiscence. We also remove data which gives the information that the “fruit” is a samara and does not belong into the fields samara, carp, or dehisc.
# remove contradictory entries
# only one category possible
for (i in c(2, 3)) {
searchResults[
rowSums(searchResults[, colnames(searchResults) %in% searchResultsCols[[i]]]) > 1,
colnames(searchResults) %in% searchResultsCols[[i]]
] <- FALSE
}
searchResults[
searchResults[, grep("samara", colnames(searchResults))] == TRUE,
!grepl("samara|carp|dehisc", colnames(searchResults))
] <- FALSE
Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.
# use the searchResults matrix to create new value strings by concatenating all data found
newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[newVals == ""] <- NA
There are also some logical relationships to consider: Some specific fruit types are fleshy or dry.
# consider logical relationships
# if not explicitly stated, add fleshy/dry to subcategories
searchResults[
rowSums(searchResults[, grep(
"achene|capsule|nut|follicle|siliqua|samara|utricle|loment|pod",
colnames(searchResults)
)]) > 0 &
rowSums(searchResults[, grep("dry|fleshy", colnames(searchResults))]) < 1,
grep("dry", colnames(searchResults))
] <- TRUE
searchResults[
rowSums(searchResults[, grep(
"drupe|berry|berries|pome|syconium",
colnames(searchResults)
)]) > 0 &
rowSums(searchResults[, grep("dry|fleshy", colnames(searchResults))]) < 1,
grep("fleshy", colnames(searchResults))
] <- TRUE
Now, we can create new strings with the cleaned values and add them to the observations.
# use the searchResults matrix to create new value strings by concatenating all data found
for (i in seq_along(searchResultsCols)) {
searchResultsTemp <- searchResults[, colnames(searchResults) %in% searchResultsCols[[i]], drop = FALSE]
newVals[, i] <- sapply(seq_len(nrow(searchResultsTemp)), function(x) {
paste(searchResultsCols[[i]][searchResultsTemp[x, ]], collapse = ",")
})
}
newVals[newVals == ""] <- NA
We now transfer the data into TRY, keeping in mind that some of it comes to new traits.
# move values to other traits
traitNames <- c("gotoFruit carpel development type", "gotoFruit dehiscence type")
for (i in seq_along(traitNames)) {
if (i > 1) TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Fruit type", CleanedValueStr := newVals[, i + 1]]
TRY[TraitName == "Fruit type", TraitName := traitNames[i]]
}
We can now add the new values to TRY.
# integrate into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Fruit type", CleanedValueStr := newVals[, 1]]
As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase in file size, we remove the rows of the duplicated data without values in the “CleanedValueStr” column.
# remove duplicated rows without new data
TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]
We have used an existing trait name with the prefix “goto” to classify some data. This was done to eventually move the data to the respective trait, but avoid another round of pre-processing. So only run the following line if this is the last of various pre-processing scripts you want to use.
# remove "goto" before trait name (do this after the last script only when executing various scripts
# to avoid duplications and other errors)
TRY[, TraitName := sub("^goto", "", TraitName)]
Let’s write the data to a file.
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))