Prepare Species genotype chromosome ploidy data from TRY for use#
The Species genotype chromosome ploidy data from TRY informs about the number of genome copies that are usually found within the cell nuclei.
If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
# select data of interest
TRYSubset <- TRY[TraitName == "Species genotype: chromosome ploidy"]
To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values
# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)
# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]
As we see, there are several ploidy levels that have been indicated using written numbers instead of latin numbers. We need to correct this.
# decode coded entries
ploidylevels <- c(
"ha", "di", "tri", "tetra", "penta", "hexa",
"hepta", "octo", "novem", "deca", "undeca", "duodeca"
)
for (i in seq_along(ploidylevels)) oriVals[oriVals == paste0(ploidylevels[i], "ploid")] <- i
oriVals <- gsub("[a-z\\-]", "", oriVals)
We do not have to do anything else, apart from rounding, when ploidy levels are averages of variable ploidy levels of individuals from one particular species.
# use the searchResults matrix to create new value strings by concatenating all data found
newVals <- round(as.numeric(oriVals), 2)
Let’s transfer the data into the original data frame.
# integrate into TRY
TRY[TraitName == "Species genotype: chromosome ploidy", CleanedValueStr := newVals]
Let’s write the data to a file.
# write data
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))