Prepare Mycorrhiza type data from TRY for use#

The Mycorrhiza type data from TRY informs on the type of association a plant has with mycorrhizal fungi.

FIf you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Mycorrhiza type"]

To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.

# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

There are a lot of numbers shown here. Some data are coded categories, while others the decimal numbers are related to likelihoods of specific mycorrhiza states, which we will remove. We will convert all values to NA that have the word “likelihood” in OriglName. We will do the same with purely numeric values (having no lowercase character).

# a specific OriglName (also a specific dataset)
oriVals[TRYSubset$OriglName == "nutrient uptake strategy (ectomycorrhizae)" & oriVals == "1"] <- "ecto"

# a specific dataset
oriVals[TRYSubset$DatasetID == 73 & oriVals == "0"] <- "no"
oriVals[TRYSubset$DatasetID == 73 & oriVals == "1"] <- "yes" # sometimes means they have mycorrhiza potentially
oriVals[TRYSubset$DatasetID == 73 & oriVals == "2"] <- "yes"

# OriglName contains "likelihood"
# remove likelihoods
oriVals[grepl("likelihood", TRYSubset$OriglName, ignore.case = TRUE)] <- NA

# remove purely numeric values and others that have no lowercase character included
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

# create a vector containing the search strings to look for
searchNames <- c(
	"non-ectomycorrhizal|yes|unidentified|atypical|other", # yes
	"absent|^n(o|m)?n?$|noinf", # none
	"va|am|arbuscular|ph\\.th\\.end\\.|vam", # arbuscular
	"abtm|arbutoid|e\\.t\\.ect\\.arb\\.|ecto arbut\\.", # arbutoid
	"(^|\\W)em|(^|\\W)endo(,|$)", # endo
	"e\\.ch\\.ect\\.|^e(c|e)m?(\\W|$)|ectendo|ecto(,|$)|^ectomycorrhizal?", # ecto
	"e\\.t\\.ect\\.er\\.|ecto er\\.|endo er\\.|^erm?$|ericoid", # ericoid
	"e\\.t\\.end\\.|orchid|orm|orcidoid", # orchid
	"mono?tropoid", # monotropoid
	"ps.end.|(^|\\W)ds(\\W|$)|septate", # dark septate endophyte mycorrhiza
	"pyroloid" # pyroloid
)

We can now search for the strings defined before and give names to the new categories.

# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals)

# name columns of searchResults matrix like corrected searchNames
colnames(searchResults) <- c(
	"mycorrhiza", "no mycorrhiza", "arbuscular mycorrhiza",
	"arbutoid mycorrhiza", "endomycorrhiza", "ectomycorrhiza",
	"ericoid mycorrhiza", "orchid mycorrhiza", "monotropoid mycorrhiza",
	"septate mycorrhiza", "pyroloid mycorrhiza"
)

Let’s have a look at the results.

# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
# oriVals[rowSums(searchResults) < 1] # shows ~18000 NA values
table(oriVals[which(rowSums(searchResults) < 1)]) # all NA in this case

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)

# check which entries were classified into > 1 groups
table(oriVals[which(rowSums(searchResults) > 1)])

As any type of mycorrhiza falls into the general mycorrhiza category, we would like to see this reflected in our search results. So let’s add a TRUE value in the mycorrhiza column for all search results that found any sub-type of mycorrhiza.

# consider logical relationships
searchResults[
	rowSums(searchResults) > 0 & searchResults[, which(colnames(searchResults) == "no mycorrhiza")] == FALSE,
	which(colnames(searchResults) == "mycorrhiza")
] <- TRUE # general mycorrhiza

Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.

# use the searchResults matrix to create new value strings by concatenating all data found
newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[newVals == ""] <- NA

# integrate into TRY
TRY[TraitName == "Mycorrhiza type", CleanedValueStr := newVals]

Although not necessary, we may change the trait name.

# add classification into whole plant trait or plant part trait to trait name
TRY[TraitName == "Mycorrhiza type", TraitName := "Plant mycorrhiza type"]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))