Prepare Species habitat characterization data from TRY for use

Prepare Species habitat characterization data from TRY for use#

The Species habitat characterization data from TRY informs on the vegetation type where species were observed. The data is mainly from Brazil, therefore vegetation types found there are predominant.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

TRY data, available here
the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Species habitat characterization: vegetation type"]

# repair data from a study where data is in OriglName instead of OrigValueStr
TRYSubset[
	Dataset == "Species able to reproduce after fire in a Brazilian Savanna" & grepl("^\\d+$", OrigValueStr),
	OrigValueStr := OriglName
]

To get an overview of the data, we convert sort the values and show them as a table.

# extract original data string
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

We need to create a conversion dataframe to change original heterogeneous entries to more cleaned ones. The first column of the dataframe contains the original values, the second the cleaned ones.

conv <- unique(unlist(strsplit(oriVals, split = ",")))
conv <- sort(conv)
conv <- cbind(conv, conv) # make a copy of the values found to create a conversion table
conv[, 2] <- gsub("\x93", "", conv[, 2]) # remove invisible special characters
conv[, 2] <- gsub("\x94", "", conv[, 2]) # remove invisible special characters
# wrong FAO forest classifications
conv[, 2] <- gsub("^(Ant|CL|CR|CV|FCi|FED|FEP|FES|FIg|FOM|FOP|FTF|FV|Man|SAm|VAq|VAR)$", "", conv[, 2])
conv[, 2] <- gsub("^\\s+", "", conv[, 2])
conv[, 2] <- gsub("\\s*\\.*\\s*$", "", conv[, 2])
conv[, 2] <- gsub("\\([^\\)]+\\)", " ", conv[, 2])
conv[, 2] <- gsub("\\(|\\[|\\)|\\]|\\?", "", conv[, 2])
conv[, 2] <- gsub("\\s+$", "", conv[, 2])
conv[, 2] <- gsub("\\s+", " ", conv[, 2])
conv[, 2] <- tolower(conv[, 2])
conv <- conv[order(conv[, 2]), ]

A second problem is that there are a number of entries where words are not fully spelled out. We will repair this whenever a word extension is unambiguous.

wsNum <- gregexpr(" ", conv[, 2])
for (i in seq_len(nrow(conv) - 1)) {
	if (nchar(conv[i, 2]) > 0) {
		for (j in (i + 1):nrow(conv)) {
			if (length(wsNum[[i]]) != length(wsNum[[j]])) {
				break
			} else if (any(wsNum[[i]] != wsNum[[j]])) {
				break
			}
			if (grepl(conv[i, 2], conv[j, 2])) {
				conv[i, 2] <- conv[j, 2]
			}
		}
	}
}

Remaining unclear data needs to be removed or changed. As the largest part of the observations is in Portuguese, we change the few English terms into Portuguese, too.

conv[, 2] <- gsub(" (lato|stri(cto)?)( sensu?)?$", "", conv[, 2])
conv[, 2] <- gsub(" \\w$", "", conv[, 2])
conv[, 2] <- gsub(" com$", "", conv[, 2])
conv[nchar(conv[, 2]) < 4, 2] <- ""
conv[grepl("^ambient$", conv[, 2]), 2] <- ""
conv[grepl("^.rea$", conv[, 2]), 2] <- ""
conv[grepl("^t.pico$", conv[, 2]), 2] <- ""
conv[grepl("invasora", conv[, 2]), 2] <- ""
conv[grepl("^savanna$", conv[, 2]), 2] <- "savana"
conv[, 2] <- sub("but prefers open land", "campo", conv[, 2])
conv[, 2] <- sub("grassland", "campo", conv[, 2])
conv[, 2] <- sub("dry forest", "mata seca", conv[, 2])
conv[, 2] <- sub("florest", "mata", conv[, 2])
conv[, 2] <- sub("largely restricted to closed forest", "mata", conv[, 2])
conv[, 2] <- sub("may occur in forests", "mata", conv[, 2])
conv[, 2] <- sub("occurs in forests as well as in open land", "campo e mata", conv[, 2])
conv[, 2] <- sub("prefers forest edges and in clearings", "borda de mata e clareiras", conv[, 2])
conv[, 2] <- sub("salt marsh", "sapal", conv[, 2])
conv[, 2] <- sub("^t.{1,2}pica d. ", "", conv[, 2])

Now we can replace original values by cleaned ones using the conversion dataframe. We then add the data into the “CleanedValueStr” column of our data frame.

splStr <- strsplit(oriVals, split = ",")
for (i in seq_along(oriVals)) {
	oriVals[i] <- paste(conv[conv[, 1] %in% unlist(splStr[i]), 2], collapse = ",")
}
oriVals <- gsub(",+", ",", oriVals)
oriVals <- gsub("^,", "", oriVals)
oriVals <- gsub(",$", "", oriVals)
oriVals <- sub("^$", NA, oriVals)

# integrate into TRY
TRY[TraitName == "Species habitat characterization: vegetation type", CleanedValueStr := oriVals]

Although not necessary, we may change the trait name.

TRY[
	TraitName == "Species habitat characterization: vegetation type",
	TraitName := "Plant habitat characterization: vegetation type"
]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))

Prepare Species habitat characterization data from TRY for use

Contents

Prepare Species habitat characterization data from TRY for use#

Requirements#

Code#