Prepare Species occurrence range: native vs invasive data from TRY for use#

The species occurrence range: native vs invasive data from TRY informs on whether a species is native or exotic in its occurrence range.

If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • TRY data, available here

  • the data.table library may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())

Let’s get the TRY data

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Species occurrence range: native vs invasive"]

To get an overview of the data, we convert sort the values and show them as a table.

# extract original data string
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]

Some strings refer to Codes applying to high-level geographic regions in the PLANTS Floristic Area. In these cases, just the information (N)/(I)/(NI) has to be filtered, coding native, invasive, and native/invasive.

In other cases, numerical values have to be decoded. This is, however, dataset-specific.

oriVals[grepl("\\((I|NI|IN)\\)", oriVals)] <- "invasive"

# check which entries include 0 or 1 and decode
zeroOnes <- table(TRYSubset$Dataset[TRYSubset$OrigValueStr %in% c("0", "1")])
table(TRYSubset[TRYSubset$Dataset %in% names(zeroOnes)[1]]$DataName)
table(TRYSubset[TRYSubset$Dataset %in% names(zeroOnes)[2]]$DataName)
table(TRYSubset[TRYSubset$Dataset %in% names(zeroOnes)[3]]$DataName)
table(oriVals[TRYSubset$Dataset %in% names(zeroOnes)[1]]) # 0,1 -> native = 1, invasive = 0
table(oriVals[TRYSubset$Dataset %in% names(zeroOnes)[2]]) # 0,1 -> native = 0, invasive = 1
table(oriVals[TRYSubset$Dataset %in% names(zeroOnes)[3]]) # 1,2 -> native = 1, invasive = 2
oriVals[TRYSubset$Dataset %in% names(zeroOnes)[1]] <-
	sub("0", "invasive", oriVals[TRYSubset$Dataset %in% names(zeroOnes)[1]])
oriVals[TRYSubset$Dataset %in% names(zeroOnes)[1]] <-
	sub("1", "native", oriVals[TRYSubset$Dataset %in% names(zeroOnes)[1]])
oriVals[TRYSubset$Dataset %in% names(zeroOnes)[2]] <-
	sub("0", "native", oriVals[TRYSubset$Dataset %in% names(zeroOnes)[2]])
oriVals[TRYSubset$Dataset %in% names(zeroOnes)[2]] <-
	sub("1", "invasive", oriVals[TRYSubset$Dataset %in% names(zeroOnes)[2]])
oriVals[TRYSubset$Dataset %in% names(zeroOnes)[3]] <-
	sub("1", "native", oriVals[TRYSubset$Dataset %in% names(zeroOnes)[3]])
oriVals[TRYSubset$Dataset %in% names(zeroOnes)[3]] <-
	sub("2", "invasive", oriVals[TRYSubset$Dataset %in% names(zeroOnes)[3]])

The most important part of the cleaning process is the definition of the search strings to look for. We use regular expressions in some cases to be more inclusive (or exclusive).

searchNames <- c(
	"(^| )native|naturally|indigenous|^0$",
	"exotic|invasive|non-native|naturalized|alien|introduced|cultivated|neophyte|archaephyte|anecophytic|^1$"
)

We can now search for the strings defined before and give names to the new categories. Note that we set the option ignore.case=TRUE in grepl.

# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, ignore.case = TRUE)

# name columns of searchResults matrix like occurence types
colnames(searchResults) <- c("native", "exotic")

Let’s have a look at the results.

# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
oriVals[rowSums(searchResults) < 1]

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)

As these categories should be exclusive, we exclude all ambiguous data by setting our search results to FALSE whenever we found more than one match in our search.

As all taxa are native somewhere, it only makes sense to mention they being exotic or exclusively native. Therefore, in case we find species to be “exotic”, we set “native” to FALSE.

searchResults[searchResults[, 2] == TRUE, 1] <- FALSE

Now, we can create new strings with the cleaned values and add them to the observations. To not remove the original entries, we will create a new column called “CleanedValueStr”.

# use the searchResults matrix to create new value strings by concatenating all data found
newVals <- sapply(seq_len(nrow(searchResults)), function(x) {
	paste(colnames(searchResults)[searchResults[x, ]], collapse = ",")
})
newVals[newVals == ""] <- NA

# integrate into TRY
TRY[TraitName == "Species occurrence range: native vs invasive", CleanedValueStr := newVals]

Although not necessary, we may change the trait name.

TRY[TraitName == "Species occurrence range: native vs invasive", TraitName := "Plant native or exotic"]

Let’s write the data to a file.

fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))