Prepare Plant reproductive phenology timing (flowering time) data from TRY for use#
Plant reproductive phenology timing (flowering time) informs about the onset, duration, and offset of flowering and fruiting. Therefore, the trait needs to be split into several traits in the pre-processing.
If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the TRY main page for details.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
TRY data, available here
the data.table library may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
Let’s get the TRY data
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))
# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")
# select data of interest
TRYSubset <- TRY[TraitName == "Plant reproductive phenology timing (flowering time)"]
To get an overview of the data, we convert values to lowercase, sort them, and show them as a table.
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values
# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)
# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]
As this is very heterogeneous data, several decoding steps are needed. Data needs to be extracted from the OriglName column, but also be converted from character to numeric. Finally, ranges need to be split to begin and end, if possible. As changes are preformed on the TRYSubset data, oriVals is redefined at the end.
# transfer data with OriglName Flowering period: <English month> into
# start of flowering (month) and end of flowering (month)
# enter this data into first two fields per species, set other fields to NA
# convert months into numeric
for (i in seq_along(month.name)) TRYSubset[DatasetID == 516, OriglName := sub(paste0("Flowering period: ", month.name[i]), sprintf("%02d", i), OriglName)]
# sort data to make sure all observations will come in correct order
TRYSubset[, oriOrder := seq_len(nrow(TRYSubset))]
setorder(TRYSubset, DatasetID, SpeciesName, OriglName)
# get first and last consecutive observation month per species
vals <- TRYSubset[DatasetID == 516]$OrigValueStr
vals <- matrix(vals, 12, length(vals) / 12)
frange <- apply(vals, 2, function(x) which(x == TRUE))
fbegin <- vapply(frange, function(x) x[1], 1)
frange <- sapply(frange, diff)
frange <- sapply(frange, paste, collapse = "")
frange <- vapply(frange, nchar, 0)
fend <- sapply(seq_along(fbegin), function(x) fbegin[x] + frange[x])
# replace original data and OriglName
TRYSubset[DatasetID == 516, OrigValueStr := NA]
TRYSubset[DatasetID == 516 & OriglName == "01", OrigValueStr := fbegin]
TRYSubset[DatasetID == 516 & OriglName == "02", OrigValueStr := fend]
# duplicate values with ranges to return start and end separately
TRYTemp <- TRYSubset[OriglName %in% c("Flowering Period", "Flowering time")]
TRYTemp[, OriglName := "temp"]
TRYTemp[, oriOrder := max(TRYSubset$oriOrder) + seq_len(nrow(TRYTemp))]
TRYSubset <- rbind(TRYSubset, TRYTemp)
# create updated oriVals
oriVals <- TRYSubset$OrigValueStr # oriVals == original values
oriVals <- tolower(oriVals)
To accommodate the different types of data, we need to define the names of the new traits.
# new trait names
traitNames <- c(
"Flowering duration (months)", "Flowering duration (months)",
"Flowering start (month)", "Flowering start (month)",
"Flowering end (month)", "Flowering end (month)",
"Flowering period (season)", "Flowering peak (month)", "Flowering peak (month)",
"Fruiting start (month)", "Fruiting start (season)", "Fruiting end (season)",
"Fruiting peak (season)", "Germination (season)"
)
We also have to define the old category names. They are found in the OriglName column.
# trait categories based on OriglName
# some are joined later on in case there are different units (days -> months)
origlNames <- c(
# Flowering duration (months)
"^(# of Flowering Months|# of Months|FlowDur|flowering duration \\(months\\)|num\\.flring\\.mnths|Reproductive phenology \\(b\\))$",
# Flowering duration (days)
"^(Duration of flowering period \\(visible anthers\\)|Flowering Lasting \\(days\\))$",
# Flowering start (day)
"^(Beginning of flowering period \\(visible anthers\\)|Fdate|onset of flowering|First Flowering Date|Flowering-beginning \\(DOY\\)|FloweringPhenologyStart|FlowStart)$",
# Flowering start (month)
"^(01|bloom\\.start|Flower|Flowering|flowering \\(start\\)|Flowering Period|Flowering Period begin|Flowering time|Flowering time: 1\\. earliest month|Onset of Flowering|onset of flowering period|Reproductive phenology \\(a\\))$",
# Flowering end (day)
"^(Flowering-End \\(DOY\\))$",
# Flowering end (month)
"^(02|end of flowering|FloweringPhenologyEnd|flowering \\(end\\)|Flowering Period end|Flowering time: 2\\. latest month|temp)$",
# Flowering period (season)
"^(Bloom Period|Flowering season|FT, flowering time|JAHRESZEIT_E|start of flowering)$",
# Flowering peak (day)
"^(FlowerDate|Flowering Period \\(Mediterranean Europe\\)|FlrDate)$",
# Flowering peak (month)
"^(Flowering time: 3\\. peak month|Phenology : reproductive)$",
# Fruiting start (day)
"^(Smdate)$",
# Fruiting start (season)
"^(Fruit/Seed Period Begin)$",
# Fruiting end (season)
"^(Fruit/Seed Period End)$",
# Fruiting peak (season)
"SE: main seed release period|Time of seed dispersal \\(season\\)",
# Germination (season)
"GE, main germination period|Time of germination \\(season\\)"
)
The pre-processing is idiosyncratic, given the differences in the categories. We select the values based on the categories defined before.
# pre-process data
newVals <- rep(NA, length(oriVals))
for (i in seq_along(origlNames)) {
# only work on data with the respective OriglName
tempVals <- oriVals
tempVals[!grepl(origlNames[i], TRYSubset$OriglName)] <- NA
# print(sort(table(tempVals)))
# print(table(TRYSubset[!is.na(tempVals)]$OrigUnitStr))
if (i == 1) {
tempVals[tempVals %in% c("/", "na")] <- ""
} else if (i == 2) {
tempVals <- as.numeric(format(as.Date(as.numeric(tempVals)), "%m"))
} else if (i == 3) {
# convert day to month
tempVals[!grepl("month", TRYSubset$OrigUnitStr)] <- as.numeric(format(as.Date(as.numeric(
tempVals[!grepl("month", TRYSubset$OrigUnitStr)]
)), "%m"))
} else if (i == 4) {
# convert names to numbers, ranges to starts
tempVals[tempVals %in% c("0", "49")] <- ""
tempVals <- sub("mid-", "", tempVals)
tempVals <- sub("(\\-|to ).*", "", tempVals)
tempVals <- sub("^na$", "", tempVals)
tempVals <- sub("`", "", tempVals)
temp <- match(tempVals, tolower(month.abb))
tempVals[!is.na(temp)] <- temp[!is.na(temp)]
temp <- match(tempVals, tolower(month.name))
tempVals[!is.na(temp)] <- temp[!is.na(temp)]
tempVals <- gsub("\\D", "", tempVals)
} else if (i == 5) {
# convert day to month
tempVals[!grepl("month", TRYSubset$OrigUnitStr)] <- as.numeric(format(as.Date(as.numeric(
tempVals[!grepl("month", TRYSubset$OrigUnitStr)]
)), "%m"))
} else if (i == 6) {
# convert names to numbers, ranges to ends
tempVals <- sub(".*(\\-|to)", "", tempVals)
tempVals <- sub("`", "", tempVals)
temp <- match(tempVals, tolower(month.abb))
tempVals[!is.na(temp)] <- temp[!is.na(temp)]
temp <- match(tempVals, tolower(month.name))
tempVals[!is.na(temp)] <- temp[!is.na(temp)]
tempVals <- sub("^\\s+", "", tempVals)
} else if (i == 7) {
tempVals[tempVals == "not available"] <- ""
tempVals[tempVals == "jm"] <- "winter"
tempVals[tempVals == "am"] <- "spring"
tempVals[tempVals == "ja"] <- "summer"
tempVals <- sub("any time with rain", "after rain", tempVals)
tempVals <- sub("all year", "spring, summer, autumn, winter", tempVals)
tempVals <- sub("autumn-spring", "spring, autumn", tempVals)
tempVals <- gsub("((start|end) of )?((early|late) )?|mid\\s*|pre-", "", tempVals)
tempVals <- gsub("fall", "autumn", tempVals)
tempVals <- gsub("indeterminate|irregular", "variable", tempVals)
tempVals <- gsub("midsummer", "summer", tempVals)
tempVals <- gsub("spring-autumn", "spring, summer, autumn", tempVals)
tempVals <- gsub("-", ", ", tempVals)
} else if (i == 8) {
tempVals[tempVals %in% c("", "/")] <- ""
tempVals[grepl("/", tempVals)] <-
as.numeric(format(as.Date(tempVals[grepl("/", tempVals)], format = "%d/%m/%Y"), "%j"))
# convert day to month
tempVals <- as.numeric(format(as.Date(as.numeric(tempVals)), "%m"))
} else if (i == 10) {
# convert day to month
tempVals <- as.numeric(format(as.Date(as.numeric(tempVals)), "%m"))
} else if (i %in% c(11:14)) {
tempVals[tempVals == "not observed"] <- ""
tempVals[tempVals == "wind"] <- ""
tempVals <- sub("any time with rain", "after rain", tempVals)
tempVals <- gsub("fall", "autumn", tempVals)
tempVals <- sub("year round", "spring, summer, autumn, winter", tempVals)
tempVals <- sub("no main period", "variable", tempVals)
}
newVals[!is.na(tempVals)] <- tempVals[!is.na(tempVals)]
}
newVals[newVals == ""] <- NA
We can now integrate the data into the TRYSubset, simultaneously updating the trait names, too.
# integrate new values and trait names into TRYSubset
TRYSubset[, CleanedValueStr := newVals]
TRYSubset[, TraitName := ""]
for (i in seq_along(traitNames)) TRYSubset[grepl(origlNames[i], OriglName), TraitName := traitNames[i]]
TRYSubset[grepl("\\(months?\\)", TraitName), OrigUnitStr := "month"]
The integration into TRY is a bit tricky here. We have duplicated some rows in TRYSubset, so we need to account for this. We first add as many rows to TRY as we added to TRYSubset (by reducing the added data.table to by its original length). The new data within the new rows is now in the TRY dataset.
# move values to other traits
setorder(TRYSubset, oriOrder)
TRYSubset[, oriOrder := NULL]
nVals <- sum(TRY$TraitName == "Plant reproductive phenology timing (flowering time)")
TRY <- rbind(TRY, TRYSubset[-seq_len(nVals)], fill = TRUE)
Now, we move the new values, units, and trait names to the original rows (by selecting only the first nVals rows). The new rows and their values have already been added above.
# integrate into TRY
TRY[
TraitName == "Plant reproductive phenology timing (flowering time)",
CleanedValueStr := TRYSubset$CleanedValueStr[seq_len(nVals)]
]
TRY[
TraitName == "Plant reproductive phenology timing (flowering time)",
OrigUnitStr := TRYSubset$OrigUnitStr[seq_len(nVals)]
]
# add classification into whole plant trait or plant part trait to trait name
TRY[
TraitName == "Plant reproductive phenology timing (flowering time)",
TraitName := TRYSubset$TraitName[seq_len(nVals)]
]
Let’s write the data to a file.
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))