sPlot - A global database on vegetation surveys#

sPlot is a huge collection of vegetation surveys. The scope of sPlot is global. This is achieved by collaborations of scientists across the world that share their data with each other. Many of the individual datasets from sPlot are open access, and they have been released as sPlotOpen. This pre-processing scripts works on sPlotOpen v2.0. It does the following:

  • add WGS84 smapling locations and TDWG4 zones to each occurrence record

  • remove typos

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • a world map shapefile of the TDWG4 regions, available here

  • sPlot data, available here

  • some R libraries that may need to be installed

Code#

# load in libraries
library(data.table) # handle large datasets
library(rgdal) # handle shapefiles
library(rgeos) # distance calculation for geodata

# clear workspace
rm(list = ls())

Let’s get the world map with TDWG regions (botanical regions). We will repair a little error in the data.

# set working directory (adapt this!)
setwd(paste0(.brd, "taxonomy/TDWG"))

# read in TDWG regions
wm4 <- readOGR("level4.shp")

# wrong name in Level4_cod
wm4@data$Level4_cod <- sub("AGE-CO", "AGE-CD", wm4@data$Level4_cod)

# set projection
proj4string(wm4) <- CRS("+init=epsg:4326")

Now we need the sPlot data. There are two files, one with the plant occurrences in the plots and one with the plot locations.

# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in sPlot data
splot <- fread("sPlotOpen_3474_52/3474_52_sPlotOpen_DT(1).txt") # vegetation surveys
splotLoc <- fread("sPlotOpen_3474_52/3474_54_sPlotOpen_header(2).txt") # plot locations

We will use the plot locations to add the sampling location coordinates and TDWG4 zone to each occurrence record.

# add sampling location coordinates
setkey(splotLoc, PlotObservationID)
splotCoords <- splotLoc[J(splot$PlotObservationID)]

# write geographical information and ID into data frame
splot[, Latitude_WGS84 := splotCoords$Latitude]
splot[, Longitude_WGS84 := splotCoords$Longitude]

Getting the TDWG zones is a little harder. We will first get the TDWG4 zones for coordinates that fall directly into a TDWG4 zone, and then search for the nearest TDWG4 zone for coordinates that do not fall into any zone. This happens because the polygons defining the TDWG4 zones have a fixed resolution and may fail to include plots at coastlines.

coordinates(splotCoords) <- ~ Longitude + Latitude
proj4string(splotCoords) <- CRS("+init=epsg:4326")
pm <- over(splotCoords, wm4) # perfect matches
splot[, TDWG4 := pm$Level4_cod]
nm <- splotCoords[which(is.na(pm$Level4_cod)), ] # no matches
nm <- remove.duplicates(nm)
dists <- gDistance(nm, wm4, byid = TRUE) # distance to get closest matches (coastline and data issues)
cm <- apply(dists, 2, function(x) order(x)[1]) # closest matches, rownames also start with 0
cm <- wm4@data$Level4_cod[cm]
for (i in seq_len(nrow(nm@coords))) {
	splot[Longitude_WGS84 == nm@coords[i, 1] & Latitude_WGS84 == nm@coords[i, 2], TDWG4 := cm[i]]
}

Changing column names is not necessary, but may be convenient.

# change column names
colnames(splot) <- gsub("Species", "AccSpeciesName", colnames(splot))

There are some remaining typos in taxon names and some inconsistencies in hybrid markers that will be removed in the last steps. We also add a genus column and sort the columns. We remove some data on algae and scan the taxon names for remaining (erronous) special characters.

# add plant genera
# correct typos
splot[, AccSpeciesName := gsub("Convululaceae", "Convolvulaceae", AccSpeciesName)]
splot[, AccSpeciesName := gsub("Laminaceae", "Lamiaceae", AccSpeciesName)]

# remove entries without AccSpeciesName
splot[AccSpeciesName == "", AccSpeciesName := Original_species] # add data where AccSpeciesName == ""
# add data where AccSpeciesName is NA
splot[is.na(AccSpeciesName) & !grepl("\\d", Original_species), AccSpeciesName := Original_species]
splot[is.na(AccSpeciesName) | AccSpeciesName == ""]
splot <- splot[!is.na(AccSpeciesName) & AccSpeciesName != ""]

# temporarily remove genus hybrid sign
hybrids <- grepl("^x ", splot$AccSpeciesName)
splot[, AccSpeciesName := sub("^x ", " ", AccSpeciesName)]

# standardize species hybrid signs
splot[, AccSpeciesName := sub("\xc3\x97", " x ", AccSpeciesName)]

# repair an error in the data
splot[, AccSpeciesName := sub("Platanus\\s+x\\s+", "Platanus x hispanica", AccSpeciesName)]

# remove unnecessary whitespaces
splot[, AccSpeciesName := gsub("(^\\s+|\\s+$)", "", AccSpeciesName)]
splot[, AccSpeciesName := gsub("\\s+", " ", AccSpeciesName)]

# make lower- and uppercase plant names normal case
splot[, AccSpeciesName := tolower(AccSpeciesName)]
splot[, AccSpeciesName := paste0(toupper(substr(AccSpeciesName, 1, 1)), sub("^.", "", AccSpeciesName))]

# add plant genus column
splot[, AccGenus := gsub(" .*", "", AccSpeciesName)]

# add back hybrid sign
splot[hybrids, AccGenus := paste0("x ", AccGenus)]
splot[hybrids, AccSpeciesName := paste0("x ", AccSpeciesName)]

# check for problems in entries, i.e. wrong structure or special characters
splot[
	!grepl("^(x )?[A-Z]+[a-z]+(\\s+[a-z'\\-]+(\\s+[a-z]+\\.?\\s+[a-z'\\-]+)?)?", AccSpeciesName),
	c("AccSpeciesName", "Original_species")
]

# remove entries with "algae in name
splot <- splot[!grepl("\"algae", AccSpeciesName)]

# sort columns
mainCols <- c("PlotObservationID", "AccGenus", "AccSpeciesName", "Latitude_WGS84", "Longitude_WGS84", "TDWG4")
setcolorder(splot, c(mainCols, setdiff(colnames(splot), mainCols)))

Now we can write the pre-processed sPlot data.

fwrite(splot, file = paste0("sPlot_processed_", Sys.Date(), ".gz"))