Taxonomic name parsing - Workflow library

This notebook is intended to show the functionality of a taxonomic name parser. Name parsing describes the identification of string components as parts of scientific names. In general, these are

generic name/genus,
specific name/epithet,
infraspecies markers (subsp./var./f.),
infraspecific name/epithet,
authors.

Additional elements, as hybrid signs and a number of other markers as cv. or agg., spelling errors, and custom information in the names, e.g., Bellis perennis_plot123, can complicate the process of name parsing. In this notebook, the taxonomic name parser from GBIF will be used. However, there are some limitation with it, and in the hands-on part, we will attempt to overcome those by pre-processing the data before sending it to the name parser. We will also try to speed up the name parsing process by using the parallel processing functionality of R.

Prerequisites¶

To run the code presented here, you will need

the sample names list provided in the workshop,
a functioning R environment and
the R packages data.table, rgbif, and doParallel installed.

Code¶

The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.

# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doParallel) # parallel computing

# clear workspace
rm(list = ls())

# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))

# load data
plants <- fread("plant names_2024-04-08.txt", sep = "\t")
animals <- fread("animal names_2024-04-09.txt", sep = "\t")

Lade nötiges Paket: foreach

Lade nötiges Paket: iterators

Lade nötiges Paket: snow

Both the plants and animals variables are tables with one column. The names in these tables are different from each other - most notably, some have authors included while others have not. To get the best results when doing name harmonization later on, we will need to separate authors, and also remove problematic characters from the data.

Encoding¶

Unfortunately, when getting data from differing sources, we will often find that these data have been encoded in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system.

We will deal with the most common case: Data being stored in the Windows-specific CP-1252 encoding (mislabeled ANSI or latin1 sometimes) and not in UTF-8.

How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:

Sys.getlocale()

If your console has no UTF-8 setting (no matter the language) you may change it like this:

Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")

You can use another encoding, too, but it may throw errors later on. So let’s check whether the data comes in UTF-8, and if not, let’s repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).

# check whether correct encoding is UTF-8
table(validUTF8(plants$oldName))
table(validUTF8(animals$modName))


FALSE  TRUE 
   73  4927


TRUE 
5000

# create new columns for variables
plants[, newName := oldName]
animals[, newName := modName]
# correct encoding, assuming current encoding is CP-1252
plants[!validUTF8(newName), newName := iconv(newName, from = "CP1252", to = "UTF-8")]

Name parsing¶

Let’s try to parse the names using the GBIF name parser.

resP <- data.table(name_parse(plants$newName))
resA <- data.table(name_parse(animals$newName))

table(resP$parsed)
table(resA$parsed)


FALSE  TRUE 
   11  4989


FALSE  TRUE 
   17  4983

That looks like a pretty good result. For plants and animals, we got all but 11 and 12 names parsed, respectively. Let’s look at what did not work for animals.

resA[parsed == FALSE]

Pre-processing¶

The problem with most of these names is the number-character combinations before the actual name. They need to be removed before using the name parser. As it seems these are combinations of one to three uppercase characters or numbers followed by a underline repeated twice, we may find them like shown below. Note that is essential to use regular expressions, which can be used to create target patterns to search for. Regular expressions are more or less the same across programming languages. Some information specifically on R can be found here.

animals[grepl("^([[:upper:]]|\\d){1,3}_([[:upper:]]|\\d){1,3}", newName), "newName"]

Removing such a sequence could be done more or less like this.

# create a new variable to not overwrite the original data
animals[, testName := newName]

# remove the name sequences
animals[, testName := sub("^([[:upper:]]|\\d){1,3}_([[:upper:]]|\\d){1,3}", "", testName)]

# check whether it worked
animals[testName != newName, c("newName", "testName")]

TASKS:
Try to fix the code so that it gives the wanted result.
To increase the accuracy of later matching, look for these combinations of uppercase letters and numbers also in the species epithet.
Then, try to fix the problems with the other unparsed names in the animal and plant names.
There may also be some generic terms you may want to remove (e.g. spec., spp., agg., etc.).
Some useful functions can be found below.

# check for a number after the genus name, but before the year
animals[grepl("^\\S+\\s.*\\d.*\\s\\d{4}$", newName), "newName"][1:3]
resA[grepl("^\\S+\\s.*\\d.*\\s\\d{4}$", animals$newName)][1:3]

# check for spec., species, morpho, spp.
animals[grepl("spec\\.|species|morpho|spp\\.", newName), "newName"][1:3]
resA[grepl("spec\\.|species|morpho|spp\\.", animals$newName)][1:3]

# find name parts after an equal sign
plants[grepl("=", newName), "newName"][1:3]
resP[grepl("=", plants$newName)][1:3]

You can compare your results to the ones provided here below. For plants, they have been created using a name parser developed specifically for the names of the TRY database. For animals this is an extract of the CITES names list, these are the unmodified correct names, from which the erronous names found here were derived.

# for testing
plantsFull <- fread("plant names full_2024-04-08.txt", sep = ",")
animalsFull <- fread("animal names full_2024-04-09.txt", sep = ",")

str(plantsFull)
str(animalsFull)

Classes 'data.table' and 'data.frame':	5000 obs. of  18 variables:
 $ oldName         : chr  "" "(lauraceae) pubescente" "?Betulaceae sp." "Abarema curvicarpa" ...
 $ newName         : chr  "" "" "" "" ...
 $ familyNameFound : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
 $ oldFamilyName   : chr  "" "lauraceae" "Betulaceae" "" ...
 $ newFamilyName   : chr  "" "Lauraceae" "Betulaceae" "" ...
 $ genus           : chr  "" "Pubescente" "" "Abarema" ...
 $ hybrid1         : chr  "" "" "" "" ...
 $ species1        : chr  "" "" "" "curvicarpa" ...
 $ subSpeciesFlag  : chr  "" "" "" "" ...
 $ subSpecies      : chr  "" "" "" "" ...
 $ varSpeciesFlag  : chr  "" "" "" "" ...
 $ varSpecies      : chr  "" "" "" "" ...
 $ formaSpeciesFlag: logi  NA NA NA NA NA NA ...
 $ formaSpecies    : logi  NA NA NA NA NA NA ...
 $ hybrid2         : chr  "" "" "" "" ...
 $ species2        : chr  "" "" "" "" ...
 $ author          : chr  "" "" "" "" ...
 $ kingdom         : chr  "" "" "" "P" ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'data.table' and 'data.frame':	5000 obs. of  55 variables:
 $ TaxonId                    : int  2581 3734 1703 68243 68179 68198 68076 68212 68150 68149 ...
 $ Kingdom                    : chr  "Animalia" "Animalia" "Animalia" "Animalia" ...
 $ Phylum                     : chr  "Chordata" "Chordata" "Chordata" "Chordata" ...
 $ Class                      : chr  "Aves" "Aves" "Reptilia" "Reptilia" ...
 $ Order                      : chr  "Apodiformes" "Apodiformes" "Sauria" "Sauria" ...
 $ Family                     : chr  "Trochilidae" "Trochilidae" "Anguidae" "Anguidae" ...
 $ Genus                      : chr  "Abeillia" "Abeillia" "Abronia" "Abronia" ...
 $ Species                    : chr  "" "abeillei" "" "anzuetoi" ...
 $ Subspecies                 : chr  "" "" "" "" ...
 $ FullName                   : chr  "Abeillia" "Abeillia abeillei" "Abronia" "Abronia anzuetoi" ...
 $ AuthorYear                 : chr  "Bonaparte, 1850" "(Lesson & DeLattre, 1839)" "Gray, 1838" "Campbell & Frost, 1993" ...
 $ RankName                   : chr  "GENUS" "SPECIES" "GENUS" "SPECIES" ...
 $ CurrentListing             : chr  "II" "II" "I/II" "I" ...
 $ FullAnnotationEnglish      : chr  "Appendix II:" "Appendix II:" "Appendix II:Except the species included in Appendix I. Zero export quota for wild specimens for <i>Abronia auri"| __truncated__ "Appendix I:" ...
 $ AnnotationEnglish          : chr  "Appendix II:" "Appendix II:" "Appendix II:Except the species included in Appendix I. Zero export quota for wild specimens for <i>Abronia auri"| __truncated__ "Appendix I:" ...
 $ AnnotationSpanish          : chr  "Appendix II:" "Appendix II:" "Appendix II:Excepto las especies incluidas en el Apéndice I. Cupo de exportación nulo para los especímenes silv"| __truncated__ "Appendix I:" ...
 $ AnnotationFrench           : chr  "Appendix II:" "Appendix II:" "Appendix II:Sauf les espèces inscrites à l’Annexe I. Quota d’exportation zéro pour les spécimens sauvages pour "| __truncated__ "Appendix I:" ...
 $ #AnnotationSymbol          : chr  "" "" "" "" ...
 $ #Annotation                : chr  "Appendix II:" "Appendix II:" "Appendix II:" "Appendix I:" ...
 $ SynonymsWithAuthors        : chr  "" "Ornismya abeillei Lesson & DeLattre, 1839" "" "Abronia anzuetoi Köhler, 2000" ...
 $ EnglishNames               : chr  "" "Emerald-chinned Hummingbird" "" "Anzuetoi arboreal alligator lizard" ...
 $ SpanishNames               : chr  "" "Colibrí barbiesmeralda" "" "" ...
 $ FrenchNames                : chr  "" "Colibri d'Abeillé" "" "" ...
 $ CitesAccepted              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ All_DistributionISOCodes   : chr  "" "SV, GT, HN, MX, NI" "" "GT" ...
 $ All_DistributionFullNames  : chr  "" "El Salvador, Guatemala, Honduras, Mexico, Nicaragua" "" "Guatemala" ...
 $ NativeDistributionFullNames: chr  "" "El Salvador, Guatemala, Honduras, Mexico, Nicaragua" "" "Guatemala" ...
 $ Introduced_Distribution    : chr  "" "" "" "" ...
 $ Introduced(?)_Distribution : chr  "" "" "" "" ...
 $ Reintroduced_Distribution  : chr  "" "" "" "" ...
 $ Extinct_Distribution       : chr  "" "" "" "" ...
 $ Extinct(?)_Distribution    : chr  "" "" "" "" ...
 $ Distribution_Uncertain     : chr  "" "" "" "" ...
 $ modOrder                   : chr  "Apodiformes" "Apodiformes" "Sauria" "Sauria" ...
 $ modFamily                  : chr  "Trochilidae" "Trochilidae" "Anguidae" "Anguidae" ...
 $ modGenus                   : chr  "Abeillia" "Abeillia" "Abronia" "Abronia" ...
 $ modSpecies                 : chr  "" "abeillei" "" "anzuetoi" ...
 $ modSubspecies              : chr  "" "" "" "" ...
 $ modAuthorYear              : chr  "Bonaparte, 1850" "(Lesson & DeLattre, 1839)" "Gray, 1838" "Campbell & Frost, 1993" ...
 $ uppercase                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ lowercase                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ changedOne                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ omittedOne                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ shuffle                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ noAuthors                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ abbrAuthors                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ noYear                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ abbrGenus                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ addPlot                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ addFamily                  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ morphoSpec                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ cutGenusEpi                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ modName                    : chr  "Abeillia Bonaparte, 1850" "Abeillia abeillei (Lesson & DeLattre, 1839)" "Abronia Gray, 1838" "Abronia anzuetoi Campbell & Frost, 1993" ...
 $ cutName                    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Name                       : chr  "Abeillia Bonaparte, 1850" "Abeillia abeillei (Lesson & DeLattre, 1839)" "Abronia Gray, 1838" "Abronia anzuetoi Campbell & Frost, 1993" ...
 - attr(*, ".internal.selfref")=<externalptr>

Parallel processing¶

With our name lists having 5000 names each, the parsing takes just some seconds. Depending on the size of the list, it may be a good idea to speed up the process by parallelizing it. As the name_parse function already accepts several names at once, we may split the names list into the number of cores we can use for parallel processing.

Let’s check how many cores are available on the system.

detectCores()

It is unlikely that you have so many cores available, but from former trials with the GBIF API I can tell you that is is wise to limit the core number to 24 at maximum. So let’s re-run the name parsing to compare the times needed.

timeStart <- Sys.time()
resP <- data.table(name_parse(plants$newName))
Sys.time() - timeStart
timeStart <- Sys.time()
resA <- data.table(name_parse(animals$newName))
Sys.time() - timeStart

Time difference of 1.98634 secs

Time difference of 3.512737 secs

Now let’s split the lists into chunks and let each worker run independently.

nLists <- min(24, detectCores() - 1)
(nNames <- nrow(plants) %/% nLists)
(nNamesLast <- nNames + nrow(plants) %% nLists)

We chose to use one workers less than we can, to allow the computer to fulfill other tasks while the script is running, and a maximum of 24. On my computer, this means that each chunk has 333 names to process, and the last chunk will have 338. Let’s create the parallel environment now and compare the times. We just to the plant case for simplicity.

# create the cluster for parallel processing
cl <- makeCluster(nLists)
registerDoParallel(cl)

# run the name parsing in parallel
# the option "fill = TRUE" makes sure foreach throws no error due to different column numbers
timeStart <- Sys.time()
resP_parallel <- foreach(
    i = seq_len(nLists), .combine = function(...) rbind(..., fill = TRUE),
    .packages = c("data.table", "rgbif")
) %dopar% {
    if (i < nLists) {
        res <- data.table(name_parse(plants$newName[seq_len(nNames) + (i - 1) * nNames]))
    } else {
        res <- data.table(name_parse(plants$newName[seq_len(nNamesLast) + (i - 1) * nNames]))
    }
    res
}
Sys.time() - timeStart

# stop the cluster
stopCluster(cl)

Time difference of 5.201381 secs

Timewise, for this little dataset, the overhead created by setting up the parallel environment was larger than the speed gain through parallel processing. Let’s check whether the results are the same.

all(resP == resP_parallel, na.rm = TRUE)

However, the results are as expected, and with larger lists, this approach could save us some time.