Vernacular name matching#

This workbook shows how to find scientific names for a given list of vernacular names using the GBIF database. While it was tedious to get lists of scientific names with attached vernacular names in the past, the GBIF API offers a fast and convenient way to get corresponding scientific names for vernacular names for any kind of living beings. In the hands on-part of this notebooks, you will try to figure a way to select the best result of the bunch of results you get back from the API. You will also seek to speed up the scientific name matching process by running it in parallel.

Prerequisites#

To run the code presented here, you will need

  • the sample names list provided in the workshop,

  • a functioning R environment and

  • the R packages data.table, rgbif, RJSONIO, and doSNOW installed.

Code#

The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.

# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing
library(RJSONIO) # parse JSON

# clear workspace
rm(list = ls())

# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))

# load data
vernNames <- fread("vernacular names_2024-04-09.txt", sep = "\t")
Lade nötiges Paket: foreach

Lade nötiges Paket: iterators

Lade nötiges Paket: snow

Let’s look at the data.

str(vernNames)
Classes 'data.table' and 'data.frame':	1000 obs. of  1 variable:
 $ vernacularName: chr  "Bermuda Cress" "Bibernelle, Große" "Mandarine" "Lavendel, Französischer" ...
 - attr(*, ".internal.selfref")=<externalptr> 

As can be seen, the names found in the file are a mixture of English and German vernacular names. For simplicity, they are all plant names, but it would not make a difference if there were animals or fungi names included. The names used here were gathered from an English wikipedia page and a German website of vernacular plant names. Both pages also include the scientific names of plants, which can serve as a check to the results obtained here.

Encoding#

Unfortunately, when getting data from differing sources, we will often find that these data have been encoded in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system.

We will deal with the most common case: Data being stored in the Windows-specific CP-1252 encoding (mislabeled ANSI or latin1 sometimes) and not in UTF-8.

How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:

Sys.getlocale()
'LC_COLLATE=German_Germany.utf8;LC_CTYPE=German_Germany.utf8;LC_MONETARY=German_Germany.utf8;LC_NUMERIC=C;LC_TIME=German_Germany.utf8'

If your console has no UTF-8 setting (no matter the language) you may change it like this:

Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")

You can use another encoding, too, but it may throw errors later on. So let’s check whether the data comes in UTF-8, and if not, let’s repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).

# check whether correct encoding is UTF-8
table(validUTF8(vernNames$vernacularName))
TRUE 
1000 

That looks all good, so there is nothing to do here. Otherwise we would apply the following:

plants[!validUTF8(vernacularName), newName := iconv(vernacularName, from = "CP1252", to = "UTF-8")]

converting all non-UTF-8 characters to UTF-8.

TRY vernacular name matching with the rgbif package#

As there is the rgbif package available to query GBIF, one would assume that a function from therein can be used to retrieve scientific names for the vernacular names.

name_lookup("Bermuda Cress", limit = 5)
name_lookup("Gänseblümchen", limit = 5)
name_lookup("Asiatischer Elefant", limit = 5)
name_lookup("cotton", limit = 5)
Records found [1] 
Records returned [1] 
No. unique hierarchies [1] 
No. facets [0] 
No. names [1] 
Args [q=Bermuda Cress, limit=5, offset=0] 
# A tibble: 1 × 21
        key scientificName    datasetKey parentKey parent genus species genusKey
      <int> <chr>             <chr>          <int> <chr>  <chr> <chr>      <int>
1 165626894 Barbarea verna (… cbb6498e-… 165626881 Barba… Barb… Barbar…   1.66e8
# ℹ 13 more variables: speciesKey <int>, canonicalName <chr>, authorship <chr>,
#   nameType <chr>, taxonomicStatus <chr>, rank <chr>, origin <chr>,
#   numDescendants <int>, numOccurrences <int>, habitats <lgl>,
#   nomenclaturalStatus <lgl>, threatStatuses <lgl>, synonym <lgl>
Records found [50] 
Records returned [5] 
No. unique hierarchies [5] 
No. facets [0] 
No. names [5] 
Args [q=Gänseblümchen, limit=5, offset=0] 
# A tibble: 5 × 34
        key scientificName datasetKey nubKey parentKey parent order family genus
      <int> <chr>          <chr>       <int>     <int> <chr>  <chr> <chr>  <chr>
1 100336410 Bellis L.      16c3f9cb-… 3.12e6 223980151 Aster… Aste… Aster… Bell…
2 116781451 Bellis L.      d027759f-… 3.12e6 116780684 Aster… Aste… Aster… Bell…
3   3117399 Bellis L.      d7dddbf4-… 3.12e6      3065 Aster… Aste… Aster… Bell…
4 100341500 Veronica L.    16c3f9cb-… 3.17e6 223978156 Plant… Lami… Plant… Vero…
5 100462938 Consolida Gray 16c3f9cb-… 3.03e6 223987175 Ranun… Ranu… Ranun… Cons…
# ℹ 25 more variables: classKey <int>, orderKey <int>, familyKey <int>,
#   genusKey <int>, canonicalName <chr>, authorship <chr>, nameType <chr>,
#   taxonomicStatus <chr>, rank <chr>, origin <chr>, numDescendants <int>,
#   numOccurrences <int>, habitats <lgl>, nomenclaturalStatus <lgl>,
#   threatStatuses <chr>, synonym <lgl>, class <chr>, phylum <chr>,
#   phylumKey <int>, nameKey <int>, constituentKey <chr>, kingdom <chr>,
#   kingdomKey <int>, publishedIn <chr>, extinct <lgl>
Records found [14] 
Records returned [5] 
No. unique hierarchies [5] 
No. facets [0] 
No. names [4] 
Args [q=Asiatischer Elefant, limit=5, offset=0] 
# A tibble: 5 × 36
       key scientificName datasetKey  nubKey parentKey parent order family genus
     <int> <chr>          <chr>        <int>     <int> <chr>  <chr> <chr>  <chr>
1   1.00e8 Elephas maxim… 16c3f9cb-… 5219461 223978954 Eleph… Prob… Eleph… Elep…
2   1.96e8 Elephas maxim… 23a3fa4c-… 5219461 225209602 Eleph… Prob… Eleph… Elep…
3   5.22e6 Elephas maxim… d7dddbf4-… 5219461   2435351 Eleph… Prob… Eleph… Elep…
4   1.65e8 Amebelodon Ba… 16c3f9cb-… 4825859 223978959 Gomph… Prob… Gomph… Ameb…
5   1.65e8 Elephas maxim… 16c3f9cb-…      NA 223987046 Eleph… Prob… Eleph… Elep…
# ℹ 27 more variables: species <chr>, classKey <int>, orderKey <int>,
#   familyKey <int>, genusKey <int>, speciesKey <int>, canonicalName <chr>,
#   authorship <chr>, nameType <chr>, taxonomicStatus <chr>, rank <chr>,
#   origin <chr>, numDescendants <int>, numOccurrences <int>, habitats <lgl>,
#   nomenclaturalStatus <lgl>, threatStatuses <chr>, synonym <lgl>,
#   class <chr>, kingdom <chr>, phylum <chr>, kingdomKey <int>,
#   phylumKey <int>, nameKey <int>, constituentKey <chr>, publishedIn <chr>, …
Records found [6458] 
Records returned [5] 
No. unique hierarchies [5] 
No. facets [0] 
No. names [0] 
Args [q=cotton, limit=5, offset=0] 
# A tibble: 5 × 34
      key scientificName nameKey datasetKey nubKey parentKey parent phylum order
    <int> <chr>            <int> <chr>       <int>     <int> <chr>  <chr>  <chr>
1  1.21e8 Neodiastoma C… 7436314 c33ce2f2-… 4.61e6 121285847 Pareo… Mollu… Sorb…
2  4.61e6 Callitriphora… 1838041 d7dddbf4-… 4.61e6      2660 Triph… Mollu… NA   
3  1.22e8 Belgradeophyl… 1411263 c33ce2f2-… 4.88e6 121517396 Rugosa Cnida… NA   
4  4.61e6 Nototerebra C… 7626629 d7dddbf4-… 4.61e6      2685 Tereb… Mollu… Neog…
5  2.19e8 Nototerebra C…      NA 7ddf754f-… 4.61e6 218637945 Tereb… Mollu… Neog…
# ℹ 25 more variables: family <chr>, genus <chr>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   canonicalName <chr>, authorship <chr>, nameType <chr>,
#   taxonomicStatus <chr>, rank <chr>, origin <chr>, numDescendants <int>,
#   numOccurrences <int>, extinct <lgl>, habitats <chr>,
#   nomenclaturalStatus <lgl>, threatStatuses <lgl>, synonym <lgl>,
#   class <chr>, constituentKey <chr>, kingdom <chr>, kingdomKey <int>, …

Creating a custom vernacular name matching function using the GBIF API#

From the examples shown here, we see that data is returned, and for some names, as “Bermuda Cress” or the Asian Elephant (“Asiatischer Elefant”), there are a limited number of records available, and we can easily select the (first) correct one. In the last example, “cotton”, however, we get over 6000 matches, and it is impossible to select the correct one easily. The problem here is that name_lookup does not return the actual matched vernacular names. It searches in all fields, and the first results returned have “Cotton” as the author of the scientific name, not as a vernacular name. The biggest problem is that the results do not include the actual vernacular names, even if they were used in matching, which impedes a subsequent filtering. As this is not what we want, we will need to access the GBIF API directly instead of relying on the rgbif package.

Fortunately, that is not difficult, as soon as we know the syntax. In our call, we will make sure that the name query is only done on vernacular names.

# define the search term
searchName <- "cotton"
# define the maximum number of results per query
nRes <- 100

# directly call the GBIF API
res <- fromJSON(paste0(
	"https://api.gbif.org/v1/species/search?q=",
	searchName, "&offset=400&qField=VERNACULAR&limit=", nRes
))

# check result
names(res)
res[c(1:4, 6)]
length(res$results)
  1. 'offset'
  2. 'limit'
  3. 'endOfRecords'
  4. 'count'
  5. 'results'
  6. 'facets'
$offset
[1] 400

$limit
[1] 100

$endOfRecords
[1] TRUE

$count
[1] 477

$facets
list()
77

As we can see, the API call returns a list with six elements.

  • The first element tells us the the index of the first retrieved element minus one. So if the offset is 10, we will get the results from the 11th match onwards. We can set the offset in the API call and this will allow us to retrieve all results even if there are more than 1000 later on.

  • The second element tells us what the maximum number of returned results is. Note that we have defined that number ourselves in nRes. GBIF will ignore any number > 1000 and set it to 1000.

  • The third element tells us whether we have included the last match in our returned results. In case our limit is smaller than the number of results, this will only be the case if we use an offset to include the last match.

  • The fourth element tells us how many matches to our query were found, it corresponds to “Records found” in the name_lookup function.

  • The fifth element contains the actual results. Its length corresponds to the “Records returned” in the name_lookup function.

We will ignore the last element that is not relevant for us here. Let us now look at the results.

length(res$results)
res$results[[1]]
77
$key
[1] 176665880

$datasetKey
[1] "19491596-35ae-4a91-9a98-85cf505f1bd3"

$nubKey
[1] 2436469

$parentKey
[1] 224008563

$parent
[1] "Saguinus"

$kingdom
[1] "ANIMALIA"

$phylum
[1] "CHORDATA"

$order
[1] "PRIMATES"

$family
[1] "CALLITRICHIDAE"

$genus
[1] "Saguinus"

$species
[1] "Saguinus oedipus"

$kingdomKey
[1] 223993981

$phylumKey
[1] 223994122

$classKey
[1] 224006554

$orderKey
[1] 224008432

$familyKey
[1] 224008550

$genusKey
[1] 224008563

$speciesKey
[1] 176665880

$scientificName
[1] "Saguinus oedipus (Linnaeus, 1758)"

$canonicalName
[1] "Saguinus oedipus"

$authorship
[1] "(Linnaeus, 1758)"

$nameType
[1] "SCIENTIFIC"

$taxonomicStatus
[1] "ACCEPTED"

$rank
[1] "SPECIES"

$origin
[1] "SOURCE"

$numDescendants
[1] 0

$numOccurrences
[1] 0

$habitats
list()

$nomenclaturalStatus
list()

$threatStatuses
[1] "CRITICALLY_ENDANGERED"

$descriptions
list()

$vernacularNames
$vernacularNames[[1]]
         vernacularName                language 
"Cotton-headed Tamarin"                   "eng" 

$vernacularNames[[2]]
      vernacularName             language 
"Cotton-top Tamarin"                "eng" 

$vernacularNames[[3]]
    vernacularName           language 
"Tamarin d'Oedipe"              "fra" 

$vernacularNames[[4]]
  vernacularName         language 
"Tamarin pinché"            "fra" 

$vernacularNames[[5]]
      vernacularName             language 
"Tamarin à perruque"                "fra" 

$vernacularNames[[6]]
vernacularName       language 
    "Bichichi"          "spa" 

$vernacularNames[[7]]
 vernacularName        language 
"Tití Leoncito"           "spa" 

$vernacularNames[[8]]
 vernacularName        language 
"Tití Pielroja"           "spa" 

$vernacularNames[[9]]
     vernacularName            language 
"Tití cabeciblanco"               "spa" 

$vernacularNames[[10]]
         vernacularName                language 
"Tití de Cabeza Blanca"                   "spa" 


$higherClassificationMap
       223993981        223994122        224006554        224008432 
      "ANIMALIA"       "CHORDATA"       "MAMMALIA"       "PRIMATES" 
       224008550        224008563 
"CALLITRICHIDAE"       "Saguinus" 

$synonym
[1] FALSE

$class
[1] "MAMMALIA"

From the first result we see here we notice that the each individual result is a list with single elements, except for the elements “vernacularNames” and “higherClassificationMap”. We will focus on the vernacular names. To extract the information needed, we have to consider that each result can store a variable number of vernacular names. To transfer everything into a dataframe structure, we will have to create as many rows as there are vernacular names for each result, repeating the information from the other fields. For simplicity, we will not process the information in the “higherClassificationMap” variable. We also have to account for the maximum number of results to be retrieved so that we get all information for a certain name. The code below does this job. It conveniently stores the data in a data.table object

# define variables to extract (you may modify this depending on your needs)
resVars <- c(
	"canonicalName", "authorship", "scientificName", "genus", "family", "order", "class", "phylum", "kingdom",
	"key", "nubKey", "nameType", " taxonomicStatus", "rank", "origin"
)

# call the GBIF API
res <- fromJSON(paste0("https://api.gbif.org/v1/species/search?q=", searchName, "&qField=VERNACULAR&limit=", nRes))

resTable <- data.table(vernacularName = character())
for (i in seq_along(resVars)) {
	resTable[, new := character()]
	colnames(resTable)[ncol(resTable)] <- resVars[i]
}

# calculate number of queries given nRes
nRuns <- res$count %/% nRes + ifelse(res$count %% nRes > 0, 1, 0)
for (i in seq_len(nRuns)) {
	# query data
	if (i > 1) {
		res <- fromJSON(paste0(
			"https://api.gbif.org/v1/species/search?q=", searchName,
			"&qField=VERNACULAR&limit=", nRes, "&offset=", (i - 1) * 20
		))
	}
	# structure data
	res <- res$results
	for (j in seq_along(res)) {
		# extract the vernacular name (first element of each vernacularNames element) from the data
		# you could also extract the second element, which is the language abbreviation
		temp <- data.table(vernacularName = sapply(res[[j]]$vernacularNames, function(x) x[1]))
		# fill the remaining fields
		for (k in seq_along(resVars)) {
			if (resVars[k] %in% names(res[[j]])) {
				temp[, new := res[[j]][[which(resVars[k] == names(res[[j]]))]]]
			} else {
				temp[, new := character()]
			}
			colnames(temp)[ncol(temp)] <- resVars[k]
		}
		resTable <- rbind(resTable, temp)
	}
}

resTable
A data.table: 2252 × 16
vernacularNamecanonicalNameauthorshipscientificNamegenusfamilyorderclassphylumkingdomkeynubKeynameType taxonomicStatusrankorigin
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
cottonthistle Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae1022363993094883SCIENTIFICNAGENUSSOURCE
Cotton Thistles Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae1607834973094883SCIENTIFICNAGENUSSOURCE
Cottonthistle Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae1607834973094883SCIENTIFICNAGENUSSOURCE
Bog Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae1607867752730118SCIENTIFICNAGENUSSOURCE
Bog-Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae1607867752730118SCIENTIFICNAGENUSSOURCE
Cottongrass EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae1607867752730118SCIENTIFICNAGENUSSOURCE
Bog Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2062282502730118SCIENTIFICNAGENUSSOURCE
Bog-Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2062282502730118SCIENTIFICNAGENUSSOURCE
Cottongrass EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2062282502730118SCIENTIFICNAGENUSSOURCE
bog cotton EriophorumLinnaeusEriophorum LinnaeusEriophorumCyperaceaePoales EquisetopsidaNA NA 1000146552730118SCIENTIFICNAGENUSSOURCE
cottongrass EriophorumLinnaeusEriophorum LinnaeusEriophorumCyperaceaePoales EquisetopsidaNA NA 1000146552730118SCIENTIFICNAGENUSSOURCE
Cotton Thistles Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae3094883 3094883SCIENTIFICNAGENUSSOURCE
Cottonthistle Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae3094883 3094883SCIENTIFICNAGENUSSOURCE
Æselfoderslægten Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae3094883 3094883SCIENTIFICNAGENUSSOURCE
Eselsdistel Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae3094883 3094883SCIENTIFICNAGENUSSOURCE
ulltistlar Onopordum L. Onopordum L. Onopordum AsteraceaeAsteralesMagnoliopsidaTracheophytaPlantae3094883 3094883SCIENTIFICNAGENUSSOURCE
Bog Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Bog-Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Cottongrass EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Bog Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Bog-Cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Cottongrass EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
bog cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
cottongrass EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Kæruld (Eriophorum-slægten)EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
jeaggeullu EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Wollgras EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Bog-cotton EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
canach EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
canaichean EriophorumL. Eriophorum L. EriophorumCyperaceaePoales Liliopsida TracheophytaPlantae2730118 2730118SCIENTIFICNAGENUSSOURCE
Arizona cotton rat Sigmodon arizonae Mearns, 1890 Sigmodon arizonae Mearns, 1890 Sigmodon Cricetidae Rodentia Mammalia Chordata Animalia 2438148 2438148SCIENTIFICNASPECIESSOURCE
indian cotton jassid Amrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
indian cotton jassid Amrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
indian cotton jassid Amrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
okra leafhopper Amrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
okra leafhopper Amrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
okra leafhopper Amrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
the cotton leaf hopperAmrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
the cotton leaf hopperAmrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
the cotton leaf hopperAmrasca biguttula (Ishida, 1913) Amrasca biguttula (Ishida, 1913) Amrasca Cicadellidae Hemiptera Insecta Arthropoda Animalia 2042790 2042790SCIENTIFICNASPECIESSOURCE
algodoeiro Gossypium hirsutum L. Gossypium hirsutum L. Gossypium Malvaceae Malvales MagnoliopsidaTracheophytaPlantae 1147377123152661SCIENTIFICNASPECIESSOURCE
algodoeiro-americano Gossypium hirsutum L. Gossypium hirsutum L. Gossypium Malvaceae Malvales MagnoliopsidaTracheophytaPlantae 1147377123152661SCIENTIFICNASPECIESSOURCE
algodão Gossypium hirsutum L. Gossypium hirsutum L. Gossypium Malvaceae Malvales MagnoliopsidaTracheophytaPlantae 1147377123152661SCIENTIFICNASPECIESSOURCE
algodón Gossypium hirsutum L. Gossypium hirsutum L. Gossypium Malvaceae Malvales MagnoliopsidaTracheophytaPlantae 1147377123152661SCIENTIFICNASPECIESSOURCE
american-cotton Gossypium hirsutum L. Gossypium hirsutum L. Gossypium Malvaceae Malvales MagnoliopsidaTracheophytaPlantae 1147377123152661SCIENTIFICNASPECIESSOURCE
cotton Gossypium hirsutum L. Gossypium hirsutum L. Gossypium Malvaceae Malvales MagnoliopsidaTracheophytaPlantae 1147377123152661SCIENTIFICNASPECIESSOURCE
cotton-batting cudweedPseudognaphalium stramineum(Kunth) AnderbergPseudognaphalium stramineum (Kunth) AnderbergPseudognaphaliumAsteraceae Asterales EquisetopsidaNA NA 1000032703100970SCIENTIFICNASPECIESSOURCE
cotton-batting plant Pseudognaphalium stramineum(Kunth) AnderbergPseudognaphalium stramineum (Kunth) AnderbergPseudognaphaliumAsteraceae Asterales EquisetopsidaNA NA 1000032703100970SCIENTIFICNASPECIESSOURCE
gnaphale paille Pseudognaphalium stramineum(Kunth) AnderbergPseudognaphalium stramineum (Kunth) AnderbergPseudognaphaliumAsteraceae Asterales EquisetopsidaNA NA 1000032703100970SCIENTIFICNASPECIESSOURCE
Arizona snake-cotton Froelichia arizonica NA Froelichia arizonica Froelichia Amaranthaceae Caryophyllales MagnoliopsidaStreptophytaViridiplantae1033940845384323SCIENTIFICNASPECIESSOURCE
Allen's Cotton Rat Sigmodon alleni Bailey, 1902 Sigmodon alleni Bailey, 1902 Sigmodon CRICETIDAE RODENTIA MAMMALIA CHORDATA ANIMALIA 1766745142438158SCIENTIFICNASPECIESSOURCE
Cotton-grass Dwarf Elachista albidella Nylander, [1848] Elachista albidella Nylander, 1848 Elachista Elachistidae Lepidoptera Insecta NA Animalia 1802590598033146SCIENTIFICNASPECIESSOURCE
silvery cotton plant Celmisia semicordata Petrie Celmisia semicordata Petrie, 1914 Celmisia Asteraceae Asterales MagnoliopsidaTracheophytaPlantae 1548843225391005SCIENTIFICNASPECIESSOURCE
Cotton Springtail Entomobrya unostrigata J.Stach, 1930 Entomobrya unostrigata J.Stach, 1930 Entomobrya Entomobryidae EntomobryomorphaCollembola Arthropoda Animalia 1607828762120761SCIENTIFICNASPECIESSOURCE
Springtail Entomobrya unostrigata J.Stach, 1930 Entomobrya unostrigata J.Stach, 1930 Entomobrya Entomobryidae EntomobryomorphaCollembola Arthropoda Animalia 1607828762120761SCIENTIFICNASPECIESSOURCE
tievine Ipomoea cordatotriloba Dennst. Ipomoea cordatotriloba Dennst. Ipomoea ConvolvulaceaeSolanales MagnoliopsidaTracheophytaPlantae 1022600602928541SCIENTIFICNASPECIESSOURCE
Cotton Gossypium hirsutum L. Gossypium hirsutum L. Gossypium MALVACEAE MALVALES MAGNOLIOPSIDATRACHEOPHYTAPLANTAE 1768083703152661SCIENTIFICNASPECIESSOURCE
Algodón Mexicano Gossypium hirsutum L. Gossypium hirsutum L. Gossypium MALVACEAE MALVALES MAGNOLIOPSIDATRACHEOPHYTAPLANTAE 1768083703152661SCIENTIFICNASPECIESSOURCE
Peruvian Cotton Rat Sigmodon peruanus J.A. Allen, 1897 Sigmodon peruanus J.A. Allen, 1897 Sigmodon CRICETIDAE RODENTIA MAMMALIA CHORDATA ANIMALIA 1766744942438154SCIENTIFICNASPECIESSOURCE
Alston's Cotton Rat Sigmodon alstoni (Thomas, 1881) Sigmodon alstoni (Thomas, 1881) Sigmodon Cricetidae Rodentia Mammalia Chordata Animalia 1021222482438155SCIENTIFICNASPECIESSOURCE

We can see that our results table is quite large. Fortunately, we have some information we can use to extract the data that we want. Most importantly, the “vernacularName” column stores the actual vernacular names. As we see in the example, the word “cotton” appears in a lot of names, including in animals or fungi. As a first step we can reduce the results to plants only.

nrow(resTable)
table(resTable$kingdom)
resTable <- resTable[kingdom %in% c("Plantae", "PLANTAE")]
nrow(resTable)
2252
     Animalia      ANIMALIA       Metazoa       Plantae       PLANTAE 
          392            39            15          1697            32 
Viridiplantae 
           31 
1729

This has reduced the number of results from about 2200 to 1700.

TASKS:

  1. Try to find other ways to reduce the number of results. Ideally, you should keep one result only per name.

  2. It would also be a good idea to pack the code in a function, so that it can easily be applied in a loop.

  3. To further increase the quality of the matching, you might want to consider to check the vernacular names and apply some kind of pre-processing to them.

Parallel processing#

As the matching process will take quite some time for each name, it makes sense to parallelize it. An example on how to parallelize a the execution of a function can be found below. Let’s first check how many cores are available on our system.

parallel::detectCores()
16

It is unlikely that you have so many cores available, but from former trials with the GBIF API I can tell you that is is wise to limit the core number to 24 at maximum. In this exercise, I will reduce the number of cores used by 1 to avoid blocking my computer for other tasks while the loop is running.

# a test function
testFunction <- function(x) {
	if (x < 0) {
		return(1000 %% (-x))
	} else if (x > 0) {
		return(1000 %% x)
	} else {
		return(0)
	}
}

# create test data
testData <- seq(-1000, 1000)

# create results vectors
resSeq <- rep(NA, length(testData))
resPar <- rep(NA, length(testData))

# sequential loop
startTime <- Sys.time()
for (i in seq_along(testData)) {
	resSeq[i] <- testFunction(testData[i])
}
Sys.time() - startTime

# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)

# parallel loop
startTime <- Sys.time()
resPar <- foreach(i = seq_along(testData), .combine = c) %dopar% {
	# note that the result of each loop execution will be returned and stored in resPar eventually
	# however, if anything happens in the loop, it will be lost
	testFunction(testData[i])
}
Sys.time() - startTime

# stop the cluster
stopCluster(cl)

# test whether results are identical
all(resSeq == resPar)
Time difference of 0.01140094 secs
Time difference of 1.056732 secs
TRUE

As we can see, in our little example, the use of parallel processing was not necessary. It needs so much time to set up the parallel processing that there is no gain from it. Whenever the task becomes more complicated and takes more time, this will, however, pay back.

TASKS:

  1. Implement the parallel processing in the vernacular name matching algorihm.