Vernacular name matching#
This workbook shows how to find scientific names for a given list of vernacular names using the GBIF database. While it was tedious to get lists of scientific names with attached vernacular names in the past, the GBIF API offers a fast and convenient way to get corresponding scientific names for vernacular names for any kind of living beings. In the hands on-part of this notebooks, you will try to figure a way to select the best result of the bunch of results you get back from the API. You will also seek to speed up the scientific name matching process by running it in parallel.
Prerequisites#
To run the code presented here, you will need
the sample names list provided in the workshop,
a functioning R environment and
the R packages
data.table
,rgbif
,RJSONIO
, anddoSNOW
installed.
Code#
The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.
# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing
library(RJSONIO) # parse JSON
# clear workspace
rm(list = ls())
# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))
# load data
vernNames <- fread("vernacular names_2024-04-09.txt", sep = "\t")
Lade nötiges Paket: foreach
Lade nötiges Paket: iterators
Lade nötiges Paket: snow
Let’s look at the data.
str(vernNames)
Classes 'data.table' and 'data.frame': 1000 obs. of 1 variable:
$ vernacularName: chr "Bermuda Cress" "Bibernelle, Große" "Mandarine" "Lavendel, Französischer" ...
- attr(*, ".internal.selfref")=<externalptr>
As can be seen, the names found in the file are a mixture of English and German vernacular names. For simplicity, they are all plant names, but it would not make a difference if there were animals or fungi names included. The names used here were gathered from an English wikipedia page and a German website of vernacular plant names. Both pages also include the scientific names of plants, which can serve as a check to the results obtained here.
Encoding#
Unfortunately, when getting data from differing sources, we will often find that these data have been encoded in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system.
We will deal with the most common case: Data being stored in the Windows-specific CP-1252 encoding (mislabeled ANSI or latin1 sometimes) and not in UTF-8.
How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:
Sys.getlocale()
If your console has no UTF-8 setting (no matter the language) you may change it like this:
Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")
You can use another encoding, too, but it may throw errors later on. So let’s check whether the data comes in UTF-8, and if not, let’s repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).
# check whether correct encoding is UTF-8
table(validUTF8(vernNames$vernacularName))
TRUE
1000
That looks all good, so there is nothing to do here. Otherwise we would apply the following:
plants[!validUTF8(vernacularName), newName := iconv(vernacularName, from = "CP1252", to = "UTF-8")]
converting all non-UTF-8 characters to UTF-8.
TRY vernacular name matching with the rgbif
package#
As there is the rgbif
package available to query GBIF, one would assume that a function from therein can be used to retrieve scientific names for the vernacular names.
name_lookup("Bermuda Cress", limit = 5)
name_lookup("Gänseblümchen", limit = 5)
name_lookup("Asiatischer Elefant", limit = 5)
name_lookup("cotton", limit = 5)
Records found [1]
Records returned [1]
No. unique hierarchies [1]
No. facets [0]
No. names [1]
Args [q=Bermuda Cress, limit=5, offset=0]
# A tibble: 1 × 21
key scientificName datasetKey parentKey parent genus species genusKey
<int> <chr> <chr> <int> <chr> <chr> <chr> <int>
1 165626894 Barbarea verna (… cbb6498e-… 165626881 Barba… Barb… Barbar… 1.66e8
# ℹ 13 more variables: speciesKey <int>, canonicalName <chr>, authorship <chr>,
# nameType <chr>, taxonomicStatus <chr>, rank <chr>, origin <chr>,
# numDescendants <int>, numOccurrences <int>, habitats <lgl>,
# nomenclaturalStatus <lgl>, threatStatuses <lgl>, synonym <lgl>
Records found [50]
Records returned [5]
No. unique hierarchies [5]
No. facets [0]
No. names [5]
Args [q=Gänseblümchen, limit=5, offset=0]
# A tibble: 5 × 34
key scientificName datasetKey nubKey parentKey parent order family genus
<int> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 100336410 Bellis L. 16c3f9cb-… 3.12e6 223980151 Aster… Aste… Aster… Bell…
2 116781451 Bellis L. d027759f-… 3.12e6 116780684 Aster… Aste… Aster… Bell…
3 3117399 Bellis L. d7dddbf4-… 3.12e6 3065 Aster… Aste… Aster… Bell…
4 100341500 Veronica L. 16c3f9cb-… 3.17e6 223978156 Plant… Lami… Plant… Vero…
5 100462938 Consolida Gray 16c3f9cb-… 3.03e6 223987175 Ranun… Ranu… Ranun… Cons…
# ℹ 25 more variables: classKey <int>, orderKey <int>, familyKey <int>,
# genusKey <int>, canonicalName <chr>, authorship <chr>, nameType <chr>,
# taxonomicStatus <chr>, rank <chr>, origin <chr>, numDescendants <int>,
# numOccurrences <int>, habitats <lgl>, nomenclaturalStatus <lgl>,
# threatStatuses <chr>, synonym <lgl>, class <chr>, phylum <chr>,
# phylumKey <int>, nameKey <int>, constituentKey <chr>, kingdom <chr>,
# kingdomKey <int>, publishedIn <chr>, extinct <lgl>
Records found [14]
Records returned [5]
No. unique hierarchies [5]
No. facets [0]
No. names [4]
Args [q=Asiatischer Elefant, limit=5, offset=0]
# A tibble: 5 × 36
key scientificName datasetKey nubKey parentKey parent order family genus
<int> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr>
1 1.00e8 Elephas maxim… 16c3f9cb-… 5219461 223978954 Eleph… Prob… Eleph… Elep…
2 1.96e8 Elephas maxim… 23a3fa4c-… 5219461 225209602 Eleph… Prob… Eleph… Elep…
3 5.22e6 Elephas maxim… d7dddbf4-… 5219461 2435351 Eleph… Prob… Eleph… Elep…
4 1.65e8 Amebelodon Ba… 16c3f9cb-… 4825859 223978959 Gomph… Prob… Gomph… Ameb…
5 1.65e8 Elephas maxim… 16c3f9cb-… NA 223987046 Eleph… Prob… Eleph… Elep…
# ℹ 27 more variables: species <chr>, classKey <int>, orderKey <int>,
# familyKey <int>, genusKey <int>, speciesKey <int>, canonicalName <chr>,
# authorship <chr>, nameType <chr>, taxonomicStatus <chr>, rank <chr>,
# origin <chr>, numDescendants <int>, numOccurrences <int>, habitats <lgl>,
# nomenclaturalStatus <lgl>, threatStatuses <chr>, synonym <lgl>,
# class <chr>, kingdom <chr>, phylum <chr>, kingdomKey <int>,
# phylumKey <int>, nameKey <int>, constituentKey <chr>, publishedIn <chr>, …
Records found [6458]
Records returned [5]
No. unique hierarchies [5]
No. facets [0]
No. names [0]
Args [q=cotton, limit=5, offset=0]
# A tibble: 5 × 34
key scientificName nameKey datasetKey nubKey parentKey parent phylum order
<int> <chr> <int> <chr> <int> <int> <chr> <chr> <chr>
1 1.21e8 Neodiastoma C… 7436314 c33ce2f2-… 4.61e6 121285847 Pareo… Mollu… Sorb…
2 4.61e6 Callitriphora… 1838041 d7dddbf4-… 4.61e6 2660 Triph… Mollu… NA
3 1.22e8 Belgradeophyl… 1411263 c33ce2f2-… 4.88e6 121517396 Rugosa Cnida… NA
4 4.61e6 Nototerebra C… 7626629 d7dddbf4-… 4.61e6 2685 Tereb… Mollu… Neog…
5 2.19e8 Nototerebra C… NA 7ddf754f-… 4.61e6 218637945 Tereb… Mollu… Neog…
# ℹ 25 more variables: family <chr>, genus <chr>, phylumKey <int>,
# classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
# canonicalName <chr>, authorship <chr>, nameType <chr>,
# taxonomicStatus <chr>, rank <chr>, origin <chr>, numDescendants <int>,
# numOccurrences <int>, extinct <lgl>, habitats <chr>,
# nomenclaturalStatus <lgl>, threatStatuses <lgl>, synonym <lgl>,
# class <chr>, constituentKey <chr>, kingdom <chr>, kingdomKey <int>, …
Creating a custom vernacular name matching function using the GBIF API#
From the examples shown here, we see that data is returned, and for some names, as “Bermuda Cress” or the Asian Elephant (“Asiatischer Elefant”), there are a limited number of records available, and we can easily select the (first) correct one. In the last example, “cotton”, however, we get over 6000 matches, and it is impossible to select the correct one easily. The problem here is that name_lookup
does not return the actual matched vernacular names. It searches in all fields, and the first results returned have “Cotton” as the author of the scientific name, not as a vernacular name. The biggest problem is that the results do not include the actual vernacular names, even if they were used in matching, which impedes a subsequent filtering. As this is not what we want, we will need to access the GBIF API directly instead of relying on the rgbif
package.
Fortunately, that is not difficult, as soon as we know the syntax. In our call, we will make sure that the name query is only done on vernacular names.
# define the search term
searchName <- "cotton"
# define the maximum number of results per query
nRes <- 100
# directly call the GBIF API
res <- fromJSON(paste0(
"https://api.gbif.org/v1/species/search?q=",
searchName, "&offset=400&qField=VERNACULAR&limit=", nRes
))
# check result
names(res)
res[c(1:4, 6)]
length(res$results)
- 'offset'
- 'limit'
- 'endOfRecords'
- 'count'
- 'results'
- 'facets'
$offset
[1] 400
$limit
[1] 100
$endOfRecords
[1] TRUE
$count
[1] 477
$facets
list()
As we can see, the API call returns a list with six elements.
The first element tells us the the index of the first retrieved element minus one. So if the offset is 10, we will get the results from the 11th match onwards. We can set the offset in the API call and this will allow us to retrieve all results even if there are more than 1000 later on.
The second element tells us what the maximum number of returned results is. Note that we have defined that number ourselves in nRes. GBIF will ignore any number > 1000 and set it to 1000.
The third element tells us whether we have included the last match in our returned results. In case our limit is smaller than the number of results, this will only be the case if we use an offset to include the last match.
The fourth element tells us how many matches to our query were found, it corresponds to “Records found” in the
name_lookup
function.The fifth element contains the actual results. Its length corresponds to the “Records returned” in the
name_lookup
function.
We will ignore the last element that is not relevant for us here. Let us now look at the results.
length(res$results)
res$results[[1]]
$key
[1] 176665880
$datasetKey
[1] "19491596-35ae-4a91-9a98-85cf505f1bd3"
$nubKey
[1] 2436469
$parentKey
[1] 224008563
$parent
[1] "Saguinus"
$kingdom
[1] "ANIMALIA"
$phylum
[1] "CHORDATA"
$order
[1] "PRIMATES"
$family
[1] "CALLITRICHIDAE"
$genus
[1] "Saguinus"
$species
[1] "Saguinus oedipus"
$kingdomKey
[1] 223993981
$phylumKey
[1] 223994122
$classKey
[1] 224006554
$orderKey
[1] 224008432
$familyKey
[1] 224008550
$genusKey
[1] 224008563
$speciesKey
[1] 176665880
$scientificName
[1] "Saguinus oedipus (Linnaeus, 1758)"
$canonicalName
[1] "Saguinus oedipus"
$authorship
[1] "(Linnaeus, 1758)"
$nameType
[1] "SCIENTIFIC"
$taxonomicStatus
[1] "ACCEPTED"
$rank
[1] "SPECIES"
$origin
[1] "SOURCE"
$numDescendants
[1] 0
$numOccurrences
[1] 0
$habitats
list()
$nomenclaturalStatus
list()
$threatStatuses
[1] "CRITICALLY_ENDANGERED"
$descriptions
list()
$vernacularNames
$vernacularNames[[1]]
vernacularName language
"Cotton-headed Tamarin" "eng"
$vernacularNames[[2]]
vernacularName language
"Cotton-top Tamarin" "eng"
$vernacularNames[[3]]
vernacularName language
"Tamarin d'Oedipe" "fra"
$vernacularNames[[4]]
vernacularName language
"Tamarin pinché" "fra"
$vernacularNames[[5]]
vernacularName language
"Tamarin à perruque" "fra"
$vernacularNames[[6]]
vernacularName language
"Bichichi" "spa"
$vernacularNames[[7]]
vernacularName language
"Tití Leoncito" "spa"
$vernacularNames[[8]]
vernacularName language
"Tití Pielroja" "spa"
$vernacularNames[[9]]
vernacularName language
"Tití cabeciblanco" "spa"
$vernacularNames[[10]]
vernacularName language
"Tití de Cabeza Blanca" "spa"
$higherClassificationMap
223993981 223994122 224006554 224008432
"ANIMALIA" "CHORDATA" "MAMMALIA" "PRIMATES"
224008550 224008563
"CALLITRICHIDAE" "Saguinus"
$synonym
[1] FALSE
$class
[1] "MAMMALIA"
From the first result we see here we notice that the each individual result is a list with single elements, except for the elements “vernacularNames” and “higherClassificationMap”. We will focus on the vernacular names. To extract the information needed, we have to consider that each result can store a variable number of vernacular names. To transfer everything into a dataframe structure, we will have to create as many rows as there are vernacular names for each result, repeating the information from the other fields. For simplicity, we will not process the information in the “higherClassificationMap” variable. We also have to account for the maximum number of results to be retrieved so that we get all information for a certain name. The code below does this job. It conveniently stores the data in a data.table object
# define variables to extract (you may modify this depending on your needs)
resVars <- c(
"canonicalName", "authorship", "scientificName", "genus", "family", "order", "class", "phylum", "kingdom",
"key", "nubKey", "nameType", " taxonomicStatus", "rank", "origin"
)
# call the GBIF API
res <- fromJSON(paste0("https://api.gbif.org/v1/species/search?q=", searchName, "&qField=VERNACULAR&limit=", nRes))
resTable <- data.table(vernacularName = character())
for (i in seq_along(resVars)) {
resTable[, new := character()]
colnames(resTable)[ncol(resTable)] <- resVars[i]
}
# calculate number of queries given nRes
nRuns <- res$count %/% nRes + ifelse(res$count %% nRes > 0, 1, 0)
for (i in seq_len(nRuns)) {
# query data
if (i > 1) {
res <- fromJSON(paste0(
"https://api.gbif.org/v1/species/search?q=", searchName,
"&qField=VERNACULAR&limit=", nRes, "&offset=", (i - 1) * 20
))
}
# structure data
res <- res$results
for (j in seq_along(res)) {
# extract the vernacular name (first element of each vernacularNames element) from the data
# you could also extract the second element, which is the language abbreviation
temp <- data.table(vernacularName = sapply(res[[j]]$vernacularNames, function(x) x[1]))
# fill the remaining fields
for (k in seq_along(resVars)) {
if (resVars[k] %in% names(res[[j]])) {
temp[, new := res[[j]][[which(resVars[k] == names(res[[j]]))]]]
} else {
temp[, new := character()]
}
colnames(temp)[ncol(temp)] <- resVars[k]
}
resTable <- rbind(resTable, temp)
}
}
resTable
vernacularName | canonicalName | authorship | scientificName | genus | family | order | class | phylum | kingdom | key | nubKey | nameType | taxonomicStatus | rank | origin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
cottonthistle | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 102236399 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Cotton Thistles | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 160783497 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Cottonthistle | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 160783497 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 160786775 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog-Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 160786775 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Cottongrass | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 160786775 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 206228250 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog-Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 206228250 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Cottongrass | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 206228250 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
bog cotton | Eriophorum | Linnaeus | Eriophorum Linnaeus | Eriophorum | Cyperaceae | Poales | Equisetopsida | NA | NA | 100014655 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
cottongrass | Eriophorum | Linnaeus | Eriophorum Linnaeus | Eriophorum | Cyperaceae | Poales | Equisetopsida | NA | NA | 100014655 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Cotton Thistles | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 3094883 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Cottonthistle | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 3094883 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Æselfoderslægten | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 3094883 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Eselsdistel | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 3094883 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
ulltistlar | Onopordum | L. | Onopordum L. | Onopordum | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 3094883 | 3094883 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog-Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Cottongrass | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog-Cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Cottongrass | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
bog cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
cottongrass | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Kæruld (Eriophorum-slægten) | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
jeaggeullu | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Wollgras | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
Bog-cotton | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
canach | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
canaichean | Eriophorum | L. | Eriophorum L. | Eriophorum | Cyperaceae | Poales | Liliopsida | Tracheophyta | Plantae | 2730118 | 2730118 | SCIENTIFIC | NA | GENUS | SOURCE |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Arizona cotton rat | Sigmodon arizonae | Mearns, 1890 | Sigmodon arizonae Mearns, 1890 | Sigmodon | Cricetidae | Rodentia | Mammalia | Chordata | Animalia | 2438148 | 2438148 | SCIENTIFIC | NA | SPECIES | SOURCE |
indian cotton jassid | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
indian cotton jassid | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
indian cotton jassid | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
okra leafhopper | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
okra leafhopper | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
okra leafhopper | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
the cotton leaf hopper | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
the cotton leaf hopper | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
the cotton leaf hopper | Amrasca biguttula | (Ishida, 1913) | Amrasca biguttula (Ishida, 1913) | Amrasca | Cicadellidae | Hemiptera | Insecta | Arthropoda | Animalia | 2042790 | 2042790 | SCIENTIFIC | NA | SPECIES | SOURCE |
algodoeiro | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | Malvaceae | Malvales | Magnoliopsida | Tracheophyta | Plantae | 114737712 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
algodoeiro-americano | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | Malvaceae | Malvales | Magnoliopsida | Tracheophyta | Plantae | 114737712 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
algodão | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | Malvaceae | Malvales | Magnoliopsida | Tracheophyta | Plantae | 114737712 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
algodón | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | Malvaceae | Malvales | Magnoliopsida | Tracheophyta | Plantae | 114737712 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
american-cotton | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | Malvaceae | Malvales | Magnoliopsida | Tracheophyta | Plantae | 114737712 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
cotton | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | Malvaceae | Malvales | Magnoliopsida | Tracheophyta | Plantae | 114737712 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
cotton-batting cudweed | Pseudognaphalium stramineum | (Kunth) Anderberg | Pseudognaphalium stramineum (Kunth) Anderberg | Pseudognaphalium | Asteraceae | Asterales | Equisetopsida | NA | NA | 100003270 | 3100970 | SCIENTIFIC | NA | SPECIES | SOURCE |
cotton-batting plant | Pseudognaphalium stramineum | (Kunth) Anderberg | Pseudognaphalium stramineum (Kunth) Anderberg | Pseudognaphalium | Asteraceae | Asterales | Equisetopsida | NA | NA | 100003270 | 3100970 | SCIENTIFIC | NA | SPECIES | SOURCE |
gnaphale paille | Pseudognaphalium stramineum | (Kunth) Anderberg | Pseudognaphalium stramineum (Kunth) Anderberg | Pseudognaphalium | Asteraceae | Asterales | Equisetopsida | NA | NA | 100003270 | 3100970 | SCIENTIFIC | NA | SPECIES | SOURCE |
Arizona snake-cotton | Froelichia arizonica | NA | Froelichia arizonica | Froelichia | Amaranthaceae | Caryophyllales | Magnoliopsida | Streptophyta | Viridiplantae | 103394084 | 5384323 | SCIENTIFIC | NA | SPECIES | SOURCE |
Allen's Cotton Rat | Sigmodon alleni | Bailey, 1902 | Sigmodon alleni Bailey, 1902 | Sigmodon | CRICETIDAE | RODENTIA | MAMMALIA | CHORDATA | ANIMALIA | 176674514 | 2438158 | SCIENTIFIC | NA | SPECIES | SOURCE |
Cotton-grass Dwarf | Elachista albidella | Nylander, [1848] | Elachista albidella Nylander, 1848 | Elachista | Elachistidae | Lepidoptera | Insecta | NA | Animalia | 180259059 | 8033146 | SCIENTIFIC | NA | SPECIES | SOURCE |
silvery cotton plant | Celmisia semicordata | Petrie | Celmisia semicordata Petrie, 1914 | Celmisia | Asteraceae | Asterales | Magnoliopsida | Tracheophyta | Plantae | 154884322 | 5391005 | SCIENTIFIC | NA | SPECIES | SOURCE |
Cotton Springtail | Entomobrya unostrigata | J.Stach, 1930 | Entomobrya unostrigata J.Stach, 1930 | Entomobrya | Entomobryidae | Entomobryomorpha | Collembola | Arthropoda | Animalia | 160782876 | 2120761 | SCIENTIFIC | NA | SPECIES | SOURCE |
Springtail | Entomobrya unostrigata | J.Stach, 1930 | Entomobrya unostrigata J.Stach, 1930 | Entomobrya | Entomobryidae | Entomobryomorpha | Collembola | Arthropoda | Animalia | 160782876 | 2120761 | SCIENTIFIC | NA | SPECIES | SOURCE |
tievine | Ipomoea cordatotriloba | Dennst. | Ipomoea cordatotriloba Dennst. | Ipomoea | Convolvulaceae | Solanales | Magnoliopsida | Tracheophyta | Plantae | 102260060 | 2928541 | SCIENTIFIC | NA | SPECIES | SOURCE |
Cotton | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | MALVACEAE | MALVALES | MAGNOLIOPSIDA | TRACHEOPHYTA | PLANTAE | 176808370 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
Algodón Mexicano | Gossypium hirsutum | L. | Gossypium hirsutum L. | Gossypium | MALVACEAE | MALVALES | MAGNOLIOPSIDA | TRACHEOPHYTA | PLANTAE | 176808370 | 3152661 | SCIENTIFIC | NA | SPECIES | SOURCE |
Peruvian Cotton Rat | Sigmodon peruanus | J.A. Allen, 1897 | Sigmodon peruanus J.A. Allen, 1897 | Sigmodon | CRICETIDAE | RODENTIA | MAMMALIA | CHORDATA | ANIMALIA | 176674494 | 2438154 | SCIENTIFIC | NA | SPECIES | SOURCE |
Alston's Cotton Rat | Sigmodon alstoni | (Thomas, 1881) | Sigmodon alstoni (Thomas, 1881) | Sigmodon | Cricetidae | Rodentia | Mammalia | Chordata | Animalia | 102122248 | 2438155 | SCIENTIFIC | NA | SPECIES | SOURCE |
We can see that our results table is quite large. Fortunately, we have some information we can use to extract the data that we want. Most importantly, the “vernacularName” column stores the actual vernacular names. As we see in the example, the word “cotton” appears in a lot of names, including in animals or fungi. As a first step we can reduce the results to plants only.
nrow(resTable)
table(resTable$kingdom)
resTable <- resTable[kingdom %in% c("Plantae", "PLANTAE")]
nrow(resTable)
Animalia ANIMALIA Metazoa Plantae PLANTAE
392 39 15 1697 32
Viridiplantae
31
This has reduced the number of results from about 2200 to 1700.
TASKS:
Try to find other ways to reduce the number of results. Ideally, you should keep one result only per name.
It would also be a good idea to pack the code in a function, so that it can easily be applied in a loop.
To further increase the quality of the matching, you might want to consider to check the vernacular names and apply some kind of pre-processing to them.
Parallel processing#
As the matching process will take quite some time for each name, it makes sense to parallelize it. An example on how to parallelize a the execution of a function can be found below. Let’s first check how many cores are available on our system.
parallel::detectCores()
It is unlikely that you have so many cores available, but from former trials with the GBIF API I can tell you that is is wise to limit the core number to 24 at maximum. In this exercise, I will reduce the number of cores used by 1 to avoid blocking my computer for other tasks while the loop is running.
# a test function
testFunction <- function(x) {
if (x < 0) {
return(1000 %% (-x))
} else if (x > 0) {
return(1000 %% x)
} else {
return(0)
}
}
# create test data
testData <- seq(-1000, 1000)
# create results vectors
resSeq <- rep(NA, length(testData))
resPar <- rep(NA, length(testData))
# sequential loop
startTime <- Sys.time()
for (i in seq_along(testData)) {
resSeq[i] <- testFunction(testData[i])
}
Sys.time() - startTime
# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)
# parallel loop
startTime <- Sys.time()
resPar <- foreach(i = seq_along(testData), .combine = c) %dopar% {
# note that the result of each loop execution will be returned and stored in resPar eventually
# however, if anything happens in the loop, it will be lost
testFunction(testData[i])
}
Sys.time() - startTime
# stop the cluster
stopCluster(cl)
# test whether results are identical
all(resSeq == resPar)
Time difference of 0.01140094 secs
Time difference of 1.056732 secs
As we can see, in our little example, the use of parallel processing was not necessary. It needs so much time to set up the parallel processing that there is no gain from it. Whenever the task becomes more complicated and takes more time, this will, however, pay back.
TASKS:
Implement the parallel processing in the vernacular name matching algorihm.