Taxonomic name resolution#
This notebook is intended to demonstrate the workflow of taxonomic name resolution. Taxonomic name resolution is the process of checking and de-synonymizing taxonomic names using a reference database. This is necessary because many taxonomic entities, such as species, are known and have been described using several scientific names. However, only one of those scientific names is the accepted name, while the others are referred to as synonyms (although there are special cases, e.g., unresolved names). To correctly link data from several sources, it is necessary to check and de-synonymize names using a common reference database.
This process is complicated by spelling errors, names not found in the database, and other challenges. In this notebook, we will utilize the powerful GBIF API to resolve the names from a sample list. However, users may not always want to apply the GBIF taxonomy, and not all datasets incorporated in GBIF are the newest versions. Therefore, in the hands on-part of this notebook, we will attempt to create a custom name resolution function using the Leipzig Catalogue of Vascular Plants (LCVP). We will also strive to speed up the name resolution process by using the parallel processing functionality of R.
Prerequisites#
To run the code presented here, you will need
the sample names list provided in the workshop,
a version of the Leipzig Catalogue of Vascular Plants (LCVP), also provided in the workshop,
a functioning R environment and
the R packages
data.table
,rgbif
, anddoSNOW
installed.
Code#
The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.
# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing
# clear workspace
rm(list = ls())
# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))
# load data
plants <- fread("plant names_2024-04-08.txt", sep = "\t")
animals <- fread("animal names_2024-04-09.txt", sep = "\t")
Lade nötiges Paket: foreach
Lade nötiges Paket: iterators
Lade nötiges Paket: snow
For the sake of simplicity, we will proceed to do the name matching using the names as they are. Note that name parsing can help to increase the accuracy and efficiency of the name harmonization process.
Encoding#
Unfortunately, when getting data from differing sources, we will often find that these data have been encoded in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system.
We will deal with the most common case: Data being stored in the Windows-specific CP-1252 encoding (mislabeled ANSI or latin1 sometimes) and not in UTF-8.
How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:
Sys.getlocale()
If your console has no UTF-8 setting (no matter the language) you may change it like this:
Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")
You can use another encoding, too, but it may throw errors later on. So let’s check whether the data comes in UTF-8, and if not, let’s repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).
# check whether correct encoding is UTF-8
table(validUTF8(plants$oldName))
table(validUTF8(animals$modName))
FALSE TRUE
73 4927
TRUE
5000
# create new columns for variables
plants[, newName := oldName]
animals[, newName := modName]
# correct encoding, assuming current encoding is CP-1252
plants[!validUTF8(newName), newName := iconv(newName, from = "CP1252", to = "UTF-8")]
Name resolution#
To access the GBIF backbone, we can use the name_backbone_checklist
function found in the rgbif package.
resP <- data.table(name_backbone_checklist(plants$newName))
resA <- data.table(name_backbone_checklist(animals$newName))
It took some time, but we got some results. Let’s look at the result structure.
str(resP)
Classes 'data.table' and 'data.frame': 5000 obs. of 25 variables:
$ confidence : int 100 100 100 98 94 98 94 99 99 99 ...
$ matchType : chr "NONE" "NONE" "NONE" "EXACT" ...
$ synonym : logi FALSE FALSE FALSE TRUE FALSE TRUE ...
$ usageKey : int NA NA NA 2977925 3152705 2685510 2684876 8132859 3138531 3830289 ...
$ acceptedUsageKey: int NA NA NA 11698476 NA 2685508 NA NA NA NA ...
$ scientificName : chr NA NA NA "Abarema curvicarpa (H.S.Irwin) Barneby & J.W.Grimes" ...
$ canonicalName : chr NA NA NA "Abarema curvicarpa" ...
$ rank : chr NA NA NA "SPECIES" ...
$ status : chr NA NA NA "SYNONYM" ...
$ kingdom : chr NA NA NA "Plantae" ...
$ phylum : chr NA NA NA "Tracheophyta" ...
$ order : chr NA NA NA "Fabales" ...
$ family : chr NA NA NA "Fabaceae" ...
$ genus : chr NA NA NA "Jupunba" ...
$ species : chr NA NA NA "Jupunba curvicarpa" ...
$ kingdomKey : int NA NA NA 6 6 6 6 6 6 6 ...
$ phylumKey : int NA NA NA 7707728 7707728 7707728 7707728 7707728 7707728 7707728 ...
$ classKey : int NA NA NA 220 220 194 194 220 220 220 ...
$ orderKey : int NA NA NA 1370 941 640 640 933 414 399 ...
$ familyKey : int NA NA NA 5386 6685 3925 3925 2398 3065 2411 ...
$ genusKey : int NA NA NA 7266089 3152705 2684876 2684876 4890936 3138519 7268654 ...
$ speciesKey : int NA NA NA 11698476 NA 2685507 NA 8132859 3138531 3830289 ...
$ class : chr NA NA NA "Magnoliopsida" ...
$ verbatim_name : chr "" "(lauraceae) pubescente" "?Betulaceae sp." "Abarema curvicarpa" ...
$ verbatim_index : num 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
As there is a column named “kingdom”, we might check whether we actually got all plants matched, and how many non-matches we got. Another important information can be found in the “matchType” column. Here, we can see whether names were retrieved exactly as they wer spelled, or some fuzzy matching was done, or whether they could only be matched to a higher rank. The latter means that names may only have been matched to genus, familiy, order, or phylum level. It is worth checking these results.
table(resP$kingdom)
sum(is.na(resP$kingdom))
table(resP$matchType)
Animalia Fungi Plantae
1 40 4814
EXACT FUZZY HIGHERRANK NONE
4550 138 167 145
resP[kingdom != "Plantae"]
confidence | matchType | synonym | usageKey | acceptedUsageKey | scientificName | canonicalName | rank | status | kingdom | ⋯ | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | speciesKey | class | verbatim_name | verbatim_index |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <lgl> | <int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <chr> | <chr> | <dbl> |
99 | EXACT | FALSE | 3414350 | NA | Acarospora radicata H.Magn. | Acarospora radicata | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1271 | 8347 | 2600495 | 3414350 | Lecanoromycetes | Acarospora radicata | 63 |
99 | EXACT | FALSE | 2592767 | NA | Arthonia polymorpha Ach., 1814 | Arthonia polymorpha | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 313 | 1273 | 8363 | 2581942 | 2592767 | Arthoniomycetes | Arthonia polymorpha | 420 |
99 | EXACT | FALSE | 7250670 | NA | Aspicilia candida (Anzi) Hue | Aspicilia candida | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1051 | 4116 | 2599747 | 7250670 | Lecanoromycetes | Aspicilia candida | 445 |
99 | EXACT | FALSE | 3419869 | NA | Aulaxina microphana (Vain.) R.Sant. | Aulaxina microphana | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1279 | 2161 | 6610216 | 3419869 | Lecanoromycetes | Aulaxina microphana | 507 |
98 | EXACT | TRUE | 2608288 | 3438532 | Bacidia kingmanii Hasse | Bacidia kingmanii | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 180 | 10848190 | 8296 | 2569865 | 3438532 | Lecanoromycetes | Bacidia kingmanii | 527 |
99 | EXACT | FALSE | 2609464 | NA | Buellia griseovirens (Turner & Borrer ex Sm.) Almb. | Buellia griseovirens | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 10861608 | 4115 | 2587707 | 2609464 | Lecanoromycetes | Buellia griseovirens | 711 |
99 | EXACT | FALSE | 2609287 | NA | Calicium quercinum Pers. | Calicium quercinum | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 10861608 | 4115 | 2592559 | 2609287 | Lecanoromycetes | Calicium quercinum | 789 |
98 | EXACT | TRUE | 2610162 | 7462577 | Caloplaca inconspecta Arup | Caloplaca inconspecta | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 180 | 1050 | 8368 | 7251287 | 7462577 | Lecanoromycetes | Caloplaca inconspecta | 805 |
98 | EXACT | TRUE | 3469366 | 2596842 | Catapyrenium caeruleopulvinum J.W.Thomson | Catapyrenium caeruleopulvinum | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 178 | 1043 | 4841 | 2596841 | 2596842 | Eurotiomycetes | Catapyrenium caeruleopulvinum | 929 |
99 | EXACT | FALSE | 7186902 | NA | Cetraria islandica subsp. islandica | Cetraria islandica islandica | SUBSPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 8305 | 2601309 | 2605272 | Lecanoromycetes | Cetraria islandica ssp. islandica | 989 |
99 | EXACT | FALSE | 3391187 | NA | Cladonia alinii Trass | Cladonia alinii | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 8328 | 2607519 | 3391187 | Lecanoromycetes | Cladonia alinii | 1089 |
99 | EXACT | FALSE | 2607718 | NA | Cladonia pleurota (Flörke) Schaer. | Cladonia pleurota | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 8328 | 2607519 | 2607718 | Lecanoromycetes | Cladonia pleurota | 1090 |
97 | EXACT | FALSE | 10883702 | NA | Collema tenax Degel. | Collema tenax | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1055 | 8377 | 7237642 | 10883702 | Lecanoromycetes | Collema tenax | 1149 |
99 | EXACT | FALSE | 3433571 | NA | Dirinaria leopoldii (Stein) D.D.Awasthi | Dirinaria leopoldii | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1050 | NA | 10937028 | 3433571 | Lecanoromycetes | Dirinaria leopoldii | 1571 |
97 | EXACT | FALSE | 5475464 | NA | Graphis anfractuosa (Eschw.) Eschw. | Graphis anfractuosa | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1279 | 2176 | 2602041 | 5475464 | Lecanoromycetes | Graphis anfractuosa | 2189 |
98 | EXACT | TRUE | 2606991 | 2606897 | Lecanora melaena (Hedl.) Fink | Lecanora melaena | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 4828 | 2569863 | 2606897 | Lecanoromycetes | Lecanora melaena | 2649 |
99 | EXACT | FALSE | 3439134 | NA | Lecidea polaris Lynge | Lecidea polaris | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 10848190 | 8296 | 2569865 | 3439134 | Lecanoromycetes | Lecidea polaris | 2652 |
99 | EXACT | FALSE | 2600912 | NA | Leptogium austroamericanum (Malme) C.W.Dodge | Leptogium austroamericanum | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1055 | 8377 | 2600911 | 2600912 | Lecanoromycetes | Leptogium austroamericanum | 2691 |
99 | EXACT | FALSE | 2601899 | NA | Nadvornikia hawaiensis (Tuck.) Tibell | Nadvornikia hawaiensis | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1279 | 2176 | 2601898 | 2601899 | Lecanoromycetes | Nadvornikia hawaiensis | 3137 |
98 | EXACT | TRUE | 5517083 | 10809472 | Opegrapha cypressi R.C.Harris | Opegrapha cypressi | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 313 | 1273 | 8359 | 10747378 | 10809472 | Arthoniomycetes | Opegrapha cypressi | 3262 |
99 | EXACT | FALSE | 3424580 | NA | Pannaria conoplea (Ach.) Bory | Pannaria conoplea | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1055 | 4119 | 2587001 | 3424580 | Lecanoromycetes | Pannaria conoplea | 3379 |
98 | EXACT | TRUE | 2605920 | 2605922 | Parmelia omphalodes subsp. pinnatifida (Kurok) Skult | Parmelia omphalodes pinnatifida | SUBSPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 8305 | 2576643 | 2605922 | Lecanoromycetes | Parmelia omphalodes ssp. pinnatifida | 3399 |
97 | EXACT | FALSE | 2601157 | NA | Peltigera venosa (L.) Hoffm. | Peltigera venosa | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1055 | 8373 | 2601137 | 2601157 | Lecanoromycetes | Peltigera venosa | 3455 |
99 | EXACT | FALSE | 2600005 | NA | Pertusaria wulfenioides B.de Lesd. | Pertusaria wulfenioides | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1051 | 4113 | 2599923 | 2600005 | Lecanoromycetes | Pertusaria wulfenioides | 3501 |
98 | EXACT | TRUE | 8460032 | 3291408 | Phaeographina explicans Fink | Phaeographina explicans | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 180 | 1279 | 2176 | 8269656 | 3291408 | Lecanoromycetes | Phaeographina explicans | 3516 |
99 | EXACT | FALSE | 3477254 | NA | Phylliscum demangeonii (Moug. & Mont.) Nyl. | Phylliscum demangeonii | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 314 | 1049 | 4104 | 3477247 | 3477254 | Lichinomycetes | Phylliscum demangeonii | 3568 |
99 | EXACT | FALSE | 7251966 | NA | Placopsis cribellans (Nyl.) Räsänen | Placopsis cribellans | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1042 | 8344 | 7954402 | 7251966 | Lecanoromycetes | Placopsis cribellans | 3656 |
98 | EXACT | TRUE | 3528094 | 10787294 | Polymeridium proponens (Nyl.) R.C.Harris | Polymeridium proponens | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 183 | 7186734 | 4845 | 10702635 | 10787294 | Dothideomycetes | Polymeridium proponens | 3745 |
95 | FUZZY | FALSE | 2603432 | NA | Porpidia contraponenda (Arnold) Knoph & Hertel | Porpidia contraponenda | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 10848190 | 8296 | 2603408 | 2603432 | Lecanoromycetes | Porpidia contrapoenda | 3770 |
99 | EXACT | FALSE | 5261389 | NA | Psora crenata (Taylor) Reinke | Psora crenata | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 4813 | 5953400 | 5261389 | Lecanoromycetes | Psora crenata | 3857 |
99 | EXACT | FALSE | 5489996 | NA | Pyrenula laetior Müll.Arg. | Pyrenula laetior | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 178 | 1044 | 8352 | 3269671 | 5489996 | Eurotiomycetes | Pyrenula laetior | 3914 |
99 | EXACT | FALSE | 3425847 | NA | Rhizocarpon rittokense (Hellb.) Th.Fr. | Rhizocarpon rittokense | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 10789844 | 8371 | 2600386 | 3425847 | Lecanoromycetes | Rhizocarpon rittokense | 3989 |
99 | EXACT | FALSE | 2609027 | NA | Rinodina efflorescens Malme | Rinodina efflorescens | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 10861608 | 8369 | 2599871 | 2609027 | Lecanoromycetes | Rinodina efflorescens | 4031 |
99 | EXACT | FALSE | 8664645 | NA | Schismatomma vernans (Tuck.) Zahlbr. | Schismatomma vernans | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 313 | 1273 | 8359 | 2592839 | 8664645 | Arthoniomycetes | Schismatomma vernans | 4206 |
99 | EXACT | FALSE | 2610322 | NA | Sclerophora peronella (Ach.) Tibell | Sclerophora peronella | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 10874653 | 10699167 | 6243 | 2610321 | 2610322 | Coniocybomycetes | Sclerophora peronella | 4236 |
98 | EXACT | TRUE | 2608024 | 10688503 | Toninia opuntioides (Vill.) Timdal | Toninia opuntioides | SPECIES | SYNONYM | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 4812 | 7251896 | 10688503 | Lecanoromycetes | Toninia opuntioides | 4721 |
99 | EXACT | FALSE | 2599776 | NA | Trapeliopsis granulosa (Hoffm.) Lumbsch | Trapeliopsis granulosa | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1042 | 8344 | 2599761 | 2599776 | Lecanoromycetes | Trapeliopsis granulosa | 4737 |
98 | EXACT | TRUE | 11240236 | 6496469 | Triphora minima Pease, 1871 | Triphora minima | SPECIES | SYNONYM | Animalia | ⋯ | 1 | 52 | 225 | NA | 2660 | 2299148 | 6496469 | Gastropoda | Triphora minima | 4775 |
99 | EXACT | FALSE | 5260494 | NA | Umbilicaria cinereorufescens (Schaer.) Frey | Umbilicaria cinereorufescens | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1275 | 4108 | 2600611 | 5260494 | Lecanoromycetes | Umbilicaria cinereorufescens | 4811 |
99 | EXACT | FALSE | 2606076 | NA | Usnea diplotypus Vain. | Usnea diplotypus | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 8305 | 2605982 | 2606076 | Lecanoromycetes | Usnea diplotypus | 4824 |
99 | EXACT | FALSE | 2604141 | NA | Xanthoparmelia neochlorochroa Hale | Xanthoparmelia neochlorochroa | SPECIES | ACCEPTED | Fungi | ⋯ | 5 | 95 | 180 | 1048 | 8305 | 2603781 | 2604141 | Lecanoromycetes | Xanthoparmelia neochlorochroa | 4953 |
The only “animal” in this dataset is Triphora minima, which also happens to be an orchid species. To get the correct match, we could re-run the query for this particular species adding the parameter kingdom = "Plantae"
. You should not do this initially, if you are not 100% sure your data belongs to a certain group. The species labelled as fungi may rather be lichen, as in the case of Usnea, but their classification is plausible, as it is known that the TRY database includes lichens.
Let’s check the “HIGHERRANK” matches. (“scientificName” includes authors, “canonicalName” is without authors, and “verbatim_name” is the name queried.)
resP[matchType == "HIGHERRANK", c("scientificName", "canonicalName", "verbatim_name")]
scientificName | canonicalName | verbatim_name |
---|---|---|
<chr> | <chr> | <chr> |
Abies Mill. | Abies | Abies sp |
Acacia Mill. | Acacia | Acacia contriva |
Acacia Mill. | Acacia | Acacia sp1887 |
Aconitum L. | Aconitum | Aconitum napellus L. Orthodox |
Adenanthera L. | Adenanthera | Adenanthera sp. |
Aesculus L. | Aesculus | Aesculus xworlitzensis |
Aloinopsis Schwantes | Aloinopsis | Aloinopsis gydouwensis |
Alyssum L. | Alyssum | Alyssum thunbergii Moq. Orthodox p |
Ampelocera Klotzsch | Ampelocera | Ampelocera indet |
Asphodeline Rchb. | Asphodeline | Asphodeline fistulosus |
Aucoumea Pierre | Aucoumea | Aucoumea sp. |
Poaceae | Poaceae | Avena nervosa |
Neobartsia Uribe-Convers & Tank | Neobartsia | Bartsia lamiflora |
Beccarianthus Cogn. | Beccarianthus | Beccarianthus sp |
Beta maritima L. | Beta maritima | Beta maritima subsp maritima |
Betula L. | Betula | Betula borealis |
Urticaceae | Urticaceae | Boehmeria spicata |
Bromus L. | Bromus | Bromus erectus Huds. Orthodox |
Calluna vulgaris (L.) Hull | Calluna vulgaris | Calluna vulgaris alp (L.) Hull |
Campanula stevenii M.Bieb. | Campanula stevenii | Campanula stevenii M.Bieb. subsp. beauverdiana (Fomin) Rech.f. & Schiman-Czeika |
Tracheophyta | Tracheophyta | Campylospermum sp. |
Carex L. | Carex | Carex foliosa |
Carex L. | Carex | Carex nigra agg. |
Psephellus Cass. | Psephellus | Centaurea bella |
Asteraceae | Asteraceae | Centipeda orbicularis |
Chamelaucium Desf. | Chamelaucium | Chamelaucium griffinii |
Dysphania R.Br. | Dysphania | Chenopodium nepalense |
Chicorium Dumort., 1822 | Chicorium | Chicorium intybus |
Chrysophyllum L. | Chrysophyllum | Chrysophyllum sp.1 |
Citrus L. | Citrus | Citrus sp |
⋮ | ⋮ | ⋮ |
Scapania paradoxa R.M.Schust. | Scapania paradoxa | Scapania paradoxa var. paradoxa |
Schuurmansia Blume | Schuurmansia | Schuurmansia sp |
Scirpus Tourn. ex L. | Scirpus | Scirpus sp. |
Cyperaceae | Cyperaceae | Scirpus fistulosus |
Scrophularia L. | Scrophularia | Scrophularia auriculata L. Orthodox |
Senna Mill. | Senna | Senna form taxon petiolaris |
Sisymbrium L. | Sisymbrium | Sisymbrium sp |
Solanum L. | Solanum | Solanum fendleri |
Solanum L. | Solanum | Solanum quercifolium |
Fabaceae | Fabaceae | Sophora viciifolia |
Sorghum Moench | Sorghum | Sorghum spp. |
Spermacoce L. | Spermacoce | Spermacoce assurgens |
Sphalmanthus N.E.Br. | Sphalmanthus | Sphalmanthus neilii |
Tabebuia Gomes | Tabebuia | Tabebuia billbergiana |
Asteraceae | Asteraceae | Taraxacum pumilum |
Taxillus limprichtii (Grüning) H.S.Kiu | Taxillus limprichtii | Taxillus limprichtii (Grning) H.S. Kiu |
Thecanthes Wikstr. | Thecanthes | Thecanthes sp |
Tripleurospermum Sch.Bip. | Tripleurospermum | Tripleurospermum hampeanum |
Ursinia Gaertn. | Ursinia | Ursinia sp. |
Urtica L. | Urtica | Urtica major |
Vallisneria P.Micheli ex L. | Vallisneria | Vallisneria tortissima |
Vangueria Juss. | Vangueria | Vangueria sp. |
Vepris Comm. ex A.Juss. | Vepris | Vepris fischeri |
Asteraceae | Asteraceae | Vernonia praecox |
Asteraceae | Asteraceae | Vernonia subsessilis |
Veronica L. | Veronica | Veronica sp |
Vignea P.Beauv. | Vignea | Vignea lapponica |
Viola L. | Viola | Viola repens |
Zollernia Wied-Neuw. & Nees | Zollernia | Zollernia sp |
Zygogynum Baill. | Zygogynum | Zygogynum pancheri_Dogny (Baill.) Vink |
A large part of the matches are such where the species epithet is given as “sp/sp.”, or where authors may be interpreted as parts of the epithets. Preprocessing the names (see name parsing) may help mitigate these issues.
name_backbone("Aconitum napellus")
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | ⋯ | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | speciesKey | synonym | class | verbatim_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | <chr> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <lgl> | <chr> | <chr> | |
1 | 3033665 | Aconitum napellus L. | Aconitum napellus | SPECIES | ACCEPTED | 97 | EXACT | Plantae | Tracheophyta | Ranunculales | ⋯ | 6 | 7707728 | 220 | 399 | 2410 | 3033663 | 3033665 | FALSE | Magnoliopsida | Aconitum napellus |
Let’s check the animals now.
table(resA$kingdom)
sum(is.na(resA$kingdom))
table(resA$matchType)
Animalia
4821
EXACT FUZZY HIGHERRANK NONE
4408 282 131 179
Here, classification into animals was unambiguous. Let’s check the higher rank matches.
resA[matchType == "HIGHERRANK", c("scientificName", "canonicalName", "verbatim_name")]
scientificName | canonicalName | verbatim_name |
---|---|---|
<chr> | <chr> | <chr> |
Accipiter Brisson, 1760 | Accipiter | Accipiter spp.-5 Rothschild & Hartert, 1926 |
Achillides chikae | Achillides chikae | Achillides chikae chikae |
Achillides chikae | Achillides chikae | Achillides chikae hermeli Nuyda, 1992 |
Animalia | Animalia | Acropora morphospec1 Veron & Wallace, 1984 |
Acroporidae | Acroporidae | Acroporidae_Acropora cuneata (Dana, 1846) |
Animalia | Animalia | Acropora josephi George & Sukumaran, 2007 |
Animalia | Animalia | Acropora mannaqensis George & Sukumaran, 2007 |
Acroporidae | Acroporidae | Acroporidae Acropora subglabra (Brook, 1891) |
Acroporidae | Acroporidae | Acroporidae_Acropora tanegashimensis Veron, 1990 |
Animalia | Animalia | Acropora thomasi George & Sukumaran, 2007 |
Animalia | Animalia | Acropora valimunensis George & Sukumaran, 2007 |
Strigidae | Strigidae | Strigidae Aegolius funereus (Linnaeus, 1758) |
Anatidae | Anatidae | Anatidae Anas nesiotis (J. H. Fleming, 1935) |
Trochilidae | Trochilidae | Trochilidae_Androdon aequatorialis Gould, 1863 |
Antigone antigone (Linnaeus, 1758) | Antigone antigone | Antigone antigone canadensis (Linnaeus, 1758) |
Antigone antigone (Linnaeus, 1758) | Antigone antigone | Antigone antigone canadensis nesiotes (Bangs & Zappey, 1905) |
Antigone antigone (Linnaeus, 1758) | Antigone antigone | Antigone antigone rubicunda (Perry, 1810) |
Astrapia Vieillot, 1816 | Astrapia | Astrapia sephaniae (Finsch, 1885) |
Australomussa Veron, 1985 | Australomussa | Australomussa roleyensis Veron, 1985 |
Brookesia Gray, 1865 | Brookesia | Brookesia spp. Q Brygoo & Domergue, 1975 |
Brookesia Gray, 1865 | Brookesia | Brookesia spec._F |
Buteo Lacepede, 1799 | Buteo | BUTEO sp.N (Linnaeus, 1758) |
Calumma Gray, 1865 | Calumma | Calumma spec.-H (Hillenius, 1959) |
Candoia Gray, 1842 | Candoia | Candoia sp_N (Duméril & Bibron, 1844) |
Centrolene Jiménez de la Espada, 1872 | Centrolene | Centrolene bukleyi (Boulenger, 1882) |
Cerdocyon C.E.H.Smith, 1839 | Cerdocyon | Cerdocyon thuos (Linnaeus, 1766) |
Chersobius Fitzinger, 1835 | Chersobius | Chersobius chersobius signatus (Gmelin, 1789) |
Chilabothrus A.M.C.Duméril & Bibron, 1844 | Chilabothrus | Chilabothrus chilabothrus chrysogaster (Cope, 1871) |
Chilabothrus A.M.C.Duméril & Bibron, 1844 | Chilabothrus | Chilabothrus chilabothrus fordii (Günther, 1861) |
Antipathidae | Antipathidae | Antipathidae Cirrhipathes sieboldii Blainville, 1834 |
⋮ | ⋮ | ⋮ |
Poritidae | Poritidae | Poritidae_Porites lichen Dana, 1846 |
Animalia | Animalia | Pristis |
Psephotellus Mathews, 1913 | Psephotellus | Psephotellus psephotellus pulcherrimus (Gould, 1845) |
Pseudastur Mayr, 1998 | Pseudastur | Pseudastur pseudastur albicollis (Latham, 1790) |
Pseudocordylus Smith, 1838 | Pseudocordylus | Pseudocordylus pseudocordylus melanotus A. Smith, 1838 |
Pseudocordylus Smith, 1838 | Pseudocordylus | Pseudocordylus pseudocordylus spinosus FitzSimons, 1947 |
Pseudocordylus Smith, 1838 | Pseudocordylus | Pseudocordylus sp._L A. Smith, 1838 |
Pyrrhura Bonaparte, 1856 | Pyrrhura | Pyrrhura alipectus Chapman, 1914 |
Chordata | Chordata | Salvator salvator merianae Duméril & Bibron, 1839 |
Sarcoramphus Dumeril, 1805 | Sarcoramphus | Sarcoramphus spec.Z |
Scaphirhynchus Heckel, 1836 | Scaphirhynchus | Scaphirhynchus sp K |
Schizoculina Wells, 1937 | Schizoculina | Schizoculina sp-Y Wells 1937 |
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 | Smaug | Smaug morphospec_U |
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 | Smaug | Smaug smaug breyeri Van Dam, 1921 |
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 | Smaug | Smaug smaug giganteus A. Smith, 1844 |
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 | Smaug | Smaug smaug mossambicus FitzSimons, 1958 |
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 | Smaug | Smaug smaug regius Broadley, 1962 |
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 | Smaug | Smaug smaug vandami FitzSimons, 1930 |
Animalia | Animalia | STENELLA J. E. Gray, 1866 |
Stylaster Gray, 1831 | Stylaster | Stylaster sp.-1 Quelch, 1884 |
Animalia | Animalia | Tanygnathus |
Animalia | Animalia | Tanygnathus ssp_Z (Boddaert, 1783) |
Tapirus Brisson, 1762 | Tapirus | Tapirus teqrestris (Linnaeus, 1758) |
Toropuku Nielsen, Bauer, Jackman, Hitchmough & Daugherty, 2011 | Toropuku | Toropuku toropuku stephensi Robb, 1980 |
Touit G.R.Gray, 1855 | Touit | Touit hueii (Temminck, 1830) |
Trachypithecus Reichenbach, 1862 | Trachypithecus | Trachypithecus villosus (Griffith, 1821) |
Trichoglossus Stephens, 1826 | Trichoglossus | Trichoglossus jonstoniae Hartert, 1903 |
Tylototriton Anderson, 1871 | Tylototriton | Tylototriton daloushanensis Zhou, Xiao & Luo, 2022 |
Tympanocryptis W.C.H.Peters, 1863 | Tympanocryptis | Tympanocryptis sp. W W. C. H. Peters, 1863 |
Gekkonidae | Gekkonidae | Gekkonidae Woodworthia ""cromwell gorge"" Nielsen, Bauer, Jackman, Hitchmough & Daugherty, 2011 |
There are some problems with the classification of specific genera, like Acropora, and there are problems with genera that are erronously repeated within names, like Smaug smaug breyeri. Additionally, family names before the actual scientific names should also be removed. These are issues that can also be alleviated during pre-processing.
Writing your own name resolution function#
Sometimes, you may want to check datasets against a very specific reference database, or your name resolution service of choice may not use the newest version of your reference dataset. In this case, you can write your own name resolution algorithm. Be aware: There are a lot of caveats in this process, and nowadays, it will not be easy to write a function that matches the correctness and efficiency of services like GBIF. Especially when it comes to speed and large datasets, it is unlikely a function that cannot be used in parallel on your own machine or a high performance cluster will deliver results for big datasets within an acceptable time frame.
Let’s imagine we want to check our plants dataset against the newest version of the Leipzig Catalogue of Vascular Plants (LCVP).
# read in dataset
LCVP <- fread(paste0(.brd, "PlantHub/LCVP_PlantHub_2024-01-25.gz"))
str(LCVP)
Classes 'data.table' and 'data.frame': 1337778 obs. of 21 variables:
$ global Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Input Genus : chr "Aa" "Aa" "Aa" "Aa" ...
$ Input Epitheton : chr "argyrolepis" "aurantiaca" "brevis" "calceata" ...
$ Rank : chr "species" "species" "species" "species" ...
$ Input Subspecies Epitheton: chr "" "" "" "" ...
$ Input Authors : chr "(Rchb.f.) Rchb.f." "D.Trujillo" "Schltr." "(Rchb.f.) Schltr." ...
$ Status : chr "accepted" "accepted" "synonym" "accepted" ...
$ globalId of Output Taxon : int 1 2 819078 4 819080 6 7 8 9 10 ...
$ Output Taxon : chr "Aa argyrolepis (Rchb.f.) Rchb.f." "Aa aurantiaca D.Trujillo" "Myrosmodes breve (Schltr.) Garay" "Aa calceata (Rchb.f.) Schltr." ...
$ family : chr "Orchidaceae" "Orchidaceae" "Orchidaceae" "Orchidaceae" ...
$ Order : chr "Asparagales" "Asparagales" "Asparagales" "Asparagales" ...
$ Literature : chr "" "Lankesteriana 2011.11.1 1-8;" "" "" ...
$ Comments : chr "" "" "" "" ...
$ status : chr "accepted" "accepted" "synonym" "accepted" ...
$ nameIn : chr "Aa argyrolepis" "Aa aurantiaca" "Aa brevis" "Aa calceata" ...
$ authorsIn : chr "(Rchb.f.) Rchb.f." "D.Trujillo" "Schltr." "(Rchb.f.) Schltr." ...
$ nameOut : chr "Aa argyrolepis" "Aa aurantiaca" "Myrosmodes breve" "Aa calceata" ...
$ authorsOut : chr "(Rchb.f.) Rchb.f." "D.Trujillo" "(Schltr.) Garay" "(Rchb.f.) Schltr." ...
$ IPNIID : chr "614525-1" "77112075-1" "301821-2" "1008443-2" ...
$ WFOLink : chr "wfo-0000760991" "wfo-0000922666" "wfo-0000854509" "wfo-0000928062" ...
$ WPName : chr "Aa argyrolepis (Rchb.f.) Rchb.f." "Aa aurantiaca D.Trujillo" "Aa brevis Schltr." "Aa calceata (Rchb.f.) Schltr." ...
- attr(*, ".internal.selfref")=<externalptr>
This is an enhanced version of LCVP 2.0, with some errors corrected, and some data added. It includes ASCII-only columns nameIn, authorsIn, nameOut, and authorsOut, as well as links to IPNI, POWO, WFO, and WorldPlants.
Let’s set up a simple matching algorithm. You may work on it to include some of the main problematic cases.
First, let’s tune our reference list. We would like to be able to identify families and genera before the actual matching, and to do this efficiently, we can extract those from LCVP. We should also be able to directly match complete names with authors, so let’s create a column with those, too.
# genera
genera <- sort(unique(sub("\\s.*", "", LCVP$nameIn)))
genera <- genera[genera != ""]
genera[1:10]
# families
families <- sort(unique(LCVP$family))
families <- families[families != ""]
families[1:10]
# name + author column
LCVP[, fullNameIn := trimws(paste(nameIn, authorsIn))]
LCVP$fullNameIn[1:10]
- 'Aa'
- 'Aakia'
- 'Aalius'
- 'Aaronsohnia'
- 'Abacopteris'
- 'Abacosa'
- 'Abalon'
- 'Abama'
- 'Abapus'
- 'Abarema'
- 'Acanthaceae'
- 'Achariaceae'
- 'Achatocarpaceae'
- 'Acoraceae'
- 'Actinidiaceae'
- 'Adoxaceae'
- 'Aextoxicaceae'
- 'Afrothismiaceae'
- 'Agavaceae'
- 'Agdestidaceae'
- 'Aa argyrolepis (Rchb.f.) Rchb.f.'
- 'Aa aurantiaca D.Trujillo'
- 'Aa brevis Schltr.'
- 'Aa calceata (Rchb.f.) Schltr.'
- 'Aa chiogena Schltr.'
- 'Aa colombiana Schltr.'
- 'Aa denticulata Schltr.'
- 'Aa erosa (Rchb.f.) Schltr.'
- 'Aa fiebrigii (Schltr.) Schltr.'
- 'Aa figueroi Szlach. & S.Nowak'
Let’s prepare a results table. For simplicity, we will store the ID of matches in LCVP when a name was found, and indicate whether it is a genus or family found in LCVP (without ID) otherwise.
resTable <- data.table(name = plants$newName, LCVP_ID = numeric(), LCVP_genus = logical(), LCVP_family = logical())
Warning message in as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, :
"Item 2 has 0 rows but longest item has 5000; filled with NA"
Warning message in as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, :
"Item 3 has 0 rows but longest item has 5000; filled with NA"
Warning message in as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, :
"Item 4 has 0 rows but longest item has 5000; filled with NA"
We can ignore the warnings which just tell us that the LCVP columns in the results table are empty for now. Let’s fill the table.
# test whether names found in genera
which(plants$newName %in% genera)
# test whether names found in families
which(plants$newName %in% families)
# write data into results
resTable[plants$newName %in% genera, LCVP_genus := TRUE]
resTable[plants$newName %in% families, LCVP_family := TRUE]
# test whether names in nameIn, i.e. names without authors
which(plants$newName %in% LCVP$nameIn)
# test whether names in fullNameIn, i.e. names with authors
which(plants$newName %in% LCVP$fullNameIn)
- 5
- 62
- 108
- 517
- 664
- 1157
- 1206
- 1410
- 1509
- 1592
- 1956
- 1957
- 1959
- 2085
- 2185
- 2290
- 2452
- 2660
- 2692
- 2864
- 2892
- 2904
- 3125
- 3639
- 3679
- 4029
- 4452
- 4642
- 4703
- 4930
- 4
- 6
- 8
- 9
- 12
- 13
- 15
- 16
- 20
- 21
- 25
- 28
- 29
- 30
- 32
- 33
- 35
- 36
- 39
- 41
- 42
- 52
- 53
- 54
- 56
- 57
- 58
- 59
- 60
- 61
- 65
- 68
- 71
- 72
- 73
- 75
- 79
- 80
- 81
- 82
- 84
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 98
- 99
- 100
- 102
- 105
- 106
- 107
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 119
- 120
- 123
- 124
- 126
- 127
- 130
- 131
- 135
- 136
- 137
- 139
- 142
- 143
- 144
- 148
- 149
- 150
- 151
- 152
- 154
- 157
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 170
- 171
- 173
- 179
- 180
- 181
- 182
- 184
- 188
- 190
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 202
- 205
- 207
- 211
- 212
- 213
- 214
- 217
- 218
- 220
- 221
- 224
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 237
- 238
- 239
- 243
- 246
- 249
- 250
- 252
- 253
- 254
- 255
- 258
- 260
- 262
- 263
- 264
- 265
- 267
- 268
- 270
- 271
- 274
- 276
- 277
- 282
- 283
- 284
- 285
- 286
- 287
- 289
- 290
- 293
- 295
- 297
- 299
- 300
- 301
- 302
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 312
- 313
- 314
- 316
- 317
- 319
- 320
- 322
- 323
- 324
- 325
- 326
- 328
- 329
- 330
- 331
- 332
- 334
- 337
- 338
- ⋯
- 4686
- 4687
- 4688
- 4689
- 4693
- 4694
- 4695
- 4697
- 4699
- 4702
- 4704
- 4705
- 4706
- 4707
- 4708
- 4710
- 4711
- 4712
- 4713
- 4714
- 4715
- 4716
- 4718
- 4719
- 4722
- 4726
- 4727
- 4734
- 4735
- 4736
- 4739
- 4740
- 4744
- 4746
- 4747
- 4749
- 4750
- 4751
- 4752
- 4753
- 4756
- 4757
- 4759
- 4762
- 4763
- 4764
- 4767
- 4768
- 4770
- 4771
- 4772
- 4773
- 4774
- 4775
- 4777
- 4778
- 4779
- 4780
- 4781
- 4782
- 4783
- 4784
- 4786
- 4789
- 4790
- 4791
- 4792
- 4793
- 4796
- 4797
- 4798
- 4799
- 4800
- 4801
- 4803
- 4804
- 4805
- 4807
- 4813
- 4814
- 4816
- 4817
- 4818
- 4820
- 4821
- 4823
- 4825
- 4826
- 4829
- 4831
- 4832
- 4835
- 4837
- 4838
- 4844
- 4846
- 4847
- 4848
- 4849
- 4851
- 4854
- 4856
- 4858
- 4862
- 4863
- 4865
- 4866
- 4867
- 4869
- 4870
- 4871
- 4872
- 4873
- 4874
- 4875
- 4876
- 4879
- 4880
- 4883
- 4884
- 4885
- 4889
- 4892
- 4893
- 4894
- 4896
- 4897
- 4899
- 4900
- 4902
- 4903
- 4904
- 4906
- 4907
- 4908
- 4909
- 4913
- 4916
- 4917
- 4918
- 4919
- 4920
- 4923
- 4924
- 4925
- 4926
- 4927
- 4928
- 4929
- 4932
- 4933
- 4934
- 4935
- 4937
- 4938
- 4939
- 4940
- 4941
- 4943
- 4944
- 4946
- 4948
- 4949
- 4950
- 4952
- 4955
- 4956
- 4957
- 4958
- 4959
- 4961
- 4963
- 4964
- 4965
- 4967
- 4969
- 4970
- 4971
- 4972
- 4973
- 4974
- 4975
- 4976
- 4977
- 4979
- 4980
- 4981
- 4982
- 4983
- 4984
- 4985
- 4986
- 4988
- 4989
- 4991
- 4992
- 4994
- 4995
- 4996
- 5000
- 14
- 17
- 18
- 24
- 27
- 40
- 55
- 78
- 83
- 86
- 97
- 101
- 104
- 117
- 118
- 121
- 122
- 129
- 132
- 147
- 156
- 176
- 178
- 183
- 185
- 186
- 200
- 201
- 203
- 206
- 208
- 215
- 216
- 219
- 222
- 241
- 248
- 251
- 257
- 272
- 278
- 291
- 294
- 296
- 298
- 303
- 336
- 342
- 345
- 347
- 349
- 358
- 360
- 372
- 384
- 434
- 436
- 438
- 439
- 440
- 451
- 456
- 458
- 459
- 469
- 475
- 483
- 487
- 488
- 501
- 502
- 535
- 537
- 561
- 571
- 572
- 575
- 577
- 578
- 580
- 582
- 584
- 585
- 586
- 587
- 592
- 593
- 601
- 610
- 621
- 632
- 653
- 657
- 672
- 673
- 682
- 689
- 694
- 706
- 713
- 716
- 717
- 725
- 740
- 741
- 744
- 745
- 746
- 757
- 760
- 775
- 796
- 800
- 809
- 841
- 845
- 847
- 855
- 868
- 881
- 884
- 893
- 899
- 903
- 908
- 915
- 918
- 922
- 927
- 930
- 948
- 952
- 953
- 955
- 974
- 980
- 990
- 992
- 998
- 1000
- 1009
- 1010
- 1011
- 1016
- 1018
- 1019
- 1028
- 1032
- 1033
- 1042
- 1045
- 1062
- 1066
- 1068
- 1071
- 1073
- 1079
- 1099
- 1114
- 1119
- 1134
- 1167
- 1175
- 1181
- 1187
- 1188
- 1193
- 1195
- 1196
- 1197
- 1198
- 1222
- 1223
- 1229
- 1239
- 1243
- 1255
- 1265
- 1273
- 1278
- 1285
- 1286
- 1287
- 1289
- 1290
- 1292
- 1295
- 1297
- 1298
- 1299
- 1300
- 1301
- 1306
- 1308
- 1328
- 1332
- 1338
- 1339
- 1346
- 1360
- ⋯
- 3495
- 3497
- 3509
- 3511
- 3512
- 3518
- 3536
- 3545
- 3546
- 3550
- 3559
- 3572
- 3577
- 3583
- 3586
- 3589
- 3593
- 3599
- 3604
- 3606
- 3607
- 3608
- 3611
- 3612
- 3614
- 3615
- 3616
- 3626
- 3631
- 3634
- 3652
- 3664
- 3665
- 3676
- 3704
- 3706
- 3714
- 3720
- 3726
- 3732
- 3734
- 3736
- 3737
- 3739
- 3740
- 3749
- 3752
- 3778
- 3786
- 3788
- 3792
- 3794
- 3817
- 3834
- 3848
- 3856
- 3879
- 3885
- 3891
- 3903
- 3921
- 3922
- 3938
- 3948
- 3949
- 3956
- 3966
- 3973
- 3979
- 3985
- 3991
- 4002
- 4020
- 4051
- 4069
- 4070
- 4089
- 4102
- 4106
- 4137
- 4141
- 4146
- 4148
- 4154
- 4164
- 4166
- 4203
- 4207
- 4220
- 4242
- 4252
- 4257
- 4269
- 4274
- 4278
- 4283
- 4304
- 4307
- 4311
- 4317
- 4319
- 4321
- 4322
- 4329
- 4332
- 4339
- 4340
- 4342
- 4346
- 4347
- 4353
- 4355
- 4367
- 4369
- 4383
- 4388
- 4389
- 4393
- 4394
- 4396
- 4398
- 4402
- 4404
- 4408
- 4409
- 4416
- 4449
- 4456
- 4458
- 4461
- 4492
- 4503
- 4509
- 4512
- 4515
- 4521
- 4527
- 4528
- 4532
- 4535
- 4544
- 4548
- 4549
- 4555
- 4560
- 4583
- 4601
- 4605
- 4606
- 4623
- 4639
- 4658
- 4659
- 4676
- 4691
- 4692
- 4700
- 4701
- 4709
- 4730
- 4741
- 4743
- 4748
- 4754
- 4755
- 4758
- 4765
- 4769
- 4794
- 4819
- 4836
- 4839
- 4840
- 4853
- 4857
- 4859
- 4860
- 4861
- 4864
- 4868
- 4878
- 4882
- 4886
- 4890
- 4895
- 4901
- 4905
- 4911
- 4914
- 4936
- 4945
- 4947
- 4951
- 4960
- 4962
- 4968
- 4987
- 4990
- 4998
- 4999
As we can see, there are many matches both when searching with and without authors. However, for names without authors, there may be more than one name in the reference list (they are called homonyms). Only one of those will be an accepted name, while the others are synonyms. As the matched names without authors do not allow for a disambiguation, we will assign the ID of the accepted name from the reference list, if there are several. To do this, we create a copy of the reference list, order by taxonomic status so that accepted names come first, and only keep the first of several rows with identical names without authors. We then use this reduced list to extract the respective IDs.
# create a copy of the reference list
LCVPUnique <- LCVP
# order by taxonomic status
setorder(LCVPUnique, status)
# keep only the first of several rows with identical names without authors
LCVPUnique <- unique(LCVPUnique, by = "nameIn")
# check whether it worked
nrow(LCVPUnique)
nrow(LCVP)
We removed about 80000 names from LCVP in this process. Let’s now get the IDs.
# write data into results, extract LCVP ID
setkey(LCVP, fullNameIn)
res <- LCVP[plants$newName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]
setkey(LCVPUnique, nameIn)
res <- LCVPUnique[plants$newName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]
We can now check what remains from the names in our list. The remainder will be the difficult part where the algorihm used actually matters.
plants[, matched := FALSE]
plants[!is.na(resTable$LCVP_ID) | !is.na(resTable$LCVP_genus) | !is.na(resTable$LCVP_family), matched := TRUE]
table(plants$matched)
FALSE TRUE
1180 3820
From the 5000 names we had to check, 1180 remain to be tested. This is a relatively large number, as this dataset is especially messy, but good for us to practice. We should have a look at the unmatched names.
plants[matched == FALSE]$newName[1:20]
- ''
- '(lauraceae) pubescente'
- '?Betulaceae sp.'
- 'Abies sp'
- 'Abuta_panamensis'
- 'Abutilon grandiflorum G.Don Orthodox'
- 'Acacia contriva'
- 'Acacia eremophila W.Fitzg. var. variabilis Maiden & Blakeley'
- 'Acacia flavescens A.Cunn. ex Benth. Orthodox?'
- 'Acacia incanicarpa A.R.Chapman & Maslin'
- 'Acacia mucronata Willd. ex H.L.Wendl. subsp. mucronata'
- 'Acacia plectocarpa A.Cunn. ex Benth. Orthodox'
- 'Acacia sclerosperma F.Muell. subsp. sclerosperma'
- 'Acacia sp1887'
- 'Acacia auriculiformis A.Cunn. ex Benth.'
- 'Acacia colei Maslin & L.A.J.Thomson'
- 'Acacia drummondii Lindl.'
- 'Acacia hadrophylla R.S.Cowan & Maslin'
- 'Acacia lazaridis Pedley'
- 'Acacia neriifolia A.Cunn. ex Benth.'
It seems that in the first place, we should get rid of author names, as there spelling may be different from the one in LCVP and therefore not produce a match. A very simple way of doing so would be to cut names after the second whitespace.
# function to extract first two words
nameShorter <- function(x) {
# get number of whitespaces
ws <- gregexpr(" ", x)
# get position of second whitespace if available, otherweise return 0
ws <- sapply(ws, function(x) if (length(x) > 1) x[2] else 0)
x[ws > 0] <- substr(x[ws > 0], 1, ws[ws > 0] - 1)
return(x)
}
print(nameShorter(plants$newName[40:50]))
[1] "Acacia tortilis" "Acacia valida"
[3] "Acacia yorkrakinensis" "Acacia auriculiformis A.Cunn. ex"
[5] "Acacia colei Maslin &" "Acacia drummondii Lindl."
[7] "Acacia hadrophylla R.S.Cowan &" "Acacia lazaridis Pedley"
[9] "Acacia neriifolia A.Cunn. ex" "Acacia ptychoclada Maiden &"
[11] "Acacia speckii R.S.Cowan &"
We see that some of the shortNames are not as expected. Thare are still author names linked to them. The reason is that there are protected whitespaces in there. We need to remove them first.
# function to extract first two words
nameShorter <- function(x) {
# remove protected whitespaces
x <- gsub("\xc2\xa0", " ", x)
# get number of whitespaces
ws <- gregexpr(" ", x)
# get position of second whitespace if available, otherweise return 0
ws <- sapply(ws, function(x) if (length(x) > 1) x[2] else 0)
x[ws > 0] <- substr(x[ws > 0], 1, ws[ws > 0] - 1)
return(x)
}
print(nameShorter(plants$newName[40:50]))
[1] "Acacia tortilis" "Acacia valida" "Acacia yorkrakinensis"
[4] "Acacia auriculiformis" "Acacia colei" "Acacia drummondii"
[7] "Acacia hadrophylla" "Acacia lazaridis" "Acacia neriifolia"
[10] "Acacia ptychoclada" "Acacia speckii"
This looks much nicer. We can now do the name matching without authors again. Note that, ideally, we would not just match without authors, but also measure the difference between author names so that we actually select the closest match.
plants[, shortName := nameShorter(newName)]
setkey(LCVPUnique, nameIn)
res <- LCVPUnique[plants$shortName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]
# update the "notMatched" column
plants[!is.na(resTable$LCVP_ID) | !is.na(resTable$LCVP_genus) | !is.na(resTable$LCVP_family), matched := TRUE]
table(plants$matched)
FALSE TRUE
581 4419
As we can see, the number of unmatched names was reduced from 1211 to 581. We could now introduce some fuzzy matching, i.e. try to assign names from the reference list to names with spelling errors. Of course we could also consider other pre-processing options: correcting the uppercase/lowercase of the names, removing special characters like question marks or underlines, removing “sp.”, etc..
plants[matched == FALSE]$shortName[1:50]
- ''
- '(lauraceae) pubescente'
- '?Betulaceae sp.'
- 'Abies sp'
- 'Abuta_panamensis'
- 'Acacia contriva'
- 'Acacia sp1887'
- 'Acarospora radicata'
- 'Acer sino-oblongum'
- 'Aconitum delphiniifolium'
- 'Adenanthera sp.'
- 'A-Elyhordeum schaackianum'
- 'Aeranthus muscicola'
- 'Aesculus xworlitzensis'
- 'Agathis mooeri'
- 'Agave vera-cruz'
- 'Agrosthophyllum bicuspidatum'
- 'Albizia NA'
- 'Alectryon macrococcus'
- 'Alexa wachenheimii'
- 'Alissum tortuosum'
- 'Allium scabrifolium'
- 'Aloinopsis gydouwensis'
- 'ALOPECURUS GENICULATUS,'
- 'Alstroemeria riedeliana'
- 'Alyssum caliacre'
- 'Alyssum thunbergii'
- 'Amaroria soulameiodes'
- 'Ampelocera indet'
- 'Amphibromus NA'
- 'Anartia meyeri'
- 'Ancistrocladus stelliger'
- 'Andopogon glomeratus'
- 'ANEMONE NEMOROSA'
- 'Anona squamosa'
- 'Anthaenantia villosa'
- 'Anthocephalus sp.'
- 'Anthurium lezamai'
- 'Aquilegia coerulea'
- 'Arctoa hyperborea'
- 'Arctostaphylos_obispoensis'
- 'Ardisia brevipetala'
- 'Aristida sanctae-luciae'
- 'Arrabidaea trailii'
- 'Artemisia '
- 'Arthonia polymorpha'
- 'Artocarpus lessigiana'
- 'Arundinella khaseana'
- 'Asperula rechingeri'
- 'Asphodeline fistulosus'
The fuzzy matching will be done in a loop using a matching function. Later on, this will allow us to easily switch to parallel processing. The below function first checks for the presence of the first word, assumed to be the genus, in the reference list. If it is found, the fuzzy matching will only be done on the species belonging to this genus, massively reducing the computation time. Then, fuzzy matching is done, the best result(s) selected and the first of the best results or no result returned (in case there is none).
# create a template to return in case there is no match
resTemplate <- LCVPUnique[1]
resTemplate[1] <- NA
# function for fuzzy matching
# maxDist controls the Levenshtein distance, i.e. the difference between the given and matched name
nameMatcher <- function(x, maxDist = 2) {
genus <- sub("\\s.*", "", x$shortName)
if (genus %in% genera) {
checkRows <- sub("\\s.*", "", LCVPUnique$nameIn) == genus
} else {
checkRows <- rep(TRUE, nrow(LCVPUnique))
}
# do fuzzy matching
res <- LCVPUnique[checkRows][agrepl(paste0("^", x$shortName, "$"), LCVPUnique$nameIn[checkRows],
max.distance = maxDist, fixed = FALSE
)]
if (nrow(res) > 0) {
# calculate Levenshtein distance
dists <- adist(x$shortName, res$nameIn)
# keep best result(s)
res <- res[as.vector(dists == min(dists))]
# return first result or return template
return(res[1])
} else {
return(resTemplate)
}
}
Let’s run this function on some of the remaining unmatched names. As this may take a while, we will only loop over the first 200 names. You may run it on the whole dataset, but expect it to take about half an hour. Running on the first 200 names will just take a minute.
timeStart <- Sys.time()
# for (i in seq_len(nrow(plants))) {
for (i in seq_len(200)) {
# only check unmatched cases
if (plants$matched[i] == FALSE) {
# counter to show progress
print(paste(i, Sys.time()))
res <- nameMatcher(plants[i])
if (!is.na(res$`global Id`)) {
resTable[i, LCVP_ID := res$`global Id`]
plants[i, matched := TRUE]
}
}
}
Sys.time() - timeStart
[1] "1 2024-04-12 11:58:24.400659"
[1] "2 2024-04-12 11:58:28.823094"
[1] "3 2024-04-12 11:58:34.227135"
[1] "7 2024-04-12 11:58:37.908572"
[1] "10 2024-04-12 11:58:38.654882"
[1] "19 2024-04-12 11:58:42.660404"
[1] "38 2024-04-12 11:58:43.418891"
[1] "63 2024-04-12 11:58:44.179911"
[1] "70 2024-04-12 11:58:48.752534"
[1] "87 2024-04-12 11:58:49.515793"
[1] "103 2024-04-12 11:58:50.269285"
[1] "125 2024-04-12 11:58:51.013671"
[1] "128 2024-04-12 11:58:56.107607"
[1] "133 2024-04-12 11:59:00.12651"
[1] "141 2024-04-12 11:59:00.888705"
[1] "146 2024-04-12 11:59:01.645952"
[1] "158 2024-04-12 11:59:02.720926"
[1] "174 2024-04-12 11:59:07.297597"
[1] "187 2024-04-12 11:59:08.048403"
[1] "189 2024-04-12 11:59:08.804925"
[1] "191 2024-04-12 11:59:09.561838"
Time difference of 49.24616 secs
plants[c(1, 10, 19, 128, 133)]
oldName | newName | matched | shortName |
---|---|---|---|
<chr> | <chr> | <lgl> | <chr> |
FALSE | |||
Abuta_panamensis | Abuta_panamensis | TRUE | Abuta_panamensis |
Acacia contriva | Acacia contriva | FALSE | Acacia contriva |
Aeranthus muscicola | Aeranthus muscicola | TRUE | Aeranthus muscicola |
Aesculus xworlitzensis | Aesculus xworlitzensis | TRUE | Aesculus xworlitzensis |
What we see from the times needed per individual run is that whenever the genus is found, matching is relatively fast, taking about a second, but when this is not the case, it is quite slow. This is because agrepl() then works on the whole LCVPUnique dataset and has to compare more than one million pairs of words. You could think about a heuristic to reduce the number of rows checked.
Parallel processing#
We will now focus on speeding up the process of name checking by running it in parallel. Let’s check how many cores are available on the system.
parallel::detectCores()
On my machine, I can at maximum use 16 cores. That means that I can expect a more or less 16-fold increase in processing time. Assuming that the matching of all unmatched names of the 5000 row dataset would take 30 minutes when running it sequentially, that means that I can expect the task to be completed in about 2 minutes when running in parallel. However, there comes a cost with it: When running processes in parallel, R will copy all the objects in the workspace needed for each parallel process, and in our cases, that means copying LCVPUnique 16 times. This will take quite some time, and for few iterations of the loop, initializing the parallel process will take more time than is saved by running in parallel. Anyway, we will first try the first 200 names we already processed before (but note that the matched ones will not be done again, because they are matched).
We also need to make some adjustments to the code. As the parallel processes will use their copies of the data, it would not make sense to let them write to the individual copies. Therefore, if a match is found, the information needs to be returned to the main process. Also, as objects are copied for individual workers, the information needed for the individual processes should be kept minimal. I will also not make use of all available cores to make sure I can do others stuff on my computer without delay while the process is running.
# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)
# run the name resolution in parallel
timeStart <- Sys.time()
resTemp <- foreach(i = seq_len(200), .combine = c, .packages = c("data.table")) %dopar% {
# only check unmatched cases
# return the global Id if it is found and NA if not checked or nothing could be found
if (plants$matched[i] == FALSE) {
res <- nameMatcher(plants[i])
res <- res$`global Id`
} else {
res <- NA
}
# indicate what to return
res
}
Sys.time() - timeStart
# stop the cluster
stopCluster(cl)
Time difference of 3.274923 mins
The process took about 3.5 mins for me with 15 cores, so no gain in terms of time for now. Let’s see what we got.
resTemp
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
As we already ran the process sequentially, new matches were not to be expected on the first 200 entries. Let’s risk running the process on the whole dataset.
# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)
# run the name resolution in parallel
timeStart <- Sys.time()
resTemp <- foreach(i = seq_len(nrow(plants)), .combine = c, .packages = c("data.table")) %dopar% {
# only check unmatched cases
# return the global Id if it is found and NA if not checked or nothing could be found
if (plants$matched[i] == FALSE) {
res <- nameMatcher(plants[i])
res <- res$`global Id`
} else {
res <- NA
}
# indicate what to return
res
}
Sys.time() - timeStart
# stop the cluster
stopCluster(cl)
Time difference of 7.922978 mins
This took about 7.5 mins. That’s a big improvement compared to the sequential process. Let’s see how many matches we got.
table(!is.na(resTemp))
FALSE TRUE
4749 251
So out of the 581 names, another 251 could be matched. Let’s transfer the data into resTable. It might well be worth thinking about adding information on type of matching, as the fuzzy matches are not perfect any might require further checking. However, our current implementation does not give us any information on the type of match.
resTable[!is.na(resTemp), LCVP_ID := resTemp[!is.na(resTemp)]]
plants[!is.na(resTable$LCVP_ID), matched := TRUE]
Let’s just look at the results and the remaining names. Maybe you can figure out some possible improvements to the code.
TASKS:
For example, you could think about allowing for partial matches, if the genus is found, but not the species. This could easily be implemented by extracting the first word from the shortName column.
You could also play with the
maxDist
parameter to increase or decrease the Levenshtein distance.To improve the speed of the
nameMatcher
function, you would certainly have to filter potential matches, for example by only including names starting with a certain letter (assuming the first letter is correct), or by only including names with a certain number of characters.Finally, the code would be more efficient if you would only loop over the rows that have not been matched yet.
table(plants$matched)
plants[matched == FALSE][1:20]
FALSE TRUE
319 4681
oldName | newName | matched | shortName |
---|---|---|---|
<chr> | <chr> | <lgl> | <chr> |
FALSE | |||
(lauraceae) pubescente | (lauraceae) pubescente | FALSE | (lauraceae) pubescente |
?Betulaceae sp. | ?Betulaceae sp. | FALSE | ?Betulaceae sp. |
Abies sp | Abies sp | FALSE | Abies sp |
Acacia contriva | Acacia contriva | FALSE | Acacia contriva |
Acacia sp1887 | Acacia sp1887 | FALSE | Acacia sp1887 |
Acarospora radicata | Acarospora radicata | FALSE | Acarospora radicata |
Adenanthera sp. | Adenanthera sp. | FALSE | Adenanthera sp. |
A-Elyhordeum schaackianum | A-Elyhordeum schaackianum | FALSE | A-Elyhordeum schaackianum |
Albizia NA | Albizia NA | FALSE | Albizia NA |
Allium scabrifolium | Allium scabrifolium | FALSE | Allium scabrifolium |
Aloinopsis gydouwensis | Aloinopsis gydouwensis | FALSE | Aloinopsis gydouwensis |
ALOPECURUS GENICULATUS, | ALOPECURUS GENICULATUS, | FALSE | ALOPECURUS GENICULATUS, |
Alyssum thunbergii Moq. Orthodox p | Alyssum thunbergii Moq. Orthodox p | FALSE | Alyssum thunbergii |
Ampelocera indet | Ampelocera indet | FALSE | Ampelocera indet |
Amphibromus NA | Amphibromus NA | FALSE | Amphibromus NA |
Ancistrocladus stelliger Wall. ex DC. | Ancistrocladus stelliger Wall. ex DC. | FALSE | Ancistrocladus stelliger |
ANEMONE NEMOROSA L. | ANEMONE NEMOROSA L. | FALSE | ANEMONE NEMOROSA |
Anthocephalus sp. | Anthocephalus sp. | FALSE | Anthocephalus sp. |
Arctoa hyperborea | Arctoa hyperborea | FALSE | Arctoa hyperborea |