Taxonomic name resolution#

This notebook is intended to demonstrate the workflow of taxonomic name resolution. Taxonomic name resolution is the process of checking and de-synonymizing taxonomic names using a reference database. This is necessary because many taxonomic entities, such as species, are known and have been described using several scientific names. However, only one of those scientific names is the accepted name, while the others are referred to as synonyms (although there are special cases, e.g., unresolved names). To correctly link data from several sources, it is necessary to check and de-synonymize names using a common reference database.

This process is complicated by spelling errors, names not found in the database, and other challenges. In this notebook, we will utilize the powerful GBIF API to resolve the names from a sample list. However, users may not always want to apply the GBIF taxonomy, and not all datasets incorporated in GBIF are the newest versions. Therefore, in the hands on-part of this notebook, we will attempt to create a custom name resolution function using the Leipzig Catalogue of Vascular Plants (LCVP). We will also strive to speed up the name resolution process by using the parallel processing functionality of R.

Prerequisites#

To run the code presented here, you will need

  • the sample names list provided in the workshop,

  • a version of the Leipzig Catalogue of Vascular Plants (LCVP), also provided in the workshop,

  • a functioning R environment and

  • the R packages data.table, rgbif, and doSNOW installed.

Code#

The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.

# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing

# clear workspace
rm(list = ls())

# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))

# load data
plants <- fread("plant names_2024-04-08.txt", sep = "\t")
animals <- fread("animal names_2024-04-09.txt", sep = "\t")
Lade nötiges Paket: foreach

Lade nötiges Paket: iterators

Lade nötiges Paket: snow

For the sake of simplicity, we will proceed to do the name matching using the names as they are. Note that name parsing can help to increase the accuracy and efficiency of the name harmonization process.

Encoding#

Unfortunately, when getting data from differing sources, we will often find that these data have been encoded in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system.

We will deal with the most common case: Data being stored in the Windows-specific CP-1252 encoding (mislabeled ANSI or latin1 sometimes) and not in UTF-8.

How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:

Sys.getlocale()
'LC_COLLATE=German_Germany.utf8;LC_CTYPE=German_Germany.utf8;LC_MONETARY=German_Germany.utf8;LC_NUMERIC=C;LC_TIME=German_Germany.utf8'

If your console has no UTF-8 setting (no matter the language) you may change it like this:

Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")

You can use another encoding, too, but it may throw errors later on. So let’s check whether the data comes in UTF-8, and if not, let’s repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).

# check whether correct encoding is UTF-8
table(validUTF8(plants$oldName))
table(validUTF8(animals$modName))
FALSE  TRUE 
   73  4927 
TRUE 
5000 
# create new columns for variables
plants[, newName := oldName]
animals[, newName := modName]
# correct encoding, assuming current encoding is CP-1252
plants[!validUTF8(newName), newName := iconv(newName, from = "CP1252", to = "UTF-8")]

Name resolution#

To access the GBIF backbone, we can use the name_backbone_checklist function found in the rgbif package.

resP <- data.table(name_backbone_checklist(plants$newName))
resA <- data.table(name_backbone_checklist(animals$newName))

It took some time, but we got some results. Let’s look at the result structure.

str(resP)
Classes 'data.table' and 'data.frame':	5000 obs. of  25 variables:
 $ confidence      : int  100 100 100 98 94 98 94 99 99 99 ...
 $ matchType       : chr  "NONE" "NONE" "NONE" "EXACT" ...
 $ synonym         : logi  FALSE FALSE FALSE TRUE FALSE TRUE ...
 $ usageKey        : int  NA NA NA 2977925 3152705 2685510 2684876 8132859 3138531 3830289 ...
 $ acceptedUsageKey: int  NA NA NA 11698476 NA 2685508 NA NA NA NA ...
 $ scientificName  : chr  NA NA NA "Abarema curvicarpa (H.S.Irwin) Barneby & J.W.Grimes" ...
 $ canonicalName   : chr  NA NA NA "Abarema curvicarpa" ...
 $ rank            : chr  NA NA NA "SPECIES" ...
 $ status          : chr  NA NA NA "SYNONYM" ...
 $ kingdom         : chr  NA NA NA "Plantae" ...
 $ phylum          : chr  NA NA NA "Tracheophyta" ...
 $ order           : chr  NA NA NA "Fabales" ...
 $ family          : chr  NA NA NA "Fabaceae" ...
 $ genus           : chr  NA NA NA "Jupunba" ...
 $ species         : chr  NA NA NA "Jupunba curvicarpa" ...
 $ kingdomKey      : int  NA NA NA 6 6 6 6 6 6 6 ...
 $ phylumKey       : int  NA NA NA 7707728 7707728 7707728 7707728 7707728 7707728 7707728 ...
 $ classKey        : int  NA NA NA 220 220 194 194 220 220 220 ...
 $ orderKey        : int  NA NA NA 1370 941 640 640 933 414 399 ...
 $ familyKey       : int  NA NA NA 5386 6685 3925 3925 2398 3065 2411 ...
 $ genusKey        : int  NA NA NA 7266089 3152705 2684876 2684876 4890936 3138519 7268654 ...
 $ speciesKey      : int  NA NA NA 11698476 NA 2685507 NA 8132859 3138531 3830289 ...
 $ class           : chr  NA NA NA "Magnoliopsida" ...
 $ verbatim_name   : chr  "" "(lauraceae) pubescente" "?Betulaceae sp." "Abarema curvicarpa" ...
 $ verbatim_index  : num  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr> 

As there is a column named “kingdom”, we might check whether we actually got all plants matched, and how many non-matches we got. Another important information can be found in the “matchType” column. Here, we can see whether names were retrieved exactly as they wer spelled, or some fuzzy matching was done, or whether they could only be matched to a higher rank. The latter means that names may only have been matched to genus, familiy, order, or phylum level. It is worth checking these results.

table(resP$kingdom)
sum(is.na(resP$kingdom))
table(resP$matchType)
Animalia    Fungi  Plantae 
       1       40     4814 
145
     EXACT      FUZZY HIGHERRANK       NONE 
      4550        138        167        145 
resP[kingdom != "Plantae"]
A data.table: 41 × 25
confidencematchTypesynonymusageKeyacceptedUsageKeyscientificNamecanonicalNamerankstatuskingdomkingdomKeyphylumKeyclassKeyorderKeyfamilyKeygenusKeyspeciesKeyclassverbatim_nameverbatim_index
<int><chr><lgl><int><int><chr><chr><chr><chr><chr><int><int><int><int><int><int><int><chr><chr><dbl>
99EXACTFALSE 3414350 NAAcarospora radicata H.Magn. Acarospora radicata SPECIES ACCEPTEDFungi 595 180 12718347 2600495 3414350Lecanoromycetes Acarospora radicata 63
99EXACTFALSE 2592767 NAArthonia polymorpha Ach., 1814 Arthonia polymorpha SPECIES ACCEPTEDFungi 595 313 12738363 2581942 2592767Arthoniomycetes Arthonia polymorpha 420
99EXACTFALSE 7250670 NAAspicilia candida (Anzi) Hue Aspicilia candida SPECIES ACCEPTEDFungi 595 180 10514116 2599747 7250670Lecanoromycetes Aspicilia candida 445
99EXACTFALSE 3419869 NAAulaxina microphana (Vain.) R.Sant. Aulaxina microphana SPECIES ACCEPTEDFungi 595 180 12792161 6610216 3419869Lecanoromycetes Aulaxina microphana 507
98EXACT TRUE 2608288 3438532Bacidia kingmanii Hasse Bacidia kingmanii SPECIES SYNONYM Fungi 595 180108481908296 2569865 3438532Lecanoromycetes Bacidia kingmanii 527
99EXACTFALSE 2609464 NABuellia griseovirens (Turner & Borrer ex Sm.) Almb. Buellia griseovirens SPECIES ACCEPTEDFungi 595 180108616084115 2587707 2609464Lecanoromycetes Buellia griseovirens 711
99EXACTFALSE 2609287 NACalicium quercinum Pers. Calicium quercinum SPECIES ACCEPTEDFungi 595 180108616084115 2592559 2609287Lecanoromycetes Calicium quercinum 789
98EXACT TRUE 2610162 7462577Caloplaca inconspecta Arup Caloplaca inconspecta SPECIES SYNONYM Fungi 595 180 10508368 7251287 7462577Lecanoromycetes Caloplaca inconspecta 805
98EXACT TRUE 3469366 2596842Catapyrenium caeruleopulvinum J.W.Thomson Catapyrenium caeruleopulvinum SPECIES SYNONYM Fungi 595 178 10434841 2596841 2596842Eurotiomycetes Catapyrenium caeruleopulvinum 929
99EXACTFALSE 7186902 NACetraria islandica subsp. islandica Cetraria islandica islandica SUBSPECIESACCEPTEDFungi 595 180 10488305 2601309 2605272Lecanoromycetes Cetraria islandica ssp. islandica 989
99EXACTFALSE 3391187 NACladonia alinii Trass Cladonia alinii SPECIES ACCEPTEDFungi 595 180 10488328 2607519 3391187Lecanoromycetes Cladonia alinii 1089
99EXACTFALSE 2607718 NACladonia pleurota (Flörke) Schaer. Cladonia pleurota SPECIES ACCEPTEDFungi 595 180 10488328 2607519 2607718Lecanoromycetes Cladonia pleurota 1090
97EXACTFALSE10883702 NACollema tenax Degel. Collema tenax SPECIES ACCEPTEDFungi 595 180 10558377 723764210883702Lecanoromycetes Collema tenax 1149
99EXACTFALSE 3433571 NADirinaria leopoldii (Stein) D.D.Awasthi Dirinaria leopoldii SPECIES ACCEPTEDFungi 595 180 1050 NA10937028 3433571Lecanoromycetes Dirinaria leopoldii 1571
97EXACTFALSE 5475464 NAGraphis anfractuosa (Eschw.) Eschw. Graphis anfractuosa SPECIES ACCEPTEDFungi 595 180 12792176 2602041 5475464Lecanoromycetes Graphis anfractuosa 2189
98EXACT TRUE 2606991 2606897Lecanora melaena (Hedl.) Fink Lecanora melaena SPECIES SYNONYM Fungi 595 180 10484828 2569863 2606897Lecanoromycetes Lecanora melaena 2649
99EXACTFALSE 3439134 NALecidea polaris Lynge Lecidea polaris SPECIES ACCEPTEDFungi 595 180108481908296 2569865 3439134Lecanoromycetes Lecidea polaris 2652
99EXACTFALSE 2600912 NALeptogium austroamericanum (Malme) C.W.Dodge Leptogium austroamericanum SPECIES ACCEPTEDFungi 595 180 10558377 2600911 2600912Lecanoromycetes Leptogium austroamericanum 2691
99EXACTFALSE 2601899 NANadvornikia hawaiensis (Tuck.) Tibell Nadvornikia hawaiensis SPECIES ACCEPTEDFungi 595 180 12792176 2601898 2601899Lecanoromycetes Nadvornikia hawaiensis 3137
98EXACT TRUE 551708310809472Opegrapha cypressi R.C.Harris Opegrapha cypressi SPECIES SYNONYM Fungi 595 313 127383591074737810809472Arthoniomycetes Opegrapha cypressi 3262
99EXACTFALSE 3424580 NAPannaria conoplea (Ach.) Bory Pannaria conoplea SPECIES ACCEPTEDFungi 595 180 10554119 2587001 3424580Lecanoromycetes Pannaria conoplea 3379
98EXACT TRUE 2605920 2605922Parmelia omphalodes subsp. pinnatifida (Kurok) SkultParmelia omphalodes pinnatifidaSUBSPECIESSYNONYM Fungi 595 180 10488305 2576643 2605922Lecanoromycetes Parmelia omphalodes ssp. pinnatifida3399
97EXACTFALSE 2601157 NAPeltigera venosa (L.) Hoffm. Peltigera venosa SPECIES ACCEPTEDFungi 595 180 10558373 2601137 2601157Lecanoromycetes Peltigera venosa 3455
99EXACTFALSE 2600005 NAPertusaria wulfenioides B.de Lesd. Pertusaria wulfenioides SPECIES ACCEPTEDFungi 595 180 10514113 2599923 2600005Lecanoromycetes Pertusaria wulfenioides 3501
98EXACT TRUE 8460032 3291408Phaeographina explicans Fink Phaeographina explicans SPECIES SYNONYM Fungi 595 180 12792176 8269656 3291408Lecanoromycetes Phaeographina explicans 3516
99EXACTFALSE 3477254 NAPhylliscum demangeonii (Moug. & Mont.) Nyl. Phylliscum demangeonii SPECIES ACCEPTEDFungi 595 314 10494104 3477247 3477254Lichinomycetes Phylliscum demangeonii 3568
99EXACTFALSE 7251966 NAPlacopsis cribellans (Nyl.) Räsänen Placopsis cribellans SPECIES ACCEPTEDFungi 595 180 10428344 7954402 7251966Lecanoromycetes Placopsis cribellans 3656
98EXACT TRUE 352809410787294Polymeridium proponens (Nyl.) R.C.Harris Polymeridium proponens SPECIES SYNONYM Fungi 595 183 718673448451070263510787294Dothideomycetes Polymeridium proponens 3745
95FUZZYFALSE 2603432 NAPorpidia contraponenda (Arnold) Knoph & Hertel Porpidia contraponenda SPECIES ACCEPTEDFungi 595 180108481908296 2603408 2603432Lecanoromycetes Porpidia contrapoenda 3770
99EXACTFALSE 5261389 NAPsora crenata (Taylor) Reinke Psora crenata SPECIES ACCEPTEDFungi 595 180 10484813 5953400 5261389Lecanoromycetes Psora crenata 3857
99EXACTFALSE 5489996 NAPyrenula laetior Müll.Arg. Pyrenula laetior SPECIES ACCEPTEDFungi 595 178 10448352 3269671 5489996Eurotiomycetes Pyrenula laetior 3914
99EXACTFALSE 3425847 NARhizocarpon rittokense (Hellb.) Th.Fr. Rhizocarpon rittokense SPECIES ACCEPTEDFungi 595 180107898448371 2600386 3425847Lecanoromycetes Rhizocarpon rittokense 3989
99EXACTFALSE 2609027 NARinodina efflorescens Malme Rinodina efflorescens SPECIES ACCEPTEDFungi 595 180108616088369 2599871 2609027Lecanoromycetes Rinodina efflorescens 4031
99EXACTFALSE 8664645 NASchismatomma vernans (Tuck.) Zahlbr. Schismatomma vernans SPECIES ACCEPTEDFungi 595 313 12738359 2592839 8664645Arthoniomycetes Schismatomma vernans 4206
99EXACTFALSE 2610322 NASclerophora peronella (Ach.) Tibell Sclerophora peronella SPECIES ACCEPTEDFungi 59510874653106991676243 2610321 2610322ConiocybomycetesSclerophora peronella 4236
98EXACT TRUE 260802410688503Toninia opuntioides (Vill.) Timdal Toninia opuntioides SPECIES SYNONYM Fungi 595 180 10484812 725189610688503Lecanoromycetes Toninia opuntioides 4721
99EXACTFALSE 2599776 NATrapeliopsis granulosa (Hoffm.) Lumbsch Trapeliopsis granulosa SPECIES ACCEPTEDFungi 595 180 10428344 2599761 2599776Lecanoromycetes Trapeliopsis granulosa 4737
98EXACT TRUE11240236 6496469Triphora minima Pease, 1871 Triphora minima SPECIES SYNONYM Animalia152 225 NA2660 2299148 6496469Gastropoda Triphora minima 4775
99EXACTFALSE 5260494 NAUmbilicaria cinereorufescens (Schaer.) Frey Umbilicaria cinereorufescens SPECIES ACCEPTEDFungi 595 180 12754108 2600611 5260494Lecanoromycetes Umbilicaria cinereorufescens 4811
99EXACTFALSE 2606076 NAUsnea diplotypus Vain. Usnea diplotypus SPECIES ACCEPTEDFungi 595 180 10488305 2605982 2606076Lecanoromycetes Usnea diplotypus 4824
99EXACTFALSE 2604141 NAXanthoparmelia neochlorochroa Hale Xanthoparmelia neochlorochroa SPECIES ACCEPTEDFungi 595 180 10488305 2603781 2604141Lecanoromycetes Xanthoparmelia neochlorochroa 4953

The only “animal” in this dataset is Triphora minima, which also happens to be an orchid species. To get the correct match, we could re-run the query for this particular species adding the parameter kingdom = "Plantae". You should not do this initially, if you are not 100% sure your data belongs to a certain group. The species labelled as fungi may rather be lichen, as in the case of Usnea, but their classification is plausible, as it is known that the TRY database includes lichens.

Let’s check the “HIGHERRANK” matches. (“scientificName” includes authors, “canonicalName” is without authors, and “verbatim_name” is the name queried.)

resP[matchType == "HIGHERRANK", c("scientificName", "canonicalName", "verbatim_name")]
A data.table: 167 × 3
scientificNamecanonicalNameverbatim_name
<chr><chr><chr>
Abies Mill. Abies Abies sp
Acacia Mill. Acacia Acacia contriva
Acacia Mill. Acacia Acacia sp1887
Aconitum L. Aconitum Aconitum napellus L. Orthodox
Adenanthera L. Adenanthera Adenanthera sp.
Aesculus L. Aesculus Aesculus xworlitzensis
Aloinopsis Schwantes Aloinopsis Aloinopsis gydouwensis
Alyssum L. Alyssum Alyssum thunbergii Moq. Orthodox p
Ampelocera Klotzsch Ampelocera Ampelocera indet
Asphodeline Rchb. Asphodeline Asphodeline fistulosus
Aucoumea Pierre Aucoumea Aucoumea sp.
Poaceae Poaceae Avena nervosa
Neobartsia Uribe-Convers & TankNeobartsia Bartsia lamiflora
Beccarianthus Cogn. Beccarianthus Beccarianthus sp
Beta maritima L. Beta maritima Beta maritima subsp maritima
Betula L. Betula Betula borealis
Urticaceae Urticaceae Boehmeria spicata
Bromus L. Bromus Bromus erectus Huds. Orthodox
Calluna vulgaris (L.) Hull Calluna vulgaris Calluna vulgaris alp (L.) Hull
Campanula stevenii M.Bieb. Campanula steveniiCampanula stevenii M.Bieb. subsp. beauverdiana (Fomin) Rech.f. & Schiman-Czeika
Tracheophyta Tracheophyta Campylospermum sp.
Carex L. Carex Carex foliosa
Carex L. Carex Carex nigra agg.
Psephellus Cass. Psephellus Centaurea bella
Asteraceae Asteraceae Centipeda orbicularis
Chamelaucium Desf. Chamelaucium Chamelaucium griffinii
Dysphania R.Br. Dysphania Chenopodium nepalense
Chicorium Dumort., 1822 Chicorium Chicorium intybus
Chrysophyllum L. Chrysophyllum Chrysophyllum sp.1
Citrus L. Citrus Citrus sp
Scapania paradoxa R.M.Schust. Scapania paradoxa Scapania paradoxa var. paradoxa
Schuurmansia Blume Schuurmansia Schuurmansia sp
Scirpus Tourn. ex L. Scirpus Scirpus sp.
Cyperaceae Cyperaceae Scirpus fistulosus
Scrophularia L. Scrophularia Scrophularia auriculata L. Orthodox
Senna Mill. Senna Senna form taxon petiolaris
Sisymbrium L. Sisymbrium Sisymbrium sp
Solanum L. Solanum Solanum fendleri
Solanum L. Solanum Solanum quercifolium
Fabaceae Fabaceae Sophora viciifolia
Sorghum Moench Sorghum Sorghum spp.
Spermacoce L. Spermacoce Spermacoce assurgens
Sphalmanthus N.E.Br. Sphalmanthus Sphalmanthus neilii
Tabebuia Gomes Tabebuia Tabebuia billbergiana
Asteraceae Asteraceae Taraxacum pumilum
Taxillus limprichtii (Grüning) H.S.KiuTaxillus limprichtiiTaxillus limprichtii (Grning) H.S. Kiu
Thecanthes Wikstr. Thecanthes Thecanthes sp
Tripleurospermum Sch.Bip. Tripleurospermum Tripleurospermum hampeanum
Ursinia Gaertn. Ursinia Ursinia sp.
Urtica L. Urtica Urtica major
Vallisneria P.Micheli ex L. Vallisneria Vallisneria tortissima
Vangueria Juss. Vangueria Vangueria sp.
Vepris Comm. ex A.Juss. Vepris Vepris fischeri
Asteraceae Asteraceae Vernonia praecox
Asteraceae Asteraceae Vernonia subsessilis
Veronica L. Veronica Veronica sp
Vignea P.Beauv. Vignea Vignea lapponica
Viola L. Viola Viola repens
Zollernia Wied-Neuw. & Nees Zollernia Zollernia sp
Zygogynum Baill. Zygogynum Zygogynum pancheri_Dogny (Baill.) Vink

A large part of the matches are such where the species epithet is given as “sp/sp.”, or where authors may be interpreted as parts of the epithets. Preprocessing the names (see name parsing) may help mitigate these issues.

name_backbone("Aconitum napellus")
A tibble: 1 × 23
usageKeyscientificNamecanonicalNamerankstatusconfidencematchTypekingdomphylumorderkingdomKeyphylumKeyclassKeyorderKeyfamilyKeygenusKeyspeciesKeysynonymclassverbatim_name
<int><chr><chr><chr><chr><int><chr><chr><chr><chr><int><int><int><int><int><int><int><lgl><chr><chr>
13033665Aconitum napellus L.Aconitum napellusSPECIESACCEPTED97EXACTPlantaeTracheophytaRanunculales67707728220399241030336633033665FALSEMagnoliopsidaAconitum napellus

Let’s check the animals now.

table(resA$kingdom)
sum(is.na(resA$kingdom))
table(resA$matchType)
Animalia 
    4821 
179
     EXACT      FUZZY HIGHERRANK       NONE 
      4408        282        131        179 

Here, classification into animals was unambiguous. Let’s check the higher rank matches.

resA[matchType == "HIGHERRANK", c("scientificName", "canonicalName", "verbatim_name")]
A data.table: 131 × 3
scientificNamecanonicalNameverbatim_name
<chr><chr><chr>
Accipiter Brisson, 1760 Accipiter Accipiter spp.-5 Rothschild & Hartert, 1926
Achillides chikae Achillides chikaeAchillides chikae chikae
Achillides chikae Achillides chikaeAchillides chikae hermeli Nuyda, 1992
Animalia Animalia Acropora morphospec1 Veron & Wallace, 1984
Acroporidae Acroporidae Acroporidae_Acropora cuneata (Dana, 1846)
Animalia Animalia Acropora josephi George & Sukumaran, 2007
Animalia Animalia Acropora mannaqensis George & Sukumaran, 2007
Acroporidae Acroporidae Acroporidae Acropora subglabra (Brook, 1891)
Acroporidae Acroporidae Acroporidae_Acropora tanegashimensis Veron, 1990
Animalia Animalia Acropora thomasi George & Sukumaran, 2007
Animalia Animalia Acropora valimunensis George & Sukumaran, 2007
Strigidae Strigidae Strigidae Aegolius funereus (Linnaeus, 1758)
Anatidae Anatidae Anatidae Anas nesiotis (J. H. Fleming, 1935)
Trochilidae Trochilidae Trochilidae_Androdon aequatorialis Gould, 1863
Antigone antigone (Linnaeus, 1758) Antigone antigoneAntigone antigone canadensis (Linnaeus, 1758)
Antigone antigone (Linnaeus, 1758) Antigone antigoneAntigone antigone canadensis nesiotes (Bangs & Zappey, 1905)
Antigone antigone (Linnaeus, 1758) Antigone antigoneAntigone antigone rubicunda (Perry, 1810)
Astrapia Vieillot, 1816 Astrapia Astrapia sephaniae (Finsch, 1885)
Australomussa Veron, 1985 Australomussa Australomussa roleyensis Veron, 1985
Brookesia Gray, 1865 Brookesia Brookesia spp. Q Brygoo & Domergue, 1975
Brookesia Gray, 1865 Brookesia Brookesia spec._F
Buteo Lacepede, 1799 Buteo BUTEO sp.N (Linnaeus, 1758)
Calumma Gray, 1865 Calumma Calumma spec.-H (Hillenius, 1959)
Candoia Gray, 1842 Candoia Candoia sp_N (Duméril & Bibron, 1844)
Centrolene Jiménez de la Espada, 1872 Centrolene Centrolene bukleyi (Boulenger, 1882)
Cerdocyon C.E.H.Smith, 1839 Cerdocyon Cerdocyon thuos (Linnaeus, 1766)
Chersobius Fitzinger, 1835 Chersobius Chersobius chersobius signatus (Gmelin, 1789)
Chilabothrus A.M.C.Duméril & Bibron, 1844Chilabothrus Chilabothrus chilabothrus chrysogaster (Cope, 1871)
Chilabothrus A.M.C.Duméril & Bibron, 1844Chilabothrus Chilabothrus chilabothrus fordii (Günther, 1861)
Antipathidae Antipathidae Antipathidae Cirrhipathes sieboldii Blainville, 1834
Poritidae Poritidae Poritidae_Porites lichen Dana, 1846
Animalia Animalia Pristis
Psephotellus Mathews, 1913 Psephotellus Psephotellus psephotellus pulcherrimus (Gould, 1845)
Pseudastur Mayr, 1998 Pseudastur Pseudastur pseudastur albicollis (Latham, 1790)
Pseudocordylus Smith, 1838 PseudocordylusPseudocordylus pseudocordylus melanotus A. Smith, 1838
Pseudocordylus Smith, 1838 PseudocordylusPseudocordylus pseudocordylus spinosus FitzSimons, 1947
Pseudocordylus Smith, 1838 PseudocordylusPseudocordylus sp._L A. Smith, 1838
Pyrrhura Bonaparte, 1856 Pyrrhura Pyrrhura alipectus Chapman, 1914
Chordata Chordata Salvator salvator merianae Duméril & Bibron, 1839
Sarcoramphus Dumeril, 1805 Sarcoramphus Sarcoramphus spec.Z
Scaphirhynchus Heckel, 1836 ScaphirhynchusScaphirhynchus sp K
Schizoculina Wells, 1937 Schizoculina Schizoculina sp-Y Wells 1937
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 Smaug Smaug morphospec_U
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 Smaug Smaug smaug breyeri Van Dam, 1921
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 Smaug Smaug smaug giganteus A. Smith, 1844
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 Smaug Smaug smaug mossambicus FitzSimons, 1958
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 Smaug Smaug smaug regius Broadley, 1962
Smaug Stanley, Bauer, Jackman, Branch & Mouton, 2011 Smaug Smaug smaug vandami FitzSimons, 1930
Animalia Animalia STENELLA J. E. Gray, 1866
Stylaster Gray, 1831 Stylaster Stylaster sp.-1 Quelch, 1884
Animalia Animalia Tanygnathus
Animalia Animalia Tanygnathus ssp_Z (Boddaert, 1783)
Tapirus Brisson, 1762 Tapirus Tapirus teqrestris (Linnaeus, 1758)
Toropuku Nielsen, Bauer, Jackman, Hitchmough & Daugherty, 2011Toropuku Toropuku toropuku stephensi Robb, 1980
Touit G.R.Gray, 1855 Touit Touit hueii (Temminck, 1830)
Trachypithecus Reichenbach, 1862 TrachypithecusTrachypithecus villosus (Griffith, 1821)
Trichoglossus Stephens, 1826 Trichoglossus Trichoglossus jonstoniae Hartert, 1903
Tylototriton Anderson, 1871 Tylototriton Tylototriton daloushanensis Zhou, Xiao & Luo, 2022
Tympanocryptis W.C.H.Peters, 1863 TympanocryptisTympanocryptis sp. W W. C. H. Peters, 1863
Gekkonidae Gekkonidae Gekkonidae Woodworthia ""cromwell gorge"" Nielsen, Bauer, Jackman, Hitchmough & Daugherty, 2011

There are some problems with the classification of specific genera, like Acropora, and there are problems with genera that are erronously repeated within names, like Smaug smaug breyeri. Additionally, family names before the actual scientific names should also be removed. These are issues that can also be alleviated during pre-processing.

Writing your own name resolution function#

Sometimes, you may want to check datasets against a very specific reference database, or your name resolution service of choice may not use the newest version of your reference dataset. In this case, you can write your own name resolution algorithm. Be aware: There are a lot of caveats in this process, and nowadays, it will not be easy to write a function that matches the correctness and efficiency of services like GBIF. Especially when it comes to speed and large datasets, it is unlikely a function that cannot be used in parallel on your own machine or a high performance cluster will deliver results for big datasets within an acceptable time frame.

Let’s imagine we want to check our plants dataset against the newest version of the Leipzig Catalogue of Vascular Plants (LCVP).

# read in dataset
LCVP <- fread(paste0(.brd, "PlantHub/LCVP_PlantHub_2024-01-25.gz"))
str(LCVP)
Classes 'data.table' and 'data.frame':	1337778 obs. of  21 variables:
 $ global Id                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Input Genus               : chr  "Aa" "Aa" "Aa" "Aa" ...
 $ Input Epitheton           : chr  "argyrolepis" "aurantiaca" "brevis" "calceata" ...
 $ Rank                      : chr  "species" "species" "species" "species" ...
 $ Input Subspecies Epitheton: chr  "" "" "" "" ...
 $ Input Authors             : chr  "(Rchb.f.) Rchb.f." "D.Trujillo" "Schltr." "(Rchb.f.) Schltr." ...
 $ Status                    : chr  "accepted" "accepted" "synonym" "accepted" ...
 $ globalId of Output Taxon  : int  1 2 819078 4 819080 6 7 8 9 10 ...
 $ Output Taxon              : chr  "Aa argyrolepis (Rchb.f.) Rchb.f." "Aa aurantiaca D.Trujillo" "Myrosmodes breve (Schltr.) Garay" "Aa calceata (Rchb.f.) Schltr." ...
 $ family                    : chr  "Orchidaceae" "Orchidaceae" "Orchidaceae" "Orchidaceae" ...
 $ Order                     : chr  "Asparagales" "Asparagales" "Asparagales" "Asparagales" ...
 $ Literature                : chr  "" "Lankesteriana 2011.11.1 1-8;" "" "" ...
 $ Comments                  : chr  "" "" "" "" ...
 $ status                    : chr  "accepted" "accepted" "synonym" "accepted" ...
 $ nameIn                    : chr  "Aa argyrolepis" "Aa aurantiaca" "Aa brevis" "Aa calceata" ...
 $ authorsIn                 : chr  "(Rchb.f.) Rchb.f." "D.Trujillo" "Schltr." "(Rchb.f.) Schltr." ...
 $ nameOut                   : chr  "Aa argyrolepis" "Aa aurantiaca" "Myrosmodes breve" "Aa calceata" ...
 $ authorsOut                : chr  "(Rchb.f.) Rchb.f." "D.Trujillo" "(Schltr.) Garay" "(Rchb.f.) Schltr." ...
 $ IPNIID                    : chr  "614525-1" "77112075-1" "301821-2" "1008443-2" ...
 $ WFOLink                   : chr  "wfo-0000760991" "wfo-0000922666" "wfo-0000854509" "wfo-0000928062" ...
 $ WPName                    : chr  "Aa argyrolepis (Rchb.f.) Rchb.f." "Aa aurantiaca D.Trujillo" "Aa brevis Schltr." "Aa calceata (Rchb.f.) Schltr." ...
 - attr(*, ".internal.selfref")=<externalptr> 

This is an enhanced version of LCVP 2.0, with some errors corrected, and some data added. It includes ASCII-only columns nameIn, authorsIn, nameOut, and authorsOut, as well as links to IPNI, POWO, WFO, and WorldPlants.

Let’s set up a simple matching algorithm. You may work on it to include some of the main problematic cases.

First, let’s tune our reference list. We would like to be able to identify families and genera before the actual matching, and to do this efficiently, we can extract those from LCVP. We should also be able to directly match complete names with authors, so let’s create a column with those, too.

# genera
genera <- sort(unique(sub("\\s.*", "", LCVP$nameIn)))
genera <- genera[genera != ""]
genera[1:10]
# families
families <- sort(unique(LCVP$family))
families <- families[families != ""]
families[1:10]
# name + author column
LCVP[, fullNameIn := trimws(paste(nameIn, authorsIn))]
LCVP$fullNameIn[1:10]
  1. 'Aa'
  2. 'Aakia'
  3. 'Aalius'
  4. 'Aaronsohnia'
  5. 'Abacopteris'
  6. 'Abacosa'
  7. 'Abalon'
  8. 'Abama'
  9. 'Abapus'
  10. 'Abarema'
  1. 'Acanthaceae'
  2. 'Achariaceae'
  3. 'Achatocarpaceae'
  4. 'Acoraceae'
  5. 'Actinidiaceae'
  6. 'Adoxaceae'
  7. 'Aextoxicaceae'
  8. 'Afrothismiaceae'
  9. 'Agavaceae'
  10. 'Agdestidaceae'
  1. 'Aa argyrolepis (Rchb.f.) Rchb.f.'
  2. 'Aa aurantiaca D.Trujillo'
  3. 'Aa brevis Schltr.'
  4. 'Aa calceata (Rchb.f.) Schltr.'
  5. 'Aa chiogena Schltr.'
  6. 'Aa colombiana Schltr.'
  7. 'Aa denticulata Schltr.'
  8. 'Aa erosa (Rchb.f.) Schltr.'
  9. 'Aa fiebrigii (Schltr.) Schltr.'
  10. 'Aa figueroi Szlach. & S.Nowak'

Let’s prepare a results table. For simplicity, we will store the ID of matches in LCVP when a name was found, and indicate whether it is a genus or family found in LCVP (without ID) otherwise.

resTable <- data.table(name = plants$newName, LCVP_ID = numeric(), LCVP_genus = logical(), LCVP_family = logical())
Warning message in as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, :
"Item 2 has 0 rows but longest item has 5000; filled with NA"
Warning message in as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, :
"Item 3 has 0 rows but longest item has 5000; filled with NA"
Warning message in as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names, :
"Item 4 has 0 rows but longest item has 5000; filled with NA"

We can ignore the warnings which just tell us that the LCVP columns in the results table are empty for now. Let’s fill the table.

# test whether names found in genera
which(plants$newName %in% genera)
# test whether names found in families
which(plants$newName %in% families)

# write data into results
resTable[plants$newName %in% genera, LCVP_genus := TRUE]

resTable[plants$newName %in% families, LCVP_family := TRUE]

# test whether names in nameIn, i.e. names without authors
which(plants$newName %in% LCVP$nameIn)
# test whether names in fullNameIn, i.e. names with authors
which(plants$newName %in% LCVP$fullNameIn)
  1. 5
  2. 62
  3. 108
  4. 517
  5. 664
  6. 1157
  7. 1206
  8. 1410
  9. 1509
  10. 1592
  11. 1956
  12. 1957
  13. 1959
  14. 2085
  15. 2185
  16. 2290
  17. 2452
  18. 2660
  19. 2692
  20. 2864
  21. 2892
  22. 2904
  23. 3125
  24. 3639
  25. 3679
  26. 4029
  27. 4452
  28. 4642
  29. 4703
  30. 4930
1375
  1. 4
  2. 6
  3. 8
  4. 9
  5. 12
  6. 13
  7. 15
  8. 16
  9. 20
  10. 21
  11. 25
  12. 28
  13. 29
  14. 30
  15. 32
  16. 33
  17. 35
  18. 36
  19. 39
  20. 41
  21. 42
  22. 52
  23. 53
  24. 54
  25. 56
  26. 57
  27. 58
  28. 59
  29. 60
  30. 61
  31. 65
  32. 68
  33. 71
  34. 72
  35. 73
  36. 75
  37. 79
  38. 80
  39. 81
  40. 82
  41. 84
  42. 89
  43. 90
  44. 91
  45. 92
  46. 93
  47. 94
  48. 95
  49. 96
  50. 98
  51. 99
  52. 100
  53. 102
  54. 105
  55. 106
  56. 107
  57. 109
  58. 110
  59. 111
  60. 112
  61. 113
  62. 114
  63. 115
  64. 116
  65. 119
  66. 120
  67. 123
  68. 124
  69. 126
  70. 127
  71. 130
  72. 131
  73. 135
  74. 136
  75. 137
  76. 139
  77. 142
  78. 143
  79. 144
  80. 148
  81. 149
  82. 150
  83. 151
  84. 152
  85. 154
  86. 157
  87. 160
  88. 161
  89. 162
  90. 163
  91. 164
  92. 165
  93. 166
  94. 167
  95. 168
  96. 170
  97. 171
  98. 173
  99. 179
  100. 180
  101. 181
  102. 182
  103. 184
  104. 188
  105. 190
  106. 192
  107. 193
  108. 194
  109. 195
  110. 196
  111. 197
  112. 198
  113. 199
  114. 202
  115. 205
  116. 207
  117. 211
  118. 212
  119. 213
  120. 214
  121. 217
  122. 218
  123. 220
  124. 221
  125. 224
  126. 227
  127. 228
  128. 229
  129. 230
  130. 231
  131. 232
  132. 233
  133. 234
  134. 235
  135. 237
  136. 238
  137. 239
  138. 243
  139. 246
  140. 249
  141. 250
  142. 252
  143. 253
  144. 254
  145. 255
  146. 258
  147. 260
  148. 262
  149. 263
  150. 264
  151. 265
  152. 267
  153. 268
  154. 270
  155. 271
  156. 274
  157. 276
  158. 277
  159. 282
  160. 283
  161. 284
  162. 285
  163. 286
  164. 287
  165. 289
  166. 290
  167. 293
  168. 295
  169. 297
  170. 299
  171. 300
  172. 301
  173. 302
  174. 304
  175. 305
  176. 306
  177. 307
  178. 308
  179. 309
  180. 310
  181. 312
  182. 313
  183. 314
  184. 316
  185. 317
  186. 319
  187. 320
  188. 322
  189. 323
  190. 324
  191. 325
  192. 326
  193. 328
  194. 329
  195. 330
  196. 331
  197. 332
  198. 334
  199. 337
  200. 338
  201. 4686
  202. 4687
  203. 4688
  204. 4689
  205. 4693
  206. 4694
  207. 4695
  208. 4697
  209. 4699
  210. 4702
  211. 4704
  212. 4705
  213. 4706
  214. 4707
  215. 4708
  216. 4710
  217. 4711
  218. 4712
  219. 4713
  220. 4714
  221. 4715
  222. 4716
  223. 4718
  224. 4719
  225. 4722
  226. 4726
  227. 4727
  228. 4734
  229. 4735
  230. 4736
  231. 4739
  232. 4740
  233. 4744
  234. 4746
  235. 4747
  236. 4749
  237. 4750
  238. 4751
  239. 4752
  240. 4753
  241. 4756
  242. 4757
  243. 4759
  244. 4762
  245. 4763
  246. 4764
  247. 4767
  248. 4768
  249. 4770
  250. 4771
  251. 4772
  252. 4773
  253. 4774
  254. 4775
  255. 4777
  256. 4778
  257. 4779
  258. 4780
  259. 4781
  260. 4782
  261. 4783
  262. 4784
  263. 4786
  264. 4789
  265. 4790
  266. 4791
  267. 4792
  268. 4793
  269. 4796
  270. 4797
  271. 4798
  272. 4799
  273. 4800
  274. 4801
  275. 4803
  276. 4804
  277. 4805
  278. 4807
  279. 4813
  280. 4814
  281. 4816
  282. 4817
  283. 4818
  284. 4820
  285. 4821
  286. 4823
  287. 4825
  288. 4826
  289. 4829
  290. 4831
  291. 4832
  292. 4835
  293. 4837
  294. 4838
  295. 4844
  296. 4846
  297. 4847
  298. 4848
  299. 4849
  300. 4851
  301. 4854
  302. 4856
  303. 4858
  304. 4862
  305. 4863
  306. 4865
  307. 4866
  308. 4867
  309. 4869
  310. 4870
  311. 4871
  312. 4872
  313. 4873
  314. 4874
  315. 4875
  316. 4876
  317. 4879
  318. 4880
  319. 4883
  320. 4884
  321. 4885
  322. 4889
  323. 4892
  324. 4893
  325. 4894
  326. 4896
  327. 4897
  328. 4899
  329. 4900
  330. 4902
  331. 4903
  332. 4904
  333. 4906
  334. 4907
  335. 4908
  336. 4909
  337. 4913
  338. 4916
  339. 4917
  340. 4918
  341. 4919
  342. 4920
  343. 4923
  344. 4924
  345. 4925
  346. 4926
  347. 4927
  348. 4928
  349. 4929
  350. 4932
  351. 4933
  352. 4934
  353. 4935
  354. 4937
  355. 4938
  356. 4939
  357. 4940
  358. 4941
  359. 4943
  360. 4944
  361. 4946
  362. 4948
  363. 4949
  364. 4950
  365. 4952
  366. 4955
  367. 4956
  368. 4957
  369. 4958
  370. 4959
  371. 4961
  372. 4963
  373. 4964
  374. 4965
  375. 4967
  376. 4969
  377. 4970
  378. 4971
  379. 4972
  380. 4973
  381. 4974
  382. 4975
  383. 4976
  384. 4977
  385. 4979
  386. 4980
  387. 4981
  388. 4982
  389. 4983
  390. 4984
  391. 4985
  392. 4986
  393. 4988
  394. 4989
  395. 4991
  396. 4992
  397. 4994
  398. 4995
  399. 4996
  400. 5000
  1. 14
  2. 17
  3. 18
  4. 24
  5. 27
  6. 40
  7. 55
  8. 78
  9. 83
  10. 86
  11. 97
  12. 101
  13. 104
  14. 117
  15. 118
  16. 121
  17. 122
  18. 129
  19. 132
  20. 147
  21. 156
  22. 176
  23. 178
  24. 183
  25. 185
  26. 186
  27. 200
  28. 201
  29. 203
  30. 206
  31. 208
  32. 215
  33. 216
  34. 219
  35. 222
  36. 241
  37. 248
  38. 251
  39. 257
  40. 272
  41. 278
  42. 291
  43. 294
  44. 296
  45. 298
  46. 303
  47. 336
  48. 342
  49. 345
  50. 347
  51. 349
  52. 358
  53. 360
  54. 372
  55. 384
  56. 434
  57. 436
  58. 438
  59. 439
  60. 440
  61. 451
  62. 456
  63. 458
  64. 459
  65. 469
  66. 475
  67. 483
  68. 487
  69. 488
  70. 501
  71. 502
  72. 535
  73. 537
  74. 561
  75. 571
  76. 572
  77. 575
  78. 577
  79. 578
  80. 580
  81. 582
  82. 584
  83. 585
  84. 586
  85. 587
  86. 592
  87. 593
  88. 601
  89. 610
  90. 621
  91. 632
  92. 653
  93. 657
  94. 672
  95. 673
  96. 682
  97. 689
  98. 694
  99. 706
  100. 713
  101. 716
  102. 717
  103. 725
  104. 740
  105. 741
  106. 744
  107. 745
  108. 746
  109. 757
  110. 760
  111. 775
  112. 796
  113. 800
  114. 809
  115. 841
  116. 845
  117. 847
  118. 855
  119. 868
  120. 881
  121. 884
  122. 893
  123. 899
  124. 903
  125. 908
  126. 915
  127. 918
  128. 922
  129. 927
  130. 930
  131. 948
  132. 952
  133. 953
  134. 955
  135. 974
  136. 980
  137. 990
  138. 992
  139. 998
  140. 1000
  141. 1009
  142. 1010
  143. 1011
  144. 1016
  145. 1018
  146. 1019
  147. 1028
  148. 1032
  149. 1033
  150. 1042
  151. 1045
  152. 1062
  153. 1066
  154. 1068
  155. 1071
  156. 1073
  157. 1079
  158. 1099
  159. 1114
  160. 1119
  161. 1134
  162. 1167
  163. 1175
  164. 1181
  165. 1187
  166. 1188
  167. 1193
  168. 1195
  169. 1196
  170. 1197
  171. 1198
  172. 1222
  173. 1223
  174. 1229
  175. 1239
  176. 1243
  177. 1255
  178. 1265
  179. 1273
  180. 1278
  181. 1285
  182. 1286
  183. 1287
  184. 1289
  185. 1290
  186. 1292
  187. 1295
  188. 1297
  189. 1298
  190. 1299
  191. 1300
  192. 1301
  193. 1306
  194. 1308
  195. 1328
  196. 1332
  197. 1338
  198. 1339
  199. 1346
  200. 1360
  201. 3495
  202. 3497
  203. 3509
  204. 3511
  205. 3512
  206. 3518
  207. 3536
  208. 3545
  209. 3546
  210. 3550
  211. 3559
  212. 3572
  213. 3577
  214. 3583
  215. 3586
  216. 3589
  217. 3593
  218. 3599
  219. 3604
  220. 3606
  221. 3607
  222. 3608
  223. 3611
  224. 3612
  225. 3614
  226. 3615
  227. 3616
  228. 3626
  229. 3631
  230. 3634
  231. 3652
  232. 3664
  233. 3665
  234. 3676
  235. 3704
  236. 3706
  237. 3714
  238. 3720
  239. 3726
  240. 3732
  241. 3734
  242. 3736
  243. 3737
  244. 3739
  245. 3740
  246. 3749
  247. 3752
  248. 3778
  249. 3786
  250. 3788
  251. 3792
  252. 3794
  253. 3817
  254. 3834
  255. 3848
  256. 3856
  257. 3879
  258. 3885
  259. 3891
  260. 3903
  261. 3921
  262. 3922
  263. 3938
  264. 3948
  265. 3949
  266. 3956
  267. 3966
  268. 3973
  269. 3979
  270. 3985
  271. 3991
  272. 4002
  273. 4020
  274. 4051
  275. 4069
  276. 4070
  277. 4089
  278. 4102
  279. 4106
  280. 4137
  281. 4141
  282. 4146
  283. 4148
  284. 4154
  285. 4164
  286. 4166
  287. 4203
  288. 4207
  289. 4220
  290. 4242
  291. 4252
  292. 4257
  293. 4269
  294. 4274
  295. 4278
  296. 4283
  297. 4304
  298. 4307
  299. 4311
  300. 4317
  301. 4319
  302. 4321
  303. 4322
  304. 4329
  305. 4332
  306. 4339
  307. 4340
  308. 4342
  309. 4346
  310. 4347
  311. 4353
  312. 4355
  313. 4367
  314. 4369
  315. 4383
  316. 4388
  317. 4389
  318. 4393
  319. 4394
  320. 4396
  321. 4398
  322. 4402
  323. 4404
  324. 4408
  325. 4409
  326. 4416
  327. 4449
  328. 4456
  329. 4458
  330. 4461
  331. 4492
  332. 4503
  333. 4509
  334. 4512
  335. 4515
  336. 4521
  337. 4527
  338. 4528
  339. 4532
  340. 4535
  341. 4544
  342. 4548
  343. 4549
  344. 4555
  345. 4560
  346. 4583
  347. 4601
  348. 4605
  349. 4606
  350. 4623
  351. 4639
  352. 4658
  353. 4659
  354. 4676
  355. 4691
  356. 4692
  357. 4700
  358. 4701
  359. 4709
  360. 4730
  361. 4741
  362. 4743
  363. 4748
  364. 4754
  365. 4755
  366. 4758
  367. 4765
  368. 4769
  369. 4794
  370. 4819
  371. 4836
  372. 4839
  373. 4840
  374. 4853
  375. 4857
  376. 4859
  377. 4860
  378. 4861
  379. 4864
  380. 4868
  381. 4878
  382. 4882
  383. 4886
  384. 4890
  385. 4895
  386. 4901
  387. 4905
  388. 4911
  389. 4914
  390. 4936
  391. 4945
  392. 4947
  393. 4951
  394. 4960
  395. 4962
  396. 4968
  397. 4987
  398. 4990
  399. 4998
  400. 4999

As we can see, there are many matches both when searching with and without authors. However, for names without authors, there may be more than one name in the reference list (they are called homonyms). Only one of those will be an accepted name, while the others are synonyms. As the matched names without authors do not allow for a disambiguation, we will assign the ID of the accepted name from the reference list, if there are several. To do this, we create a copy of the reference list, order by taxonomic status so that accepted names come first, and only keep the first of several rows with identical names without authors. We then use this reduced list to extract the respective IDs.

# create a copy of the reference list
LCVPUnique <- LCVP
# order by taxonomic status
setorder(LCVPUnique, status)
# keep only the first of several rows with identical names without authors
LCVPUnique <- unique(LCVPUnique, by = "nameIn")

# check whether it worked
nrow(LCVPUnique)
nrow(LCVP)
1256502
1337778

We removed about 80000 names from LCVP in this process. Let’s now get the IDs.

# write data into results, extract LCVP ID
setkey(LCVP, fullNameIn)
res <- LCVP[plants$newName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]

setkey(LCVPUnique, nameIn)
res <- LCVPUnique[plants$newName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]

We can now check what remains from the names in our list. The remainder will be the difficult part where the algorihm used actually matters.

plants[, matched := FALSE]
plants[!is.na(resTable$LCVP_ID) | !is.na(resTable$LCVP_genus) | !is.na(resTable$LCVP_family), matched := TRUE]
table(plants$matched)
FALSE  TRUE 
 1180  3820 

From the 5000 names we had to check, 1180 remain to be tested. This is a relatively large number, as this dataset is especially messy, but good for us to practice. We should have a look at the unmatched names.

plants[matched == FALSE]$newName[1:20]
  1. ''
  2. '(lauraceae) pubescente'
  3. '?Betulaceae sp.'
  4. 'Abies sp'
  5. 'Abuta_panamensis'
  6. 'Abutilon grandiflorum G.Don Orthodox'
  7. 'Acacia contriva'
  8. 'Acacia eremophila W.Fitzg. var. variabilis Maiden & Blakeley'
  9. 'Acacia flavescens A.Cunn. ex Benth. Orthodox?'
  10. 'Acacia incanicarpa A.R.Chapman & Maslin'
  11. 'Acacia mucronata Willd. ex H.L.Wendl. subsp. mucronata'
  12. 'Acacia plectocarpa A.Cunn. ex Benth. Orthodox'
  13. 'Acacia sclerosperma F.Muell. subsp. sclerosperma'
  14. 'Acacia sp1887'
  15. 'Acacia auriculiformis A.Cunn. ex Benth.'
  16. 'Acacia colei Maslin & L.A.J.Thomson'
  17. 'Acacia drummondii Lindl.'
  18. 'Acacia hadrophylla R.S.Cowan & Maslin'
  19. 'Acacia lazaridis Pedley'
  20. 'Acacia neriifolia A.Cunn. ex Benth.'

It seems that in the first place, we should get rid of author names, as there spelling may be different from the one in LCVP and therefore not produce a match. A very simple way of doing so would be to cut names after the second whitespace.

# function to extract first two words
nameShorter <- function(x) {
	# get number of whitespaces
	ws <- gregexpr(" ", x)
	# get position of second whitespace if available, otherweise return 0
	ws <- sapply(ws, function(x) if (length(x) > 1) x[2] else 0)
	x[ws > 0] <- substr(x[ws > 0], 1, ws[ws > 0] - 1)
	return(x)
}
print(nameShorter(plants$newName[40:50]))
 [1] "Acacia tortilis"                  "Acacia valida"                   
 [3] "Acacia yorkrakinensis"            "Acacia auriculiformis A.Cunn. ex"
 [5] "Acacia colei Maslin &"            "Acacia drummondii Lindl."        
 [7] "Acacia hadrophylla R.S.Cowan &"   "Acacia lazaridis Pedley"         
 [9] "Acacia neriifolia A.Cunn. ex"     "Acacia ptychoclada Maiden &"     
[11] "Acacia speckii R.S.Cowan &"      

We see that some of the shortNames are not as expected. Thare are still author names linked to them. The reason is that there are protected whitespaces in there. We need to remove them first.

# function to extract first two words
nameShorter <- function(x) {
	# remove protected whitespaces
	x <- gsub("\xc2\xa0", " ", x)
	# get number of whitespaces
	ws <- gregexpr(" ", x)
	# get position of second whitespace if available, otherweise return 0
	ws <- sapply(ws, function(x) if (length(x) > 1) x[2] else 0)
	x[ws > 0] <- substr(x[ws > 0], 1, ws[ws > 0] - 1)
	return(x)
}
print(nameShorter(plants$newName[40:50]))
 [1] "Acacia tortilis"       "Acacia valida"         "Acacia yorkrakinensis"
 [4] "Acacia auriculiformis" "Acacia colei"          "Acacia drummondii"    
 [7] "Acacia hadrophylla"    "Acacia lazaridis"      "Acacia neriifolia"    
[10] "Acacia ptychoclada"    "Acacia speckii"       

This looks much nicer. We can now do the name matching without authors again. Note that, ideally, we would not just match without authors, but also measure the difference between author names so that we actually select the closest match.

plants[, shortName := nameShorter(newName)]

setkey(LCVPUnique, nameIn)
res <- LCVPUnique[plants$shortName]
resTable[is.na(LCVP_ID), LCVP_ID := res$`global Id`[is.na(resTable$LCVP_ID)]]

# update the "notMatched" column
plants[!is.na(resTable$LCVP_ID) | !is.na(resTable$LCVP_genus) | !is.na(resTable$LCVP_family), matched := TRUE]
table(plants$matched)
FALSE  TRUE 
  581  4419 

As we can see, the number of unmatched names was reduced from 1211 to 581. We could now introduce some fuzzy matching, i.e. try to assign names from the reference list to names with spelling errors. Of course we could also consider other pre-processing options: correcting the uppercase/lowercase of the names, removing special characters like question marks or underlines, removing “sp.”, etc..

plants[matched == FALSE]$shortName[1:50]
  1. ''
  2. '(lauraceae) pubescente'
  3. '?Betulaceae sp.'
  4. 'Abies sp'
  5. 'Abuta_panamensis'
  6. 'Acacia contriva'
  7. 'Acacia sp1887'
  8. 'Acarospora radicata'
  9. 'Acer sino-oblongum'
  10. 'Aconitum delphiniifolium'
  11. 'Adenanthera sp.'
  12. 'A-Elyhordeum schaackianum'
  13. 'Aeranthus muscicola'
  14. 'Aesculus xworlitzensis'
  15. 'Agathis mooeri'
  16. 'Agave vera-cruz'
  17. 'Agrosthophyllum bicuspidatum'
  18. 'Albizia NA'
  19. 'Alectryon macrococcus'
  20. 'Alexa wachenheimii'
  21. 'Alissum tortuosum'
  22. 'Allium scabrifolium'
  23. 'Aloinopsis gydouwensis'
  24. 'ALOPECURUS GENICULATUS,'
  25. 'Alstroemeria riedeliana'
  26. 'Alyssum caliacre'
  27. 'Alyssum thunbergii'
  28. 'Amaroria soulameiodes'
  29. 'Ampelocera indet'
  30. 'Amphibromus NA'
  31. 'Anartia meyeri'
  32. 'Ancistrocladus stelliger'
  33. 'Andopogon glomeratus'
  34. 'ANEMONE NEMOROSA'
  35. 'Anona squamosa'
  36. 'Anthaenantia villosa'
  37. 'Anthocephalus sp.'
  38. 'Anthurium lezamai'
  39. 'Aquilegia coerulea'
  40. 'Arctoa hyperborea'
  41. 'Arctostaphylos_obispoensis'
  42. 'Ardisia brevipetala'
  43. 'Aristida sanctae-luciae'
  44. 'Arrabidaea trailii'
  45. 'Artemisia '
  46. 'Arthonia polymorpha'
  47. 'Artocarpus lessigiana'
  48. 'Arundinella khaseana'
  49. 'Asperula rechingeri'
  50. 'Asphodeline fistulosus'

The fuzzy matching will be done in a loop using a matching function. Later on, this will allow us to easily switch to parallel processing. The below function first checks for the presence of the first word, assumed to be the genus, in the reference list. If it is found, the fuzzy matching will only be done on the species belonging to this genus, massively reducing the computation time. Then, fuzzy matching is done, the best result(s) selected and the first of the best results or no result returned (in case there is none).

# create a template to return in case there is no match
resTemplate <- LCVPUnique[1]
resTemplate[1] <- NA

# function for fuzzy matching
# maxDist controls the Levenshtein distance, i.e. the difference between the given and matched name
nameMatcher <- function(x, maxDist = 2) {
	genus <- sub("\\s.*", "", x$shortName)
	if (genus %in% genera) {
		checkRows <- sub("\\s.*", "", LCVPUnique$nameIn) == genus
	} else {
		checkRows <- rep(TRUE, nrow(LCVPUnique))
	}
	# do fuzzy matching
	res <- LCVPUnique[checkRows][agrepl(paste0("^", x$shortName, "$"), LCVPUnique$nameIn[checkRows],
		max.distance = maxDist, fixed = FALSE
	)]
	if (nrow(res) > 0) {
		# calculate Levenshtein distance
		dists <- adist(x$shortName, res$nameIn)
		# keep best result(s)
		res <- res[as.vector(dists == min(dists))]
		# return first result or return template
		return(res[1])
	} else {
		return(resTemplate)
	}
}

Let’s run this function on some of the remaining unmatched names. As this may take a while, we will only loop over the first 200 names. You may run it on the whole dataset, but expect it to take about half an hour. Running on the first 200 names will just take a minute.

timeStart <- Sys.time()
# for (i in seq_len(nrow(plants))) {
for (i in seq_len(200)) {
	# only check unmatched cases
	if (plants$matched[i] == FALSE) {
		# counter to show progress
		print(paste(i, Sys.time()))
		res <- nameMatcher(plants[i])
		if (!is.na(res$`global Id`)) {
			resTable[i, LCVP_ID := res$`global Id`]
			plants[i, matched := TRUE]
		}
	}
}
Sys.time() - timeStart
[1] "1 2024-04-12 11:58:24.400659"
[1] "2 2024-04-12 11:58:28.823094"
[1] "3 2024-04-12 11:58:34.227135"
[1] "7 2024-04-12 11:58:37.908572"
[1] "10 2024-04-12 11:58:38.654882"
[1] "19 2024-04-12 11:58:42.660404"
[1] "38 2024-04-12 11:58:43.418891"
[1] "63 2024-04-12 11:58:44.179911"
[1] "70 2024-04-12 11:58:48.752534"
[1] "87 2024-04-12 11:58:49.515793"
[1] "103 2024-04-12 11:58:50.269285"
[1] "125 2024-04-12 11:58:51.013671"
[1] "128 2024-04-12 11:58:56.107607"
[1] "133 2024-04-12 11:59:00.12651"
[1] "141 2024-04-12 11:59:00.888705"
[1] "146 2024-04-12 11:59:01.645952"
[1] "158 2024-04-12 11:59:02.720926"
[1] "174 2024-04-12 11:59:07.297597"
[1] "187 2024-04-12 11:59:08.048403"
[1] "189 2024-04-12 11:59:08.804925"
[1] "191 2024-04-12 11:59:09.561838"
Time difference of 49.24616 secs
plants[c(1, 10, 19, 128, 133)]
A data.table: 5 × 4
oldNamenewNamematchedshortName
<chr><chr><lgl><chr>
FALSE
Abuta_panamensis Abuta_panamensis TRUEAbuta_panamensis
Acacia contriva Acacia contriva FALSEAcacia contriva
Aeranthus muscicola Aeranthus muscicola TRUEAeranthus muscicola
Aesculus xworlitzensisAesculus xworlitzensis TRUEAesculus xworlitzensis

What we see from the times needed per individual run is that whenever the genus is found, matching is relatively fast, taking about a second, but when this is not the case, it is quite slow. This is because agrepl() then works on the whole LCVPUnique dataset and has to compare more than one million pairs of words. You could think about a heuristic to reduce the number of rows checked.

Parallel processing#

We will now focus on speeding up the process of name checking by running it in parallel. Let’s check how many cores are available on the system.

parallel::detectCores()
16

On my machine, I can at maximum use 16 cores. That means that I can expect a more or less 16-fold increase in processing time. Assuming that the matching of all unmatched names of the 5000 row dataset would take 30 minutes when running it sequentially, that means that I can expect the task to be completed in about 2 minutes when running in parallel. However, there comes a cost with it: When running processes in parallel, R will copy all the objects in the workspace needed for each parallel process, and in our cases, that means copying LCVPUnique 16 times. This will take quite some time, and for few iterations of the loop, initializing the parallel process will take more time than is saved by running in parallel. Anyway, we will first try the first 200 names we already processed before (but note that the matched ones will not be done again, because they are matched).

We also need to make some adjustments to the code. As the parallel processes will use their copies of the data, it would not make sense to let them write to the individual copies. Therefore, if a match is found, the information needs to be returned to the main process. Also, as objects are copied for individual workers, the information needed for the individual processes should be kept minimal. I will also not make use of all available cores to make sure I can do others stuff on my computer without delay while the process is running.

# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)

# run the name resolution in parallel
timeStart <- Sys.time()
resTemp <- foreach(i = seq_len(200), .combine = c, .packages = c("data.table")) %dopar% {
	# only check unmatched cases
	# return the global Id if it is found and NA if not checked or nothing could be found
	if (plants$matched[i] == FALSE) {
		res <- nameMatcher(plants[i])
		res <- res$`global Id`
	} else {
		res <- NA
	}
	# indicate what to return
	res
}
Sys.time() - timeStart
# stop the cluster
stopCluster(cl)
Time difference of 3.274923 mins

The process took about 3.5 mins for me with 15 cores, so no gain in terms of time for now. Let’s see what we got.

resTemp
  1. <NA>
  2. <NA>
  3. <NA>
  4. <NA>
  5. <NA>
  6. <NA>
  7. <NA>
  8. <NA>
  9. <NA>
  10. <NA>
  11. <NA>
  12. <NA>
  13. <NA>
  14. <NA>
  15. <NA>
  16. <NA>
  17. <NA>
  18. <NA>
  19. <NA>
  20. <NA>
  21. <NA>
  22. <NA>
  23. <NA>
  24. <NA>
  25. <NA>
  26. <NA>
  27. <NA>
  28. <NA>
  29. <NA>
  30. <NA>
  31. <NA>
  32. <NA>
  33. <NA>
  34. <NA>
  35. <NA>
  36. <NA>
  37. <NA>
  38. <NA>
  39. <NA>
  40. <NA>
  41. <NA>
  42. <NA>
  43. <NA>
  44. <NA>
  45. <NA>
  46. <NA>
  47. <NA>
  48. <NA>
  49. <NA>
  50. <NA>
  51. <NA>
  52. <NA>
  53. <NA>
  54. <NA>
  55. <NA>
  56. <NA>
  57. <NA>
  58. <NA>
  59. <NA>
  60. <NA>
  61. <NA>
  62. <NA>
  63. <NA>
  64. <NA>
  65. <NA>
  66. <NA>
  67. <NA>
  68. <NA>
  69. <NA>
  70. <NA>
  71. <NA>
  72. <NA>
  73. <NA>
  74. <NA>
  75. <NA>
  76. <NA>
  77. <NA>
  78. <NA>
  79. <NA>
  80. <NA>
  81. <NA>
  82. <NA>
  83. <NA>
  84. <NA>
  85. <NA>
  86. <NA>
  87. <NA>
  88. <NA>
  89. <NA>
  90. <NA>
  91. <NA>
  92. <NA>
  93. <NA>
  94. <NA>
  95. <NA>
  96. <NA>
  97. <NA>
  98. <NA>
  99. <NA>
  100. <NA>
  101. <NA>
  102. <NA>
  103. <NA>
  104. <NA>
  105. <NA>
  106. <NA>
  107. <NA>
  108. <NA>
  109. <NA>
  110. <NA>
  111. <NA>
  112. <NA>
  113. <NA>
  114. <NA>
  115. <NA>
  116. <NA>
  117. <NA>
  118. <NA>
  119. <NA>
  120. <NA>
  121. <NA>
  122. <NA>
  123. <NA>
  124. <NA>
  125. <NA>
  126. <NA>
  127. <NA>
  128. <NA>
  129. <NA>
  130. <NA>
  131. <NA>
  132. <NA>
  133. <NA>
  134. <NA>
  135. <NA>
  136. <NA>
  137. <NA>
  138. <NA>
  139. <NA>
  140. <NA>
  141. <NA>
  142. <NA>
  143. <NA>
  144. <NA>
  145. <NA>
  146. <NA>
  147. <NA>
  148. <NA>
  149. <NA>
  150. <NA>
  151. <NA>
  152. <NA>
  153. <NA>
  154. <NA>
  155. <NA>
  156. <NA>
  157. <NA>
  158. <NA>
  159. <NA>
  160. <NA>
  161. <NA>
  162. <NA>
  163. <NA>
  164. <NA>
  165. <NA>
  166. <NA>
  167. <NA>
  168. <NA>
  169. <NA>
  170. <NA>
  171. <NA>
  172. <NA>
  173. <NA>
  174. <NA>
  175. <NA>
  176. <NA>
  177. <NA>
  178. <NA>
  179. <NA>
  180. <NA>
  181. <NA>
  182. <NA>
  183. <NA>
  184. <NA>
  185. <NA>
  186. <NA>
  187. <NA>
  188. <NA>
  189. <NA>
  190. <NA>
  191. <NA>
  192. <NA>
  193. <NA>
  194. <NA>
  195. <NA>
  196. <NA>
  197. <NA>
  198. <NA>
  199. <NA>
  200. <NA>

As we already ran the process sequentially, new matches were not to be expected on the first 200 entries. Let’s risk running the process on the whole dataset.

# create the cluster for parallel processing
cl <- makeCluster(parallel::detectCores() - 1)
registerDoSNOW(cl)

# run the name resolution in parallel
timeStart <- Sys.time()
resTemp <- foreach(i = seq_len(nrow(plants)), .combine = c, .packages = c("data.table")) %dopar% {
	# only check unmatched cases
	# return the global Id if it is found and NA if not checked or nothing could be found
	if (plants$matched[i] == FALSE) {
		res <- nameMatcher(plants[i])
		res <- res$`global Id`
	} else {
		res <- NA
	}
	# indicate what to return
	res
}
Sys.time() - timeStart

# stop the cluster
stopCluster(cl)
Time difference of 7.922978 mins

This took about 7.5 mins. That’s a big improvement compared to the sequential process. Let’s see how many matches we got.

table(!is.na(resTemp))
FALSE  TRUE 
 4749   251 

So out of the 581 names, another 251 could be matched. Let’s transfer the data into resTable. It might well be worth thinking about adding information on type of matching, as the fuzzy matches are not perfect any might require further checking. However, our current implementation does not give us any information on the type of match.

resTable[!is.na(resTemp), LCVP_ID := resTemp[!is.na(resTemp)]]
plants[!is.na(resTable$LCVP_ID), matched := TRUE]

Let’s just look at the results and the remaining names. Maybe you can figure out some possible improvements to the code.

TASKS:

  1. For example, you could think about allowing for partial matches, if the genus is found, but not the species. This could easily be implemented by extracting the first word from the shortName column.

  2. You could also play with the maxDist parameter to increase or decrease the Levenshtein distance.

  3. To improve the speed of the nameMatcher function, you would certainly have to filter potential matches, for example by only including names starting with a certain letter (assuming the first letter is correct), or by only including names with a certain number of characters.

  4. Finally, the code would be more efficient if you would only loop over the rows that have not been matched yet.

table(plants$matched)
plants[matched == FALSE][1:20]
FALSE  TRUE 
  319  4681 
A data.table: 20 × 4
oldNamenewNamematchedshortName
<chr><chr><lgl><chr>
FALSE
(lauraceae) pubescente (lauraceae) pubescente FALSE(lauraceae) pubescente
?Betulaceae sp. ?Betulaceae sp. FALSE?Betulaceae sp.
Abies sp Abies sp FALSEAbies sp
Acacia contriva Acacia contriva FALSEAcacia contriva
Acacia sp1887 Acacia sp1887 FALSEAcacia sp1887
Acarospora radicata Acarospora radicata FALSEAcarospora radicata
Adenanthera sp. Adenanthera sp. FALSEAdenanthera sp.
A-Elyhordeum schaackianum A-Elyhordeum schaackianum FALSEA-Elyhordeum schaackianum
Albizia NA Albizia NA FALSEAlbizia NA
Allium scabrifolium Allium scabrifolium FALSEAllium scabrifolium
Aloinopsis gydouwensis Aloinopsis gydouwensis FALSEAloinopsis gydouwensis
ALOPECURUS GENICULATUS, ALOPECURUS GENICULATUS, FALSEALOPECURUS GENICULATUS,
Alyssum thunbergii Moq. Orthodox p Alyssum thunbergii Moq. Orthodox p FALSEAlyssum thunbergii
Ampelocera indet Ampelocera indet FALSEAmpelocera indet
Amphibromus NA Amphibromus NA FALSEAmphibromus NA
Ancistrocladus stelliger Wall. ex DC.Ancistrocladus stelliger Wall. ex DC.FALSEAncistrocladus stelliger
ANEMONE NEMOROSA L. ANEMONE NEMOROSA L. FALSEANEMONE NEMOROSA
Anthocephalus sp. Anthocephalus sp. FALSEAnthocephalus sp.
Arctoa hyperborea Arctoa hyperborea FALSEArctoa hyperborea