Taxonomic name parsing#
This notebook is intended to show the functionality of a taxonomic name parser. Name parsing describes the identification of string components as parts of scientific names. In general, these are
generic name/genus,
specific name/epithet,
infraspecies markers (subsp./var./f.),
infraspecific name/epithet,
authors.
Additional elements, as hybrid signs and a number of other markers as cv. or agg., spelling errors, and custom information in the names, e.g., Bellis perennis_plot123, can complicate the process of name parsing. In this notebook, the taxonomic name parser from GBIF will be used. However, there are some limitation with it, and in the hands-on part, we will attempt to overcome those by pre-processing the data before sending it to the name parser. We will also try to speed up the name parsing process by using the parallel processing functionality of R.
Prerequisites#
To run the code presented here, you will need
the sample names list provided in the workshop,
a functioning R environment and
the R packages
data.table
,rgbif
, anddoSNOW
installed.
Code#
The first block of code loads libraries and prepares the workspace. You will need to adapt the working directory.
# load packages
library(data.table) # handle large datasets
library(rgbif) # access GBIF data
library(doSNOW) # parallel computing
# clear workspace
rm(list = ls())
# set working directory
setwd(paste0(.brd, "gfoe NFDI taxonomic harmonization workshop"))
# load data
plants <- fread("plant names_2024-04-08.txt", sep = "\t")
animals <- fread("animal names_2024-04-09.txt", sep = "\t")
Lade nötiges Paket: foreach
Lade nötiges Paket: iterators
Lade nötiges Paket: snow
Both the plants and animals variables are tables with one column. The names in these tables are different from each other - most notably, some have authors included while others have not. To get the best results when doing name harmonization later on, we will need to separate authors, and also remove problematic characters from the data.
Encoding#
Unfortunately, when getting data from differing sources, we will often find that these data have been encoded in different ways. This means that while the typical English language letters will be stored the same way on any machine, when it comes to accents and some other special characters, it may matter whether data was stored by a computer in the US or Japan, and whether the computer has a Windows, Mac, or Linux operating system.
We will deal with the most common case: Data being stored in the Windows-specific CP-1252 encoding (mislabeled ANSI or latin1 sometimes) and not in UTF-8.
How your machine treats data from different encodings depends on what encoding is preset in your console. You can check this using the following:
Sys.getlocale()
If your console has no UTF-8 setting (no matter the language) you may change it like this:
Sys.setlocale(category = "LC_ALL", locale = "German_Germany.utf8")
You can use another encoding, too, but it may throw errors later on. So let’s check whether the data comes in UTF-8, and if not, let’s repair it, assuming it is CP-1252 (our best guess, likely correct in 99% of the cases).
# check whether correct encoding is UTF-8
table(validUTF8(plants$oldName))
table(validUTF8(animals$modName))
FALSE TRUE
73 4927
TRUE
5000
# create new columns for variables
plants[, newName := oldName]
animals[, newName := modName]
# correct encoding, assuming current encoding is CP-1252
plants[!validUTF8(newName), newName := iconv(newName, from = "CP1252", to = "UTF-8")]
Name parsing#
Let’s try to parse the names using the GBIF name parser.
resP <- data.table(name_parse(plants$newName))
resA <- data.table(name_parse(animals$newName))
table(resP$parsed)
table(resA$parsed)
FALSE TRUE
11 4989
FALSE TRUE
17 4983
That looks like a pretty good result. For plants and animals, we got all but 11 and 12 names parsed, respectively. Let’s look at what did not work for animals.
resA[parsed == FALSE]
scientificname | type | genusorabove | authorship | year | parsed | parsedpartially | canonicalname | canonicalnamecomplete | canonicalnamewithmarker | specificepithet | bracketauthorship | bracketyear | rankmarker | infraspecificepithet | infrageneric | cultivarepithet | sensu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <lgl> | <lgl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
W0_758_Allobates hodli Simões, Lima & Farias | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Andinobates spec X Batista, Jaramillo, Ponce, & Crawford, 2014 | HYBRID | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
XM3_777_Anthracothorax viridis (Audebert & Vieillot, 1801) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
U_291_Balearica pavonina (Linnaeus, 1758) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Q4_650_Colpophyllia Milne Edwards & Haime | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
O_824_Crypthelia glebulenta Cairns, 1986 | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
DISsOSURA LONGICAUDUS (Gmelin, 1788) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
C9_428_Euphlyctis hexadactylus (Lesson, 1834) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
0S_65_Flabellum siboae Gardiner, 1904 | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
165_251_Glaucidium nubicola Robbins & Stiles, 1999 | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
HOMOPUq AREOLATUS (Thunberg, 1787) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
LEPIDOPhRA SYMMETRICA Cairns, 1991 | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
P_944_Podocnemis erythrocephala (Spix, 1824) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
X_242_RHODOPIS | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
0P_663_Saiga tatarica (Linnaeus, 1766) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
98Y_336_Stichopathes semiglabra (van Pesch, 1914) | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
6X_832_Uroplatus henkeli | NO_NAME | NA | NA | NA | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Pre-processing#
The problem with most of these names is the number-character combinations before the actual name. They need to be removed before using the name parser. As it seems these are combinations of one to three uppercase characters or numbers followed by a underline repeated twice, we may find them like shown below. Note that is essential to use regular expressions, which can be used to create target patterns to search for. Regular expressions are more or less the same across programming languages. Some information specifically on R can be found here.
animals[grepl("^([[:upper:]]|\\d){1,3}_([[:upper:]]|\\d){1,3}", newName), "newName"]
newName |
---|
<chr> |
W0_758_Allobates hodli Simões, Lima & Farias |
XM3_777_Anthracothorax viridis (Audebert & Vieillot, 1801) |
U_291_Balearica pavonina (Linnaeus, 1758) |
WCO_501_Calliphlox mitchellii (Bourcier, 1847) |
Q4_650_Colpophyllia Milne Edwards & Haime |
O_824_Crypthelia glebulenta Cairns, 1986 |
C9_428_Euphlyctis hexadactylus (Lesson, 1834) |
0S_65_Flabellum siboae Gardiner, 1904 |
165_251_Glaucidium nubicola Robbins & Stiles, 1999 |
P_944_Podocnemis erythrocephala (Spix, 1824) |
X_242_RHODOPIS |
0P_663_Saiga tatarica (Linnaeus, 1766) |
98Y_336_Stichopathes semiglabra (van Pesch, 1914) |
6X_832_Uroplatus henkeli |
Removing such a sequence could be done more or less like this.
# create a new variable to not overwrite the original data
animals[, testName := newName]
# remove the name sequences
animals[, testName := sub("^([[:upper:]]|\\d){1,3}_([[:upper:]]|\\d){1,3}", "", testName)]
# check whether it worked
animals[testName != newName, c("newName", "testName")]
newName | testName |
---|---|
<chr> | <chr> |
W0_758_Allobates hodli Simões, Lima & Farias | _Allobates hodli Simões, Lima & Farias |
XM3_777_Anthracothorax viridis (Audebert & Vieillot, 1801) | _Anthracothorax viridis (Audebert & Vieillot, 1801) |
U_291_Balearica pavonina (Linnaeus, 1758) | _Balearica pavonina (Linnaeus, 1758) |
WCO_501_Calliphlox mitchellii (Bourcier, 1847) | _Calliphlox mitchellii (Bourcier, 1847) |
Q4_650_Colpophyllia Milne Edwards & Haime | _Colpophyllia Milne Edwards & Haime |
O_824_Crypthelia glebulenta Cairns, 1986 | _Crypthelia glebulenta Cairns, 1986 |
C9_428_Euphlyctis hexadactylus (Lesson, 1834) | _Euphlyctis hexadactylus (Lesson, 1834) |
0S_65_Flabellum siboae Gardiner, 1904 | _Flabellum siboae Gardiner, 1904 |
165_251_Glaucidium nubicola Robbins & Stiles, 1999 | _Glaucidium nubicola Robbins & Stiles, 1999 |
P_944_Podocnemis erythrocephala (Spix, 1824) | _Podocnemis erythrocephala (Spix, 1824) |
X_242_RHODOPIS | _RHODOPIS |
0P_663_Saiga tatarica (Linnaeus, 1766) | _Saiga tatarica (Linnaeus, 1766) |
98Y_336_Stichopathes semiglabra (van Pesch, 1914) | _Stichopathes semiglabra (van Pesch, 1914) |
6X_832_Uroplatus henkeli | _Uroplatus henkeli |
TASKS:
Try to fix the code so that it gives the wanted result.
To increase the accuracy of later matching, look for these combinations of uppercase letters and numbers also in the species epithet.
Then, try to fix the problems with the other unparsed names in the animal and plant names.
There may also be some generic terms you may want to remove (e.g. spec., spp., agg., etc.).
Some useful functions can be found below.
# check for a number after the genus name, but before the year
animals[grepl("^\\S+\\s.*\\d.*\\s\\d{4}$", newName), "newName"][1:3]
resA[grepl("^\\S+\\s.*\\d.*\\s\\d{4}$", animals$newName)][1:3]
# check for spec., species, morpho, spp.
animals[grepl("spec\\.|species|morpho|spp\\.", newName), "newName"][1:3]
resA[grepl("spec\\.|species|morpho|spp\\.", animals$newName)][1:3]
# find name parts after an equal sign
plants[grepl("=", newName), "newName"][1:3]
resP[grepl("=", plants$newName)][1:3]
newName |
---|
<chr> |
ACANTHASTREA LORDHOWENSIS_4_889 Veron & Pichon, 1982 |
Accipiter spp.-5 Rothschild & Hartert, 1926 |
Acropora morphospec1 Veron & Wallace, 1984 |
scientificname | type | genusorabove | authorship | year | parsed | parsedpartially | canonicalname | canonicalnamecomplete | canonicalnamewithmarker | specificepithet | bracketauthorship | bracketyear | rankmarker | infraspecificepithet | infrageneric | cultivarepithet | sensu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <lgl> | <lgl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
ACANTHASTREA LORDHOWENSIS_4_889 Veron & Pichon, 1982 | DOUBTFUL | Acanthastrea | Lordhowensis | NA | TRUE | TRUE | Acanthastrea | Acanthastrea Lordhowensis | Acanthastrea | NA | NA | NA | NA | NA | NA | NA | NA |
Accipiter spp.-5 Rothschild & Hartert, 1926 | INFORMAL | Accipiter | NA | NA | TRUE | TRUE | Accipiter spec. | Accipiter spec. | Accipiter spec. | NA | NA | NA | sp. | NA | NA | NA | NA |
Acropora morphospec1 Veron & Wallace, 1984 | SCIENTIFIC | Acropora | NA | NA | TRUE | TRUE | Acropora | Acropora | Acropora | NA | NA | NA | NA | NA | NA | NA | NA |
newName |
---|
<chr> |
Accipiter spp.-5 Rothschild & Hartert, 1926 |
Acropora morphospec1 Veron & Wallace, 1984 |
Brookesia spp. Q Brygoo & Domergue, 1975 |
scientificname | type | genusorabove | authorship | year | parsed | parsedpartially | canonicalname | canonicalnamecomplete | canonicalnamewithmarker | specificepithet | bracketauthorship | bracketyear | rankmarker | infraspecificepithet | infrageneric | cultivarepithet | sensu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <lgl> | <lgl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
Accipiter spp.-5 Rothschild & Hartert, 1926 | INFORMAL | Accipiter | NA | NA | TRUE | TRUE | Accipiter spec. | Accipiter spec. | Accipiter spec. | NA | NA | NA | sp. | NA | NA | NA | NA |
Acropora morphospec1 Veron & Wallace, 1984 | SCIENTIFIC | Acropora | NA | NA | TRUE | TRUE | Acropora | Acropora | Acropora | NA | NA | NA | NA | NA | NA | NA | NA |
Brookesia spp. Q Brygoo & Domergue, 1975 | INFORMAL | Brookesia | NA | NA | TRUE | FALSE | Brookesia spp.Q | Brookesia spp.Q | Brookesia spp.Q | spp.Q | NA | NA | sp. | NA | NA | NA | NA |
newName |
---|
<chr> |
Artemisia vulgaris x verlotiorum = A. x wurzellii C.M. James & Stace |
Lolium perenne x multiflorum = L. x boucheanum Kunth |
Mentha arvensis x aquatica x spicata = M. x smithiana R.A. Graham |
scientificname | type | parsed | parsedpartially | genusorabove | canonicalname | canonicalnamecomplete | canonicalnamewithmarker | rankmarker | specificepithet | authorship | infraspecificepithet | bracketauthorship | notho | sensu | nomstatus | strain |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <lgl> | <lgl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
Artemisia vulgaris x verlotiorum = A. x wurzellii C.M. James & Stace | HYBRID | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Lolium perenne x multiflorum = L. x boucheanum Kunth | HYBRID | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Mentha arvensis x aquatica x spicata = M. x smithiana R.A. Graham | HYBRID | FALSE | FALSE | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
You can compare your results to the ones provided here below. For plants, they have been created using a name parser developed specifically for the names of the TRY database. For animals this is an extract of the CITES names list, these are the unmodified correct names, from which the erronous names found here were derived.
# for testing
plantsFull <- fread("plant names full_2024-04-08.txt", sep = ",")
animalsFull <- fread("animal names full_2024-04-09.txt", sep = ",")
str(plantsFull)
str(animalsFull)
Classes 'data.table' and 'data.frame': 5000 obs. of 18 variables:
$ oldName : chr "" "(lauraceae) pubescente" "?Betulaceae sp." "Abarema curvicarpa" ...
$ newName : chr "" "" "" "" ...
$ familyNameFound : logi FALSE TRUE TRUE FALSE FALSE FALSE ...
$ oldFamilyName : chr "" "lauraceae" "Betulaceae" "" ...
$ newFamilyName : chr "" "Lauraceae" "Betulaceae" "" ...
$ genus : chr "" "Pubescente" "" "Abarema" ...
$ hybrid1 : chr "" "" "" "" ...
$ species1 : chr "" "" "" "curvicarpa" ...
$ subSpeciesFlag : chr "" "" "" "" ...
$ subSpecies : chr "" "" "" "" ...
$ varSpeciesFlag : chr "" "" "" "" ...
$ varSpecies : chr "" "" "" "" ...
$ formaSpeciesFlag: logi NA NA NA NA NA NA ...
$ formaSpecies : logi NA NA NA NA NA NA ...
$ hybrid2 : chr "" "" "" "" ...
$ species2 : chr "" "" "" "" ...
$ author : chr "" "" "" "" ...
$ kingdom : chr "" "" "" "P" ...
- attr(*, ".internal.selfref")=<externalptr>
Classes 'data.table' and 'data.frame': 5000 obs. of 55 variables:
$ TaxonId : int 2581 3734 1703 68243 68179 68198 68076 68212 68150 68149 ...
$ Kingdom : chr "Animalia" "Animalia" "Animalia" "Animalia" ...
$ Phylum : chr "Chordata" "Chordata" "Chordata" "Chordata" ...
$ Class : chr "Aves" "Aves" "Reptilia" "Reptilia" ...
$ Order : chr "Apodiformes" "Apodiformes" "Sauria" "Sauria" ...
$ Family : chr "Trochilidae" "Trochilidae" "Anguidae" "Anguidae" ...
$ Genus : chr "Abeillia" "Abeillia" "Abronia" "Abronia" ...
$ Species : chr "" "abeillei" "" "anzuetoi" ...
$ Subspecies : chr "" "" "" "" ...
$ FullName : chr "Abeillia" "Abeillia abeillei" "Abronia" "Abronia anzuetoi" ...
$ AuthorYear : chr "Bonaparte, 1850" "(Lesson & DeLattre, 1839)" "Gray, 1838" "Campbell & Frost, 1993" ...
$ RankName : chr "GENUS" "SPECIES" "GENUS" "SPECIES" ...
$ CurrentListing : chr "II" "II" "I/II" "I" ...
$ FullAnnotationEnglish : chr "Appendix II:" "Appendix II:" "Appendix II:Except the species included in Appendix I. Zero export quota for wild specimens for <i>Abronia auri"| __truncated__ "Appendix I:" ...
$ AnnotationEnglish : chr "Appendix II:" "Appendix II:" "Appendix II:Except the species included in Appendix I. Zero export quota for wild specimens for <i>Abronia auri"| __truncated__ "Appendix I:" ...
$ AnnotationSpanish : chr "Appendix II:" "Appendix II:" "Appendix II:Excepto las especies incluidas en el Apéndice I. Cupo de exportación nulo para los especímenes silv"| __truncated__ "Appendix I:" ...
$ AnnotationFrench : chr "Appendix II:" "Appendix II:" "Appendix II:Sauf les espèces inscrites à l’Annexe I. Quota d’exportation zéro pour les spécimens sauvages pour "| __truncated__ "Appendix I:" ...
$ #AnnotationSymbol : chr "" "" "" "" ...
$ #Annotation : chr "Appendix II:" "Appendix II:" "Appendix II:" "Appendix I:" ...
$ SynonymsWithAuthors : chr "" "Ornismya abeillei Lesson & DeLattre, 1839" "" "Abronia anzuetoi Köhler, 2000" ...
$ EnglishNames : chr "" "Emerald-chinned Hummingbird" "" "Anzuetoi arboreal alligator lizard" ...
$ SpanishNames : chr "" "Colibrí barbiesmeralda" "" "" ...
$ FrenchNames : chr "" "Colibri d'Abeillé" "" "" ...
$ CitesAccepted : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ All_DistributionISOCodes : chr "" "SV, GT, HN, MX, NI" "" "GT" ...
$ All_DistributionFullNames : chr "" "El Salvador, Guatemala, Honduras, Mexico, Nicaragua" "" "Guatemala" ...
$ NativeDistributionFullNames: chr "" "El Salvador, Guatemala, Honduras, Mexico, Nicaragua" "" "Guatemala" ...
$ Introduced_Distribution : chr "" "" "" "" ...
$ Introduced(?)_Distribution : chr "" "" "" "" ...
$ Reintroduced_Distribution : chr "" "" "" "" ...
$ Extinct_Distribution : chr "" "" "" "" ...
$ Extinct(?)_Distribution : chr "" "" "" "" ...
$ Distribution_Uncertain : chr "" "" "" "" ...
$ modOrder : chr "Apodiformes" "Apodiformes" "Sauria" "Sauria" ...
$ modFamily : chr "Trochilidae" "Trochilidae" "Anguidae" "Anguidae" ...
$ modGenus : chr "Abeillia" "Abeillia" "Abronia" "Abronia" ...
$ modSpecies : chr "" "abeillei" "" "anzuetoi" ...
$ modSubspecies : chr "" "" "" "" ...
$ modAuthorYear : chr "Bonaparte, 1850" "(Lesson & DeLattre, 1839)" "Gray, 1838" "Campbell & Frost, 1993" ...
$ uppercase : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ lowercase : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ changedOne : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ omittedOne : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ shuffle : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ noAuthors : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ abbrAuthors : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ noYear : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ abbrGenus : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ addPlot : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ addFamily : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ morphoSpec : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ cutGenusEpi : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ modName : chr "Abeillia Bonaparte, 1850" "Abeillia abeillei (Lesson & DeLattre, 1839)" "Abronia Gray, 1838" "Abronia anzuetoi Campbell & Frost, 1993" ...
$ cutName : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Name : chr "Abeillia Bonaparte, 1850" "Abeillia abeillei (Lesson & DeLattre, 1839)" "Abronia Gray, 1838" "Abronia anzuetoi Campbell & Frost, 1993" ...
- attr(*, ".internal.selfref")=<externalptr>
Parallel processing#
With our name lists having 5000 names each, the parsing takes just some seconds. Depending on the size of the list, it may be a good idea to speed up the process by parallelizing it. As the name_parse
function already accepts several names at once, we may split the names list into the number of cores we can use for parallel processing.
Let’s check how many cores are available on the system.
parallel::detectCores()
It is unlikely that you have so many cores available, but from former trials with the GBIF API I can tell you that is is wise to limit the core number to 24 at maximum. So let’s re-run the name parsing to compare the times needed.
timeStart <- Sys.time()
resP <- data.table(name_parse(plants$newName))
Sys.time() - timeStart
timeStart <- Sys.time()
resA <- data.table(name_parse(animals$newName))
Sys.time() - timeStart
Time difference of 1.98634 secs
Time difference of 3.512737 secs
Now let’s split the lists into chunks and let each worker run independently.
nLists <- min(24, parallel::detectCores() - 1)
(nNames <- nrow(plants) %/% nLists)
(nNamesLast <- nNames + nrow(plants) %% nLists)
We chose to use one workers less than we can, to allow the computer to fulfill other tasks while the script is running, and a maximum of 24. On my computer, this means that each chunk has 333 names to process, and the last chunk will have 338. Let’s create the parallel environment now and compare the times. We just to the plant case for simplicity.
# create the cluster for parallel processing
cl <- makeCluster(nLists)
registerDoSNOW(cl)
# run the name parsing in parallel
# the option "fill = TRUE" makes sure foreach throws no error due to different column numbers
timeStart <- Sys.time()
resP_parallel <- foreach(
i = seq_len(nLists), .combine = function(...) rbind(..., fill = TRUE),
.packages = c("data.table", "rgbif")
) %dopar% {
if (i < nLists) {
res <- data.table(name_parse(plants$newName[seq_len(nNames) + (i - 1) * nNames]))
} else {
res <- data.table(name_parse(plants$newName[seq_len(nNamesLast) + (i - 1) * nNames]))
}
res
}
Sys.time() - timeStart
# stop the cluster
stopCluster(cl)
Time difference of 5.201381 secs
Timewise, for this little dataset, the overhead created by setting up the parallel environment was larger than the speed gain through parallel processing. Let’s check whether the results are the same.
all(resP == resP_parallel, na.rm = TRUE)
However, the results are as expected, and with larger lists, this approach could save us some time.