RSelenium - Automating the control of web browsers#

This tutorial demonstrates syntax for the use of the RSelenium library in R. It is not intended to be complete nor comprehensive, but as an addition to the documentation and cheatsheets that are available elsewhere.

Using RSelenium is helpful in cases where no API is provided to access data on web pages, or manually entering a bunch of parameters is very tedious.

This package focuses on using RSelenium together with Mozilla Firefox. Some code and some methods described here will have to be adapted to others browsers.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

  • some R libraries that may need to be installed

  • the Mozilla Firefox web browser

Code#

# load in libraries
library(data.table) # handle large datasets
library(RSelenium) # automated control of the internet browser

# clear workspace
rm(list = ls())

# set working directory (adapt this!)
setwd(paste0(.brd, "snippets"))

RSelenium works with any modern browser. It can at least use Microsoft Edge, Mozilla Firefox, and Google Chrome. Throughout this tutorial, we will use Mozilla Firefox, as it is a non-profit software.

Create a Firefox instance#

We will now create a new Firefox instance. It will imediately open a new visible browser window. The option “chromever = NULL” ensures we do not depend on the presence of the Chrome browser. Be sure to use this option if you do not want to use Chrome. Note that this command uses a specific port on your computer, and the default is 4567. To open another instance, another port needs to be specified.Possible ports are the ones following 4567, i.e. 4568, 4569, etc. If you accidentally close the browser window and you want to free the port again, use the following command:

# close all open ports
try(system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE), silent = TRUE)
# open a browser window using the standard port 4567
rD <- rsDriver(browser = "firefox", verbose = FALSE, chromever = NULL)
remDr <- rD[["client"]]

Interact with website elements#

Having navigated to a website, we can now look for elements we can interact with. Common cases for the use of RSelenium are inserting text into input fields, and clicking on buttons. Let’s give it a first try. You should be on the LifeGate website, which features a large search input field on its upper left. Let’s try to find it.

In RSelenium, there are different ways of finding and accessing elements. The two functions that can be used are findElement and findElements. They can search for a number of properties, most notably the “tag” property, i.e. whether the element is a div, span, or an input field, the “class” property, the “name” property, the “id” property, and so on. So what should we do to know which properties our element has? In Firefox, we can right click on the element and select “Inspect Element” from the menu. In this particular case, we will not directly get the element but need to navigate a tree structure that highlights the respective elements on mouse hover. We end up with the input element that looks like this:

<input class="prompt" type="text" autocomplete="off" placeholder="Nach Taxon suchen…">

As the input element is of class “prompt”, but has no name or id, we have two options now: Searching for it using the “tag” property or using the “class” property. We will first use the “tag” property. Given there is likely more than one input element, we should use the findElements function.

elem1 <- remDr$findElements(using = "tag", value = "input")
length(elem1)

We can see there are twelve input elements on the page. How can we select the right one? We can look at its outer HTML to see whether this matches what we saw in the browser.

for (i in seq_along(elem1)) {
	print(elem1[[i]]$getElementAttribute("outerHTML")[[1]])
}

Clearly, the element we are interested in is the first one, so to select it, we do

elem1 <- elem1[[1]]

As a second option, we can use the “class” property. We will again use the findElements function.

elem2 <- remDr$findElements(using = "class", value = "prompt")
length(elem2)
print(elem2[[1]]$getElementAttribute("outerHTML")[[1]])
elem2 <- elem2[[1]]

There is only one element of the class “prompt” on the page, so we can select it directly. In general, the selection of elements is most efficient when using id, name, class, and tag, in this order. We can now enter some stuff into the input field. This is done with the sendKeysToElement function.

elem2$sendKeysToElement(list("Bellis perennis"))

We can also press special buttons like arrows or enter.

elem2$sendKeysToElement(list(key = "enter"))

If you accidentally run the input function twice, you can clear the input like this:

elem2$clearElement()

You can also combine the two calls of the sendKeysToElement function.

elem2$sendKeysToElement(list("Canis lupus", key = "enter"))

It seems we need to hit “enter” twice after this first search.

elem2$clearElement()
elem2$sendKeysToElement(list("Bellis perennis", key = "enter", key = "enter"))
elem2$clearElement()
elem

I am not sure why this is happening. Sometimes, executing several commands very swiftly after each other will not work. In some cases, it is necessary to check until a page has loaded using a loop. This may also be the reason here.

Let’s try some other keys, the most important are the arrows to go through drop-down menus.

elem2$clearElement()
elem2$sendKeysToElement(list("Elephant"))

We got several results. We can select the second one with the “down_arrow” key.

elem2$sendKeysToElement(list(key = "down_arrow"))
elem2$sendKeysToElement(list(key = "enter"))

It is also possible to access the data from within the drop-down. It may be a bit difficult clicking through it using the Firefox Inspector, but I now found that the names seen in the dropdown are within div elements of the class “title”.

elem3 <- remDr$findElements(using = "class", value = "title")
length(elem3)
for (i in seq_along(elem3)) {
	print(elem3[[i]]$getElementAttribute("innerHTML")[[1]])
}

As we can see, there are quite some options. Note that the “innerHTML” property returns only the HTML code witin the elements, while “outerHTML” returns the elements themselves with the enclosed HTML code.

Let’s now change the background images to distribution data clicking the “Karten” symbol on the the lower left of the LifeGate window.

elem2$clearElement()
elem4 <- remDr$findElement(using = "id", value = "btn_prsp_range")
elem4$clickElement()

# And back..

elem4 <- remDr$findElement(using = "id", value = "btn_prsp_scheme")
elem4$clickElement()

We have now learned how to find elements, enter input and extract data from drop-down menus. The rest is playing with the data itself and adapting it to your needs. Still, I would like to point out some things that may be helpful, although the following code snippets are out of context here.

(1) Imagine you need to move the mouse to a specific location on a web element. This can be done with the moveMouseToLocation function.

remDr$mouseMoveToLocation(x = <XVALUE>, y = <YVALUE>, webElement = <WEBELEMENT>)

Note that the location defined by <XVALUE> and <YVALUE> is relative to <WEBELEMENT>, which, of course, can also be the whole page, but will often be more helpful if more specific.

(2) Another situation that might happen is that you need to wait for a page to load. If a symbol like turning arrows appear that show the loading is happing, it makes sense to look for those and wait until they disappear.

repeat {
	waitSymbol <- remDr$findElements(using = "class", value = "animate-spin")[[1]]
	if (!grepl("aria-hidden=.true.", waitSymbol$getElementAttribute("outerHTML")[[1]])) {
		Sys.sleep(1)
	} else {
		break
	}
}

In the above case, an element of the class “animate-spin” is used to show that the page is loading. This element is always present, but has the “aria-hidden” attribute set to “true” when loading takes place. So we can check if this is the case, and if so, wait for a second and check again.