data.table - An extremely fast R library for basic dataset handling

data.table - An extremely fast R library for basic dataset handling#

This tutorial demonstrates syntax for the use of the data.table library in R. It is not intended to be complete nor comprehensive, but as an addition to the documentation and cheatsheets that are available elsewhere.

This package is used in PlantHub for two reasons: (1) fast loading of large (compressed) data files. (2) fast matching of species names and other data. The speed gain of these processes compared to standard R functions in so large that some tasks that cannot reasonably been done using standard R just take a couple of seconds. However, to get out most of data.table, some special syntax is needed, and some cases not covered in depth elsewhere are covered here.

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

Requirements#

To run the script, the following is needed:

some R libraries that may need to be installed

Code#

# load in libraries
library(data.table)
library(doSNOW)

# clear workspace
rm(list = ls())

# set working directory (adapt this!)
setwd(paste0(.brd, "snippets"))

Checking and changing the encoding of data, and applying functions to some or all columns of a data.table#

These functions are multi-purpose. The encoding is sometimes relevant when data comes from different sources and consists of characters strings. A common example are species and author names.

Let’s first create a sample data.table. We just define the columns and their contents. We include non-ASCII characters, as “é” and some characters that are represented by their UTF-8 encoding.

dt1 <- data.table(
	x = c("é", "j", "k"),
	y = c("jjhj", "kjhj", "kjlk"),
	z = c(1, 2, 3),
	q = c("\ud7", "\u75", "\u7d"),
	r = c(4, 5, 6)
)
str(dt1)

Let’s check the encoding of our data. We make a distinction between numeric and non-numeric data, as numeric data will return an error when we try to get or set the encoding.

# check encodings
sapply(seq_len(ncol(dt1)), function(x) {
	if (!is.numeric(dt1[[x]])) {
		c(colnames(dt1)[x], table(Encoding(dt1[[x]])))
	} else {
		paste0(colnames(dt1)[x], " is numeric.")
	}
})

We may need to change the encoding of the data. This is especially useful if searching for special characters with the functions using regular expressions, as gsub() or grepl(). They will behave differently depending on the encoding of our data. To change the encoding, we need to make a distinction between numeric and non-numeric data, as numeric data would throw an error when used with the Encoding function. To do so, we use the lapply function together the data.table assignment operator := with the .SD variable .SD means subset of data. As we do not define the subset, data.table will insert each column into .SD. The return values shall be stored in the respective columns, however, instead of referring to .SD on the left-hand side of the term, we need to write colnames(dt1). Note that R will only change the encoding to “UTF-8” or “latin1” if this is actually necessary, i.e. if non-ASCII characters are found in a string.

# change the encoding of all columns (except numeric columns)
dt1[, colnames(dt1) := lapply(.SD, function(x) {
	if (!is.numeric(x)) {
		# Encoding(x) <- "latin1"
		# Encoding(x) <- "UTF-8"
		Encoding(x) <- "unknown"
		x
	} else {
		x
	}
})]

# check encodings
sapply(seq_len(ncol(dt1)), function(x) {
	if (!is.numeric(dt1[[x]])) {
		c(colnames(dt1)[x], table(Encoding(dt1[[x]])))
	} else {
		paste0(colnames(dt1)[x], " is numeric.")
	}
})

It is helpful to understand how data is actually stored in the file system. We can have a look at the raw strings.

sapply(seq_len(ncol(dt1)), function(x) c(colnames(dt1)[x], sapply(as.character(dt1[[x]]), charToRaw)))

Now let’s apply a function to all columns of the data.table. Imagine we want them to all be of character data type. Having done this, let’s re-convert a specific column.

# convert all columns to a specific data type, omit .SDcols
dt1[, colnames(dt1) := lapply(.SD, function(x) as.character(x))]
str(dt1)

# convert a specific colum to a specific data type
dt1[, z := as.numeric(z)]
str(dt1)

What if we want to apply our function to a subset of the columns? We have three options: We can create a vector of column names, give the column numbers, or a boolean vector. We now define the columns we want to be used with .SD, and we do so by using the .SDcols variable.

# use a variable containing column names to convert some columns to a specific data type
numCols <- c("z", "r")
dt1[, (numCols) := lapply(.SD, function(x) as.numeric(x)), .SDcols = (numCols)]
str(dt1)

# use a variable containing column numbers to convert some columns to a specific data type
numCols <- c(3, 5)
dt1[, (numCols) := lapply(.SD, function(x) as.character(x)), .SDcols = (numCols)]
str(dt1)

# use a boolean vector to convert some columns to a specific data type
numCols <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
dt1[, colnames(dt1)[numCols] := lapply(.SD, function(x) as.numeric(x)), .SDcols = colnames(dt1)[numCols]]
str(dt1)

What if we want to create a new variable from an old one present in the data.table, but the old one is defined by some kind of logic, so we do not want to hard-code the variable name? We can submit it through a variable like this:

nameCol <- "y"
dt1[, s := .SD, .SDcols = (nameCol)]
str(dt1)

We may want to do the opposite: Create a new variable, but name it on the fly. This can be done by first using a “dummy name” for the variable and re-naming it afterwards.

newName <- "t"
dt1[, newVar := s]
colnames(dt1)[colnames(dt1) == "newVar"] <- newName
str(dt1)

Copies by reference (or not)#

Data.table is so fast and efficient, because it creates data copies by reference. This may lead to some unexpected behavior, so let’s try it out. We will create a new data.table by reference, change that one, and enjoy the changes in the original table. This may first seem like a bug, but allows for processing several objects in a loop.

dt1
dt1[, s := c(9, 4, 1)]
dt2 <- dt1
setkey(dt2, s)
dt2
dt1

As we have seen, the changes in dt2 also affected dt1. We can avoid this behavior by using copy().

setkey(dt1, r)
dt1
dt2 <- copy(dt1)
setkey(dt2, s)
dt2
dt1

However, there is a caveat: The modify be reference function of data.table objects is lost when the data.table objects are loaded into the workspace with load().

dt1 <- data.table(matrix(1:30, 3, 10))
dt2 <- dt1
dt2[, testCol1 := TRUE]
str(dt1)
dt1[, testCol1 := NULL]
save(dt1, file = "test.RData")
load("test.RData")
dt2 <- dt1
dt2[, testCol1 := TRUE]
str(dt1)

This behavior may not what we want. It can be repaired in some instances by using alloc.col().

dt1 <- data.table(matrix(1:30, 3, 10))
dt2 <- dt1
dt2[, testCol1 := TRUE]
str(dt1)
dt1[, testCol1 := NULL]
save(dt1, file = "test.RData")
load("test.RData")
alloc.col(dt1)
dt2 <- dt1
dt2[, testCol1 := TRUE]
str(dt1)

However, this will not always work, for example, if we are using a loop and want to process variables using a vector of variable names.

dt1 <- data.table(matrix(1:30, 3, 10))
dt2 <- dt1
dt2[, testCol1 := TRUE]
str(dt1)
dt1[, testCol1 := NULL]
save(dt1, file = "test.RData")
load("test.RData")
alloc.col(eval(as.symbol("dt1")))
dt2 <- dt1
dt2[, testCol1 := TRUE]
str(dt1)

Using data.table in parallel computing#

data.table is very fast, and we can even increase speed by using parallel computing. However, sometimes we may have two options: To run some function in parallel when a lot of data is used, and run it sequentially, if not. The reason for this is that running in parallel always takes some additional time initially, as several copies of the data are created (as many as there are cores used). In general, if return values are rows of a data.table, the result is a data.table, too.

# create a data.table
dt3 <- data.table(numVar = 1:1000, charVar = sapply(1:1000, function(x) paste(sample(letters, 10), collapse = "")))
dt3

# set the number of iterations for our test function
its <- 15000

# prepare a parallel computing cluster
t1 <- Sys.time() # measure the time used
clust.num.all <- parallel::detectCores() # number of available cores
cl <- parallel::makeCluster(clust.num.all)
registerDoSNOW(cl)

# run a parallel loop
dt4 <- foreach(i = seq_len(its), .combine = rbind, .packages = c("data.table")) %dopar% {
	dt3[sample(seq_len(nrow(dt3)), 1)]
}
stopCluster(cl)
Sys.time() - t1
dt4 # output is a data.table

If we want to use a data.table in sequential computing with sapply(), we end up with a matrix result, where all data has the same type, or a list. This is because sapply() can only structure results if individual returns are vectors of the same type. A workaround to get a data.table is to convert returns with unlist() and re-convert them to a data.table later.

# use sapply without changing variable types before return
t1 <- Sys.time()
res <- sapply(seq_len(its), function(x) dt3[sample(seq_len(nrow(dt3)), 1)])
Sys.time() - t1
str(res[, 1:5])
dt4 <- data.table(t(res))
# output is a list, conversion to data.table not straight-forward,
# as individual entries are lists themselves
str(dt4[1:5])


# use sapply changing variable types before return
t1 <- Sys.time()
res <- sapply(seq_len(its), function(x) unlist(dt3[sample(seq_len(nrow(dt3)), 1)]))
Sys.time() - t1
str(res[, 1:5])
dt4 <- data.table(t(res))
str(dt4)
dt4[, numVar := as.numeric(numVar)]
# output is a list, conversion to data.table straight-forward, but data type change
# necessary after creation of data.table
str(dt4)

A cleaner solution is to use the foreach loop sequentially. Compared to sapply(), this may, however, take more time.

t1 <- Sys.time()
dt4 <- foreach(i = seq_len(its), .combine = rbind, .packages = c("data.table")) %do% {
	dt3[sample(seq_len(nrow(dt3)), 1)]
}
Sys.time() - t1
dt4 # output is a data.table