Hands on data.table

The data.table package introduces “enhanced” data frames, concise syntax, and fast data processing, so let’s learn about it!

I won’t give any more introductions except that this tutorial is mostly based on the package’s introductory vignette, but I will include some of my own observations and opinions as well. You could find the vignette here or you could load it (after installing the package) with this command:

vignette("datatable-intro")

Let’s get started!

Creating data.tables

Before working on data.table objects, let’s explore the methods we can use to create one.

Reading from a file/URL

The function fread() allows you to read from local files (e.g. CSV) or external URLs.

flights <- fread("flights14.csv")

Building it manually

Creating a data.table here is almost identical to creating base R’s data frames, except that the function is named data.table() instead of data.frame().

my_dt <- data.table(
  id = c("b", "b", "b", "a", "a", "c"),
  a  = 1:6,
  b  = 7:12,
  c  = 13:18
)

my_dt
#    id a  b  c
# 1:  b 1  7 13
# 2:  b 2  8 14
# 3:  b 3  9 15
# 4:  a 4 10 16
# 5:  a 5 11 17
# 6:  c 6 12 18

Converting an existing data frame

The function data.table() mentioned earlier can be given a data frame to be converted to a data.table object. It’s that simple!

However, There is another function that is a little bit tricky because R usually doesn’t work that way. That function is setDT(). The tricky part is that it modifies (or mutates) the existing object. It doesn’t return a new object for you to assign somewhere. It doesn’t return anything! Because of that, you cannot use setDT() on an built-in data frame (e.g. iris, mtcars) unless you assign it into a variable in your environment.

iris_df <- iris
setDT(iris_df) # same as `iris_df <- data.table(iris_df)`

Subsetting data.tables

First of all, since a data.table are “enhanced” version of the conventional data frame, basic functions that work on data frames still work on data.tables. Also, string columns don’t get automatically converted to factors (which was the default behavior in base R pre-version 4.0).

class(my_dt)
# [1] "data.table" "data.frame"

summary(my_dt)
#       id                  a              b               c
#  Length:6           Min.   :1.00   Min.   : 7.00   Min.   :13.00
#  Class :character   1st Qu.:2.25   1st Qu.: 8.25   1st Qu.:14.25
#  Mode  :character   Median :3.50   Median : 9.50   Median :15.50
#                     Mean   :3.50   Mean   : 9.50   Mean   :15.50
#                     3rd Qu.:4.75   3rd Qu.:10.75   3rd Qu.:16.75
#                     Max.   :6.00   Max.   :12.00   Max.   :18.00

The square-bracket subsetting syntax [] is what encloses the special data.table syntax. So, its basic premise is the same as base R. But data.table adds a lot of features on that syntax.

Subsetting rows

Subsetting rows with data.table is exactly the same as base R’s subsetting, except that the comma is optional if no columns are chosen.

my_dt[ c(1, 4, 6) ] # same as `my_dt[ c(1, 4, 6), ]`
#    id a  b  c
# 1:  b 1  7 13
# 2:  a 4 10 16
# 3:  c 6 12 18

my_dt[ id == "a" ]
#    id a  b  c
# 1:  a 4 10 16
# 2:  a 5 11 17

Although this indexing method is usually meant for subsetting, you could actually use it to order your table by some column. This is not special to data.table, but it is kind of a neat trick.

my_dt[ order(id) ]
#    id a  b  c
# 1:  a 4 10 16
# 2:  a 5 11 17
# 3:  b 1  7 13
# 4:  b 2  8 14
# 5:  b 3  9 15
# 6:  c 6 12 18

Subsetting columns

Again, the syntax here is similar to base R. But there are additional points to be considered.

my_dt[ , c ]
# [1] 13 14 15 16 17 18
my_dt[ , .(a) ] # same as `my_dt[ , list(a) ]`
#    a
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
my_dt[ , .(a, c) ]
#    a  c
# 1: 1 13
# 2: 2 14
# 3: 3 15
# 4: 4 16
# 5: 5 17
# 6: 6 18
my_dt[ , .(col_a = a, col_c = c) ]
#    col_a col_c
# 1:     1    13
# 2:     2    14
# 3:     3    15
# 4:     4    16
# 5:     5    17
# 6:     6    18

Here is some additional column-subsetting syntax:

Aggregation

This is where things get more interesting. Within the same pair of brackets, you could actually perform aggregation (summary statistics). Of course, this operation can be combined with row-subsetting.

my_dt[ , sum(a) ]
# [1] 21

my_dt[ , .(mean_a = mean(a), mean_b = mean(b)) ]
#    mean_a mean_b
# 1:    3.5    9.5

my_dt[ id == "b", .(mean_a = mean(a), mean_b = mean(b)) ]
#    mean_a mean_b
# 1:      2      8

There is a special symbol .N which gets the number of rows/sample size.

my_dt[ a >= 3, .N ]
# [1] 4

Aggregation is most useful when it is grouped, and this is where we take data.table’s syntax to an even higher level: there is a by= argument that we can use also inside the square brackets. This argument takes a variable name or multiple names with the same .()/list() syntax.

my_dt[ , .N, by = .(id) ] # same as `my_dt[ , .N, id ]`
#    id N
# 1:  b 3
# 2:  a 2
# 3:  c 1

Instead of by= There is another argument named keyby= which actually sorts the groups.

my_dt[ , .N, keyby= id ]
#    id N
# 1:  a 2
# 2:  b 3
# 3:  c 1

Chaining data.table operations

So far, we have been discussing operations that have matching “verbs” in dplyr (e.g. filter(), select(), summarize(), group_by()). But there is another important feature: chaining operations (for which dplyr uses the pipe operator %>%). So can we chain operations in data.table code?

The answer is yes. Actually, it requires no additional syntax if you think about the following logic:

  1. The operation starts with a data.table object followed by square-bracket syntax.

  2. The output of that operation is also a data.table object, which (you might have noticed) is also the starting point of step 1!

After performing one operation on a data.table, a new operation can be added directly after the previous one with no need for intermediate variables or even special operators like the pipe.

In the next example, I made the same operation from the previous example, which had a column N. Then, I used that N column in a new subsetting operation.

my_dt[ , .N, keyby = id ][ N > 1 ]
#    id N
# 1:  a 2
# 2:  b 3

Other perks of data.table

So far, the whole tutorial has been about the syntax of making and working with data.tables, now let me describe some othe features and advantages for using the package:

My personal comments

I think data.table is quite an interesting package, and the next points sum up my own opinions after getting started with the package along with some comparisons to its popular counterpart dplyr.

# data.table
data.table(mtcars)[ am == 1, .(mmpg = mean(mpg)), by = cyl ][ order(mmpg) ]

# dplyr
tibble(mtcars) %>% filter(am == 1) %>% group_by(cyl) %>% summarize(mmpg = mean(mpg)) %>% arrange(mmpg)

We’re done playing with the basics of data.table! I hope you enjoyed it as much as I did.

Tags: