Which is Faster?: R Data Manipulation with data.table vs dplyr

I started my voyage into learning R by taking Datacamp’s online courses.  After finishing courses on data manipulation in both base R and dplyr, I stumbled upon a course on using the data.table library.  I was taken back a bit after learning that data.table using a different syntax than base R.  This was unnerving as I didn’t know what I would gain from learning data manipulation in yet another syntax.  My skepticism, however, changed to optimism once I began working on a rather large dataset a few weeks later.  This dataset (a 43M row set of email opens and clickthroughs), took something like 30-40 minutes to read into R using the base read.csv function.  Instead, I tried using the fread function in data.table.  Low and behold, what took 30-40 minutes using base R took about 5 minutes using fread.  Here’s timing data for a 3M row text file:

fread for r
85% improvement in median performance using fread

The speed improvements are not just limited to reading data.  Manipulations are also faster using data.table.  Here are 2 functions written to group and count rows using the world cities population dataset.


wcp <- fread(file="worldcitiespop.csv")

df_func <- function(x) {
  x %>%
    group_by(Country) %>%
      Count = n()
    ) %>%

dt_func <- function(x) {
  x[,.(Count = .N), Country][order(-Count)]

microbenchmark(df_func(wcp), dt_func(wcp),times = 10)
data.table performance
83% increase in median performance using data.table

Lastly, as you can see in the functions written above, the data.table function (dt_func) is less verbose than the dplyr function (df_func).  One of the reasons for this is that dplyr is meant to be easily expressed from one programmer to another, however, some programmers will not need to share their code from one user to another.  Nevertheless, once I learned the data.table syntax, I preferred using it over the dplyr syntax.  This seems to be the case for many programmers with a previous foundation in SQL.

datacamp data.table SQL similarity
Datacamp’s data.table tutorial explains the data.table – SQL similarity

While learning syntax can be a tough task, I have to admit that the extra work of learning data.table syntax is worth it.

If you haven’t had a chance to read my last post on using R with Google Analytics data, please take a look.  Also, if you have any comments, questions or if simply want to call me crazy, drop a comment below.

Like and share!

Leave a Reply

Your email address will not be published. Required fields are marked *