itsakettle

#-2 | Rolling a Join

The data.table package in R facilitates rolling joins. I’ve stumbled across the term a few times so I thought I should learn about it. My first Google search for rolling join yielded a mix of data topics…and other content that wasn’t work safe. I carefully navigated to a few different pages and did some reading. I didn’t understand what was going on though. So I created some toy examples of my own and with some tinkering I got to grips with the technique.

Nomenclature

According to the data.table intro vignette a rolling join is also called a fast ordered join or last observation carried forward join. The basic idea is that if there is no matching value for a row being looked up during the join, then the value from the closest record is matched in lieu. I guess each of the names kind of makes sense to me but I’m sticking with rolling join because otherwise the title of this post wouldn’t work.

Making up data

To start with I made up a couple of pretend data.tables to join together.

  dt <- data.table(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
                 t=c(100, 200, 300, 400, 500, 600, 700, 800, 100, 200, 300, 400, 500, 600, 700, 800),
                 key=c('id', 't'))
  dt
##     id   t
##  1:  1 100
##  2:  1 200
##  3:  1 300
##  4:  1 400
##  5:  1 500
##  6:  1 600
##  7:  1 700
##  8:  1 800
##  9:  2 100
## 10:  2 200
## 11:  2 300
## 12:  2 400
## 13:  2 500
## 14:  2 600
## 15:  2 700
## 16:  2 800

The id column can be any arbitrary entity but here I’ve just used 1 and 2. The column t represents time, or something similar to that, and again I’ve just picked arbitrary values. id is set as the first key column and t is set as the second key column - the order will matter later.

Say we want to add another column, the colour associated with an id at time t.

  dt.colour <- data.table(id=c(1, 2, 2),
                          t=c(300, 200, 600),
                          colour=c('blue', 'yellow', 'red'), key=c('id', 't'))
  dt.colour
##    id   t colour
## 1:  1 300   blue
## 2:  2 200 yellow
## 3:  2 600    red

So id 1 was associated with ‘blue’ at time 300 and id 2 was associated with ‘yellow’ at time 200 and ‘red’ at time 600. The keys are set the same as dt.

Let’s join the two data.tables together to add the colours. So for each entry in dt we want to try and find any matching values in dt.colour. In data.table land this is done like this

  dt.colour[dt,]
##     id   t colour
##  1:  1 100     NA
##  2:  1 200     NA
##  3:  1 300   blue
##  4:  1 400     NA
##  5:  1 500     NA
##  6:  1 600     NA
##  7:  1 700     NA
##  8:  1 800     NA
##  9:  2 100     NA
## 10:  2 200 yellow
## 11:  2 300     NA
## 12:  2 400     NA
## 13:  2 500     NA
## 14:  2 600    red
## 15:  2 700     NA
## 16:  2 800     NA

As expected the colour column has been added with values if possible and with NAs inserted where there is no match.

Rolling

Let’s set roll=Inf and see what happens

  dt.colour[dt, , roll=Inf]
##     id   t colour
##  1:  1 100     NA
##  2:  1 200     NA
##  3:  1 300   blue
##  4:  1 400   blue
##  5:  1 500   blue
##  6:  1 600   blue
##  7:  1 700   blue
##  8:  1 800   blue
##  9:  2 100     NA
## 10:  2 200 yellow
## 11:  2 300 yellow
## 12:  2 400 yellow
## 13:  2 500 yellow
## 14:  2 600    red
## 15:  2 700    red
## 16:  2 800    red

For rows that have no exact key match, the last value matching the first key is rolled forward. The last value is defined by the order of the final key column - in this case t. The roll argument is set to Inf to specify that there is no limit to how far a value can be rolled forward.

From here on I’m going to add the unrolled version of the colour column to dt so as the rolled result can be easily compared to non-rolled result.

  dt[, non.rolled.colour := dt.colour[dt, colour, roll=FALSE]]
  dt
##     id   t non.rolled.colour
##  1:  1 100                NA
##  2:  1 200                NA
##  3:  1 300              blue
##  4:  1 400                NA
##  5:  1 500                NA
##  6:  1 600                NA
##  7:  1 700                NA
##  8:  1 800                NA
##  9:  2 100                NA
## 10:  2 200            yellow
## 11:  2 300                NA
## 12:  2 400                NA
## 13:  2 500                NA
## 14:  2 600               red
## 15:  2 700                NA
## 16:  2 800                NA

Equivalent arguments

Note that roll=TRUE and roll=T are equivalent to roll=Inf.

Rolling backwards

Rolling can be reversed by setting roll=-Inf. Now values are rolled back instead of forward.

  dt.colour[dt, , roll=-Inf]
##     id   t colour non.rolled.colour
##  1:  1 100   blue                NA
##  2:  1 200   blue                NA
##  3:  1 300   blue              blue
##  4:  1 400     NA                NA
##  5:  1 500     NA                NA
##  6:  1 600     NA                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100 yellow                NA
## 10:  2 200 yellow            yellow
## 11:  2 300    red                NA
## 12:  2 400    red                NA
## 13:  2 500    red                NA
## 14:  2 600    red               red
## 15:  2 700     NA                NA
## 16:  2 800     NA                NA

Limiting the roll

By setting roll to a number you can limit how far values are rolled forward or back. The value specified is relative to the second key column, in this case t.

  dt.colour[dt, , roll=100]
##     id   t colour non.rolled.colour
##  1:  1 100     NA                NA
##  2:  1 200     NA                NA
##  3:  1 300   blue              blue
##  4:  1 400   blue                NA
##  5:  1 500     NA                NA
##  6:  1 600     NA                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100     NA                NA
## 10:  2 200 yellow            yellow
## 11:  2 300 yellow                NA
## 12:  2 400     NA                NA
## 13:  2 500     NA                NA
## 14:  2 600    red               red
## 15:  2 700    red                NA
## 16:  2 800     NA                NA

Since t increases in increments of 100 for each id, the colour is rolled forward one row. So with these datatables if roll=99 then nothing will be rolled forward and the result is the same as roll=FALSE.

  dt.colour[dt, , roll=99]
##     id   t colour non.rolled.colour
##  1:  1 100     NA                NA
##  2:  1 200     NA                NA
##  3:  1 300   blue              blue
##  4:  1 400     NA                NA
##  5:  1 500     NA                NA
##  6:  1 600     NA                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100     NA                NA
## 10:  2 200 yellow            yellow
## 11:  2 300     NA                NA
## 12:  2 400     NA                NA
## 13:  2 500     NA                NA
## 14:  2 600    red               red
## 15:  2 700     NA                NA
## 16:  2 800     NA                NA

What happens if roll isn’t an integer? It seems to work as you would expect.

  dt.colour[dt, , roll=100.1]
##     id   t colour non.rolled.colour
##  1:  1 100     NA                NA
##  2:  1 200     NA                NA
##  3:  1 300   blue              blue
##  4:  1 400   blue                NA
##  5:  1 500     NA                NA
##  6:  1 600     NA                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100     NA                NA
## 10:  2 200 yellow            yellow
## 11:  2 300 yellow                NA
## 12:  2 400     NA                NA
## 13:  2 500     NA                NA
## 14:  2 600    red               red
## 15:  2 700    red                NA
## 16:  2 800     NA                NA

Here’s an example with a higher value of roll.

  dt.colour[dt, , roll=300]
##     id   t colour non.rolled.colour
##  1:  1 100     NA                NA
##  2:  1 200     NA                NA
##  3:  1 300   blue              blue
##  4:  1 400   blue                NA
##  5:  1 500   blue                NA
##  6:  1 600   blue                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100     NA                NA
## 10:  2 200 yellow            yellow
## 11:  2 300 yellow                NA
## 12:  2 400 yellow                NA
## 13:  2 500 yellow                NA
## 14:  2 600    red               red
## 15:  2 700    red                NA
## 16:  2 800    red                NA

Note again that when you specify a number for the roll argument it is how far a particular value can be rolled forward or back. It is worth pointing out that it can also be thought of as how far back a row missing a value is willing to look to find an alternative. This means you can use it to, say, find the most recent value within a certain window.

Selective Roll

I couldn’t figure out how to roll on some columns but not others. I don’t think this is currently possible.

Nearest Value

You can also set roll=nearest. Let’s see what this does.

  dt.colour[dt, , roll='nearest']
##     id   t colour non.rolled.colour
##  1:  1 100   blue                NA
##  2:  1 200   blue                NA
##  3:  1 300   blue              blue
##  4:  1 400   blue                NA
##  5:  1 500   blue                NA
##  6:  1 600   blue                NA
##  7:  1 700   blue                NA
##  8:  1 800   blue                NA
##  9:  2 100 yellow                NA
## 10:  2 200 yellow            yellow
## 11:  2 300 yellow                NA
## 12:  2 400 yellow                NA
## 13:  2 500    red                NA
## 14:  2 600    red               red
## 15:  2 700    red                NA
## 16:  2 800    red                NA

It looks like the nearest value, either forwards or backwards, is taken. You can see that there is a tie when id=2 and t=400 and in this case the value found by looking backwards seems to be taken over the one found by looking forwards. But what if you want to look only so far backwards and forwards?

Rollends

There is also the rollends argument which by default is set to c(TRUE, FALSE) if roll is negative and c(FALSE, TRUE) if it is positive or zero (I suppose roll=TRUE is considered positive). So the first element of rollends is whether to roll backwards and the second element is whether to roll forwards. So by setting rollends=c(TRUE, TRUE) you can roll backwards and forwards whilst also specifying how far to look.

  dt.colour[dt, , roll=200, rollends=c(TRUE,TRUE)]
##     id   t colour non.rolled.colour
##  1:  1 100   blue                NA
##  2:  1 200   blue                NA
##  3:  1 300   blue              blue
##  4:  1 400   blue                NA
##  5:  1 500   blue                NA
##  6:  1 600     NA                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100 yellow                NA
## 10:  2 200 yellow            yellow
## 11:  2 300 yellow                NA
## 12:  2 400 yellow                NA
## 13:  2 500     NA                NA
## 14:  2 600    red               red
## 15:  2 700    red                NA
## 16:  2 800    red                NA

What happens when you set rollends=c(FALSE, FALSE). I expected this would mean no rolling, but that doesn’t seem to be the case. The documentation says

If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column.

First note that rollends=FALSE is the same as rollends=c(FALSE, FALSE) because the value is recycled. Next, in our case the x referred to is dt.colours and i is dt. So in order for a row in dt to find a match, its value of t must fall in between the values of t for two rows in dt.colours that have the same value of id, the first key column. Let’s try it with roll=Inf. It might be a good idea to print out dt.colours as well for reference.

  dt.colour
##    id   t colour
## 1:  1 300   blue
## 2:  2 200 yellow
## 3:  2 600    red
  # Note roll=Inf
  dt.colour[dt, , roll=Inf, rollends=c(FALSE, FALSE)]
##     id   t colour non.rolled.colour
##  1:  1 100     NA                NA
##  2:  1 200     NA                NA
##  3:  1 300   blue              blue
##  4:  1 400     NA                NA
##  5:  1 500     NA                NA
##  6:  1 600     NA                NA
##  7:  1 700     NA                NA
##  8:  1 800     NA                NA
##  9:  2 100     NA                NA
## 10:  2 200 yellow            yellow
## 11:  2 300 yellow                NA
## 12:  2 400 yellow                NA
## 13:  2 500 yellow                NA
## 14:  2 600    red               red
## 15:  2 700     NA                NA
## 16:  2 800     NA                NA

Only rows of dt that have a value of t in between values of t in dt.colour can roll previous values forward. So when id is 1 there is no rolling since there is only one applicable row in dt.colours meaning nothing can be in between anything. However for dt equal 2 the rows with values of t from 200-500 are all in between the two entries with an id of 2 in dt.colour. So it seems that rollends=FALSE is a way of turning off rolling on the edges of the data.

Multiple Grouping Columns

Everything used in this post has used two key columns id and t when performing rolling joins. However you can use as many as you like with the understanding that the last key column will always be used for rolling and the rest for defining groups. As a finish here’s an example of this.

  dt <- data.table(id=c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
                   id2=c(rep('a', 4), rep('b', 4), rep('a', 4), rep('b', 4)),
                 t=c(100, 200, 300, 400, 500, 600, 700, 800, 100, 200, 300, 400, 500, 600, 700, 800),
                 key=c('id', 'id2', 't'))
  dt
##     id id2   t
##  1:  1   a 100
##  2:  1   a 200
##  3:  1   a 300
##  4:  1   a 400
##  5:  1   b 500
##  6:  1   b 600
##  7:  1   b 700
##  8:  1   b 800
##  9:  2   a 100
## 10:  2   a 200
## 11:  2   a 300
## 12:  2   a 400
## 13:  2   b 500
## 14:  2   b 600
## 15:  2   b 700
## 16:  2   b 800
  dt.colour <- data.table(id=c(1, 2, 2),
                          id2=c('b', 'b', 'b'),
                          t=c(300, 200, 600),
                          colour=c('blue', 'yellow', 'red'), key=c('id', 'id2', 't'))
  dt.colour
##    id id2   t colour
## 1:  1   b 300   blue
## 2:  2   b 200 yellow
## 3:  2   b 600    red
# Add non rolled data for reference
dt[, non.rolled.colour := dt.colour[dt, colour, roll=FALSE]]
dt.colour[dt, , roll=TRUE]
##     id id2   t colour non.rolled.colour
##  1:  1   a 100     NA                NA
##  2:  1   a 200     NA                NA
##  3:  1   a 300     NA                NA
##  4:  1   a 400     NA                NA
##  5:  1   b 500   blue                NA
##  6:  1   b 600   blue                NA
##  7:  1   b 700   blue                NA
##  8:  1   b 800   blue                NA
##  9:  2   a 100     NA                NA
## 10:  2   a 200     NA                NA
## 11:  2   a 300     NA                NA
## 12:  2   a 400     NA                NA
## 13:  2   b 500 yellow                NA
## 14:  2   b 600    red               red
## 15:  2   b 700    red                NA
## 16:  2   b 800    red                NA

Corrections

2017-07-02: Updated the Rollends section to state that the default value of rollends when roll is positive or zero is c(FALSE, TRUE) rather than c(TRUE, FALSE).