Almost all joins between 2 data.table
s use a notation
where one of them is used as i
in a frame applied to the
other, and the joining columns are specified with the on
parameter. However, in addition to the “basic” joins,
data.table
allows for special cases like rolling joins,
summarizing while joining, non-equi joins, etc. This vignette will
describe the notation to apply these joins with verbs defined in
table.express
, which, like the single-table
verbs, build data.table
expressions.
We’ll consider most of the dplyr
joining verbs in this
section:
inner_join
left_join
right_join
anti_join
semi_join
full_join
A <- data.table::data.table(x = rep(c("b", "a", "c"), each = 3),
y = c(1, 3, 6),
v = 1:9)
B <- data.table::data.table(x = c("c", "b"),
v2 = 8:7,
foo = c(4, 2))
A
#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
#> x v2 foo
#> 1: c 8 4
#> 2: b 7 2
The methods defined in table.express
accept the
on
part of the expression in their ellipsis:
#> x y v v2 foo
#> 1: c 1 7 8 4
#> 2: c 3 8 8 4
#> 3: c 6 9 8 4
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
#> x y v foo
#> 1: c 3 8 4
An important thing to note in the second example above is the order
in which the columns are given, i.e. that v
is written
before v2
, since the order is relevant for
data.table
. We can remember the correct order simply by
looking at which data.table
appears first in the
expression, and knowing that said data.table
’s columns must
appear first in the on
expressions. In this case,
A
appears before B
, so writing
v2 = v
would not work.
In order to maintain consistency in the ordering just described,
left_join
automatically swaps expression elements
internally in order to build the expression:
#> .DT_[.DT_0_, on = list(x, v2 = v), nomatch = , mult = , roll = ,
#> rollends = ]
#> x v2 foo y
#> 1: b 1 NA 1
#> 2: b 2 NA 3
#> 3: b 3 NA 6
#> 4: a 4 NA 1
#> 5: a 5 NA 3
#> 6: a 6 NA 6
#> 7: c 7 NA 1
#> 8: c 8 4 3
#> 9: c 9 NA 6
We can also see an extra .DT_0_
pronoun in the
expression. These special .DT_*_
pronouns hold the
different data.table
s that have entered the expression so
far, and are automatically assigned to the evaluation’s environment. In
this case, .DT_
holds B
and
.DT_0_
holds A
.
No additional considerations are required to use
right_join
or anti_join
:
#> x y v foo
#> 1: c 3 8 4
#> 2: b NA 7 2
#> x y v
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 6 9
A semi_join
is essentially a right_join
with nomatch = NULL
, and where j
is set to
unique(.SD)
:
#> x y v
#> 1: c 1 7
#> 2: c 3 8
#> 3: c 6 9
#> 4: b 1 1
#> 5: b 3 2
#> 6: b 6 3
Finally, full_join
is basically a wrapper for
merge
specifying all = TRUE
:
#> x y v.x v.y foo
#> 1: a 1 4 NA NA
#> 2: a 3 5 NA NA
#> 3: a 6 6 NA NA
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
#> 7: c 1 7 8 4
#> 8: c 3 8 8 4
#> 9: c 6 9 8 4
Something to keep in mind is that most joins specify i
and on
inside the frame, so any subsequent verbs that
specify j
, by
, or keyby
would
still be possible. In order to enable this, many joining verbs have an
.expr
parameter that indicates that the expression should
be kept, delaying evaluation until a verb that sets j
is
used. This can be useful if only a subset of the joined columns is
desired:
#> x y foo
#> 1: b 1 2
#> 2: b 3 2
#> 3: b 6 2
#> 4: a 1 NA
#> 5: a 3 NA
#> 6: a 6 NA
#> 7: c 1 4
#> 8: c 3 4
#> 9: c 6 4
But, when working lazily, this would require explicit chaining for expressions that should be applied to the joined table:
A %>%
start_expr %>%
left_join(B, x) %>%
chain %>%
group_by(x) %>%
mutate(y = cumsum(y)) %>%
end_expr %>% {
invisible(print(.))
}
#> x v foo y i.v
#> 1: b 7 2 1 1
#> 2: b 7 2 4 2
#> 3: b 7 2 10 3
#> 4: a NA NA 1 4
#> 5: a NA NA 4 5
#> 6: a NA NA 10 6
#> 7: c 8 4 1 7
#> 8: c 8 4 4 8
#> 9: c 8 4 10 9
This is particularly important if the selection expressions call any
function, e.g. tidyselect
helpers or even :
with non-numerics, because data.table
does not support that
kind of syntax for j
in the same frame as a join.
A special data.table
idiom is described here as an update
join. In order to highlight the fact that it modifies the left-hand
side table by reference, the mutate_join
verb is defined in
table.express
. Said verb accepts the columns to be added in
its .SDcols
parameter, possibly with new names:
#> x y v foo v_from_B
#> 1: b 1 1 2 7
#> 2: b 3 2 2 7
#> 3: b 6 3 2 7
#> 4: a 1 4 NA NA
#> 5: a 3 5 NA NA
#> 6: a 6 6 NA NA
#> 7: c 1 7 4 8
#> 8: c 3 8 4 8
#> 9: c 6 9 4 8
A particularity of this idiom is that the number of rows from the resulting join must match the left-hand side exactly or not at all, so this won’t work:
In these cases, we must either use mult
if appropriate,
or specify a summarizing expression in .SDcols
:
#> x v foo y
#> 1: c 8 4 1
#> 2: b 7 2 1
#> x v foo y
#> 1: c 8 4 3.333333
#> 2: b 7 2 3.333333
The last example specifies by = .EACHI
in the joining
expression.
A nice blog post describing rolling joins can be found at R-bloggers,
so almost the same website
and paypal
tables
will be used for the examples below. Another short description with
animated depictions can also be found here.
#> name session_start_time session_id
#> 1: Erica 2016-01-04 19:12:00 1
#> 2: Erica 2016-01-04 21:05:00 2
#> 3: Francis 2016-01-02 13:09:00 3
#> 4: Francis 2016-01-03 19:22:00 4
#> 5: Francis 2016-01-08 08:44:00 5
#> 6: Francis 2016-01-08 20:22:00 6
#> 7: Francis 2016-01-10 17:36:00 7
#> 8: Francis 2016-01-15 16:56:00 8
#> 9: Isabel 2016-01-01 11:01:00 9
#> 10: Isabel 2016-01-02 08:59:00 10
#> 11: Isabel 2016-01-05 18:18:00 11
#> 12: Isabel 2016-01-07 19:03:00 12
#> 13: Isabel 2016-01-08 19:01:00 13
#> 14: Sally 2016-01-03 10:00:00 14
#> 15: Vivian 2016-01-01 09:10:00 15
#> 16: Vivian 2016-01-09 02:15:00 16
#> name purchase_time payment_id
#> 1: Erica 2016-01-03 08:02:00 1
#> 2: Francis 2016-01-03 19:28:00 2
#> 3: Francis 2016-01-08 20:33:00 3
#> 4: Francis 2016-01-10 17:46:00 4
#> 5: Isabel 2016-01-08 19:10:00 5
#> 6: Mom 2015-12-02 17:58:00 6
#> 7: Sally 2016-01-03 10:06:00 7
#> 8: Sally 2016-01-03 10:15:00 8
In contrast to the blog post, no join_time
is added to
the tables. This is done on purpose in order to show what happens with
the columns that are rolled.
Let’s use a left rolling join to obtain the session_id
that immediately preceded a purchase, if any:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 4 2
#> 3: Francis 2016-01-08 20:33:00 6 3
#> 4: Francis 2016-01-10 17:46:00 7 4
#> 5: Isabel 2016-01-08 19:10:00 13 5
#> 6: Mom 2015-12-02 17:58:00 NA 6
#> 7: Sally 2016-01-03 10:06:00 14 7
#> 8: Sally 2016-01-03 10:15:00 14 8
We can see that the rows returned are from the left-hand side
(paypal
), and since neither Mom nor Erica visited the
website before their purchases, their session_id
ended as
NA
.
The order of the columns in the on
expressions is the
same as above. The tricky part is that the
rolled column ended up with the name from the right-hand side, but
keeping the values from the left-hand side. If we “invert” the join, the
result is the same, but the rolled column’s name is now from the
expression’s left-hand side.
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 4 2
#> 3: Francis 2016-01-08 20:33:00 6 3
#> 4: Francis 2016-01-10 17:46:00 7 4
#> 5: Isabel 2016-01-08 19:10:00 13 5
#> 6: Mom 2015-12-02 17:58:00 NA 6
#> 7: Sally 2016-01-03 10:06:00 14 7
#> 8: Sally 2016-01-03 10:15:00 14 8
Note, however, that roll
stayed equal to
Inf
. This is because even though the column order in the
expressions changed, we could understand the rolling expressions as
follows:
left_join
, the rolling column on the left is
purchase_time
, so with roll = Inf
, the values
from session_start_time
are rolled forward onto
purchase_time
to find a match while joining.right_join
, the rolling column on the right is
purchase_time
, so roll
must stay as
Inf
to keep the same semantics.Now let’s say we want to keep all the rows from website
and find the closest payment_id
that occurred
after the visit. This could be expressed as:
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 NA 1
#> 2: Erica 2016-01-04 21:05:00 NA 2
#> 3: Francis 2016-01-02 13:09:00 2 3
#> 4: Francis 2016-01-03 19:22:00 2 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 NA 8
#> 9: Isabel 2016-01-01 11:01:00 5 9
#> 10: Isabel 2016-01-02 08:59:00 5 10
#> 11: Isabel 2016-01-05 18:18:00 5 11
#> 12: Isabel 2016-01-07 19:03:00 5 12
#> 13: Isabel 2016-01-08 19:01:00 5 13
#> 14: Sally 2016-01-03 10:00:00 7 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
In order to simplify the meaning of rollends
a bit, we
could think of it as missing or being a single
TRUE
/FALSE
. If it’s missing, rolling works
according to the value of roll
, otherwise:
rollends = TRUE
, the value of roll
is
inverted only for those rows that would have no match
otherwise.rollends = FALSE
, a matching roll will only occur
if the column’s value falls in a gap with values both before
and after.website %>%
left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = TRUE)
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 1 1
#> 2: Erica 2016-01-04 21:05:00 1 2
#> 3: Francis 2016-01-02 13:09:00 2 3
#> 4: Francis 2016-01-03 19:22:00 2 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 4 8
#> 9: Isabel 2016-01-01 11:01:00 5 9
#> 10: Isabel 2016-01-02 08:59:00 5 10
#> 11: Isabel 2016-01-05 18:18:00 5 11
#> 12: Isabel 2016-01-07 19:03:00 5 12
#> 13: Isabel 2016-01-08 19:01:00 5 13
#> 14: Sally 2016-01-03 10:00:00 7 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
website %>%
left_join(paypal, name, session_start_time = purchase_time, roll = -Inf, rollends = FALSE)
#> name purchase_time payment_id session_id
#> 1: Erica 2016-01-04 19:12:00 NA 1
#> 2: Erica 2016-01-04 21:05:00 NA 2
#> 3: Francis 2016-01-02 13:09:00 NA 3
#> 4: Francis 2016-01-03 19:22:00 NA 4
#> 5: Francis 2016-01-08 08:44:00 3 5
#> 6: Francis 2016-01-08 20:22:00 3 6
#> 7: Francis 2016-01-10 17:36:00 4 7
#> 8: Francis 2016-01-15 16:56:00 NA 8
#> 9: Isabel 2016-01-01 11:01:00 NA 9
#> 10: Isabel 2016-01-02 08:59:00 NA 10
#> 11: Isabel 2016-01-05 18:18:00 NA 11
#> 12: Isabel 2016-01-07 19:03:00 NA 12
#> 13: Isabel 2016-01-08 19:01:00 NA 13
#> 14: Sally 2016-01-03 10:00:00 NA 14
#> 15: Vivian 2016-01-01 09:10:00 NA 15
#> 16: Vivian 2016-01-09 02:15:00 NA 16
Vivian’s payment_id
s are always NA
because
she has never purchased anything. On the other hand, no one except
Francis has visited the website both before and after a purchase.
Non-equi joins are similar to rolling joins, but instead of rolling a single row’s value, they can return several values per row.
Using the same data as before, we could find all the
session_id
s that preceded a payment_id
, giving
“priority” to paypal
’s rows:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 3 2
#> 3: Francis 2016-01-03 19:28:00 4 2
#> 4: Francis 2016-01-08 20:33:00 3 3
#> 5: Francis 2016-01-08 20:33:00 4 3
#> 6: Francis 2016-01-08 20:33:00 5 3
#> 7: Francis 2016-01-08 20:33:00 6 3
#> 8: Francis 2016-01-10 17:46:00 3 4
#> 9: Francis 2016-01-10 17:46:00 4 4
#> 10: Francis 2016-01-10 17:46:00 5 4
#> 11: Francis 2016-01-10 17:46:00 6 4
#> 12: Francis 2016-01-10 17:46:00 7 4
#> 13: Isabel 2016-01-08 19:10:00 9 5
#> 14: Isabel 2016-01-08 19:10:00 10 5
#> 15: Isabel 2016-01-08 19:10:00 11 5
#> 16: Isabel 2016-01-08 19:10:00 12 5
#> 17: Isabel 2016-01-08 19:10:00 13 5
#> 18: Mom 2015-12-02 17:58:00 NA 6
#> 19: Sally 2016-01-03 10:06:00 14 7
#> 20: Sally 2016-01-03 10:15:00 14 8
Priority above simply means that all rows from paypal
are returned, even if they don’t have a match in website
.
Even though a column session_start_time
appears in the
result, the values contained therein are from paypal
’s
purchase_time
.
A corresponding right non-equi join would yield the same result, expecting only a different order in the columns that are part of the comparisons:
#> name session_start_time session_id payment_id
#> 1: Erica 2016-01-03 08:02:00 NA 1
#> 2: Francis 2016-01-03 19:28:00 3 2
#> 3: Francis 2016-01-03 19:28:00 4 2
#> 4: Francis 2016-01-08 20:33:00 3 3
#> 5: Francis 2016-01-08 20:33:00 4 3
#> 6: Francis 2016-01-08 20:33:00 5 3
#> 7: Francis 2016-01-08 20:33:00 6 3
#> 8: Francis 2016-01-10 17:46:00 3 4
#> 9: Francis 2016-01-10 17:46:00 4 4
#> 10: Francis 2016-01-10 17:46:00 5 4
#> 11: Francis 2016-01-10 17:46:00 6 4
#> 12: Francis 2016-01-10 17:46:00 7 4
#> 13: Isabel 2016-01-08 19:10:00 9 5
#> 14: Isabel 2016-01-08 19:10:00 10 5
#> 15: Isabel 2016-01-08 19:10:00 11 5
#> 16: Isabel 2016-01-08 19:10:00 12 5
#> 17: Isabel 2016-01-08 19:10:00 13 5
#> 18: Mom 2015-12-02 17:58:00 NA 6
#> 19: Sally 2016-01-03 10:06:00 14 7
#> 20: Sally 2016-01-03 10:15:00 14 8
In case a self join were necessary, perhaps while using a rolling or
non-equi join, the way magrittr
’s pipe handles the
.
outside of nested calls wouldn’t allow calling a joining
verb with .
both in x
and y
. To
work around this, the following verbs default to an eager self join when
y
is missing:
full_join
left_join
mutate_join
As a somewhat contrived example, we could add a rolling count of
weekly visits per user to the website
data introduced above:
website %>%
mutate(window_start = session_start_time - as.difftime(7, units = "days")) %>%
mutate_join(, name, window_start <= session_start_time, session_start_time >= session_start_time,
.SDcols = .(weekly_visits = .N),
.by_each = TRUE) %>%
mutate(window_start = NULL)
print(website)
#> name session_start_time session_id weekly_visits
#> 1: Erica 2016-01-04 19:12:00 1 1
#> 2: Erica 2016-01-04 21:05:00 2 2
#> 3: Francis 2016-01-02 13:09:00 3 1
#> 4: Francis 2016-01-03 19:22:00 4 2
#> 5: Francis 2016-01-08 08:44:00 5 3
#> 6: Francis 2016-01-08 20:22:00 6 4
#> 7: Francis 2016-01-10 17:36:00 7 4
#> 8: Francis 2016-01-15 16:56:00 8 3
#> 9: Isabel 2016-01-01 11:01:00 9 1
#> 10: Isabel 2016-01-02 08:59:00 10 2
#> 11: Isabel 2016-01-05 18:18:00 11 3
#> 12: Isabel 2016-01-07 19:03:00 12 4
#> 13: Isabel 2016-01-08 19:01:00 13 4
#> 14: Sally 2016-01-03 10:00:00 14 1
#> 15: Vivian 2016-01-01 09:10:00 15 1
#> 16: Vivian 2016-01-09 02:15:00 16 1