slider is implemented with a new convention that began in vctrs,
treating a data frame as a vector of rows. This makes
slide()
a row-wise iterator over a data frame,
which can be useful for solving some previously tricky problems in the
tidyverse.
The point of this vignette is to go through a few examples of a
row-oriented workflow. The examples are adapted from Jenny Bryan’s
talk of row-oriented workflows with purrr, to show how this workflow
is improved with slide()
.
Let’s first explore using slide()
as a row wise iterator
in general. We’ll start with this simple data frame.
example <- tibble(
x = 1:4,
y = letters[1:4]
)
example
#> # A tibble: 4 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
If we were to pass the x
column to slide()
,
it would iterate over that using the window specified by
.before
, .after
, and .complete
.
The defaults are similar to purrr::map()
.
slide(example$x, ~.x)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
#>
#> [[4]]
#> [1] 4
slide(example$x, ~.x, .before = 2)
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 1 2
#>
#> [[3]]
#> [1] 1 2 3
#>
#> [[4]]
#> [1] 2 3 4
When applied to the entire example
data frame,
map()
treats it as a list and iterates over the columns.
slide()
, on the other hand, iterates over rows. This is
consistent with the vctrs idea of size, which is the length of
an atomic vector, but the number of rows of a data frame or matrix.
slide()
always returns an object with the same
size as its input. Because the number of rows in
example
is 4, the output size is 4 and you get one row per
element in the output.
slide(example, ~.x)
#> [[1]]
#> # A tibble: 1 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#>
#> [[2]]
#> # A tibble: 1 × 2
#> x y
#> <int> <chr>
#> 1 2 b
#>
#> [[3]]
#> # A tibble: 1 × 2
#> x y
#> <int> <chr>
#> 1 3 c
#>
#> [[4]]
#> # A tibble: 1 × 2
#> x y
#> <int> <chr>
#> 1 4 d
You can still use the other arguments to slide()
to
control the window size.
# Current row + 2 before
slide(example, ~.x, .before = 2)
#> [[1]]
#> # A tibble: 1 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#>
#> [[2]]
#> # A tibble: 2 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#>
#> [[3]]
#> # A tibble: 3 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#>
#> [[4]]
#> # A tibble: 3 × 2
#> x y
#> <int> <chr>
#> 1 2 b
#> 2 3 c
#> 3 4 d
# Center aligned, with no partial results
slide(example, ~.x, .before = 1, .after = 1, .complete = TRUE)
#> [[1]]
#> NULL
#>
#> [[2]]
#> # A tibble: 3 × 2
#> x y
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#>
#> [[3]]
#> # A tibble: 3 × 2
#> x y
#> <int> <chr>
#> 1 2 b
#> 2 3 c
#> 3 4 d
#>
#> [[4]]
#> NULL
Often, using slide()
with its defaults will be enough,
as it is common to iterate over just one row at a time.
A nice use of a tibble is as a structured way to store parameter
combinations. For example, we could store multiple rows of parameter
combinations where each row could be supplied to runif()
to
generate different types of uniform random variables.
parameters <- tibble(
n = 1:3,
min = c(0, 10, 100),
max = c(1, 100, 1000)
)
parameters
#> # A tibble: 3 × 3
#> n min max
#> <int> <dbl> <dbl>
#> 1 1 0 1
#> 2 2 10 100
#> 3 3 100 1000
With slide()
you can pass these parameters on to
runif()
by iterating over parameters
row-wise.
This gives you access to the data frame of the current row through
.x
. Because it is a data frame, you have access to each
column by name. Notice how there is no restriction that the columns of
the data frame be the same as the argument names of
runif()
.
set.seed(123)
slide(parameters, ~runif(.x$n, .x$min, .x$max))
#> [[1]]
#> [1] 0.2875775
#>
#> [[2]]
#> [1] 80.94746 46.80792
#>
#> [[3]]
#> [1] 894.7157 946.4206 141.0008
This can also be done with purrr::pmap()
, but you either
have to name the parameters
tibble with the same column
names as the function you are calling, or you have to access each column
positionally as ..1
, ..3
, etc.
A third alternative that works nicely here is to use
rowwise()
before calling mutate()
. Just
remember to wrap the result of runif()
in a
list()
!
For these examples, we will consider a company
data set
containing the day
a sale was made, the number of calls,
n_calls
, that were placed on that day, and the number of
sales
that resulted from those calls.
company <- tibble(
day = rep(c(1, 2), each = 5),
sales = sample(100, 10),
n_calls = sales + sample(1000, 10)
)
company
#> # A tibble: 10 × 3
#> day sales n_calls
#> <dbl> <int> <int>
#> 1 1 25 544
#> 2 1 90 516
#> 3 1 91 740
#> 4 1 69 835
#> 5 1 98 309
#> 6 2 57 989
#> 7 2 92 682
#> 8 2 9 602
#> 9 2 72 627
#> 10 2 26 897
When slide()
-ing inside of a mutate()
call,
there are a few scenarios that can arise. First, you might want to slide
over a single column. This is easy enough in both the un-grouped and
grouped case.
company %>%
mutate(sales_roll = slide_dbl(sales, mean, .before = 2, .complete = TRUE))
#> # A tibble: 10 × 4
#> day sales n_calls sales_roll
#> <dbl> <int> <int> <dbl>
#> 1 1 25 544 NA
#> 2 1 90 516 NA
#> 3 1 91 740 68.7
#> 4 1 69 835 83.3
#> 5 1 98 309 86
#> 6 2 57 989 74.7
#> 7 2 92 682 82.3
#> 8 2 9 602 52.7
#> 9 2 72 627 57.7
#> 10 2 26 897 35.7
company %>%
group_by(day) %>%
mutate(sales_roll = slide_dbl(sales, mean, .before = 2, .complete = TRUE))
#> # A tibble: 10 × 4
#> # Groups: day [2]
#> day sales n_calls sales_roll
#> <dbl> <int> <int> <dbl>
#> 1 1 25 544 NA
#> 2 1 90 516 NA
#> 3 1 91 740 68.7
#> 4 1 69 835 83.3
#> 5 1 98 309 86
#> 6 2 57 989 NA
#> 7 2 92 682 NA
#> 8 2 9 602 52.7
#> 9 2 72 627 57.7
#> 10 2 26 897 35.7
If you need to apply a sliding function that takes a data frame as
input to slide over, then you’ll need some way to access the “current”
data frame that mutate()
is acting on. As of dplyr 1.0.0,
you can access this with cur_data()
. When there is only 1
group, the current data frame is the input itself, but when there are
multiple groups cur_data()
returns the data frame
corresponding to the current group that is being worked on.
As an example, imagine you want to fit a rolling linear model
predicting sales from the number of calls. The most robust way to do
this in a mutate()
is to use cur_data()
to
access the data frame to slide over. Since slide()
iterates
row-wise, .x
corresponds to the current slice of the
current data frame.
company %>%
mutate(
regressions = slide(
.x = cur_data(),
.f = ~lm(sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `regressions = slide(...)`.
#> Caused by warning:
#> ! `cur_data()` was deprecated in dplyr 1.1.0.
#> ℹ Please use `pick()` instead.
#> # A tibble: 10 × 4
#> day sales n_calls regressions
#> <dbl> <int> <int> <list>
#> 1 1 25 544 <NULL>
#> 2 1 90 516 <NULL>
#> 3 1 91 740 <lm>
#> 4 1 69 835 <lm>
#> 5 1 98 309 <lm>
#> 6 2 57 989 <lm>
#> 7 2 92 682 <lm>
#> 8 2 9 602 <lm>
#> 9 2 72 627 <lm>
#> 10 2 26 897 <lm>
When you group by day
, cur_data()
will
first correspond to all rows where day == 1
, and then where
day == 2
. Notice how the output has two clumps of
NULL
s, proving that the rolling regressions “restarted”
between groups.
company %>%
group_by(day) %>%
mutate(
regressions = slide(
.x = cur_data(),
.f = ~lm(sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> # A tibble: 10 × 4
#> # Groups: day [2]
#> day sales n_calls regressions
#> <dbl> <int> <int> <list>
#> 1 1 25 544 <NULL>
#> 2 1 90 516 <NULL>
#> 3 1 91 740 <lm>
#> 4 1 69 835 <lm>
#> 5 1 98 309 <lm>
#> 6 2 57 989 <NULL>
#> 7 2 92 682 <NULL>
#> 8 2 9 602 <lm>
#> 9 2 72 627 <lm>
#> 10 2 26 897 <lm>
In the past, you might have used .
in place of
cur_data()
. This .
is actually from the
magrittr %>%
, not from dplyr, and has a few issues. The
biggest one is that it won’t work with grouped data frames, it will
always return the entire data set rather than the current group’s data
frame. The other issue is that, even with un-grouped data frames, you
can’t take advantage of the sequential nature of how
mutate()
evaluates expressions. For example, the following
doesn’t work because .
corresponds to company
without the updated log_sales
column.
company %>%
mutate(
log_sales = log10(sales),
regressions = slide(
.x = .,
.f = ~lm(log_sales ~ n_calls, .x),
.before = 2,
.complete = TRUE
)
)
#> Error in `mutate()`:
#> ℹ In argument: `regressions = slide(...)`.
#> Caused by error in `model.frame.default()`:
#> ! variable lengths differ (found for 'n_calls')