Unlike traditional data frames, duckplyr defers computation until absolutely necessary, allowing DuckDB to optimize execution. This article explains how to control the materialization of data to maintain a seamless dplyr-like experience while remaining cautious of memory usage.
library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Removing existing preference.
#> [conflicted] Will prefer dplyr::filter over any other package.
From a user’s perspective, data frames backed by duckplyr, with class
"duckplyr_df"
, behave as regular data frames in almost all
respects. In particular, direct column access like df$x
, or
retrieving the number of rows with nrow()
, works
identically. Conceptually, duckplyr frames are “eager”:
df <-
duckplyr::duckdb_tibble(x = 1:3) |>
mutate(y = x + 1)
df
#> # A duckplyr data frame: 2 variables
#> x y
#> <int> <dbl>
#> 1 1 2
#> 2 2 3
#> 3 3 4
class(df)
#> [1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
df$y
#> [1] 2 3 4
nrow(df)
#> [1] 3
Under the hood, two key differences provide improved performance and usability:
The term “prudence” is introduced here to set a clear distinction from the concept of “laziness”, and because “control of automatic materialization” is a mouthful.
For a duckplyr frame that is the result of a dplyr operation, accessing column data or retrieving the number of rows will trigger a computation that is carried out by DuckDB, not dplyr. In this sense, duckplyr frames are also “lazy”: the computation is deferred until the last possible moment, allowing DuckDB to optimize the whole pipeline.
This is explained in the following example that computes the mean arrival delay for flights departing from Newark airport (EWR) by day and month:
flights <- duckplyr::flights_df()
flights_duckdb <-
flights |>
duckplyr::as_duckdb_tibble()
system.time(
mean_arr_delay_ewr <-
flights_duckdb |>
filter(origin == "EWR", !is.na(arr_delay)) |>
summarize(
.by = month,
mean_arr_delay = mean(arr_delay),
min_arr_delay = min(arr_delay),
max_arr_delay = max(arr_delay),
median_arr_delay = median(arr_delay),
)
)
#> user system elapsed
#> 0.006 0.000 0.006
Setting up the pipeline is fast, the size of the data does not affect the setup costs. Because the computation is deferred, DuckDB can optimize the whole pipeline, which can be seen in the output below:
mean_arr_delay_ewr |>
explain()
#> ┌---------------------------┐
#> │ ORDER_BY │
#> │ -------------------- │
#> │ dataframe_42_42 │
#> │ 42.___row_number ASC│
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ month │
#> │ ___row_number │
#> │ mean_arr_delay │
#> │ min_arr_delay │
#> │ max_arr_delay │
#> │ median_arr_delay │
#> │ │
#> │ ~33677 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ HASH_GROUP_BY │
#> │ -------------------- │
#> │ Groups: #0 │
#> │ │
#> │ Aggregates: │
#> │ sum_no_overflow(#1) │
#> │ min(#2) │
#> │ sum_no_overflow(#3) │
#> │ avg(#4) │
#> │ min(#5) │
#> │ max(#6) │
#> │ quantile_cont(#7) │
#> │ │
#> │ ~33677 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ month │
#> │ CASE WHEN ((___row_number│
#> │ IS NULL)) THEN (1) ELSE 0│
#> │ END │
#> │ ___row_number │
#> │ CASE WHEN ((arr_delay IS │
#> │ NULL)) THEN (1) ELSE 0 END│
#> │ arr_delay │
#> │ arr_delay │
#> │ arr_delay │
#> │ arr_delay │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ #0 │
#> │ #1 │
#> │ #2 │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ STREAMING_WINDOW │
#> │ -------------------- │
#> │ Projections: │
#> │ ROW_NUMBER() OVER () │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ ORDER_BY │
#> │ -------------------- │
#> │ dataframe_42_42 │
#> │ 42.___row_number ASC│
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ #0 │
#> │ #1 │
#> │ #3 │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ FILTER │
#> │ -------------------- │
#> │ ((NOT (arr_delay IS NULL))│
#> │ AND (origin = 'EWR')) │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ #0 │
#> │ #1 │
#> │ #2 │
#> │ #3 │
#> │ │
#> │ ~336776 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ STREAMING_WINDOW │
#> │ -------------------- │
#> │ Projections: │
#> │ ROW_NUMBER() OVER () │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ Text: data.frame │
#> │ │
#> │ Projections: │
#> │ month │
#> │ arr_delay │
#> │ origin │
#> │ │
#> │ ~336776 Rows │
#> └---------------------------┘
The first step in the pipeline is to prune the unneeded columns, only
origin
, month
, and arr_delay
are
kept. The result becomes available when accessed:
The functionality is similar to lazy tables in dbplyr and lazy frames in dtplyr. However, the behavior is different: at the time of writing, the internal structure of a lazy table or frame is different from a data frame, and columns cannot be accessed directly.
Eager 😃 | Lazy 😴 | |
---|---|---|
dplyr | ✅ | |
dbplyr | ✅ | |
dtplyr | ✅ | |
duckplyr | ✅ | ✅ |
In contrast, with dplyr, each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization:
system.time(
flights |>
filter(origin == "EWR", !is.na(arr_delay)) |>
summarize(
.by = c(month, day),
mean_arr_delay = mean(arr_delay),
min_arr_delay = min(arr_delay),
max_arr_delay = max(arr_delay),
median_arr_delay = median(arr_delay),
)
)
#> user system elapsed
#> 0.006 0.000 0.006
See also the duckplyr: dplyr Powered by DuckDB blog post for more information.
Being both “eager” and “lazy” at the same time introduces a challenge: it is too easy to accidentally trigger computation, which is prohibitive if an intermediate result is too large to fit into memory. Prudence is a setting for duckplyr frames that limits the size of the data that is materialized automatically.
Three levels of prudence are available:
For lavish duckplyr frames, as in the two previous examples, the
underlying DuckDB computation is carried out upon the first request.
Once the results are computed, they are cached and subsequent requests
are fast. This is a good choice for small to medium-sized data, where
DuckDB can provide a nice speedup but materializing the data is
affordable at any stage. This is the default for
duckdb_tibble()
and as_duckdb_tibble()
.
For stingy duckplyr frames, accessing a column or requesting the
number of rows triggers an error. This is a good choice for large data
sets where the cost of materializing the data may be prohibitive due to
size or computation time, and the user wants to control when the
computation is carried out and where the results are stored. Results can
be materialized explicitly with collect()
and other
functions.
Thrifty duckplyr frames are a compromise between lavish and stingy, discussed further below.
Passing prudence = "stingy"
to
as_duckdb_tibble()
creates a stingy duckplyr frame.
The data can be displayed, and column names and types can be accessed.
flights_stingy
#> # A duckplyr data frame: 19 variables
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
names(flights_stingy)[1:10]
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier"
class(flights_stingy)
#> [1] "prudent_duckplyr_df" "duckplyr_df" "tbl_df"
#> [4] "tbl" "data.frame"
class(flights_stingy[[1]])
#> [1] "integer"
On the other hand, accessing a column or requesting the number of rows triggers an error:
nrow(flights_stingy)
#> Error: Materialization is disabled, use collect() or as_tibble() to materialize.
flights_stingy[[1]]
#> Error: Materialization is disabled, use collect() or as_tibble() to materialize.
This means that stingy duckplyr frames can also be used to enforce DuckDB operation for a pipeline.
For operations not supported by duckplyr, the original dplyr implementation is used as a fallback. As the original dplyr implementation accesses columns directly, the data must be materialized before a fallback can be executed. Therefore, stingy frames allow you to check that all operations are supported by DuckDB: for a stingy frame, fallbacks to dplyr are not possible.
flights_stingy |>
group_by(origin) |>
summarize(n = n()) |>
ungroup()
#> Error in `group_by()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#> stingy duckplyr frame.
#> • Try `summarise(.by = ...)` or `mutate(.by = ...)` instead of `group_by()` and
#> `ungroup()`.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#> continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
The same pipeline with a lavish frame works, but the computation is carried out by dplyr:
flights_stingy |>
duckplyr::as_duckdb_tibble(prudence = "lavish") |>
group_by(origin) |>
summarize(n = n()) |>
ungroup()
#> # A tibble: 3 × 2
#> origin n
#> <chr> <int>
#> 1 EWR 120835
#> 2 JFK 111279
#> 3 LGA 104662
By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way.
A stingy duckplyr frame can be converted to a lavish one with
as_duckdb_tibble(prudence = "lavish")
. The
collect.duckplyr_df()
method triggers computation and
converts to a plain tibble. The difference between the two is the class
of the returned object:
flights_stingy |>
duckplyr::as_duckdb_tibble(prudence = "lavish") |>
class()
#> [1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
flights_stingy |>
collect() |>
class()
#> [1] "tbl_df" "tbl" "data.frame"
The same behavior is achieved with as_tibble()
and
as.data.frame()
:
Stingy duckplyr frames behave like lazy tables in dbplyr and lazy
frames in dtplyr: the computation only starts when you
explicitly request it with collect.duckplyr_df()
or through other means. However, stingy duckplyr frames can be converted
to lavish ones at any time, and vice versa. In dtplyr and dbplyr, there
are no lavish frames: collection always needs to be explicit.
Thrifty is a compromise between stingy and lavish. Materialization is allowed for data up to a certain size, measured in cells (values) and rows in the resulting data frame.
nrow(flights)
#> [1] 336776
flights_partial <-
flights |>
duckplyr::as_duckdb_tibble(prudence = "thrifty")
With this setting, the data is materialized only if the result has fewer than 1,000,000 cells (rows multiplied by columns).
flights_partial |>
select(origin, dest, dep_delay, arr_delay) |>
nrow()
#> Error: Materialization would result in more than 250000 rows. Use collect() or as_tibble() to materialize.
The original input is too large to be materialized, so the operation fails. On the other hand, the result after aggregation is small enough to be materialized:
Thrifty is a good choice for data sets where the cost of
materializing the data is prohibitive only for large results. This is
the default for the ingestion functions like
read_parquet_duckdb()
.
The duckplyr package provides
Automatic materialization can be dangerous for memory with large
data, so duckplyr provides a setting called prudence
that
controls automatic materialization: is the data automatically
materialized always (“lavish” frames), never (“stingy”
frames) or up to a certain size (“thrifty” frames).
See vignette("large")
for more details on working with
large data sets, vignette("fallback")
for fallbacks to
dplyr, and vignette("limits")
for the operations supported
by duckplyr.