This article details the fallback mechanism in duckplyr, which allows support for all dplyr verbs and R functions.
library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Removing existing preference.
#> [conflicted] Will prefer dplyr::filter over any other package.
The duckplyr package aims at providing a fully compatible drop-in replacement for dplyr. All operations, R functions, and data types that are supported by dplyr should work in an identical way with duckplyr. This is achieved in two ways:
The following operation is supported by duckplyr:
The explain()
function shows what happens under the
hood:
duckdb |>
explain()
#> ┌---------------------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ b │
#> │ │
#> │ ~3 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ ORDER_BY │
#> │ -------------------- │
#> │ dataframe_42_42 │
#> │ 577.a DESC │
#> │ dataframe_42_42 │
#> │ 42.___row_number ASC │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ #0 │
#> │ #1 │
#> │ │
#> │ ~3 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ STREAMING_WINDOW │
#> │ -------------------- │
#> │ Projections: │
#> │ ROW_NUMBER() OVER () │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ Text: data.frame │
#> │ Projections: a │
#> │ │
#> │ ~3 Rows │
#> └---------------------------┘
The plan shows three operations:
b
column and removing the
a
column).Each operation is supported by DuckDB. The resulting object contains a plan for the entire pipeline that is executed lazily, only when the data is needed.
DuckDB accepts a tree of interconnected relation objects as input. Each relation object represents a logical step of the execution plan. The duckplyr package translates dplyr verbs into relation objects.
The last_rel()
function shows the last relation that has
been materialized:
duckplyr::last_rel()
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection ["year" as year, "month" as month, mean_inflight_delay as mean_inflight_delay, median_inflight_delay as median_inflight_delay]
#> Order [___row_number ASC]
#> Filter [("month" <= 6.0)]
#> Projection ["year" as year, "month" as month, mean_inflight_delay as mean_inflight_delay, median_inflight_delay as median_inflight_delay, row_number() OVER () as ___row_number]
#> Projection ["year" as year, "month" as month, mean_inflight_delay as mean_inflight_delay, median_inflight_delay as median_inflight_delay]
#> Order [___row_number ASC]
#> Aggregate ["year", "month", ___min_na(___row_number), mean(inflight_delay), median(inflight_delay)]
#> Projection ["year" as year, "month" as month, "day" as day, dep_time as dep_time, sched_dep_time as sched_dep_time, dep_delay as dep_delay, arr_time as arr_time, sched_arr_time as sched_arr_time, arr_delay as arr_delay, carrier as carrier, flight as flight, tailnum as tailnum, origin as origin, dest as dest, air_time as air_time, distance as distance, "hour" as hour, "minute" as minute, time_hour as time_hour, inflight_delay as inflight_delay, row_number() OVER () as ___row_number]
#> Projection ["year" as year, "month" as month, "day" as day, dep_time as dep_time, sched_dep_time as sched_dep_time, dep_delay as dep_delay, arr_time as arr_time, sched_arr_time as sched_arr_time, arr_delay as arr_delay, carrier as carrier, flight as flight, tailnum as tailnum, origin as origin, dest as dest, air_time as air_time, distance as distance, "hour" as hour, "minute" as minute, time_hour as time_hour, "-"(arr_delay, dep_delay) as inflight_delay]
#> r_dataframe_scan(0xdeadbeef)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - year (INTEGER)
#> - month (INTEGER)
#> - mean_inflight_delay (DOUBLE)
#> - median_inflight_delay (DOUBLE)
It is NULL
because nothing has been computed yet.
Converting the object to a data frame triggers the computation:
duckdb |> collect()
#> # A tibble: 3 × 1
#> b
#> <dbl>
#> 1 4
#> 2 3
#> 3 2
duckplyr::last_rel()
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [b as b]
#> Projection [a as a, "+"(a, 1.0) as b]
#> Projection [a as a]
#> Order [a DESC, ___row_number ASC]
#> Projection [a as a, row_number() OVER () as ___row_number]
#> r_dataframe_scan(0xdeadbeef)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - b (DOUBLE)
The last_rel()
function now shows a relation that
describes logical plan for executing the whole pipeline.
Using a custom function with a side effect is not supported by DuckDB and triggers a dplyr fallback:
verbose_plus_one <- function(x) {
message("Adding one to ", paste(x, collapse = ", "))
x + 1
}
fallback <-
duckplyr::duckdb_tibble(a = 1:3) |>
arrange(desc(a)) |>
mutate(b = verbose_plus_one(a)) |>
select(-a)
#> Adding one to 3, 2, 1
The verbose_plus_one()
function is not supported by
DuckDB, so the mutate()
step is forwarded to dplyr and
already executed (eagerly) when the pipeline is defined. This is
confirmed by the last_rel()
function:
duckplyr::last_rel()
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a]
#> Order [a DESC, ___row_number ASC]
#> Projection [a as a, row_number() OVER () as ___row_number]
#> r_dataframe_scan(0xdeadbeef)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
Only the arrange()
step is executed by DuckDB. Because
the dplyr implementation of mutate()
needs the data before
it can proceed, the data is first converted to a data frame, and this
triggers the materialization of the first step.
The explain()
function also confirms indirectly that at
least a part of the operation is handled by dplyr:
fallback |>
explain()
#> ┌---------------------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ Text: data.frame │
#> │ Projections: b │
#> │ │
#> │ ~3 Rows │
#> └---------------------------┘
The final plan now only consists of a data frame scan. This is the
result of the mutate()
step, which at this stage already
has been executed by dplyr.
Converting the final object to a data frame triggers the rest of the computation:
fallback |> collect()
#> # A tibble: 3 × 1
#> b
#> <dbl>
#> 1 4
#> 2 3
#> 3 2
duckplyr::last_rel()
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [b as b]
#> r_dataframe_scan(0xdeadbeef)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - b (DOUBLE)
The last_rel()
function confirms that only the final
select()
is handled by DuckDB again.
For any duck frame, one can control the automatic materialization. For fallbacks to dplyr, automatic materialization must be allowed for the duck frame at hand, as dplyr necessitates eager evaluation.
Therefore, by making a data frame stingy, one can ensure a pipeline
will error when a fallback to dplyr would have normally happened. See
vignette("prudence")
for details.
By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way.
Using the fallback_sitrep()
and
fallback_config()
functions you can examine and change
settings related to fallbacks.
You can choose to make fallbacks verbose with
fallback_config(info = TRUE)
.
You can change settings related to logging and reporting fallback to duckplyr development team to inform their work.
See vignette("telemetry")
for details.
The fallback mechanism in duckplyr allows for a seamless integration of dplyr verbs and R functions that are not supported by DuckDB. It is transparent to the user and only triggers when necessary. With small or medium-sized data sets, it will not even be noticeable in most settings.
See vignette("large")
for techniques for working with
large data, vignette("limits")
for the currently
implementated translations, vignette("prudence")
for
details on controlling fallback behavior, and
vignette("telemetry")
for the automatic reporting of
fallback situations.