Fallback to dplyr

This article details the fallback mechanism in duckplyr, which allows support for all dplyr verbs and R functions.

library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Removing existing preference.
#> [conflicted] Will prefer dplyr::filter over any other package.

Introduction

The duckplyr package aims at providing a fully compatible drop-in replacement for dplyr. All operations, R functions, and data types that are supported by dplyr should work in an identical way with duckplyr. This is achieved in two ways:

DuckDB mode

The following operation is supported by duckplyr:

duckdb <-
  duckplyr::duckdb_tibble(a = 1:3) |>
  arrange(desc(a)) |>
  mutate(b = a + 1) |>
  select(-a)

The explain() function shows what happens under the hood:

duckdb |>
  explain()
#> ┌---------------------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │             b             │
#> │                           │
#> │          ~3 Rows          │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │          ORDER_BY         │
#> │    --------------------   │
#> │      dataframe_42_42      │
#> │         577.a DESC        │
#> │      dataframe_42_42      │
#> │   42.___row_number ASC   │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │             #0            │
#> │             #1            │
#> │                           │
#> │          ~3 Rows          │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │      STREAMING_WINDOW     │
#> │    --------------------   │
#> │        Projections:       │
#> │    ROW_NUMBER() OVER ()   │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │     R_DATAFRAME_SCAN      │
#> │    --------------------   │
#> │      Text: data.frame     │
#> │       Projections: a      │
#> │                           │
#> │          ~3 Rows          │
#> └---------------------------┘

The plan shows three operations:

Each operation is supported by DuckDB. The resulting object contains a plan for the entire pipeline that is executed lazily, only when the data is needed.

Relation objects

DuckDB accepts a tree of interconnected relation objects as input. Each relation object represents a logical step of the execution plan. The duckplyr package translates dplyr verbs into relation objects.

The last_rel() function shows the last relation that has been materialized:

duckplyr::last_rel()
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection ["year" as year, "month" as month, mean_inflight_delay as mean_inflight_delay, median_inflight_delay as median_inflight_delay]
#>   Order [___row_number ASC]
#>     Filter [("month" <= 6.0)]
#>       Projection ["year" as year, "month" as month, mean_inflight_delay as mean_inflight_delay, median_inflight_delay as median_inflight_delay, row_number() OVER () as ___row_number]
#>         Projection ["year" as year, "month" as month, mean_inflight_delay as mean_inflight_delay, median_inflight_delay as median_inflight_delay]
#>           Order [___row_number ASC]
#>             Aggregate ["year", "month", ___min_na(___row_number), mean(inflight_delay), median(inflight_delay)]
#>               Projection ["year" as year, "month" as month, "day" as day, dep_time as dep_time, sched_dep_time as sched_dep_time, dep_delay as dep_delay, arr_time as arr_time, sched_arr_time as sched_arr_time, arr_delay as arr_delay, carrier as carrier, flight as flight, tailnum as tailnum, origin as origin, dest as dest, air_time as air_time, distance as distance, "hour" as hour, "minute" as minute, time_hour as time_hour, inflight_delay as inflight_delay, row_number() OVER () as ___row_number]
#>                 Projection ["year" as year, "month" as month, "day" as day, dep_time as dep_time, sched_dep_time as sched_dep_time, dep_delay as dep_delay, arr_time as arr_time, sched_arr_time as sched_arr_time, arr_delay as arr_delay, carrier as carrier, flight as flight, tailnum as tailnum, origin as origin, dest as dest, air_time as air_time, distance as distance, "hour" as hour, "minute" as minute, time_hour as time_hour, "-"(arr_delay, dep_delay) as inflight_delay]
#>                   r_dataframe_scan(0xdeadbeef)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - year (INTEGER)
#> - month (INTEGER)
#> - mean_inflight_delay (DOUBLE)
#> - median_inflight_delay (DOUBLE)

It is NULL because nothing has been computed yet. Converting the object to a data frame triggers the computation:

duckdb |> collect()
#> # A tibble: 3 × 1
#>       b
#>   <dbl>
#> 1     4
#> 2     3
#> 3     2
duckplyr::last_rel()
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [b as b]
#>   Projection [a as a, "+"(a, 1.0) as b]
#>     Projection [a as a]
#>       Order [a DESC, ___row_number ASC]
#>         Projection [a as a, row_number() OVER () as ___row_number]
#>           r_dataframe_scan(0xdeadbeef)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - b (DOUBLE)

The last_rel() function now shows a relation that describes logical plan for executing the whole pipeline.

Help from dplyr

Using a custom function with a side effect is not supported by DuckDB and triggers a dplyr fallback:

verbose_plus_one <- function(x) {
  message("Adding one to ", paste(x, collapse = ", "))
  x + 1
}

fallback <-
  duckplyr::duckdb_tibble(a = 1:3) |>
  arrange(desc(a)) |>
  mutate(b = verbose_plus_one(a)) |>
  select(-a)
#> Adding one to 3, 2, 1

The verbose_plus_one() function is not supported by DuckDB, so the mutate() step is forwarded to dplyr and already executed (eagerly) when the pipeline is defined. This is confirmed by the last_rel() function:

duckplyr::last_rel()
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a]
#>   Order [a DESC, ___row_number ASC]
#>     Projection [a as a, row_number() OVER () as ___row_number]
#>       r_dataframe_scan(0xdeadbeef)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)

Only the arrange() step is executed by DuckDB. Because the dplyr implementation of mutate() needs the data before it can proceed, the data is first converted to a data frame, and this triggers the materialization of the first step.

The explain() function also confirms indirectly that at least a part of the operation is handled by dplyr:

fallback |>
  explain()
#> ┌---------------------------┐
#> │     R_DATAFRAME_SCAN      │
#> │    --------------------   │
#> │      Text: data.frame     │
#> │       Projections: b      │
#> │                           │
#> │          ~3 Rows          │
#> └---------------------------┘

The final plan now only consists of a data frame scan. This is the result of the mutate() step, which at this stage already has been executed by dplyr.

Converting the final object to a data frame triggers the rest of the computation:

fallback |> collect()
#> # A tibble: 3 × 1
#>       b
#>   <dbl>
#> 1     4
#> 2     3
#> 3     2

duckplyr::last_rel()
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [b as b]
#>   r_dataframe_scan(0xdeadbeef)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - b (DOUBLE)

The last_rel() function confirms that only the final select() is handled by DuckDB again.

Enforce DuckDB operation

For any duck frame, one can control the automatic materialization. For fallbacks to dplyr, automatic materialization must be allowed for the duck frame at hand, as dplyr necessitates eager evaluation.

Therefore, by making a data frame stingy, one can ensure a pipeline will error when a fallback to dplyr would have normally happened. See vignette("prudence") for details.

By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way.

Configure fallbacks

Using the fallback_sitrep() and fallback_config() functions you can examine and change settings related to fallbacks.

See vignette("telemetry") for details.

Conclusion

The fallback mechanism in duckplyr allows for a seamless integration of dplyr verbs and R functions that are not supported by DuckDB. It is transparent to the user and only triggers when necessary. With small or medium-sized data sets, it will not even be noticeable in most settings.

See vignette("large") for techniques for working with large data, vignette("limits") for the currently implementated translations, vignette("prudence") for details on controlling fallback behavior, and vignette("telemetry") for the automatic reporting of fallback situations.