Selective use of duckplyr

This vignette demonstrates how to use duckplyr selectively, for individual data frames or for other packages.

library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package.

Introduction

The default behavior of duckplyr is to enable itself for all data frames in the session. This happens when the package is attached with library(duckplyr), or by calling methods_overwrite(). To enable duckplyr for individual data frames instead of session-wide, it is sufficient to prefix all calls to duckplyr functions with duckplyr:: and not attach the package. Alternatively, methods_restore() can be called to undo the session-wide overwrite after library(duckplyr).

External data with explicit qualification

The following example uses duckplyr::as_duckdb_tibble() to convert a data frame to a duckplyr frame and to enable duckplyr operation.

lazy <-
  duckplyr::flights_df() |>
  duckplyr::as_duckdb_tibble() |>
  mutate(inflight_delay = arr_delay - dep_delay) |>
  summarize(
    .by = c(year, month),
    mean_inflight_delay = mean(inflight_delay, na.rm = TRUE),
    median_inflight_delay = median(inflight_delay, na.rm = TRUE),
  ) |>
  filter(month <= 6)

The result is a tibble, with its own class.

class(lazy)
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"

names(lazy)
#> [1] "year"                  "month"                 "mean_inflight_delay"  
#> [4] "median_inflight_delay"

DuckDB is responsible for eventually carrying out the operations. Despite the filter coming very late in the pipeline, it is applied to the raw data.

lazy |>
  explain()
#> ┌---------------------------┐
#> │          ORDER_BY         │
#> │    --------------------   │
#> │      dataframe_42_42      │
#> │     42.___row_number ASC    │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │           FILTER          │
#> │    --------------------   │
#> │ (CAST(month AS DOUBLE) <= │
#> │            6.0)           │
#> │                           │
#> │        ~33677 Rows        │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │             #0            │
#> │             #1            │
#> │             #2            │
#> │             #3            │
#> │             #4            │
#> │                           │
#> │        ~168388 Rows       │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │      STREAMING_WINDOW     │
#> │    --------------------   │
#> │        Projections:       │
#> │    ROW_NUMBER() OVER ()   │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │          ORDER_BY         │
#> │    --------------------   │
#> │      dataframe_42_42      │
#> │     42.___row_number ASC    │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │            year           │
#> │           month           │
#> │       ___row_number       │
#> │    mean_inflight_delay    │
#> │   median_inflight_delay   │
#> │                           │
#> │        ~168388 Rows       │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │       HASH_GROUP_BY       │
#> │    --------------------   │
#> │          Groups:          │
#> │             #0            │
#> │             #1            │
#> │                           │
#> │        Aggregates:        │
#> │    sum_no_overflow(#2)    │
#> │          min(#3)          │
#> │          mean(#4)         │
#> │         median(#5)        │
#> │                           │
#> │        ~168388 Rows       │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │            year           │
#> │           month           │
#> │ CASE  WHEN ((___row_number│
#> │  IS NULL)) THEN (1) ELSE 0│
#> │             END
#> │       ___row_number       │
#> │       inflight_delay      │
#> │       inflight_delay      │
#> │                           │
#> │        ~336776 Rows       │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │             #0            │
#> │             #1            │
#> │             #2            │
#> │             #3            │
#> │                           │
#> │        ~336776 Rows       │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │      STREAMING_WINDOW     │
#> │    --------------------   │
#> │        Projections:       │
#> │    ROW_NUMBER() OVER ()   │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │         PROJECTION        │
#> │    --------------------   │
#> │            year           │
#> │           month           │
#> │       inflight_delay      │
#> │                           │
#> │        ~336776 Rows       │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │     R_DATAFRAME_SCAN      │
#> │    --------------------   │
#> │      Text: data.frame     │
#> │                           │
#> │        Projections:       │
#> │            year           │
#> │           month           │
#> │         dep_delay         │
#> │         arr_delay         │
#> │                           │
#> │        ~336776 Rows       │
#> └---------------------------┘

All data frame operations are supported. Computation happens upon the first request.

lazy$mean_inflight_delay
#> [1] -3.855519 -5.147220 -7.356713 -2.673124 -9.370201 -4.244284

After the computation has been carried out, the results are preserved and available immediately:

lazy
#> # A duckplyr data frame: 4 variables
#>    year month mean_inflight_delay median_inflight_delay
#>   <int> <int>               <dbl>                 <dbl>
#> 1  2013     1               -3.86                    -5
#> 2  2013     2               -5.15                    -6
#> 3  2013     3               -7.36                    -9
#> 4  2013     4               -2.67                    -5
#> 5  2013     5               -9.37                   -10
#> 6  2013     6               -4.24                    -7

Restoring dplyr methods

The same can be achieved by calling methods_restore() after library(duckplyr).

library(duckplyr)
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.

methods_restore()
#> ℹ Restoring dplyr methods.

If the input is a plain data frame, duckplyr is not involved.

flights_df() |>
  mutate(inflight_delay = arr_delay - dep_delay) |>
  explain()
#> Error in UseMethod("explain"): no applicable method for 'explain' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

Own data

Construct duckplyr frames directly with duckdb_tibble():

data <- duckdb_tibble(
  x = 1:3,
  y = 5,
  z = letters[1:3]
)
data
#> # A duckplyr data frame: 3 variables
#>       x     y z    
#>   <int> <dbl> <chr>
#> 1     1     5 a    
#> 2     2     5 b    
#> 3     3     5 c

In other packages

Like other dependencies, duckplyr must be declared in the DESCRIPTION file and optionally imported in the NAMESPACE file. Because duckplyr does not import dplyr, it is necessary to import both packages. The recipe below shows how to achieve this with the usethis package.

Learn more about prudence in vignette("prudence"), about fallbacks to dplyr in vignette("fallback"), and about the translation employed by duckplyr in vignette("limits"), and about the usethis package at https://usethis.r-lib.org/.