Improved support for handling large data from files and S3:
ingestion with read_parquet_duckdb()
and others, and
materialization with as_duckdb_tibble()
,
compute.duckplyr_df()
and compute_file()
. See
vignette("large")
for details.
Control automatic materialization of duckplyr frames with the new
prudence
argument to as_duckdb_tibble()
,
duckdb_tibble()
, compute.duckplyr_df()
and
compute_file()
. See vignette("prudence")
for
details.
read_csv_duckdb()
and others, deprecating
duckplyr_df_from_csv()
and df_from_csv()
(#210, #396, #459).
read_sql_duckdb()
(experimental) to run SQL queries
against the default DuckDB connection and return the result as a
duckplyr frame (duckdb/duckdb-r#32, #397).
db_exec()
to execute configuration queries against
the default duckdb connection (#39, #165, #227, #404, #459).
duckdb_tibble()
(#382, #457).
as_duckdb_tibble()
, replaces
as_duckplyr_tibble()
and as_duckplyr_df()
(#383, #457) and supports dbplyr connections to a duckdb database (#86,
#211, #226).
compute_parquet()
and compute_csv()
,
implement compute.duckplyr_df()
(#409, #430).
fallback_config()
to create a configuration file for
the settings that do not affect behavior (#216, #426).
is_duckdb_tibble()
, deprecates
is_duckplyr_df()
(#391, #392).
last_rel()
to retrieve the last relation object used
in materialization (#209, #375).
Add "prudent_duckplyr_df"
class that stops automatic
materialization and requires collect()
(#381,
#390).
Partial support for across()
in
mutate()
and summarise()
(#296, #306, #318,
@lionel-, @DavisVaughan).
Implement na.rm
handling for sum()
,
min()
, max()
, any()
and
all()
, with fallback for window functions (#205,
#566).
Add support for sub()
and gsub()
(@toppyy, #420).
Handle dplyr::desc()
(#550).
Avoid forwarding is.na()
to is.nan()
to
support non-numeric data, avoid checking roundtrip for timestamp data
(#482).
Correctly handle missing values in
if_else()
.
Limit number of items that can be handled with %in%
(#319).
duckdb_tibble()
checks if columns can be represented
in DuckDB (#537).
Fall back to dplyr when passing multiple
with joins
(#323).
Improve fallback error message by explicitly materializing (#432, #456).
Point to the native CSV reader if encountering data frames read with readr (#127, #469).
Improve as_duckdb_tibble()
error message for invalid
x
(@maelle, #339).
Depend on dplyr instead of reexporting all generics (#405). Nothing changes for users in scripts. When using duckplyr in a package, you now also need to import dplyr.
Fallback logging is now on by default, can be disabled with configuration (#422).
The default DuckDB connection is now based on a file, the
location defaults to a subdirectory of tempdir()
and can be
controlled with the DUCKPLYR_TEMP_DIR
environment variable
(#439, #448, #561).
collect()
returns a tibble (#438, #447).
explain()
returns the input, invisibly
(#331).
Compute ptype only for join columns in a safe way without materialization, not for the entire data frame (#289).
Internal expr_scrub()
(used for telemetry) can
handle function-definitions (@toppyy, #268, #271).
Harden telemetry code against invalid arguments (#321).
New articles: vignette("large")
,
vignette("prudence")
, vignette("fallback")
,
vignette("limits")
, vignette("developers")
,
vignette("telemetry")
(#207, #504).
New flights_df()
used instead of
palmerpenguins::penguins
(#408).
Move to the tidyverse GitHub organization, new repository URL https://github.com/tidyverse/duckplyr/ (#225).
Avoid base pipe in examples for compatibility with R 4.0.0 (#463, #466).
Comparison expressions are translated in a way that allows them to be pushed down to Parquet (@toppyy, #270).
Printing a duckplyr frame no longer materializes (#255, #378).
Prefer vctrs::new_data_frame()
over
tibble()
(#500).
df_from_file()
and related functions support multiple
files (#194, #195), show a clear error message for non-string
path
arguments (#182), and create a tibble by default
(#177).as_duckplyr_tibble()
to convert a data frame to a
duckplyr tibble (#177).?df_from_file
shows how to read multiple files (#181,
#186) and how to specify CSV column types (#140, #189), and is shown
correctly in reference index (#173, #190).as.integer()
,
NA
and %in%
(#83, #154, #148, #155, #159,
#160).library(duckplyr)
calls
methods_overwrite()
(#164).grepl()
.intersect()
,
setdiff()
, symdiff()
, union()
,
and union_all()
(#169).NA
and those used in an
expression (#157).head(-1)
forwards to the default implementation (#131,
#156).left_join()
and other join functions call
auto_copy()
.row_number()
returns integer.is.na(NaN)
is TRUE
.summarise(count = n(), count = n())
creates only one
column named count
.?df_from_file
(@andreranza, #133, #134).vec_ptype()
does not materialize (#149).expect_identical()
to
capture differences between doubles and integers.df_to_parquet()
to write to Parquet, new
convenience functions df_from_csv()
,
duckdb_df_from_csv()
, df_from_parquet()
and
duckdb_df_from_parquet()
(#87, #89, #96, #128).summarise()
(#72, #106).summarise()
no longer restores subclass.log10()
and
log()
.fallback_sitrep()
and related functionality for
collecting telemetry data (#102, #107, #110, #111, #115). No data is
collected by default, only a message is displayed once per session and
then every eight hours. Opt in or opt out by setting environment
variables.group_by()
and other methods to collect
fallback information (#94, #104, #105).suppressWarnings()
as the identity
function.cli::cli_abort()
over stop()
or
rlang::abort()
(#114)..data$a
and .env$a
.integer
, numeric
, logical
,
Date
, POSIXct
, and difftime
for
now.DUCKPLYR_METHODS_OVERWRITE
is set to TRUE
, loading duckplyr automatically calls
methods_overwrite()
.log()
and
log10()
.methods_overwrite()
and methods_restore()
show a message.grepl(x = NA)
gives correct results.auto_copy()
for non-data-frame input.distinct()
now preserves order in corner cases (#77,
#78).log(0)
and
log(-1)
(#75, #76).mutate()
that are actually
representable in duckdb (#73).ifelse()
, support
if_else()
(#79).dplyr_reconstruct()
method (#48).meta_replay()
.arrange()
in case of ties.slice_sample()
, not
sample_n()
or sample_frac()
(#74).IS NOT DISTINCT FROM
for faster execution
(duckdb/duckdb-r#41, #68).summarise()
keeps "duckplyr_df"
class
(#63, #64).
Fix compatibility with duckdb >= 0.9.1.
Skip tests that give different output on dev tidyselect.
Import utils::globalVariables()
.
Small README improvements (@maelle, #34, #57).
Fix 301 in README.
Improve documentation.
Work around problem with dplyr_reconstruct()
in R
4.3.
Rename duckdb_from_file()
to
df_from_file()
.
Unexport private duckdb_rel_from_df()
,
rel_from_df()
, wrap_df()
and
wrap_integer()
.
Reexport %>%
and tibble()
.
R CMD check
.relexpr_window()
for now.Initial version, exporting: - new_relational()
to
construct objects of class "relational"
- Generics
rel_aggregate()
, rel_distinct()
,
rel_filter()
, rel_join()
,
rel_limit()
, rel_names()
,
rel_order()
, rel_project()
,
rel_set_diff()
, rel_set_intersect()
,
rel_set_symdiff()
, rel_to_df()
,
rel_union_all()
- new_relexpr()
to construct
objects of class "relational_relexpr"
- Expression builders
relexpr_constant()
, relexpr_function()
,
relexpr_reference()
, relexpr_set_alias()
,
relexpr_window()