Many data sets of Brazilian censuses are too big to load in users’
RAM memory. To avoid this problem, {censobr} works with
files saved in .parquet
format and, by default, the
functions in {censobr} returns an Arrow
table rather than a data.frame
. There are a few really
simple alternative ways to work with Arrow tables in R without loading
the full data to memory. We cover four alternative approaches in this
vignette.
First, let’s read the 2010 mortality data, which we’ll use throughout vignette for illustration purposes.
library(censobr)
# read 2010 mortality data
df <- censobr::read_mortality(
year = 2010,
add_labels = 'pt',
showProgress = FALSE
)
{dplyr}
Because of the seamless integration between {arrow}
and
{dplyr}
, Arrow tables can be analyzed pretty much like a
regular data.frame
using the {dplyr}
syntax.
There is a small but important difference, though. When using {dplyr}
with an Arrow table, the operations are not executed immediately.
Instead, {dplyr} builds a lazy query plan that is only evaluated when
you explicitly ask for the results. To retrieve the actual results, you
need to call either:
collect()
: This brings the results into memory as a
regular data.frame
.compute()
: This materializes the result in the Arrow
format (e.g., as a new Arrow table).Without calling one of these, the query is just prepared but not executed, which is useful for delaying heavy computations until needed.
In the example below, we create a new Arrow table that only includes
the deaths records of men in the state of Rio de Janeiro without loading
the data to memory. Note that we only piece of data we
collect()
(i.e. load to memory) here are the first
observations of the data.
library(dplyr)
# Filter deaths of men in the state of Rio de Janeiro
rio <- df |>
filter(V0704 == 'Masculino' & abbrev_state == 'RJ')
head(rio) |>
collect()
#> code_muni code_state abbrev_state name_state code_region name_region
#> <int> <int> <char> <char> <int> <char>
#> 1: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 2: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 3: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 4: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 5: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 6: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> code_weighting V0001 V0002 V0011 V0300 V0010 V1001 V1002
#> <char> <char> <char> <char> <num> <num> <char> <char>
#> 1: 3300100003001 33 00100 3300100003001 1285657 13.51819 3 05
#> 2: 3300100003001 33 00100 3300100003001 6084726 12.76395 3 05
#> 3: 3300100003002 33 00100 3300100003002 898500 10.68108 3 05
#> 4: 3300100003002 33 00100 3300100003002 2581116 11.61204 3 05
#> 5: 3300100003002 33 00100 3300100003002 3768175 12.93833 3 05
#> 6: 3300100003002 33 00100 3300100003002 4125020 10.66262 3 05
#> V1003 V1004 V1006 V0703 V0704 V7051 V7052 M0703 M0704
#> <char> <char> <char> <char> <char> <num> <num> <char> <char>
#> 1: 013 00 Urbana Março de 2010 Masculino 69 NA 2 2
#> 2: 013 00 Urbana Fevereiro de 2010 Masculino 84 NA 2 2
#> 3: 013 00 Urbana Abril de 2010 Masculino 38 NA 2 2
#> 4: 013 00 Urbana Maio de 2010 Masculino 54 NA 2 2
#> 5: 013 00 Rural Agosto de 2009 Masculino 31 NA 2 2
#> 6: 013 00 Urbana Setembro de 2009 Masculino 28 NA 2 2
#> M7051 M7052 V1005
#> <char> <char> <char>
#> 1: 2 2 Área urbanizada
#> 2: 2 2 Área não urbanizada
#> 3: 2 2 Área urbanizada
#> 4: 2 2 Área urbanizada
#> 5: 2 2 Área rural exclusive aglomerado rural
#> 6: 2 2 Área urbanizada
{duckdb}
{duckdb}
is another powerful library to work with
larger-than-memory data in R through database interface. There are
different ways to use {duckdb}
, but here cover three
alternatives
{duckdb}
& {dplyr}
One easy option is to combine {duckdb}
&
{dplyr}
. Note here that first you need to convert the Arrow
table into a DuckDB table with arrow::to_duckdb()
. Also
note that the you need to use a bit of SQL
syntax inside
the dplyr call. Using the same example as above:
library(duckdb)
library(dplyr)
library(arrow)
# Filter deaths of men in the state of Rio de Janeiro
rio1 <- df |>
arrow::to_duckdb() |>
filter(sql("V0704 LIKE '%Masculino%' AND abbrev_state = 'RJ'"))
head(rio1) |>
collect()
#> # A tibble: 6 × 26
#> code_muni code_state abbrev_state name_state code_region name_region
#> <int> <int> <chr> <chr> <int> <chr>
#> 1 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 2 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 3 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 4 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 5 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 6 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> # ℹ 20 more variables: code_weighting <chr>, V0001 <chr>, V0002 <chr>,
#> # V0011 <chr>, V0300 <dbl>, V0010 <dbl>, V1001 <chr>, V1002 <chr>,
#> # V1003 <chr>, V1004 <chr>, V1006 <chr>, V0703 <chr>, V0704 <chr>,
#> # V7051 <dbl>, V7052 <dbl>, M0703 <chr>, M0704 <chr>, M7051 <chr>,
#> # M7052 <chr>, V1005 <chr>
{duckdb}
& SQL
Another alternative is to combine {duckdb}
with
{DBI}
using database interface and SQL
syntax.
library(duckdb)
library(DBI)
# create databse connection
con <- duckdb::dbConnect(duckdb::duckdb())
# register the data in the data base
duckdb::duckdb_register_arrow(con, 'mortality_2010_tbl', df)
# Filter deaths of men in the state of Rio de Janeiro
rio2 <- DBI::dbGetQuery(con, "SELECT * FROM 'mortality_2010_tbl' WHERE V0704 LIKE '%Masculino%' AND abbrev_state = 'RJ'")
head(rio2)
#> code_muni code_state abbrev_state name_state code_region name_region
#> 1 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 2 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 3 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 4 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 5 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 6 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> code_weighting V0001 V0002 V0011 V0300 V0010 V1001 V1002 V1003
#> 1 3300100003001 33 00100 3300100003001 1285657 13.51819 3 05 013
#> 2 3300100003001 33 00100 3300100003001 6084726 12.76395 3 05 013
#> 3 3300100003002 33 00100 3300100003002 898500 10.68108 3 05 013
#> 4 3300100003002 33 00100 3300100003002 2581116 11.61204 3 05 013
#> 5 3300100003002 33 00100 3300100003002 3768175 12.93833 3 05 013
#> 6 3300100003002 33 00100 3300100003002 4125020 10.66262 3 05 013
#> V1004 V1006 V0703 V0704 V7051 V7052 M0703 M0704 M7051 M7052
#> 1 00 Urbana Março de 2010 Masculino 69 NA 2 2 2 2
#> 2 00 Urbana Fevereiro de 2010 Masculino 84 NA 2 2 2 2
#> 3 00 Urbana Abril de 2010 Masculino 38 NA 2 2 2 2
#> 4 00 Urbana Maio de 2010 Masculino 54 NA 2 2 2 2
#> 5 00 Rural Agosto de 2009 Masculino 31 NA 2 2 2 2
#> 6 00 Urbana Setembro de 2009 Masculino 28 NA 2 2 2 2
#> V1005
#> 1 Área urbanizada
#> 2 Área não urbanizada
#> 3 Área urbanizada
#> 4 Área urbanizada
#> 5 Área rural exclusive aglomerado rural
#> 6 Área urbanizada
{duckplyr}
A third alternative is the new {duckplyr}
package. This
this library is under development so it still does not cover many of the
{dplyr}
verbs. (See here).
Nonetheless, it looks like a really promising and seamless approach that
allow users to leverage the powerful capabilities of {duckdb}.
library(duckplyr)
rio3 <- df |>
duckplyr::filter(V0704 == 'Masculino' & abbrev_state == 'RJ')
head(rio3) |>
collect()
#> code_muni code_state abbrev_state name_state code_region name_region
#> <int> <int> <char> <char> <int> <char>
#> 1: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 2: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 3: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 4: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 5: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> 6: 3300100 33 RJ Rio de Janeiro 3 Sudeste
#> code_weighting V0001 V0002 V0011 V0300 V0010 V1001 V1002
#> <char> <char> <char> <char> <num> <num> <char> <char>
#> 1: 3300100003001 33 00100 3300100003001 1285657 13.51819 3 05
#> 2: 3300100003001 33 00100 3300100003001 6084726 12.76395 3 05
#> 3: 3300100003002 33 00100 3300100003002 898500 10.68108 3 05
#> 4: 3300100003002 33 00100 3300100003002 2581116 11.61204 3 05
#> 5: 3300100003002 33 00100 3300100003002 3768175 12.93833 3 05
#> 6: 3300100003002 33 00100 3300100003002 4125020 10.66262 3 05
#> V1003 V1004 V1006 V0703 V0704 V7051 V7052 M0703 M0704
#> <char> <char> <char> <char> <char> <num> <num> <char> <char>
#> 1: 013 00 Urbana Março de 2010 Masculino 69 NA 2 2
#> 2: 013 00 Urbana Fevereiro de 2010 Masculino 84 NA 2 2
#> 3: 013 00 Urbana Abril de 2010 Masculino 38 NA 2 2
#> 4: 013 00 Urbana Maio de 2010 Masculino 54 NA 2 2
#> 5: 013 00 Rural Agosto de 2009 Masculino 31 NA 2 2
#> 6: 013 00 Urbana Setembro de 2009 Masculino 28 NA 2 2
#> M7051 M7052 V1005
#> <char> <char> <char>
#> 1: 2 2 Área urbanizada
#> 2: 2 2 Área não urbanizada
#> 3: 2 2 Área urbanizada
#> 4: 2 2 Área urbanizada
#> 5: 2 2 Área rural exclusive aglomerado rural
#> 6: 2 2 Área urbanizada