Maps and plots with ColOpenData

ColOpenData can be used to access open geospatial data from Colombia. This data is retrieved from the National Geostatistical Framework (MGN), published by the National Administrative Department of Statistics (DANE). The MGN contains the political-administrative division and is used to reference census statistical information.

This package contains the 2018’s version of the MGN, which also included a summarized version of the National Population and Dwelling Census (CNPV) in different aggregation levels. Each level is stored in a different dataset, which can be retrieved using the download_geospatial() function, which requires three arguments:

Available levels of aggregation come from the official spatial division provided by DANE, with their names corresponding to:

Code Level Name
DPTO Department DANE_MGN_2018_DPTO
MPIO Municipality DANE_MGN_2018_MPIO
MPIOCL Municipality including Class DANE_MGN_2018_MPIOCL
MZN Block DANE_MGN_2018_MZN
SECR Rural Sector DANE_MGN_2018_SECR
SECU Urban Sector DANE_MGN_2018_SECU
SETR Rural Section DANE_MGN_2018_SETR
SETU Urban Section DANE_MGN_2018_SETU
ZU Urban Zone DANE_MGN_2018_ZU

In this vignette you will learn:

  1. How to download geospatial data using ColOpenData.
  2. How to use census data included in geospatial datasets.
  3. How to visualize spatial data using leaflet and ggplot2.

We will be using geospatial data at the level of Department (“dpto”) and we will calculate the percentage of dwellings with internet connection at each department. Later, we will build some plots using the previously mentioned approaches for dynamic and static plots.

We will start by importing the needed libraries.

library(ColOpenData)
library(dplyr)
library(sf)
library(ggplot2)
library(leaflet)

Disclaimer: all data is loaded to the environment in the user’s R session, but is not downloaded to user’s computer. Spatial datasets can be very long and might take a while to be loaded in the environment

Downloading geospatial data

First, we download the data using the function download_geospatial(), including the geometries and the census related information. The simplified parameter is used to download a lighter version, since simple plots do not require precise spatial information.

dpto <- download_geospatial(
  spatial_level = "dpto",
  simplified = TRUE,
  include_geom = TRUE,
  include_cnpv = TRUE
)
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.

head(dpto)
#> # A tibble: 6 × 89
#>   codigo_departamento departamento    version    area latitud longitud encuestas
#>   <chr>               <chr>             <dbl>   <dbl>   <dbl>    <dbl>     <dbl>
#> 1 18                  Caquetá            2018 9.01e10   0.799    -74.0    163381
#> 2 19                  Cauca              2018 3.12e10   2.40     -76.8    622959
#> 3 86                  Putumayo           2018 2.60e10   0.452    -75.9    147797
#> 4 76                  Valle del Cauca    2018 2.07e10   3.86     -76.5   1674673
#> 5 94                  Guainía            2018 7.13e10   2.73     -68.8     13059
#> 6 99                  Vichada            2018 1.00e11   4.71     -69.4     24915
#> # ℹ 82 more variables: enc_etnico <dbl>, enc_no_etnico <dbl>,
#> #   enc_resguardo_indigena <dbl>, enc_comun_negras <dbl>,
#> #   enc_area_protegida <dbl>, enc_area_no_protegida <dbl>, un_vivienda <dbl>,
#> #   un_mixto <dbl>, un_no_res <dbl>, un_lea <dbl>,
#> #   un_mixto_no_res_industria <dbl>, un_mixto_no_res_comercio <dbl>,
#> #   un_mixto_no_res_servicios <dbl>, un_mixto_no_res_agro <dbl>,
#> #   un_mixto_no_res_sin_info <dbl>, un_no_res_industria <dbl>, …

To understand which column contains the internet related information, we will need the corresponding dataset dictionary. To download the dictionary we can use the geospatial_dictionary() function. This function uses as parameters the dataset name to download the associated information and language of this information. For further information please refer to the documentation on dictionaries previously mentioned.

dict <- geospatial_dictionary(spatial_level = "dpto", language = "EN")

head(dict)
#> # A tibble: 6 × 4
#>   variable            type         length description                           
#>   <chr>               <chr>         <dbl> <chr>                                 
#> 1 codigo_departamento Text              2 Department code                       
#> 2 departamento        Text            250 Department name                       
#> 3 version             Long Integer     NA Year of validity of the department in…
#> 4 area                Double           NA Department area in square meters (Pla…
#> 5 latitud             Double           NA Centroid latitude coordinate of the d…
#> 6 longitud            Double           NA Centroid longitude coordinate of the …

To calculate the percentage of dwellings with internet connection, we will need to know the number of dwellings with internet connection and the total of dwellings in each department. From the dictionary, we get that the number of dwellings with internet connection is viv_internet and the total of dwellings is viviendas. We will calculate the percentage as follows:

internet_cov <- dpto %>% mutate(internet = round(viv_internet / viviendas, 2))

Static plots (ggplot2)

ggplot2 can be used to generate static plots of spatial data by using the geometry geom_sf(). Color palettes and themes can be defined for each plot using the aesthetic and scales, which can be consulted in the ggplot2 documentation. We will use a gradient with a two-color diverging palette, to make the differences more visible.

ggplot(data = internet_cov) +
  geom_sf(mapping = aes(fill = internet), color = NA) +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", colour = "white"),
    panel.background = element_rect(fill = "white", colour = "white"),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  ) +
  scale_fill_gradient("Percentage", low = "#10bed2", high = "#deff00") +
  ggtitle(
    label = "Internet coverage",
    subtitle = "Colombia"
  )

Dynamic plots (leaflet)

For dynamic plots, we can use leaflet, which is an open-source library for interactive maps. To create the same plot we first will create the color palette.

colfunc <- colorRampPalette(c("#10bed2", "#deff00"))
pal <- colorNumeric(
  palette = colfunc(100),
  domain = internet_cov[["internet"]]
)

With the previous color palette we can generate the interactive plot. The package also includes open source maps for the base map like OpenStreetMap and CartoDB. For further details on leaflet, please refer to the package’s documentation.

leaflet(internet_cov) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(
    stroke = TRUE,
    weight = 0,
    color = NA,
    fillColor = ~ pal(internet_cov[["internet"]]),
    fillOpacity = 1,
    popup = paste0(internet_cov[["internet"]])
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal,
    values = ~ internet_cov[["internet"]],
    opacity = 1,
    title = "Internet Coverage"
  )