Spatial filtering

Callum Waite & Shandiya Balasubramaniam

2023-10-13

Biodiversity queries to the ALA usually require some spatial filtering. Let’s see how spatial data are stored in the ALA and a few different methods to spatially filter data with galah and other packages.

Most records in the ALA contain location data in the form of two key fields: decimalLatitude and decimalLongitude. As expected, these fields are the decimal coordinates of each occurrence, with south and west values denoted by negatives. While there may be some uncertainty in these values (see field coordinateUncertaintyInMeters), they are generally very accurate.

While these are very important and useful fields, very rarely will we encounter situations that require queries directly calling decimalLatitude and decimalLongitude. Instead, the ALA and galah have a number of features that make spatial queries simpler.

Contextual and spatial layers

Often we want to filter results down to some commonly defined spatial regions, such as states, LGAs or IBRA/IMCRA regions. The ALA contains a large range (>100) of contextual and spatial layers, in-built as searchable and queriable fields. They are denoted by names beginning with "cl", followed by an identifying number that may be up to 6 digits long. These fields are each based on shapefiles, and contain the names of the regions in these layers that each record lies in.

We strongly recommend using search_fields() to check whether a contextual layer already exists in the ALA that matches what you require before proceeding with other methods of spatial filtering. These fields are all able to be queried with filter()and so they are generally easier to use.

Suppose we are interested in querying records of the Red-Necked Avocet (Recurvirostra novaehollandiae) in the Coorong wetlands in South Australia. We can search the ALA fields for wetlands.

library(galah)
library(dplyr)
library(gt)
library(sf)
galah_config(email = "your_email_here", verbose = FALSE)
search_fields("wetlands")
## # A tibble: 2 × 3
##   id      description                       type  
##   <chr>   <chr>                             <chr> 
## 1 cl901   Directory of Important Wetlands   fields
## 2 cl11192 Ramsar_Wetlands_of_AustraliaWGS84 fields

Our search identifies that layer cl901 seems to match what we are looking for. We can then either view all possible values in the field with show_values(), or search again for our particular field.

search_fields("cl901") |> search_values("coorong")
## • Showing values for 'cl901'.
## # A tibble: 1 × 1
##   cl901                                      
##   <chr>                                      
## 1 The Coorong, Lake Alexandrina & Lake Albert

We can filter all occurrences for exact matches with this value, "Lake Eyre". Our galah query can be built as follows:

galah_call() |>
  identify("Recurvirostra novaehollandiae") |>
  filter(cl901 == "The Coorong, Lake Alexandrina & Lake Albert") |>
  collect() |>
  head(5) |>
  gt::gt()
## Retrying in 1 seconds.
## Retrying in 2 seconds.
## Retrying in 4 seconds.
recordID scientificName taxonConceptID decimalLatitude decimalLongitude eventDate occurrenceStatus dataResourceName
00630f58-ee78-4f69-ba39-718b0a1c3356 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.59293 139.0218 2008-11-18 PRESENT SA Fauna
00705077-abc2-4f28-aedd-11a479ec2d82 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.52833 138.8089 2016-03-06 PRESENT BirdLife Australia, Birdata
0071444e-ddf2-45dc-9a10-c6399686e306 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.57719 138.9923 2008-01-01 PRESENT SA Fauna
008a2e6c-26b4-4f4e-a428-b0330db53466 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.52831 138.8280 2017-11-19 17:15:00 PRESENT eBird Australia
00b11f13-0100-4471-b18b-7e9736ddeeaf Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -36.18001 139.6589 2017-08-05 13:25:00 PRESENT eBird Australia

Filtering data to a polygon

While server-side spatial information is useful, there are likely to be cases where the shapefile or region you wish to query will not be pre-loaded as a contextual layer in the ALA. In this case, shapefiles can be introduced to the filtering process using the {sf} package and the geolocate() function. Shapefiles can be provided as an sf object, whether that is by importing them with sf::st_read() or taking a POLYGON or MULTIPOLYGON character string and transforming them with sf::st_as_sfc().

For instance, we might interested in species occurrences in King George Square, Brisbane. We can take the MULTIPOLYGON object for the square (as sourced from the Brisbane City Council) and transform it into sfc and then sf objects.

king_george_sq <- "MULTIPOLYGON(((153.0243 -27.46886, 153.0242 -27.46896, 153.0236 -27.46837, 153.0239 -27.46814, 153.0239 -27.46813, 153.0242 -27.46789, 153.0244 -27.46805, 153.0245 -27.46821, 153.0246 -27.46828, 153.0247 -27.46835, 153.0248 -27.46848, 153.0246 -27.4686, 153.0246 -27.46862, 153.0245 -27.46871, 153.0243 -27.46886)))" |>
  sf::st_as_sfc() |> 
  sf::st_as_sf()

We can provide this MULTIPOLYGON in our filter as the argument of geolocate() to assess which species have been recorded in King George Square.

galah_call() |>
  geolocate(king_george_sq) |>
  select(decimalLatitude, 
         decimalLongitude, 
         eventDate, 
         scientificName, 
         vernacularName) |>
  collect() |> 
  head(10) |>
  gt::gt()
decimalLatitude decimalLongitude eventDate scientificName vernacularName
-27.46862 153.0243 2006-03-30 09:32:00 Threskiornis moluccus Australian White Ibis
-27.46861 153.0244 2023-12-02 09:12:02 Cosmophasis baehrae NA
-27.46855 153.0239 2024-02-03 20:45:00 Mediastinia NA
-27.46851 153.0238 2023-04-08 17:49:00 Entomyzon cyanotis Blue-faced Honeyeater
-27.46842 153.0241 2022-08-30 11:29:00 Threskiornis moluccus Australian White Ibis
-27.46836 153.0241 2024-08-28 10:31:54 Vitellus NA
-27.46833 153.0241 2022-07-24 14:43:00 Threskiornis moluccus Australian White Ibis
-27.46831 153.0241 2024-10-13 14:34:00 Threskiornis moluccus Australian White Ibis
-27.46830 153.0240 2024-10-13 14:35:00 Gymnorhina tibicen Australian Magpie
-27.46817 153.0240 2024-07-23 10:00:53 Threskiornis moluccus Australian White Ibis

There is a second argument of geolocate() called type, which defaults to value "polygon". By setting the type argument to "bbox", the provided POLYGON or MULTIPOLYGON will be converted into the smallest bounding box (rectangle) that contains the POLYGON. In this case, records will be included that may not exactly lie inside the provided shape.

galah_call() |>
  geolocate(king_george_sq, type = "bbox") |>
  select(decimalLatitude, 
         decimalLongitude, 
         eventDate, 
         scientificName, 
         vernacularName) |>
  collect() |>
  head(10) |>
  gt::gt()
## Data returned for bounding box:
## xmin = 153.0236 xmax = 153.0248 ymin = -27.46896 ymax = -27.46789
decimalLatitude decimalLongitude eventDate scientificName vernacularName
-27.46889 153.0244 2015-10-09 Burhinus (Burhinus) grallarius Bush Stone-curlew
-27.46882 153.0236 2019-04-21 14:40:00 Threskiornis moluccus Australian White Ibis
-27.46882 153.0236 2005-08-28 12:18:00 Threskiornis moluccus Australian White Ibis
-27.46868 153.0238 2024-07-15 15:00:18 Pseudanapaea denotata NA
-27.46868 153.0238 2024-01-13 15:07:42 Tenodera australasiae Purplewinged Mantid
-27.46868 153.0238 2024-01-13 15:04:53 Pristhesancus plagipennis Bee-killer
-27.46868 153.0238 2024-01-13 15:06:26 Toxorhynchites (Toxorhynchites) speciosus NA
-27.46868 153.0238 2024-01-13 12:20:00 Xixuthrus NA
-27.46868 153.0238 2024-07-15 12:12:27 Columba (Columba) livia Rock Dove
-27.46868 153.0238 2024-01-13 15:07:42 Lyramorpha (Lyramorpha) rosea Litchi Stink Bug

Large shapefiles

The type argument with option "bbox" is provided because sf objects with >500 vertices will not be accepted by the ALA. In the event you have a large shapefile, using type = "bbox" will at least enable an initial reduction of the data that is downloaded, before finer filtering to the actual shapefile will obtain the desired set of occurrences. Alternatively, one can also perform the "bbox" reduction before passing the shape to geolocate() by using sf::st_bbox().

A common situation for this to occur is when a shapefile with multiple shapes is provided, where we are interested in grouping our results by each shape. Here is a mock workflow using a subset of a shapefile of all 2,184 Brisbane parks.

Let’s say we are interested in knowing which parks in the Brisbane postcode 4075 have the most occurrences of the Scaly-Breasted Lorikeet (Trichoglossus chlorolepidotus) since 2020. We can download the entire shapefile from the link above, and perform our filtering and summarising as follows:

brisbane_parks <- sf::st_read("path/to/Park___Locations.shp") |>
  sf::st_make_valid() |>
  filter(POST_CODE == 4075)
# Convert shapefile to a bounding box
brisbane_parks_bbox <- brisbane_parks |> sf::st_bbox()

# Find all occurrences of Trichoglossus chlorolepidotus in the bounding box in 2022
lorikeet_brisbane <- galah_call() |>
  filter(scientificName == "Trichoglossus chlorolepidotus", 
         year >= 2020) |>
  geolocate(brisbane_parks_bbox, type = "bbox") |>
  collect()
## Data returned for bounding box:
## xmin = 152.96331 xmax = 152.99668 ymin = -27.57737 ymax = -27.51606
## Retrying in 1 seconds.
# Filter records down to only those in the shapefile polygons
lorikeet_brisbane |>
  # Create a point geometry based on the occurrence coordinates
  sf::st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = sf::st_crs(brisbane_parks), remove = FALSE) |>
  # identify which park each occurrence sits in with st_intersects()
  mutate(intersection = sf::st_intersects(geometry, brisbane_parks) |> as.integer(),
         park = ifelse(is.na(intersection), NA, brisbane_parks$PARK_NAME[intersection])) |>
  # Filter out occurrences that did not occur in a park
  filter(!is.na(park)) |>
  # Drop the geometry column
  sf::st_drop_geometry() |>
  # Summarise the top 10 parks for lorikeet sightings in 2022
  group_by(park) |>
  summarise(counts = n()) |>
  arrange(desc(counts)) |>
  head(10)
## # A tibble: 10 × 2
##    park                           counts
##    <chr>                           <int>
##  1 NYUNDARE-BA PARK                  552
##  2 SHERWOOD ARBORETUM                472
##  3 FAULKNER PARK                      64
##  4 FORT ROAD BUSHLAND                 46
##  5 STRICKLAND TERRACE PARK            44
##  6 BENARRAWA RESERVE                  34
##  7 GRACEVILLE RIVERSIDE PARKLANDS     32
##  8 NOSWORTHY PARK                     17
##  9 HORACE WINDOW RESERVE              14
## 10 GRACEVILLE AVENUE PARK              7

Some shapefiles cover large geographic areas with the caveat that even the bounding box doesn’t restrict the number of records to a value that can be downloaded easily. In this case, we recommend more nuances and detailed methods that can be performed using looping techniques. One of our ALA Labs blog posts, Hex maps for species occurrence data, has been written detailing how to approach larger problems such as this.