Customizing clean

In our “Getting started” tutorial, we worked with a set of files that matched the expected metadata patterns. However, this is probably not going to be the case much of the time.

Here we’ll go over how to customize ARUtools functions to work with your data.

For example, let’s assume your files look like this, with two recordings, one at Site 100-a45 May 4th 2020 at 5:25 am with ARU unit S4A1234. The other at Site 102-b56 on the same day but at 5:40 am with ARU unit S4A1111.

If we try to clean this with the default arguments, we’re going to have some problems.

Regular expressions

First let’s talk a bit about how clean_metadata() extracts information.

This function uses regular expressions to match specific text patterns in the file path of each recording. Regular expressions are really powerful, but also reasonably complicated and can be confusing.

For example, by default, clean_metadata() matches site ids with the expression ((Q)|(P))(())(_|-)(()).

Yikes!

Broken down, that means look for a “Q” or “P” (((Q)|(P))) followed by two digits (\\d{2}) followed by a separator, either _ or - (_|-) followed by a single digit (\\d{1}).

This clearly doesn’t define the sites in our example here. You can supply your own regular expression, instead.

m <- clean_metadata(project_files = f, pattern_site_id = "site\\d{3}-(a|b)\\d{2}")
#> Extracting ARU info...
#> Extracting Dates and Times...
#> Identified possible problems with metadata extraction:
#> ✖ No times were successfully detected (2/2)
#> ✖ No ARU ids were successfully detected (2/2)
m
#> # A tibble: 2 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 2020_05_04_0… wav   site… <NA>   Wildlife Ac… Song… SongMet… site10… <NA>     
#> 2 2020_05_04_0… wav   site… <NA>   Wildlife Ac… Song… SongMet… site10… <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>
m$site_id
#> [1] "site100-a45" "site102-b56"

However, with sites that follow a reasonable pattern of a prefix, followed by digits and optionally a suffix with digits, it might be easier to use a helper function to create the regular expression for you.

For example, to create a site id pattern we can use create_pattern_site_id().

We specify the prefix text as well as how many digits we might expect, a separator, suffix text and how many suffix digits there might be.

pat_site <- create_pattern_site_id(
  prefix = "site", p_digits = 3,
  sep = "-",
  suffix = c("a", "b"), s_digits = 2
)
pat_site
#> [1] "((site))((\\d{3}))(-)((b)|(a))((\\d{2}))"

m <- clean_metadata(project_files = f, pattern_site_id = pat_site)
#> Extracting ARU info...
#> Extracting Dates and Times...
#> Identified possible problems with metadata extraction:
#> ✖ No times were successfully detected (2/2)
#> ✖ No ARU ids were successfully detected (2/2)
m$site_id
#> [1] "site100-a45" "site102-b56"

It can be useful to look at the default patterns in the functions to see what might be different in your data.

See ?create_pattern_date or any create_pattern function to pull up the documentation and explore the defaults as well as examples.

It can also be useful to test out a pattern before running all your files.

We can use the test_pattern() function to see if our pattern successfully extracts the site id from the first file in our list.

test_pattern(f[1], pat_site)
#> [1] "site100-a45"

Let’s continue customizing our metadata patterns by specifying ARU ids, dates and times.

pat_aru <- create_pattern_aru_id(arus = "s4a", n_digits = 4)

m <- clean_metadata(
  project_files = f,
  pattern_site_id = pat_site,
  pattern_aru_id = pat_aru,
  pattern_dt_sep = "_"
)
#> Extracting ARU info...
#> Extracting Dates and Times...
m
#> # A tibble: 2 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 2020_05_04_0… wav   site… s4a12… Wildlife Ac… Song… SongMet… site10… <NA>     
#> 2 2020_05_04_0… wav   site… s4a11… Wildlife Ac… Song… SongMet… site10… <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

Other options

Date order

Depending on your date formatting, you may also need to specify the order of the year, month and day, in addition to changing the pattern.

f <- c(
  "P01-1/05042020_052500_S4A1234.wav",
  "P01-1/05042020_054000_S4A1111.wav"
)

clean_metadata(
  project_files = f,
  pattern_dt_sep = "_",
  pattern_date = create_pattern_date(order = "mdy"),
  order_date = "mdy"
)
#> Extracting ARU info...
#> Extracting Dates and Times...
#> # A tibble: 2 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 05042020_052… wav   P01-… S4A12… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> 2 05042020_054… wav   P01-… S4A11… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

Note that you need to specify it once when making the pattern, and then again when telling the function how to turn the extracted text into a date.

You can specify more than one order with c("mdy", "ymd"), but only do this if you know you have multiple orders in the file names. In particular, try to avoid using both mdy and dmy. Some of these dates can be ambiguous (for example, what order is 05/05/2020?) and may not be parsed correctly in these situations.

Matching multiple patterns

f <- c(
  "P01-1/05042020_052500_S4A1234.wav",
  "P01-1/05042020_054000_S4A1111.wav",
  "Site10/2020-01-01T09:00:00_BARLT100.wav",
  "Site10/2020-01-02T09:00:00_BARLT100.wav"
)

Sometimes your files may use more than one pattern. You can address this problem in one of two ways.

One option is to run clean_metadata() twice and then join the outputs

m1 <- clean_metadata(
  project_files = f,
  pattern_dt_sep = "_",
  pattern_date = create_pattern_date(order = "mdy"),
  order_date = "mdy"
)
#> Extracting ARU info...
#> Extracting Dates and Times...
#> Identified possible problems with metadata extraction:
#> ✖ Not all dates were successfully detected (2/4)
#> ✖ Not all times were successfully detected (2/4)
#> ✖ Not all ARU ids were successfully detected (2/4)
#> ✖ Not all sites were successfully detected (2/4)
m1 <- filter(m1, !is.na(date_time)) # omit ones that didn't work
m1
#> # A tibble: 2 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 05042020_052… wav   P01-… S4A12… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> 2 05042020_054… wav   P01-… S4A11… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

m2 <- clean_metadata(
  project_files = f,
  pattern_site_id = create_pattern_site_id(prefix = "Site", s_digits = 0),
  pattern_aru_id = create_pattern_aru_id(n_digits = 3)
)
#> Extracting ARU info...
#> Extracting Dates and Times...
#> Identified possible problems with metadata extraction:
#> ✖ Not all dates were successfully detected (1/4)
#> ✖ Not all times were successfully detected (2/4)
#> ✖ Not all sites were successfully detected (2/4)
m2 <- filter(m2, !is.na(date_time)) # omit ones that didn't work
m2
#> # A tibble: 2 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 2020-01-01T0… wav   Site… BARLT… Frontier La… BAR-… BARLT    Site10  <NA>     
#> 2 2020-01-02T0… wav   Site… BARLT… Frontier La… BAR-… BARLT    Site10  <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

m <- bind_rows(m1, m2)
m
#> # A tibble: 4 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 05042020_052… wav   P01-… S4A12… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> 2 05042020_054… wav   P01-… S4A11… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> 3 2020-01-01T0… wav   Site… BARLT… Frontier La… BAR-… BARLT    Site10  <NA>     
#> 4 2020-01-02T0… wav   Site… BARLT… Frontier La… BAR-… BARLT    Site10  <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

With this approach you should check that the number of files in the end matches the number you expect.

nrow(m)
#> [1] 4

Another option is to supply multiple patterns to clean_metadata() or to the create_pattern_XXX() functions

m <- clean_metadata(
  project_files = f,
  pattern_dt_sep = c("_", "T"),
  pattern_date = create_pattern_date(order = c("ymd", "mdy")),
  order_date = c("ymd", "mdy"),
  pattern_aru_id = create_pattern_aru_id(n_digits = c(3, 4)),
  pattern_site_id = create_pattern_site_id(
    prefix = c("P", "Site"),
    sep = c("-", ""),
    s_digits = c(1, 0)
  )
)
#> Extracting ARU info...
#> Extracting Dates and Times...
m
#> # A tibble: 4 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 05042020_052… wav   P01-… S4A12… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> 2 05042020_054… wav   P01-… S4A11… Wildlife Ac… Song… SongMet… P01-1   <NA>     
#> 3 2020-01-01T0… wav   Site… BARLT… Frontier La… BAR-… BARLT    Site10  <NA>     
#> 4 2020-01-02T0… wav   Site… BARLT… Frontier La… BAR-… BARLT    Site10  <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

Which approach you should use depends on the situation.

The first approach means that the patterns being matched are more rigid. There is less of a chance of accidentally matching an incorrect pattern. However, there is a chance of omitting files that don’t match either pattern.

The second approach is more flexible in matching patterns and allows you to do so all in one step, which is convenient. However, the more flexible a pattern is, the more opportunities there are to get incorrect matches and date parsing.

With both approaches, it is important to double check the results and make sure the ids and date/times make sense.

check_meta(m)
#> # A tibble: 3 × 11
#>   site_id aru_type  aru_id   type  n_files n_dirs n_days min_date           
#>   <chr>   <chr>     <chr>    <chr>   <int>  <int>  <int> <dttm>             
#> 1 P01-1   SongMeter S4A1111  wav         1      1      1 2020-05-04 05:40:00
#> 2 P01-1   SongMeter S4A1234  wav         1      1      1 2020-05-04 05:25:00
#> 3 Site10  BARLT     BARLT100 wav         2      1      2 2020-01-01 09:00:00
#> # ℹ 3 more variables: max_date <dttm>, min_time <time>, max_time <time>
check_meta(m, date = TRUE)
#> # A tibble: 4 × 10
#>   site_id aru_type  aru_id   type  date       n_files n_dirs n_days min_time
#>   <chr>   <chr>     <chr>    <chr> <date>       <int>  <int>  <int> <time>  
#> 1 P01-1   SongMeter S4A1111  wav   2020-05-04       1      1      1 05:40   
#> 2 P01-1   SongMeter S4A1234  wav   2020-05-04       1      1      1 05:25   
#> 3 Site10  BARLT     BARLT100 wav   2020-01-01       1      1      1 09:00   
#> 4 Site10  BARLT     BARLT100 wav   2020-01-02       1      1      1 09:00   
#> # ℹ 1 more variable: max_time <time>
check_problems(m)
#> # A tibble: 0 × 6
#> # ℹ 6 variables: path <chr>, aru_id <chr>, site_id <chr>, tz_offset <chr>,
#> #   date_time <dttm>, date <date>

unique(m$site_id)
#> [1] "P01-1"  "Site10"
unique(m$aru_id)
#> [1] "S4A1234"  "S4A1111"  "BARLT100"

Subsetting files

You may not want to extract meta data for every file in your list or directory. Possibly this is because they’re not relevant recordings, or because you have some formatting issues that make it easier to split into separate groups first.

You can omit files using the subset and subset_type arguments.

To keep only certain files, use the default subset_type = "keep". To omit certain files, use subset_type = "omit".

To keep only files with the “a” prefix (note that ^ means ‘at the start’)

clean_metadata(project_files = example_files, subset = "^a")
#> Extracting ARU info...
#> Extracting Dates and Times...
#> # A tibble: 14 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 P01_1_202005… wav   a_BA… BARLT… Frontier La… BAR-… BARLT    P01_1   -0400    
#> 2 P01_1_202005… wav   a_BA… BARLT… Frontier La… BAR-… BARLT    P01_1   -0400    
#> 3 P02_1_202005… wav   a_S4… S4A01… Wildlife Ac… Song… SongMet… P02_1   <NA>     
#> 4 P02_1_202005… wav   a_S4… S4A01… Wildlife Ac… Song… SongMet… P02_1   <NA>     
#> # ℹ 10 more rows
#> # ℹ 2 more variables: date_time <dttm>, date <date>

To omit all files with the “a” prefix

clean_metadata(project_files = example_files, subset = "^a", subset_type = "omit")
#> Extracting ARU info...
#> Extracting Dates and Times...
#> # A tibble: 28 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 P01_1_202005… wav   j_BA… BARLT… Frontier La… BAR-… BARLT    P01_1   -0400    
#> 2 P01_1_202005… wav   j_BA… BARLT… Frontier La… BAR-… BARLT    P01_1   -0400    
#> 3 P02_1_202005… wav   j_S4… S4A01… Wildlife Ac… Song… SongMet… P02_1   <NA>     
#> 4 P02_1_202005… wav   j_S4… S4A01… Wildlife Ac… Song… SongMet… P02_1   <NA>     
#> # ℹ 24 more rows
#> # ℹ 2 more variables: date_time <dttm>, date <date>

Matching non-wave files

By default clean_metadata() looks for .wav files. If you want it to match something else, adjust the file_type argument.

f <- c(
  "a_BARLT10962_P01_1/P01_1_20200502T050000_ARU.mp4",
  "a_BARLT10962_P01_1/P01_1_20200503T052000_ARU.mp4"
)

Other wise we’ll run into problems…

clean_metadata(project_files = f)
#> Error in `clean_metadata()`:
#> ! Did not find any 'wav' files.
#> ℹ Use `file_type` to change file extension for sound files
#> ℹ Check `project_dir`/`project_files` are correct

clean_metadata(project_files = f, file_type = "mp4")
#> Extracting ARU info...
#> Extracting Dates and Times...
#> # A tibble: 2 × 11
#>   file_name     type  path  aru_id manufacturer model aru_type site_id tz_offset
#>   <chr>         <chr> <chr> <chr>  <chr>        <chr> <chr>    <chr>   <chr>    
#> 1 P01_1_202005… mp4   a_BA… BARLT… Frontier La… BAR-… BARLT    P01_1   <NA>     
#> 2 P01_1_202005… mp4   a_BA… BARLT… Frontier La… BAR-… BARLT    P01_1   <NA>     
#> # ℹ 2 more variables: date_time <dttm>, date <date>

Look arounds

In some cases identifying what lies before or after a string of interest can help with extracting the pattern of interest. Details of look arounds can be found on the “stringr” package website.

The following code shows a set of files that contain repeated patterns that match both site and project folders. If we run clean_metadata() it fails to detect dates and times.

f <- c(
  "//BARLTs/DeploymentProjectXYZsites_202223/XYZBrantAirstrip/20230519_RemoteTrip2223/00015998_20230519T210900-0400_SS23.wav",
  "//BARLTs/DeploymentProjectXYZsites_202223/XYZPermafrostPFSC-SP1/20230415_RemoteTrip2223/00015321_20230415T214700-0400_Owls23.wav",
  "//BARLTs/DeploymentProjectXYZsites_202223/XYZfoxden30/20230623_RemoteTrip2223/00015370_20230623T062000-0400_SR23.wav",
  "//BARLTs/DeploymentProjectXYZsites_202223/XYZfoxden107/20220922_RemoteTrip2223/00016130_20220922T000200-0400_NFC22.wav",
  "//BARLTs/DeploymentProjectXYZsites_00202223/XYZfoxden107/20230711_RemoteTrip2223/00016130_20230711T093600-0400_SR23.wav"
)

m <- clean_metadata(
  project_files = f,
  pattern_site_id = create_pattern_site_id(prefix = "XYZ\\w+", p_digits = 0:3, sep = c("", "-"), s_digits = 0:1),
  pattern_aru_id = create_pattern_aru_id(arus = "", n_digits = 8), quiet = T
)
#> Identified possible problems with metadata extraction:
#> ✖ No dates were successfully detected (5/5)
#> ✖ No times were successfully detected (5/5)

Even worse, it returns the wrong site_id values.

m$site_id
#> [1] "XYZsites_202223"   "XYZsites_202223"   "XYZsites_202223"  
#> [4] "XYZsites_202223"   "XYZsites_00202223"

To tackle the site_id issue we can add a look behind to clue into the directory before the site_id always ends with “202223/”.

m_site_id_fix <- clean_metadata(
  project_files = f,
  pattern_site_id = create_pattern_site_id(
    prefix = "XYZ\\w+", p_digits = 0:3, sep = c("", "-"), s_digits = 0:1,
    look_behind = "202223/"
  ),
  pattern_aru_id = create_pattern_aru_id(arus = "", n_digits = 8), quiet = T
)
#> Identified possible problems with metadata extraction:
#> ✖ No dates were successfully detected (5/5)
#> ✖ No times were successfully detected (5/5)

m_site_id_fix$site_id
#> [1] "XYZBrantAirstrip"  "XYZPermafrostPFSC" "XYZfoxden30"      
#> [4] "XYZfoxden107"      "XYZfoxden107"

This corrects the site_id values but the generic “pattern_aru_id” means that the clean_metadata() fails to detect dates and time as the “pattern_aru_id” is excluded when looking for dates and times.

m_fix <- clean_metadata(
  project_files = f,
  pattern_site_id = create_pattern_site_id(
    prefix = "XYZ\\w+", p_digits = 0:3, sep = c("", "-"), s_digits = 0:1,
    look_behind = "202223/"
  ),
  pattern_aru_id = create_pattern_aru_id(arus = "", n_digits = 8, 
                                         sep = "", 
                                         look_behind = "RemoteTrip2223/", 
                                         look_ahead = "_"),
  quiet = T
)

In the end the look arounds helped us pull out stubborn site_id vaules and separate overlapping patterns of aru_id and dates and times.

Customizing clean_metadata()