The purpose of this vignette is to demonstrate a workflow for
managing data sets that have censored data. In the context of water
quality data, we often think of censored values as water quality
measurements reported as not-detected, less-than the quantitation limit,
between detection limit and limit of quantitation to name a few. For
example, in water quality, the value “<0.5” is typically thought of
as left censored. In the truest statistical sense of the meaning,
left-censoring implies no knowledge about the lower bound; however, in
most water resource applications, zero is a common lower bound.
Therefore, a value “<0.5” should be thought of as a water quality
measurement having a lower and upper bound of 0 and 0.5, respectively.
This allow baytrends
to rely on the survival
package which treats these values as “interval” censored data and are
referred to as Surv objects.
This vignette is divided into three primary sections. The first section describes the process of importing data from csv files into a dataframe with Surv objects. The second section is a more general workflow. The third section provides a list of miscellaneous functions and notes.
Importing CSV files that have censored data into R dataframes for analysis in baytrends is a critical data preparation step and involves two lines of code: 1) importing the CSV data into R and 2) converting water quality measurements with censored data to Surv objects (see below code).
The remainder of this section discusses the above lines of code by way of example.
Figure 1 represents a portion of data from an example CSV file. Note, there is no requirement for the CSV file to have the data stored in the column order shown. In this example, the three required fields: station, date, and layer are stored in the first three columns ([A]-[C]). Missing values for these columns are not permissable. The user can optionally include additional fields such as depth, time, etc.
Figure 1. Example CSV file.
Fields with no censoring. In the above example, Secchi depth (secchi), salinity (salinity), dissolved oxygen (do), water temperature (wtemp), total suspended solids (tss) are stored in the next group of columns as un-censored data. While possible, censoring is less common for these types of water quality variables and as a result the data are represented with a single column in the CSV file. The values entered in these columns should be numeric values with no symbols. Values shown as “NA” are un-reported entries. Un-reported values can happen for a number of reasons including not being measured, equipment failure, etc. CSV files can use “NA” or use blanks to represent un-reported entries. Depending on the situation and the user’s sensitivity to fully evaluating censored data, a variable such as total suspended solids could be treated as an un-censored variable as shown in this example or could be treated as a variable with censored data as discussed next.
Fields with censoring. To the right of the break in Figure 1, water quality fields include lower and upper bounds to represent censoring. Note, the field naming syntax using a suffix of “_lo” and “_hi”. For example, columns [U] and [V] show the lower and upper bounds for total dissolved phosphorus as tdp_lo and tdp_hi, respectively. The upper bound must be greater than or equal to the lower bound. When the upper bound is equal to the lower bound, it indicates that observation did not have censoring.
The code df_in <- loadData(file = "ExampleData.csv")
uses a baytrends’ built-in function to load data from a CSV file.
Experienced R programmers can use other functions to load data
(e.g. read.csv
, read.table
); however,
loadData
outputs meta data about the number of columns and
rows read and performs some useful steps such as converting date fields
into an R date field (i.e., POSIXct
) and is advantageous
for further analyses with baytrends and other situations. There is also
a companion baytrends function, loadExcel
, with similar
functionality to loadData
and can be used to load data from
an Excel workbook. See help (i.e., ??loadData and ??loadExcel) for more
information.
After importing the above CSV file into an R dataframe, the next step
is to convert fields with censored data to Surv objects as depicted in
Figure 2 using the code df_Surv <- makeSurvDF(df_in)
.
Fields shaded in green or blue are column to column comparable to the
CSV file. Columns to the right of the break, such as the yellow
highlighted column, tdp, is a Surv object and is derived from the two
corresponding columns, tdp_lo and tdp_hi in the CSV file (Figure 1).
Again, experienced R programmers can make Surv objects independently
from baytrends by using the following syntax for each field with
censored data using the total dissolved phosphorus data as an
example.
The baytends function, makeSurvDF
, peforms this process
for all fields in a dataframe with censored data. By default,
makeSurvDF
assumes the dataframe employs a suffix of “_lo”
to represent the lower bound and “_hi” to represent the upper bound
although this can be changed by the user. See help (i.e., ??makeSurvDF)
for more information. The makeSurvDF
function also performs
a quality assurance review of the data to make sure the upper bound is
greater than or equal to the lower bound. The function outputs the
number of data integrity issues by parameter and sets those observations
to missing (NA) so the code can continue whilst the analyst determines
the appropriate corrective action.
Figure 2. Example dataframe with Surv objects.
The previous section described the primary steps of importing CSV
files into R and converting fields with censored data to Surv objects
using the functions loadData
and makeSurvDF
.
Because many R functions do not recognize Surv objects, it may be
necessary to convert to Surv objects to standard numeric vectors for
some analyses. To address this need, baytrends includes two additional
functions, unSurvDF
and saveDF
to facilitate
extracting and saving lower and upper bounds from Surv objects. When
combined with the previously described functions, it is possible to have
a complete workflow as depicted in Figure 3. unSurvDF
and
saveDF
are described next.
Figure 3. Workflow for handling censored water quality data with baytrends and Surv objects.
As the name suggests, this function does the opposite of
makeSurvDF
. That is, in the below example code,
df1 <- unSurvDF(df0)
, df0 is a dataframe with Surv
objects. unSurvDF
passes through fields from
df0
that are not Surv objects and splits Surv objects into
two fields, the lower and upper bound, using a field naming syntax that
includes a suffix of “_lo” and “_hi”. The user can change the default
suffix (see help, i.e., ??unSurvDF for more information).
This function saves a dataframe to a CSV file using a default naming
convention that includes the name of the dataframe and a timestamp in
the file name. In general, we have found that saving a dataframe that
includes Surv objects do not facilitate analyses in other software such
as Excel. As in the case of loadData
it is straightforward
to use other functions to create csv files (e.g. write.csv
,
write.table
).
In the previous section, the sample code started at the top of the
diagram shown in Figure 3 and used loadData
and
makeSurvDF
to arrive at a dataframe with Surv objects as
shown at the bottom of the diagram. The example code in this section
starts at the bottom of the diagram and goes full circle such that the
dataframe df0 is identical to df3.
library(baytrends)
# Load the first 10 rows of the built-in baytrends dataframe
# dataCensored into a dataframe and just a few fields.
df0 <- dataCensored[1:10,c("station","date","layer","tss","tdp","tp")]
df0
# Convert the dataframe, df0, that has Surv objects to a
# dataframe where censored data are represented with
# "_lo/_hi" field name syntax.
df1 <- unSurvDF(df0)
df1
# Save dataframe, df1, where censored data are represented with
# "_lo/_hi" field name syntax to a csv file in the working
# directory.
saveDF(df1, folder = '.')
# Load data from csv file where censored data are represented
# with "_lo/_hi“ field name syntax to a dataframe of same
# structure.
df2 <- loadData("*.csv" ,folder = '.')
df2
# Convert dataframe, df2, where censored data are represented
# with "_lo/_hi“ field name syntax to a dataframe with Surv objects.
df3 <- makeSurvDF(df2)
df3
library(baytrends)
# Load the first 10 rows of the built-in baytrends dataframe
# dataCensored into a dataframe and just a few fields.
df0 <- dataCensored[1:10,c("station","date","layer","tss","tdp","tp")]
df0
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 [0.0000, 0.0050] [0.02310, 0.02810]
#> 6 CB3.3C 1985-06-04 B 10.4 [0.0040, 0.0065] [0.02375, 0.02625]
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.02770
# Convert the dataframe, df0, that has Surv objects to a
# dataframe where censored data are represented with
# "_lo/_hi" field name syntax.
df1 <- unSurvDF(df0)
df1
#> station date layer tss tdp_lo tdp_hi tp_lo tp_hi
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.0360 0.05120 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.0170 0.03100 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.0080 0.02520 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.0255 0.04760 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 0.0000 0.0050 0.02310 0.02810
#> 6 CB3.3C 1985-06-04 B 10.4 0.0040 0.0065 0.02375 0.02625
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.0320 0.04500 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.0640 0.08410 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.0370 0.05510 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.0140 0.02770 0.02770
# Save dataframe, df1, where censored data are represented with
# "_lo/_hi" field name syntax to a csv file in the working
# directory.
saveDF(df1, folder = '.')
# Load data from csv file where censored data are represented
# with "_lo/_hi“ field name syntax to a dataframe of same
# structure.
df2 <- loadData("*.csv" ,folder = '.')
#>
#>
#> |Description |Value |
#> |:--------------------------------------|:--------------------------------|
#> |1) File Name |df1_2024_07_26_100708.853504.csv |
#> |2) Folder Name |. |
#> |3) Primary Key |NA |
#> |4) Rows Read In |10 |
#> |5) Columns Read In |8 |
#> |6) Rows After Blank Rows Removed |10 |
#> |7) Columns After Blank Columns Removed |8 |
#> |8) Rows After Duplicate Rows Removed |10 |
#> |9) Rows After Duplicate PK Removed |n/a |
#> |10) date converted with |%Y-%Om-%d format |
df2
#> station date layer tss tdp_lo tdp_hi tp_lo tp_hi
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.0360 0.05120 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.0170 0.03100 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.0080 0.02520 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.0255 0.04760 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 0.0000 0.0050 0.02310 0.02810
#> 6 CB3.3C 1985-06-04 B 10.4 0.0040 0.0065 0.02375 0.02625
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.0320 0.04500 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.0640 0.08410 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.0370 0.05510 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.0140 0.02770 0.02770
# Convert dataframe, df2, where censored data are represented
# with "_lo/_hi“ field name syntax to a dataframe with Surv objects.
df3 <- makeSurvDF(df2)
df3
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 [0.0000, 0.0050] [0.02310, 0.02810]
#> 6 CB3.3C 1985-06-04 B 10.4 [0.0040, 0.0065] [0.02375, 0.02625]
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.02770
It is reasonable to assemble newly collected data with previous data in an environment outside of R and import the data en masse as described earlier. However, this step can also be performed inside R.
Normally, dataframes can be appended by using rbind
function within R. For dataframes that include Surv objects is is
necessary to use the bind_rows
function from the
dplyr
library as demonstrated in the below code.
library(baytrends)
#library(dplyr)
# Create two dataframes with Surv objects.
df11 <- dataCensored[1:10,c("station","date","layer","tss","tdp","tp")]
df11
df12 <- dataCensored[11:20,c("station","date","layer","tss","tdp","tp")]
df12
# Combine the two dataframe into one dataframe
df13 <- dplyr::bind_rows(df11,df12)
df13
library(baytrends)
#library(dplyr)
# Create two dataframes with Surv objects.
df11 <- dataCensored[1:10,c("station","date","layer","tss","tdp","tp")]
df11
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 [0.0000, 0.0050] [0.02310, 0.02810]
#> 6 CB3.3C 1985-06-04 B 10.4 [0.0040, 0.0065] [0.02375, 0.02625]
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.02770
df12 <- dataCensored[11:20,c("station","date","layer","tss","tdp","tp")]
df12
#> station date layer tss tdp tp
#> 11 CB3.3C 1985-06-18 BP 21.30 0.0160 0.0353
#> 12 CB3.3C 1985-06-18 S 29.10 0.0260 0.0486
#> 13 CB3.3C 1985-07-09 AP 12.00 0.0270 0.0488
#> 14 CB3.3C 1985-07-09 B 20.80 0.0400 0.0539
#> 15 CB3.3C 1985-07-09 BP 9.45 0.0180 0.0419
#> 16 CB3.3C 1985-07-09 S 12.20 0.0190 0.0562
#> 17 CB3.3C 1985-07-24 AP 16.40 0.0150 0.0490
#> 18 CB3.3C 1985-07-24 B 17.00 0.0470 0.0560
#> 19 CB3.3C 1985-07-24 BP 34.60 0.0430 0.0580
#> 20 CB3.3C 1985-07-24 S 16.60 0.0145 0.0490
# Combine the two dataframe into one dataframe
df13 <- dplyr::bind_rows(df11,df12)
df13
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.90 0.0360 0.05120
#> 2 CB3.3C 1985-05-21 B 7.60 0.0170 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.80 0.0080 0.02520
#> 4 CB3.3C 1985-05-21 S 7.00 0.0255 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.00 [0.0000, 0.0050] [0.02310, 0.02810]
#> 6 CB3.3C 1985-06-04 B 10.40 [0.0040, 0.0065] [0.02375, 0.02625]
#> 7 CB3.3C 1985-06-04 BP 6.60 0.0320 0.04500
#> 8 CB3.3C 1985-06-04 S 5.80 0.0640 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.50 0.0370 0.05510
#> 10 CB3.3C 1985-06-18 B 31.20 0.0140 0.02770
#> 11 CB3.3C 1985-06-18 BP 21.30 0.0160 0.03530
#> 12 CB3.3C 1985-06-18 S 29.10 0.0260 0.04860
#> 13 CB3.3C 1985-07-09 AP 12.00 0.0270 0.04880
#> 14 CB3.3C 1985-07-09 B 20.80 0.0400 0.05390
#> 15 CB3.3C 1985-07-09 BP 9.45 0.0180 0.04190
#> 16 CB3.3C 1985-07-09 S 12.20 0.0190 0.05620
#> 17 CB3.3C 1985-07-24 AP 16.40 0.0150 0.04900
#> 18 CB3.3C 1985-07-24 B 17.00 0.0470 0.05600
#> 19 CB3.3C 1985-07-24 BP 34.60 0.0430 0.05800
#> 20 CB3.3C 1985-07-24 S 16.60 0.0145 0.04900
As described earlier, unSurvDF
converts dataframes that
contain Surv objects into dataframes with separate columns representing
the lower and upper bounds. The companion function, unSurv
,
operates on a single field and returns a 3-column matrix as demonstrated
in the below code. See help (i.e., ??unSurv) for more information about
the returned columns.
dataCensored[1:10,c("station","date","layer","tss","tdp","tp")]
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 [0.0000, 0.0050] [0.02310, 0.02810]
#> 6 CB3.3C 1985-06-04 B 10.4 [0.0040, 0.0065] [0.02375, 0.02625]
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.02770
unSurvDF(dataCensored[1:10,c("station","date","layer","tss","tdp","tp")])
#> station date layer tss tdp_lo tdp_hi tp_lo tp_hi
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.0360 0.05120 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.0170 0.03100 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.0080 0.02520 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.0255 0.04760 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 0.0000 0.0050 0.02310 0.02810
#> 6 CB3.3C 1985-06-04 B 10.4 0.0040 0.0065 0.02375 0.02625
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.0320 0.04500 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.0640 0.08410 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.0370 0.05510 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.0140 0.02770 0.02770
unSurv(dataCensored[1:10,"tdp"])
#> lo hi type
#> [1,] 0.0360 0.0360 1
#> [2,] 0.0170 0.0170 1
#> [3,] 0.0080 0.0080 1
#> [4,] 0.0255 0.0255 1
#> [5,] 0.0000 0.0050 3
#> [6,] 0.0040 0.0065 3
#> [7,] 0.0320 0.0320 1
#> [8,] 0.0640 0.0640 1
#> [9,] 0.0370 0.0370 1
#> [10,] 0.0140 0.0140 1
In some instances, it is useful to simplify Surv objects to a single
field instead of lower and upper bounds. By default impute
and imputeDF
will compute the midpoint between the lower
and upper bound for a single Surv field or convert the entire dataframe,
respectively as demonstrated with the below code. See help (i.e.,
??imputeDF and ??impute) for more information on other imputation
options including “lower”, “upper”, “mid”, “norm”, and “lnorm”.
dataCensored[1:10,c("station","date","layer","tss","tdp","tp")]
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.9 0.0360 0.05120
#> 2 CB3.3C 1985-05-21 B 7.6 0.0170 0.03100
#> 3 CB3.3C 1985-05-21 BP 7.8 0.0080 0.02520
#> 4 CB3.3C 1985-05-21 S 7.0 0.0255 0.04760
#> 5 CB3.3C 1985-06-04 AP 5.0 [0.0000, 0.0050] [0.02310, 0.02810]
#> 6 CB3.3C 1985-06-04 B 10.4 [0.0040, 0.0065] [0.02375, 0.02625]
#> 7 CB3.3C 1985-06-04 BP 6.6 0.0320 0.04500
#> 8 CB3.3C 1985-06-04 S 5.8 0.0640 0.08410
#> 9 CB3.3C 1985-06-18 AP 17.5 0.0370 0.05510
#> 10 CB3.3C 1985-06-18 B 31.2 0.0140 0.02770
imputeDF(dataCensored[1:10,c("station","date","layer","tss","tdp","tp")])
#> station date layer tss tdp tp
#> 1 CB3.3C 1985-05-21 AP 6.9 0.03600 0.0512
#> 2 CB3.3C 1985-05-21 B 7.6 0.01700 0.0310
#> 3 CB3.3C 1985-05-21 BP 7.8 0.00800 0.0252
#> 4 CB3.3C 1985-05-21 S 7.0 0.02550 0.0476
#> 5 CB3.3C 1985-06-04 AP 5.0 0.00250 0.0256
#> 6 CB3.3C 1985-06-04 B 10.4 0.00525 0.0250
#> 7 CB3.3C 1985-06-04 BP 6.6 0.03200 0.0450
#> 8 CB3.3C 1985-06-04 S 5.8 0.06400 0.0841
#> 9 CB3.3C 1985-06-18 AP 17.5 0.03700 0.0551
#> 10 CB3.3C 1985-06-18 B 31.2 0.01400 0.0277
impute(dataCensored[1:10,"tdp"])
#> [1] 0.03600 0.01700 0.00800 0.02550 0.00250 0.00525 0.03200 0.06400 0.03700
#> [10] 0.01400