The R package funtimes
contains the function
beales
that can be used to implement Beale’s (Beale 1962) ratio estimator for estimating
total value. The function also calculates recommended sample size for
desired confidence level and absolute or relative error.
The Beale’s estimator is often used in ecology to compute total pollutant load, \(\widehat{Y}\), given a sample of the loads \(y_i\) and corresponding river flow or discharges, \(x_i\) (\(i = 1,\ldots,n\)): \[ \widehat{Y} =X\frac{\bar{y}}{\bar{x}}\frac{\left( 1+ \theta\frac{s_{xy}}{\bar{x}\bar{y}}\right)}{\left( 1+\theta\frac{s^2_x}{\bar{x}^2} \right)}, \] where \(\theta=n^{-1} - N^{-1}\), \(s_{xy}=(n-1)^{-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\), and \(s^2_{x}=(n-1)^{-1}\sum_{i=1}^n(x_i-\bar{x})^2\). Total flow, \(X=\sum_{i=1}^Nx_i\), is assumed to be known. If the data set for flow contains only \(n'\) observations (\(n\leqslant n'< N\)), we use an estimate \(\widehat{X}=\frac{N}{n'}\sum_{i=1}^{n'}x_i\) following formula (2.8) in Thompson (2012).
To install and load the package, run
install.packages("funtimes")
library(funtimes)
Help file for the function can be opened from R with:
?beales
The function uses the following groups of arguments as its inputs.
x
and y
(both are required) for discharge
and corresponding load measurements;level
defines the confidence level (optional; if not
specified, level = 0.95
is used, i.e., 95%);N
(optional, see details in the section below).verbose
(optional) is a logical value
(TRUE
or FALSE
) defining whether text output
should be shown. If not specified, its value is set to TRUE
to show the text outputs.p
relative error, ord
margin of error.The ideal case is when all discharge data are know, and only some measurements of loads are missing.
The inputs should be organized in vectors of same length. Consider a
toy example where ten measurements cover the whole period of interest
(i.e., the population size N = 10
):
<- c(60, 50, 90, 100, 80, 90, 100, 90, 80, 70)
discharge <- c(33, 22, 44, 48, NA, 44, 49, NA, NA, 36) loads
NA
s stand for missing values.
To estimate the total load for this period, use:
<- beales(x = discharge, y = loads)
B10 # [1] "Beale's estimate of the total (for population size 10) is 399.176 with 95% confidence interval from 391.315 to 407.037."
By default (the setting verbose = TRUE
), the function
shows text output. All estimates have been saved in the object
B10
and can be extracted from there. For example, see the
list of elements saved in B10
ls(B10)
# [1] "CI" "N" "estimate" "level" "n" "se"
then extract the population size and standard error of the load estimate
$N
B10# [1] 10
$se
B10# [1] 4.010797
If a different level of confidence (default is 95%) is needed, set it
using the argument level
:
<- beales(x = discharge, y = loads, level = 0.9)
B11 # [1] "Beale's estimate of the total (for population size 10) is 399.176 with 90% confidence interval from 392.578 to 405.773."
To suppress the text outputs, use verbose = FALSE
:
<- beales(x = discharge, y = loads, level = 0.9, verbose = FALSE) B12
It is common that some discharge data are missing. The function fills-in the missing discharge measurements with average estimates automatically. For example, now the first discharge value is missing:
<- c(NA, 50, 90, 100, 80, 90, 100, 90, 80, 70)
discharge2 <- c(33, 22, 44, 48, NA, 44, 49, NA, NA, 36) loads2
The NA
in discharge will be replaced by an average value
of the non-missing measurements, and the first pair of discharge and
load (average discharge and the corresponding load of 33) will still be
used in estimating covariance and other quantities. Simply use the
function in the same way as above:
<- beales(x = discharge2, y = loads2)
B20 # [1] "Beale's estimate of the total (for population size 10) is 394.35 with 95% confidence interval from 381.569 to 407.131."
In another case, both discharge and load data might be
missing. If they are not represented at all in the data vectors (by
NA
s), a simple trick is to set the population size,
N
, which is one of the arguments in the function. For
example, if the data above are ten monthly measurements, and an estimate
for the whole year (12 months) is required, set N = 12
in
the function:
<- beales(x = discharge2, y = loads2, N = 12)
B21 # [1] "Beale's estimate of the total (for population size 12) is 473.25 with 95% confidence interval from 455.161 to 491.339."
which is equivalent to adding two missing values to each vector, like this:
<- c(discharge2, NA, NA)
discharge22 <- c(loads2, NA, NA)
loads22 <- beales(x = discharge22, y = loads22)
B22 # [1] "Beale's estimate of the total (for population size 12) is 473.25 with 95% confidence interval from 455.161 to 491.339."
The other two arguments of the function, p
and
d
, allow the user to set the desired relative error or
margin of error, respectively, for sample size calculations. (If both
p
and d
are defined, the calculations will run
for p
.) The estimated sample size, \(\hat{n}\), is added to the output list as
the element nhat
, and an additional sentence is printed out
at the output.
For example, using our data for 10 months out of 12, estimate the sample size needed to estimate the total yearly load with the relative error up to 5%:
<- beales(x = discharge2, y = loads2, N = 12, p = 0.05)
B30 # [1] "Beale's estimate of the total (for population size 12) is 473.25 with 95% confidence interval from 455.161 to 491.339."
# [1] "To obtain a 95% confidence interval with a relative error of 5%, a sample of size 6 is required."
What if we increase the confidence of such interval (notice the differences in the last line of the output):
<- beales(x = discharge2, y = loads2, N = 12, p = 0.05, level = 0.99)
B31 # [1] "Beale's estimate of the total (for population size 12) is 473.25 with 99% confidence interval from 449.477 to 497.024."
# [1] "To obtain a 99% confidence interval with a relative error of 5%, a sample of size 8 is required."
Similarly, when the margin of error is set:
<- beales(x = discharge2, y = loads2, N = 12, d = 15)
B32 # [1] "Beale's estimate of the total (for population size 12) is 473.25 with 95% confidence interval from 455.161 to 491.339."
# [1] "To obtain a 95% confidence interval with a margin of error being 15, a sample of size 9 is required."
The estimated sample size can be extracted as follows:
$nhat
B32# [1] 9
x
and
y
are of different lengths.n
is the number of non-missing
values in y
(missing values in x
are
automatically replaced with an average of non-missing
x
).N
is set such
that N < length(x)
(more discharge samples than possible
in a given period) or if N <= n
(sample size is bigger
than or equals the population size). In the case when
N = n
, no estimation is needed, because the total load can
be calculated just by summing up all individual loads.n > 1
(for
estimating the variances and covariance), and \(\bar{x}\neq 0\) and \(\bar{y}\neq 0\).This vignette belongs to R package funtimes
. If you wish
to cite this page, please cite the package:
citation("funtimes")
#
# To cite package 'funtimes' in publications use:
#
# Lyubchich V, Gel Y, Vishwakarma S (2023). _funtimes: Functions for
# Time Series Analysis_. R package version 9.1.
#
# A BibTeX entry for LaTeX users is
#
# @Manual{,
# title = {funtimes: Functions for Time Series Analysis},
# author = {Vyacheslav Lyubchich and Yulia R. Gel and Srishti Vishwakarma},
# year = {2023},
# note = {R package version 9.1},
# }