distfreereg
PackageThe procedure implemented by distfreereg()
relies on
asymptotic behavior for accurate results. In particular, when the
assumptions of the procedure are met, the asymptotic distribution of the
observed statistic is equal to the distribution of the simulated
statistics. How large the sample size must be for this asymptotic
behavior to be sufficiently achieved is dependent on the details of each
application. To facilitate the exploration of sample size requirements,
distfreereg
provides the function
compare()
.
In the following example, we explore the required sample size when the true mean is \[f(X;\theta) = \theta_1 + \theta_2X_1 + \theta_3X_2,\] \(\theta=(2,5,-1)\), and the errors are independent standard normal errors. The covariate matrix is generated randomly. We start with a sample size of 10.
set.seed(20240913)
n <- 10
func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2]
theta <- c(2,5,-1)
X <- matrix(rexp(2*n, rate = 1), nrow = n)
comp_dfr <- compare(theta = theta, true_mean = func, test_mean = func,
true_X = X, true_covariance = list(Sigma = 3), X = X,
covariance = list(Sigma = 3),
theta_init = rep(1, length(theta)))
This function generates a response vector Y
from
true_mean
using the given values of true_X
,
true_covariance
(by default, the errors are multivariate
normal), and theta
. It then passes Y
and most
of the remaining arguments to distfreereg()
. The observed
statistics and p-values corresponding to each simulated data set are
saved, and the process is repeated. The results are returned in an
object of class compare
.
These results can be explored graphically. Several options are
available. The default behavior of the compare
method for
plot()
produces a comparison of estimated cumulative
distribution functions (CDFs) for the observed and simulated statistics.
If the sample size is sufficiently large, these two curves should be
nearly identical.
This plot gives some cause for concern.
Another way of exploring this uses quantile–quantile plots. Below is a Q–Q plot that compares observed and simulated statistics. If the sample size is sufficiently large, these points should lie on the line \(y=x\).
The curvature indicates an insufficient sample size. Similar to this is the next plot, which is a Q–Q plot comparing p-values to uniform quantiles. Once again, if the sample size is sufficiently large, these points should lie on the line \(y=x\).
These plots all indicate that this sample size is insufficient for the asymptotic behavior for both statistics.
More details on this plot method are discussed here.
The compare
method for ks.test()
, which
compares the observed and simulated distributions, confirms that they
are still different:
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: comp_dfr[["observed_stats"]][["KS"]] and comp_dfr[["mcsim_stats"]][["KS"]]
## D = 0.1215, p-value = 4.409e-12
## alternative hypothesis: two-sided
The default statistic is whatever is the first statistic appearing in the supplied object:
## [1] "KS" "CvM"
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: comp_dfr[["observed_stats"]][["CvM"]] and comp_dfr[["mcsim_stats"]][["CvM"]]
## D = 0.1276, p-value = 2.783e-13
## alternative hypothesis: two-sided
Let us repeat the simulation with a sample size of 100.
n_2 <- 100
X_2 <- matrix(rexp(2*n_2, rate = 1), nrow = n_2)
comp_dfr_2 <- update(comp_dfr, true_X = X_2, X = X_2)
The following plots indicate that this sample size is sufficient.
A formal test confirms this assessment.
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: comp_dfr_2[["observed_stats"]][["KS"]] and comp_dfr_2[["mcsim_stats"]][["KS"]]
## D = 0.0231, p-value = 0.7171
## alternative hypothesis: two-sided
The examples using compare()
above all illustrate
behavior of the function when the null hypothesis is true. It is also
important to investigate the behavior when it is false to learn about
the test’s power against particular alternative hypotheses. Using the
example in Comparing Observed
and Simulated Statistics, consider the case in which the true mean
function is quadratic in \(X_2\): \(f(x;\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2
+ x_2^2\). Having verified above that a sample size of 100 is
sufficient for the asymptotic behavior to be sufficiently approximated,
this can be investigated as follows.
alt_func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2] + 0.5*X[,2]^2
comp_dfr_3 <- update(comp_dfr_2, true_mean = alt_func, true_covariance = list(Sigma = 3))
As hoped, these curves are clearly different. Power estimates can be
obtained with rejection()
.
## stat alpha rate mcse
## 1 KS 0.05 0.853 0.003541059
## 2 CvM 0.05 0.863 0.003438473
Power for different significance levels can be found by changing the
alpha
argument.
## stat alpha rate mcse
## 1 KS 0.01 0.653 0.004760158
## 2 CvM 0.01 0.697 0.004595552