Introduction

unmconf is a package that employs a fully Bayesian hierarchical framework for modeling under the presence of unmeasured confounders using JAGS (Just Another Gibbs Sampler). The package handles both internal and external validation scenarios for up to two unmeasured confounders. Bayesian data analysis can be summarized in the following four steps: specifying the data model and prior, estimating model parameters, evaluating sampling quality and model fit, and summarizing and interpreting results. For users new to Markov Chain Monte Carlo (MCMC) software who wish to implement models involving unmeasured confounding, challenges arise in regard to understanding the syntax and how these programs handle missing data. The primary objective of unmconf is to address these challenges by creating a function that resembles glm() on the front end, while seamlessly implementing the necessary JAGS code on the back end. Functions are implemented to simplify the workflow using this model by acquiring data, modeling data, conducting diagnostics testing on the model, and analyzing results. With this package, users can perform robust fully Bayesian analyses, even without previous familiarity with JAGS syntax or data processing intricacies.

Bayesian Unmeasured Confounding Model

For the statistical model, we denote the continuous or discrete response \(Y\), the binary main exposure variable \(X\), the vector of \(p\) other perfectly observed covariates \(\boldsymbol{C}\), and the unmeasured confounder(s) relating to both \(Y\) and \(X\) \(U\). In the event of more unmeasured covariates, we denote them \(U_{1}\), \(U_{2}\), and so forth; these unmeasured confounders can be either binary or normally distributed.

In the scenario where there is a single unmeasured confounder, the joint distribution can be factorized as \(f(y, u|x, \boldsymbol{c}) = f(y|x, u, \boldsymbol{c}) f(u|x, \boldsymbol{c})\). For the second unmeasured confounder, the joint distribution can be factorized as \(f(y, u_{1}, u_{2}|x, \boldsymbol{c}) = f(y|x, \boldsymbol{c}, u_{1}, u_{2}) f(u_{1}|x, \boldsymbol{c}, u_{2}) f(u_{2}|x, \boldsymbol{c})\), giving the Bayesian unmeasured confounding model: \[\begin{equation}\label{unmconf-two-unmeasured-confounders} \begin{split} Y|x, \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \ D_{Y}(g_Y^{-1}(\boldsymbol{v}'\boldsymbol{\beta} +\boldsymbol{\lambda}'\boldsymbol{u}), \ \xi_y) \\ U_{1}|x, \boldsymbol{c}, u_{2} & \ \ \sim \ \ D_{U_{1}}(g_{U_1}^{-1}(\boldsymbol{v}'\boldsymbol{\gamma} + \zeta u_{2}), \ \xi_{u_{1}}) \\ U_{2}|x, \boldsymbol{c} & \ \ \sim \ \ D_{U_{2}}(g_{U_2}^{-1}(\boldsymbol{v}'\boldsymbol{\delta}), \ \xi_{u_{2}}), \end{split} \end{equation}\]

where \(\boldsymbol{v}' = [x~\boldsymbol{c}']'\) denotes the vector of the main exposure variable and all of the perfectly observed covariates and \(\boldsymbol{u}' = [\boldsymbol{u_1}, \boldsymbol{u_2}]'\) denotes the vector of two unmeasured confounders in the response model. This model is completed by the specification of a link function \(g_*\) and some family of distributions \(D_{*}\). Additional parameters for certain distributions–if any–are denoted \(\xi_{y}, \xi_{u_1}, \xi_{u_2}\). Examples of these would be \(\sigma^2\) for the variance of a normal distribution or \(\alpha\) for the shape parameter in the gamma distribution. unmconf allows for the user to work with a response from the normal, Poisson, gamma, or binomial distribution and unmeasured confounder(s) from the normal or binomial distribution. The package supports identity (normal), log (Poisson or gamma), and logit (Bernoulli) link functions.

Prior Structure

Prior distributions for the model parameters will be jointly defined as \(\pi(\boldsymbol{\theta})\), where \(\boldsymbol{\theta} = (\boldsymbol{\beta}, \boldsymbol{\lambda}, \boldsymbol{\gamma}, \zeta, \boldsymbol{\delta}, \boldsymbol{\xi})\). The default prior structure is weakly informative. The regression coefficients have a relatively non-informative prior with a mean of 0 and precision (inverse of the variance) of 0.1 when the response is discrete. When the response is continuous, the regression coefficients have a relatively non-informative prior with a mean of 0 and precision of 0.001. To further customize the analysis, users can specify custom priors using the priors argument within the modeling function, unm_glm(). The format for specifying custom priors is c("{parameter}[{covariate}]" = "{distribution}"). An example of this is below.

Families specified for the response and unmeasured confounder(s) may present nuisance parameters, necessitating the inclusion of their prior distributions as well. The precision parameter, \(\tau_*\), on a normal response or normal unmeasured confounder will have a Gamma(0.001, 0.001) as the default prior. Priors can also be elicited in terms of \(\sigma_*\) or \(\sigma_*^2\) through priors. The nuisance parameter, \(\alpha_y\), for a gamma response has a gamma distribution as the prior with both scale and rate set to 0.1. The aforementioned nuisance parameters are tracked and posterior summaries are provided as a default setting in the package, but this can be modified.

Generating Data

The function runm() is used to generate data. In the workflow of this package, this function is not required to use if one wishes to analyze a data set of interest. runm() can be used to perform a simulation study, if desired, and data is generated as follows. The perfectly measured covariates and unmeasured confounder(s) are generated independently of one another. The user can specify these variables’ families and their respective distributions as named lists in runm(). Then, the data frame will be generated using the named vector of response model coefficients, \(\boldsymbol{\beta}_R\), and treatment model coefficients, \(\boldsymbol{\beta}_T\), that the user provides in the function. runm() will model the following:

\[\begin{equation} \begin{split} \boldsymbol{C} & \ \ \sim \ \ D_{\boldsymbol{C}}(g^{-1}_{\boldsymbol{C}}(\theta_{\boldsymbol{C}})) \\ U_1 & \ \ \sim \ \ D_{U_1}(g^{-1}_{U_1}(\theta_{U_1}), \xi_{u_1}) \\ U_2 & \ \ \sim \ \ D_{U_2}(g^{-1}_{U_2}(\theta_{U_2}), \xi_{u_2}) \\ X | \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \ \text{Bernoulli}(\text{logit}^{-1}(\boldsymbol{w}'\boldsymbol{\beta}_T)) \\ Y| x, \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \ D_{Y}(g^{-1}_Y(\boldsymbol{z}'\boldsymbol{\beta_R}), \ \xi_y), \end{split} \end{equation}\]

where \(\boldsymbol{w}' = [\boldsymbol{c}'~u_1~u_2]'\) and \(\boldsymbol{z}' = [x~\boldsymbol{c}'~u_1~u_2]'\). All arguments in runm() have a default value assigned other than \(n\). So, in its simplest form, one can generate a data set consisting of 100 observations by calling runm(100). The default arguments can be customized to the user’s preference if there is a desired data generation structure. An example of internal and external validation data is provided.

Internal Validation

When the validation type is internal (default), the data will first be generated from some sample size n. Internal validation is available when the unmeasured confounder is ascertained for a, typically, small subset of the main study data. In certain cases, only internal validation data may be accessible and information on the unmeasured confounder is only known for a subset of the patients. A researcher can vary the argument missing_prop to determine how large validation studies need to be in a simulation. missing_prop will set the assigned proportion of the unmeasured confounder’s observations to NA. A more detailed example is below, assigned df, and will be the data frame used throughout the remainder of this vignette.

library("unmconf")
library("bayesplot")
library("ggplot2"); theme_set(theme_minimal())

set.seed(13L)
df <- 
  runm(n = 100,
       type = "int", 
       missing_prop = .75,
       covariate_fam_list = list("norm", "bin", "norm"),
       covariate_param_list = list(c(mean = 0, sd = 1), c(.3), c(0, 2)),
       unmeasured_fam_list = list("norm", "bin"),
       unmeasured_param_list = list(c(mean = 0, sd = 1), c(.3)),
       treatment_model_coefs = 
         c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4, 
           "u1" = .75, "u2" = .75),
       response_model_coefs =
         c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4,
           "u1" = .75, "u2" = .75, "x" = .75),
       response = "norm",
       response_param = c("si_y" = 1))

rbind(head(df, 5), tail(df, 5))
#> # A tibble: 10 × 7
#>         z1    z2     z3     u1    u2     x        y
#>      <dbl> <int>  <dbl>  <dbl> <int> <int>    <dbl>
#>  1  0.554      0 -5.69   0.614     1     0 -2.97   
#>  2 -0.280      1  3.43   0.413     0     0 -1.28   
#>  3  1.78       1 -2.46  -0.459     1     0 -0.993  
#>  4  0.187      0 -0.628 -0.673     0     0 -2.36   
#>  5  1.14       0 -0.140  0.193     0     1  0.230  
#>  6 -0.256      0  1.00  NA        NA     0  0.527  
#>  7 -1.23       0 -0.866 NA        NA     0 -0.479  
#>  8  0.214      1 -0.680 NA        NA     0 -0.822  
#>  9  0.0672     0 -4.11  NA        NA     0 -2.92   
#> 10  0.857      1 -0.360 NA        NA     0  0.00750

External Validation

For external validation, the sample size argument can be a vector of length 2 to represent the number of observations in the main study data and external validation data, respectively. The sample size argument can also be of length 1, where the sample size will be split in half for the two types of data (main study data will obtain the additional observation if n is odd). The main study data has the unmeasured confounders completely missing for all observations. The external data fully observes the unmeasured confounders but typically has no reference on the exposure-unmeasured confounder relationship. Thus, informative priors on the bias parameters for this relationship are needed to achieve convergence. That is, in the above model, \(\gamma_x\) and \(\delta_x\).

If a user has a main study data set and an external validation data set that are collected and separate from one another, they can be joined together through dplyr::bind_rows(). This function binds the data sets by row and keeps all of the columns, even if the columns do not match between data sets. Say that a researcher has a main study data set with a binary outcome, a binary treatment, three perfectly measured standard normal covariates, and two unmeasured confounders that are completely missing. Further, say that the researcher also has an external validation data set with the same outcome, two of the perfectly measured covariates from the main study data, and the two unmeasured confounders completely observed. If the third covariate that is missing from the external validation data is deemed unimportant for modeling, then one can still perform inference using this data. An example of this scenario is below

# Main Study Data
M <- tibble::tibble(
  "y" = rbinom(100, 1, .5),
  "x" = rbinom(100, 1, .3),
  "z1" = rnorm(100, 0, 1),
  "z2" = rnorm(100, 0, 1),
  "z3" = rnorm(100, 0, 1),
  "u1" = NA, 
  "u2" = NA
)

# External Validation Data
EV <- tibble::tibble(
  "y" = rbinom(100, 1, .5),
  "z1" = rnorm(100, 0, 1),
  "z2" = rnorm(100, 0, 1),
  "u1" = rnorm(100, 0, 1), 
  "u2" = rnorm(100, 0, 1)
)

df_ext <- dplyr::bind_rows(M, EV) |> 
  dplyr::mutate(x = ifelse(is.na(x), 0, x))

rbind(head(df_ext, 5), tail(df_ext, 5))
#> # A tibble: 10 × 7
#>        y     x     z1      z2      z3     u1     u2
#>    <int> <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
#>  1     0     1  0.761 -0.313  -0.952  NA     NA    
#>  2     1     0 -0.695 -0.942  -0.107  NA     NA    
#>  3     0     0  0.564  0.0399 -1.15   NA     NA    
#>  4     1     1  0.706  1.96   -0.744  NA     NA    
#>  5     1     0  0.418 -0.251  -0.0683 NA     NA    
#>  6     0     0  1.60  -0.232  NA       0.676 -0.584
#>  7     1     0 -1.41  -0.621  NA      -2.38   0.700
#>  8     1     0  0.988 -0.214  NA      -1.18  -0.206
#>  9     1     0 -0.561 -0.506  NA       0.474 -1.15 
#> 10     1     0  0.448  0.930  NA      -0.958 -0.533

Specifying and Fitting the Model

The main focus of this vignette should be around unm_glm(), which fits the posterior results of the Bayesian unmeasured confounding model through MCMC iterations. Upon acquiring all the relevant information, the unm_glm() function carries out two main tasks. Initially, it constructs a JAGS model and subsequently pre-processes the data for utilization by JAGS. This simplifies the process of performing a fully Bayesian analysis, as it spares users from the necessity of being familiar with JAGS syntax for both the model and data processing. The primary aim of the unmconf package, akin to rstanarm and brms, is to offer a user-friendly interface for Bayesian analysis, utilizing programming techniques familiar to R users. Users can input model information into unm_glm() in a similar manner as they would for the standard stats::glm() function, providing arguments like formula, family, and data.

The R language provides a straightforward syntax for denoting linear models, typically written as response ~ terms, where the coefficients are implicitly represented. Like other R functions for model fitting, users have the option to use the . - {vars} syntax instead of listing all predictors. For instance, if the user wants to model the first unmeasured confounder, smoking, given predictors age, weight, height, and salary, they can use either form2 = smoking ~ age + weight + height + salary or form2 = smoking ~ . - {response varaible}. To estimate the unmeasured confounder(s), we often model them conditioned on some or all of the perfectly measured covariates and the treatment. Once estimated, these unmeasured confounder(s) become predictors in the higher levels of the model structure. On the right-hand side of the ~, we define the linear combination of predictors that models the response variable. Additionally, the predictors can include polynomial regression (e.g.~ poly(z, 2)) and interactive effects (e.g.~x*z).

unm_glm() also accepts arguments that facilitate MCMC computation on the posterior distribution to be passed to coda.samples, such as such as n.iter, n.adapt, thin, and n.chains. The arguments specified have default values, but the user is encouraged to supply their own values given the lack of convergence that is sometimes observed when validation sample sizes are small or priors are particularly diffuse.

Internal Validation

For instance, consider the Bayesian unmeasured confounding model for the generated data set above, df, with a normal response, normal first unmeasured confounder, binary second unmeasured confounder, and three perfectly observed covariates. Further, let’s say that, through expert opinion, we have prior information that the effect on \(y\) from \(u_1\), \(\lambda_{u_1}\), is normally distributed with a mean of 0.5 and standard deviation of 1. JAGS parameterizes the normal distribution with precision rather than standard deviation, so we would use \(\tau_{u_1} = 1 / \sigma_{u_1}^2 = 1\) for the prior distribution \(\lambda_{u_1} \sim N(.5, \tau_{u_1} = 1)\). Using unm_glm(), we fit:

unm_mod <- 
  unm_glm(form1 = y ~ x + z1 + z2 + z3 + u1 + u2,     # y ~ .,
          form2 = u1 ~ x + z1 + z2 + z3 + u2,         # u1 ~ . - y,
          form3 = u2 ~ x + z1 + z2 + z3,              # u2 ~ . - y - u1,
          family1 = gaussian(),
          family2 = gaussian(),
          family3 = binomial(),
          priors = c("lambda[u1]" = "dnorm(.5, 1)"),
          n.iter = 10000, n.adapt = 4000, thin = 1,
          data = df)

External Validation

We fit the Bayesian unmeasured confounding model from the external validation data set above, df_ext. Suppose that we have knowledge on the relationship between the treatment effect and both unmeasured confounders through expert opinion, where each relationship is normally distributed. JAGS parameterizes the normal distribution with precision rather than standard deviation. So, from expert opinion, we get the prior distributions \(\gamma_x \sim N(1.1, 0.9)\) and \(\delta_x \sim N(1.1, 4.5)\). Using unm_glm(), we fit:

unm_mod_ext <- 
  unm_glm(form1 = y ~ x + z1 + z2 + u1 + u2,     # y ~ . - z3,
          form2 = u1 ~ x + z1 + z2 + u2,         # u1 ~ . - y - z3,
          form3 = u2 ~ x + z1 + z2,              # u2 ~ . - y - u1 - z3,
          family1 = binomial(),
          family2 = gaussian(),
          family3 = gaussian(),
          priors = c("gamma[x]" = "dnorm(1.1, 0.9)",
                     "delta[x]" = "dnorm(1.1, 4.5)"),
          n.iter = 4000, n.adapt = 2000, thin = 1,
          data = df_ext)

Export JAGS Code

By leveraging unm_glm(), users can conveniently implement complex Bayesian models, particularly those involving unmeasured confounders, without grappling with the intricacies of JAGS syntax or handling missing data. The Bayesian unmeasured confounding model structure of unmconf is currently set up to work with at most two unmeasured confounders. Instances may arise where the user may want to work with more than two unmeasured confounders. As an attempt to resolve this concern, the user can explicitly call the JAGS code that unm_glm() generates either through the argument code_only = TRUE in the function itself or through the separate function, jags_code(). With a starting point created by the model in unmconf, the user should be able to identify the syntax for JAGS code and thus add the layer(s) to the model structure and the respective prior distribution(s). Stopping at two unmeasured confounders allows for the package to run with confidence in the instance that individuals do not check for convergence and report the results of a poor model. Below shows both ways to extract a model’s JAGS code.

unm_glm(..., code_only = TRUE)
jags_code(unm_mod)

Evaluate Convergence and Model Fit

After the model is fit and before using the MCMC samples for inference, it is necessary for users assess whether the chains have converged appropriately. Hierarchical models with unmeasured confounding are often confronted with convergence issues. To aid in chain convergence, we heavily increased the burn-in length and MCMC iterations from the default values of 1000 and 2000 in the function unm_glm() to 6000 and 10000, respectively. Additional checks include the posterior kernel density plots appearing relatively smooth in shape and the trace, or history, plots of the chains should have very similar values across the iterations (i.e., they “mix” well and the chains intermingle).

The bayesplot package provides a variety of ggplot2-based plotting functions for use after fitting Bayesian models. bayesplot::mcmc_hist() plots a histogram of the MCMC draws from all chains, and bayesplot::mcmc_trace() performs a trace plot of the chains. Given that \(\beta_x\) is our parameter of interest, we only displayed the density and trace plots of this parameter below. The histogram appears smooth and without any jaggedness. The trace plot appears to mix well, as one cannot differentiate or identify patterns in the chains across the iterations for this model. Thus, there is no lack of convergence evident here. All parameters in the model upheld convergence standards.

bayesplot::mcmc_hist(unm_mod, pars = "beta[x]")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of the MCMC draws for the internal validation example. Smooth histogram is an indication of good convergence.

Histogram of the MCMC draws for the internal validation example. Smooth histogram is an indication of good convergence.

bayesplot::mcmc_trace(unm_mod, pars = "beta[x]")
Trace plot of the MCMC draws for the internal validation example. No patterns or diverging chains in the trace plot is an indication of good convergence.

Trace plot of the MCMC draws for the internal validation example. No patterns or diverging chains in the trace plot is an indication of good convergence.

Summarizing Results

Once the posterior samples have been computed, the unm_glm() function returns an R object as a list of the posterior samples, where the length of the list matches the number of chains. Calling the returned object explicitly, in our example unm_mod, will output the model’s call and a named vector of coefficients at each level of the Bayesian unmeasured confounding model. A more formal model summary comes through unm_summary(), which obtains and prints a summary table of the results. Every parameter is summarized by the mean of the posterior distribution along with its two-sided 95% credible intervals based on quantiles. When generating a data set via runm(), adding the data argument appends a column to the summary table consisting of the true parameter values that were assigned.

unm_summary(unm_mod, df) |>
  head(10)
#> # A tibble: 10 × 9
#>    param     true_value   mean     sd `2.5%`  `50%` `97.5%` r_hat   ess
#>    <chr>          <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl>
#>  1 beta[1]        NA    -0.962 0.274  -1.50  -0.958 -0.434   1.01 3370.
#>  2 beta[x]         0.75  0.995 0.423   0.163  0.993  1.82    1.00 3306.
#>  3 beta[z1]        0.4   0.511 0.203   0.121  0.508  0.916   1.00 4932.
#>  4 beta[z2]        0.5   0.261 0.384  -0.483  0.259  1.02    1.00 5874.
#>  5 beta[z3]        0.4   0.382 0.0957  0.191  0.383  0.567   1.01 3221.
#>  6 delta[1]       NA    -1.34  0.767  -2.96  -1.29   0.0779  1.00 1534.
#>  7 delta[x]       NA     1.07  1.03   -0.916  1.05   3.17    1.00 2098.
#>  8 delta[z1]      NA     0.365 0.588  -0.747  0.352  1.57    1.01 2679.
#>  9 delta[z2]      NA     1.41  1.19   -0.850  1.38   3.83    1.01 2094.
#> 10 delta[z3]      NA    -0.505 0.253  -1.05  -0.488 -0.0593  1.00 2696.

For model comparison, the deviance information criterion (DIC) and penalized expected deviance are provided through unm_dic(). DIC, a Bayesian version of AIC, is calculated by adding the “effective number of parameters” to the expected deviance and is computed through a wrapper around rjags::dic_samples().

As mentioned above, unmeasured confounders can be considered as parameters in the Bayesian paradigm and are therefore estimated when performing MCMC. Yet, unm_summary() does not track the estimate of these “parameters” from the model fit. For this, unm_backfill() pairs with the original data set to impute the missing values for the unmeasured confounders with the posterior estimates. The ten observations below come from df, where five of the unmeasured confounders are observed and five are unobserved in the internal validation data. Columns u1_observed and u2_observed are logical variables added to the original data frame to display which variables were originally observed versus imputed. Additionally, the values of u1 and u2 that respectively have FALSE in the previously mentioned columns are the imputed posterior estimates.

unm_backfill(df, unm_mod)[16:25, ]
#> # A tibble: 10 × 9
#>        z1    z2      z3      u1    u2     x      y u1_observed u2_observed
#>     <dbl> <int>   <dbl>   <dbl> <dbl> <int>  <dbl> <lgl>       <lgl>      
#>  1 -0.194     0 -1.97    0.0284     0     0 -1.53  TRUE        TRUE       
#>  2  1.40      0  0.568  -0.706      1     1  1.42  TRUE        TRUE       
#>  3  0.101     0  0.0589  0.635      0     1  0.430 TRUE        TRUE       
#>  4 -0.114     1  1.60    1.17       0     1  1.40  TRUE        TRUE       
#>  5  0.702     0 -2.61   -0.308      1     0  0.189 TRUE        TRUE       
#>  6  0.263     0  5.54   -0.704      0     1  0.442 TRUE        TRUE       
#>  7  1.84      0 -0.691  -1.27       1     0 -0.174 TRUE        TRUE       
#>  8  0.357     1  2.01    0.0956     0     0  2.89  TRUE        TRUE       
#>  9 -1.05      1  5.63   -0.447      1     1  1.55  TRUE        TRUE       
#> 10  0.620     0  1.66   -1.71       1     1  1.37  TRUE        TRUE

To visualize the results from the model fit, the quantile-based posterior credible intervals can be drawn using bayesplot::mcmc_intervals(). A credible interval plot from the worked example is below. This plots the credible intervals for all parameters from the posterior draws of all the chains. We modify the default setting for the outer credible interval to be 0.95 for comparison with the output results from unm_summary(). The light blue circle in the middle of each parameter’s interval portrays the posterior median. The bold, dark blue line displays the 50% credible interval and the thin, blue line covers the 95% credible interval. For simulation studies, where the true parameter values are known, calling the argument true_params will add a layer of light red circles to each credible interval to illustrate the true value used to generate the data. Here, the gamma and delta parameters do not have a true value because we generated the unmeasured confounders \(u_1\) and \(u_2\) as independent normal/Bernoulli random variables.

bayesplot::mcmc_intervals(unm_mod, prob_outer = .95, 
                          regex_pars = "(beta|lambda|gamma|delta|zeta).+") +
  geom_point(
    aes(value, name), data = tibble::enframe(attr(df, "params")) |>
      dplyr::mutate(name = gsub("int", "1", name)),
    color = "red", fill = "pink", size = 4, shape = 21
  )
A credible interval plot for the internal validation example. The light blue circle indicates the posterior median for each parameter, with the bold and thin blue lines displaying the 50% and 95% credible interval, respectively. Red circles are the true parameter values used in the simulation study.

A credible interval plot for the internal validation example. The light blue circle indicates the posterior median for each parameter, with the bold and thin blue lines displaying the 50% and 95% credible interval, respectively. Red circles are the true parameter values used in the simulation study.