Multi-Bias Examples

library(multibias)

Single Bias Example

We are interested in quantifying the effect of smoking (SMK) on coronary heart disease (CHD) using data from the Evans County Heart Study. Let’s inspect the dataframe we have for 609 participants aged 40 and older.

head(evans)
#>   ID CHD AGE CHL SMK ECG DBP SBP HPT
#> 1 21   0  56 270   0   0  80 138   0
#> 2 31   0  43 159   1   0  74 128   0
#> 3 51   1  56 201   1   1 112 164   1
#> 4 71   0  64 179   1   0 100 200   1
#> 5 74   0  49 243   1   0  82 145   0
#> 6 91   0  46 252   1   0  88 142   0

This is clearly not the most robust data, but, for purposes of demonstration, let’s proceed by pretending that our data was missing information on age. Let’s see what the resulting odds ratio estimate looks like without adjustment for age and with adding hypertension (HPT) as a known confounder.

biased_model <- glm(CHD ~ SMK + HPT,
                    family = binomial(link = "logit"),
                    data = evans)
or <- round(exp(coef(biased_model)[2]), 2)
or_ci_low <- round(exp(coef(biased_model)[2] -
                         1.96 * summary(biased_model)$coef[2, 2]), 2)
or_ci_high <- round(exp(coef(biased_model)[2] +
                          1.96 * summary(biased_model)$coef[2, 2]), 2)

print(paste0("Biased Odds Ratio: ", or))
#> [1] "Biased Odds Ratio: 1.99"
print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")"))
#> [1] "95% CI: (1.12, 3.55)"

This estimate, despite not adjusting for age, appears around what we would expect. In the IJE article The association between tobacco smoking and coronary heart disease the author notes: “Cigarette smokers have about twice as much coronary heart disease as non-smokers, whether measured by deaths, prevalence, or the incidence of new events.” However, we also know from studies like this BMJ meta-analysis that the effect estimate can vary greatly depending on the degree of cigarette consumption. We also observe a wide confidence interval due to the small sample size of the data. In real-world practice we would want to obtain better data that includes more observations, participant estimates of cigarettes smoked/day, and other important confounders including sex and race/ethnicity. In Epidemiology it is very rarely the case that we have the “perfect” data with everything we would like! We’ll proceed with the analysis knowing the limitations of the data.

Can we anticipate whether this biased odds ratio (without age-adjustment) is biased toward or away from the null? Let’s consider the association of the uncontrolled confounder with the exposure and outcome. In our data, age has a negative association with smoking (older people are less likely to be smokers) and a positive association with heart disease (older people are more likely to have CHD). These opposite associations are biasing the odds ratio towards the null, creating a distortion where those who are less likely to smoke are more likely to experience the outcome.

To adjust for AGE, where we’re assuming it’s missing in the data, let’s refer to the appropriate bias model for a binary uncontrolled confounder (we’re going to treat age as a binary indicator of over (1) or under (0) age 60):

It’s now time for the most difficult part of quantitative bias analysis - deriving bias parameters. Using all the information at our disposal, this is where we provide assumptions for how the relevant variables in our data are related to the uncontrolled confounder. This experience is comparable to that of providing priors in a Bayesian statistical analysis.

Starting with exposure, we’ll assume that smokers are half as likely to have an age >60 than non-smokers, those with CHD are 2.5x as likely to have an age>60 than those without CHD, and those with hypertension are 2x as likely to have an age>60 than those without HPT. To convert these relationships as parameters in the model, we’ll log-transform them from odds ratios (see below). Lastly, for the model intercept, we can use the following reasoning: what is the probability that a non-smoker (X=0) without CHD (Y=0) and HPT (C=0) is over age 60 in this population? We’ll say this is a 25% probability. We’ll use the ‘inverse logit’ function qlogis() from the stats package to convert this from a probability to the intercept coefficient of the logistic regression model.

u_0 <- qlogis(0.25)
u_x <- log(0.5)
u_y <- log(2.5)
u_c <- log(2)

Now let’s plug these bias parameters into adjust_uc() along with our data_observed object to obtain our bias-adjusted odds ratio.

df <- data_observed(
  data = evans,
  exposure = "SMK",
  outcome = "CHD",
  confounders = "HPT"
)

set.seed(1234)
adjust_uc(
  df,
  u_model_coefs = c(u_0, u_x, u_y, u_c)
)
#> $estimate
#> [1] 2.279489
#> 
#> $ci
#> [1] 1.261951 4.117491

We get an odds ratio of 2.16 (95% CI: 1.20, 3.91). This matches our expectation that the bias adjustment would pull the odds ratio away from the null. How does this result compare to the result we would get if age wasn’t missing in the data and was incorporated in the outcome regression? The results are below. It turns out that our bias-adjusted odds ratio using multibias is close to this complete-data odds ratio of 2.31.

full_model <- glm(CHD ~ SMK + HPT + AGE,
                  family = binomial(link = "logit"),
                  data = evans)
or <- round(exp(coef(full_model)[2]), 2)
or_ci_low <- round(exp(coef(biased_model)[2] -
                         1.96 * summary(full_model)$coef[2, 2]), 2)
or_ci_high <- round(exp(coef(biased_model)[2] +
                          1.96 * summary(full_model)$coef[2, 2]), 2)

print(paste0("Odds Ratio: ", or))
#> [1] "Odds Ratio: 2.31"
print(paste0("95% CI: (", or_ci_low, ", ", or_ci_high, ")"))
#> [1] "95% CI: (1.1, 3.6)"

Triple Bias Example

We are interested in quantifying the effect of exposure X on outcome Y. The causal system can be represented in the following directed acyclic graph (DAG):

The variables are defined:

It can be seen from this DAG that the data suffers from three sources of bias. There is uncontrolled confounding from (unobserved) variable U. The true exposure, X, is unobserved, and the misclassified exposure X* is dependent on both the exposure and outcome. Lastly, there is collider stratification at variable S since exposure and outcome both affect selection. The study naturally only assesses those who were selected into the study (i.e. those with S=1), which represents a fraction of all people in the source population from which we are trying to draw inference.

A simulated dataframe corresponding to this DAG, df_uc_emc_sel can be loaded from the multibias package.

head(df_uc_em_sel)
#>   Xstar Y C1 C2 C3
#> 1     1 0  0  0  1
#> 2     0 0  0  0  1
#> 3     0 0  0  0  0
#> 4     0 0  1  0  1
#> 5     1 0  0  0  1
#> 6     1 0  0  1  1

In this data, the true, unbiased exposure-outcome odds ratio (ORYX) equals ~2. However, when we run a logistic regression of the outcome on the exposure and confounders, we do not observe an odds ratio of 2 due to the multiple bias sources.

biased_model <- glm(Y ~ Xstar + C1 + C2 + C3, ,
                    family = binomial(link = "logit"),
                    data = df_uc_em_sel)
biased_or <- round(exp(coef(biased_model)[2]), 2)
print(paste0("Biased Odds Ratio: ", biased_or))
#> [1] "Biased Odds Ratio: 1.5"

The function adjust_uc_emc_sel() can be used here to “reconstruct” the unbiased data and return the exposure-outcome odds ratio that would be observed in the unbiased setting.

Models for the missing variables (U, X, S) are used to facilitate this data reconstruction. For the above DAG, the corresponding bias models are:

where j indicates the number of measured confounders.

To perform the bias adjustment, it is necessary to obtain values of these bias parameters. Potential sources of these bias parameters include internal validation data, estimates in the literature, and expert opinion. For purposes of demonstrating the methodology, we will obtain the exact values of these bias parameters. This is possible because for purposes of validation we have access to the data of missing values that would otherwise be absent in real-world practice. This source data is available in multibias as df_uc_emc_sel_source.

u_model <- glm(U ~ X + Y,
               family = binomial(link = "logit"),
               data = df_uc_em_sel_source)
x_model <- glm(X ~ Xstar + Y + C1 + C2 + C3,
               family = binomial(link = "logit"),
               data = df_uc_em_sel_source)
s_model <- glm(S ~ Xstar + Y + C1 + C2 + C3,
               family = binomial(link = "logit"),
               data = df_uc_em_sel_source)

In this example we’ll perform probabilistic bias analysis, representing each bias parameter as a single draw from a probability distribution. For this reason, we will run the analysis over 1,000 iterations with bootstrap samples to obtain a valid confidence interval. To improve performance we will run the for loop in parallel using the foreach() function in the doParallel package. Below we create a cluster, make a seed for consistent results, and specify the desired number of bootstrap repitions.

library(doParallel)

no_cores <- detectCores() - 1
registerDoParallel(cores = no_cores)
cl <- makeCluster(no_cores)

set.seed(1234)
nreps <- 1000
est <- vector(length = nreps)

Next we run the parallel for loop in which we apply the adjust_uc_em_sel() function to bootstrap samples of the df_uc_em_sel data. We specify the following arguments: the data, the exposure variable, the outcome variable, the confounder(s), the U model coefficients, the X model coefficients, and the S model coefficients. Since knowledge of the source data was known, the correct bias parameters can be applied. Using the results from the fitted bias models above, we’ll use Normal distribution draws for each bias parameter where the mean correponds to the estimated coefficient from the bias model and the standard deviation comes from the estimated standard deviation (i.e., standard error) of the coefficient in the bias model. Each loop iteration will now have slightly different values for the bias parameters, corresponding to our uncertainty in their estimates.

or <- foreach(i = 1:nreps, .combine = c,
              .packages = c("dplyr", "multibias")) %dopar% {

  df_sample <- df_uc_em_sel[sample(seq_len(nrow(df_uc_em_sel)),
                                   nrow(df_uc_em_sel),
                                   replace = TRUE), ]

  est[i] <- adjust_uc_em_sel(
    data_observed = data_observed(
      data = df_sample,
      exposure = "Xstar",
      outcome = "Y",
      confounders = c("C1", "C2", "C3")
    ),
    u_model_coefs = c(
      rnorm(1, mean = u_model$coef[1], sd = summary(u_model)$coef[1, 2]),
      rnorm(1, mean = u_model$coef[2], sd = summary(u_model)$coef[2, 2]),
      rnorm(1, mean = u_model$coef[3], sd = summary(u_model)$coef[3, 2])
    ),
    x_model_coefs = c(
      rnorm(1, mean = x_model$coef[1], sd = summary(x_model)$coef[1, 2]),
      rnorm(1, mean = x_model$coef[2], sd = summary(x_model)$coef[2, 2]),
      rnorm(1, mean = x_model$coef[3], sd = summary(x_model)$coef[3, 2]),
      rnorm(1, mean = x_model$coef[4], sd = summary(x_model)$coef[4, 2]),
      rnorm(1, mean = x_model$coef[5], sd = summary(x_model)$coef[5, 2]),
      rnorm(1, mean = x_model$coef[6], sd = summary(x_model)$coef[6, 2])
    ),
    s_model_coefs = c(
      rnorm(1, mean = s_model$coef[1], sd = summary(s_model)$coef[1, 2]),
      rnorm(1, mean = s_model$coef[2], sd = summary(s_model)$coef[2, 2]),
      rnorm(1, mean = s_model$coef[3], sd = summary(s_model)$coef[3, 2]),
      rnorm(1, mean = s_model$coef[4], sd = summary(s_model)$coef[4, 2]),
      rnorm(1, mean = s_model$coef[5], sd = summary(s_model)$coef[5, 2]),
      rnorm(1, mean = s_model$coef[6], sd = summary(s_model)$coef[6, 2])
    )
  )$estimate
}

Finally, we obtain the ORYX estimate and 95% confidence interval from the distribution of 1,000 odds ratio estimates. As expected, ORYX ~ 2, indicating that we were able to obtain an unbiased odds ratio from the biased data.

# odds ratio estimate
round(median(or), 2)
#> 2.02

# confidence interval
round(quantile(or, c(.025, .975)), 2)
#> 1.93 2.11