psc-vignette

Richard Jackson

2024-11-19

Introduction

The psc.R package implements the methods for applying Personalised Synthetic Controls, which allows for patients receiving some experimental treatment to be compared against a model which predicts their reponse to some control. This is a form of causal inference which differes from othere approaches in that

Data are only required on a single treatment - all counterfactual evidence is supplied by a parametric model

Causal inference, in theory at least, is estimated at a patient level - as opposed to estimating average effects over a population

In its basic form, this method creates a likelihood to compare a cohort of data to a parametric model. See (X) for disucssion on it’s use as a causal inference tool. To use this package, two basic peices of information are required, a dataset and a model against which they can be compared.

In this vignette, we will detail how the psc.r package is constructed and give some examples for it’s application in practice.

To load the library use the code

Methodology

The pscfit function compares a dataset (‘DC’) against a parametric model. This is done by selecting a likelihood which is identified by the type of CFM that is supplied. At present, two types of model are supported, a flexible parmaeteric survival model of type ‘flexsurvreg’ and a geleneralised linear model of type ‘glm’.

Where the CFM is of type ‘flexsurvreg’ the likeihood supplied is of the form:

\[L(D∣\Lambda,\Gamma_i)=\prod_{i=1}^{n} f(t_i∣\Lambda,\Gamma_i)^{c_i} S(t_i∣\Gamma,\Lambda_i)^{(1−c_i)}\] Where \(\Gamma\) defines the cumulative baseline hazard function, \(\Lambda\) is the linear predictor and \(t\) and \(c\) are the event time and indicator variables.

Where the CFM is of the type ‘glm’ the likelihood supplied is of the form:

\[L(x∣\Gamma_i) = \prod_{i=1}^{n} b (x∣ \Gamma_i )\exp\{\Gamma_i t(x)−c(\Gamma_i)\}\]

Where \(b(.)\), \(t(.)\) and \(c(.)\) represent the functions of the exponential family. In both cases, \(\Gamma\) is defiend as:

\[ \Gamma_i = \gamma x_i+\beta \]

Where \(\gamma\) are the model coefficients supplied by the CFM and \(\beta\) is the parameter set to measure the difference between the CFM and the DC.

Estimation is performed using a Bayesian MCMC procedure. Prior distributions for \(\Gamma\) (& \(\Lambda\)) are derived directly from the model coefficients (mean and variance covariance matrix) or the CFM. A bespoke MCMC routine is performed to estimate \(\beta\). Please see ‘?mcmc’ for more detials.

For the standard example where the DC contains information from only a single treatment, trt need not be specified. Where comparisons between the CFM and multiple treatments are require, a covariate of treamtne allocations must be specified sperately (using the ‘trt’ option).

Package Structure

The main function for using applying Personal Synthetic Controls is the pscfit() function which has two inputs, a Counter-Factual Model (CFM) and a data cohort (DC). Further arguments include

psc object

The output of the “pscfit()” function is an object of class ‘psc’. This class contains the following attributes

Postestimation functions

basic post estimation functions have been developed to work with the psc object, namely “print()”, “coef()”, “summary()” and “plot()”. For the first three of these these provided basic summaries of the efficacy parameter obtained from the posterior distribution.

Data

The psc.r package includes as example, a dataset which is derived from patients with advanced Hepatocellular Carcinoma (aHCC) who have all received some experimental treatment. The dataset is simply named ‘data’ and is loaded into the enviroment using the “data()” function

data <- psc::data

Included is a list of prognostic covariates:

Also included are the following structures

Lastly the dataset also inlclude a ‘trt’ variable to be used in the estimation of multiple treatment comparisons.

Examples

Survival

For an example with a survival outcome a model must be supplied which is contructed ont he basis of flexible parametric splines. This is contructed using the “flexsurvreg” function within the “flexsurv” package. An example is included within the ‘psc.r’ package names ‘surv.mod’ and is loaded using the ’data()” function:

surv.mod <- psc::surv.mod
surv.mod
#> Call:
#> flexsurvspline(formula = Surv(time, cen) ~ vi/age60 + ecog + 
#>     allmets + logafp + alb + logcreat + logast + aet, data = pros, 
#>     k = 3)
#> 
#> Estimates: 
#>              data mean  est       L95%      U95%      se        exp(est)
#> gamma0             NA   -7.57573  -9.90310  -5.24836   1.18746        NA
#> gamma1             NA    2.23506   1.37951   3.09060   0.43651        NA
#> gamma2             NA   -0.13758  -0.74322   0.46807   0.30901        NA
#> gamma3             NA    0.26981  -0.72101   1.26063   0.50553        NA
#> gamma4             NA   -0.03031  -0.60738   0.54676   0.29443        NA
#> viyes         0.28400    0.32856   0.08489   0.57224   0.12433   1.38897
#> ecog1         0.40400    0.45602   0.23429   0.67776   0.11313   1.57779
#> allmetsyes    0.69400    0.29767   0.03666   0.55869   0.13317   1.34672
#> logafp        5.42436    0.08320   0.05038   0.11602   0.01674   1.08676
#> alb          38.55600   -0.05539  -0.07809  -0.03269   0.01158   0.94612
#> logcreat      4.30929    0.71084   0.26457   1.15712   0.22770   2.03571
#> logast        4.08274    0.35000   0.15915   0.54086   0.09738   1.41907
#> aetHCV        0.21200   -0.52754  -0.86299  -0.19210   0.17115   0.59005
#> aetOther      0.37000   -0.01847  -0.26319   0.22624   0.12486   0.98170
#> vino:age60    0.06400   -0.02312  -0.03479  -0.01144   0.00596   0.97715
#> viyes:age60  -0.37000    0.00716  -0.00790   0.02221   0.00768   1.00718
#>              L95%      U95%    
#> gamma0             NA        NA
#> gamma1             NA        NA
#> gamma2             NA        NA
#> gamma3             NA        NA
#> gamma4             NA        NA
#> viyes         1.08859   1.77223
#> ecog1         1.26401   1.96946
#> allmetsyes    1.03734   1.74838
#> logafp        1.05167   1.12301
#> alb           0.92488   0.96784
#> logcreat      1.30286   3.18076
#> logast        1.17251   1.71748
#> aetHCV        0.42190   0.82523
#> aetOther      0.76860   1.25388
#> vino:age60    0.96581   0.98862
#> viyes:age60   0.99214   1.02246
#> 
#> N = 500,  Events: 360,  Censored: 140
#> Total time at risk: 5357.664
#> Log-likelihood = -1221.857, df = 16
#> AIC = 2475.714

In this example you can see that this is a model constructed with 3 internal knots and hence 5 parameters to describe the baseline cumulative hazard function. There are also prognostic covariates which match with the prognostic covaraites in the data cohort.

To begin it is worth looking at the performance of the model and looking at how the survival of patietns in the data cohort compare

sfit <- survfit(Surv(data$time,data$cen)~1)
plot(sfit)

Comparing the dataset to the model is then performed using

and we can view the attributes of the psc object that is created

attributes(surv.psc)
#> $names
#> [1] "model.type" "DC_clean"   "posterior" 
#> 
#> $class
#> [1] "psc"

For example to view the matrix contianing the draws of the posterior distribution we use

surv.post <- surv.psc$posterior
head(surv.post)
#>      gamma0   gamma1     gamma2       gamma3      gamma4      viyes     ecog1
#> 1 -7.575727 2.235057 -0.1375763  0.269810877 -0.03030718 0.32856278 0.4560236
#> 2 -6.780683 2.094588 -0.2416707  0.426359842 -0.11427285 0.46420313 0.4125872
#> 3 -8.166595 2.854355  0.3246109 -0.177704366 -0.05634337 0.24922591 0.4867196
#> 4 -7.237902 2.028600 -0.2078886  0.390438652 -0.12874138 0.08494123 0.2677455
#> 5 -9.221392 2.506457 -0.2823084  0.782952623 -0.50836239 0.16106053 0.6454054
#> 6 -6.708755 3.376205  0.2451198 -0.001670286 -0.10323773 0.46205333 0.3221144
#>   allmetsyes     logafp         alb  logcreat    logast     aetHCV    aetOther
#> 1  0.2976742 0.08319787 -0.05538982 0.7108430 0.3500019 -0.5275438 -0.01847276
#> 2  0.3755551 0.07327062 -0.08238572 0.8399776 0.2698487 -0.7031579 -0.13708391
#> 3  0.3904685 0.10749238 -0.03721428 0.6358550 0.3696237 -0.6729424  0.09611035
#> 4  0.3520148 0.13297722 -0.07874917 0.8787975 0.2696924 -0.6389370  0.07840792
#> 5  0.2011027 0.05252571 -0.06009978 1.0333564 0.4499503 -0.4155221 -0.18416553
#> 6  0.1499597 0.08507176 -0.06237660 0.7015127 0.1486670 -0.5239318 -0.08720093
#>    vino:age60 viyes:age60      beta      DIC
#> 1 -0.02311823 0.007156047 0.3784481       NA
#> 2 -0.03147339 0.007768860 0.3784481 283.6776
#> 3 -0.02487429 0.011579194 0.3784481 285.5835
#> 4 -0.02828730 0.003334481 0.3784481 290.8281
#> 5 -0.01967313 0.006422261 0.3784481 289.1245
#> 6 -0.01954534 0.018346008 0.3784481 285.2520

Inspection will show that there is a column for each parameter in the original model as well as ‘beta’ and ‘DIC’ vcolumns which give teh posterior estiamtes for \(\beta\) and the Deviance Informaiton Criterion respectively.

We can inspect the poterior distribution using the autocorrelation function, trace and stardard summary statistics:

Autocorrelation

acf(surv.post$beta)

Trace

plot(surv.post$beta,typ="s")

Summary

summary(surv.post$beta)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -0.1933  0.2672  0.3470  0.3475  0.4186  0.8164

Alternatively the standard post estimation functions can be used

print(surv.psc)
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x11a0ddac0>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta    0.3470    0.1069    0.5933    0.0038    0.9962
#> DIC   281.0320  273.6470  293.2567        NA        NA
coef(surv.psc)
#>           median        2.5%       97.5% Pr(x<0) Pr(x>0)
#> beta   0.3469552   0.1068934   0.5932915  0.0038  0.9962
#> DIC  281.0320206 273.6469902 293.2567024      NA      NA
summary(surv.psc)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type flexsurvreg identified  
#> linear predictor succesfully obtained with a median of  3.15 
#> Average expected response: 9.1 
#> Average observed response: 6.366 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x11a0ddac0>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta    0.3470    0.1069    0.5933    0.0038    0.9962
#> DIC   281.0320  273.6470  293.2567        NA        NA

To visualise the original model and the fit of the data, the plot function has been developed

plot(surv.psc)

GLM

The “pscfit()” object uses class of the model supplied to derive the likelihoiod and estimation procedure that is required. In this example, the “enrichwith” package is utilised to extract from the model the parameters of the exponential family. Important from the attributes of the GLM are the “family” statements which dictates the form of the likelihood. For each of the binary, continuous and Count data outcomes then, the syntax and the DC remains the same and it is the form of the CFM that dictates the analysis

Binary

Step 1: Load Model

bin.mod <- psc::bin.mod
bin.mod
#> 
#> Call:  glm(formula = event ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat, family = "binomial", data = pros)
#> 
#> Coefficients:
#> (Intercept)        viyes        ecog1   allmetsyes       logafp          alb  
#>    -1.65694      0.80007      0.38279      0.40733      0.12941     -0.04993  
#>    logcreat   vino:age60  viyes:age60  
#>     0.76346     -0.01901      0.01380  
#> 
#> Degrees of Freedom: 570 Total (i.e. Null);  562 Residual
#>   (17 observations deleted due to missingness)
#> Null Deviance:       683 
#> Residual Deviance: 616   AIC: 634

Step 2: Fit psc object

Step 3: Review summary statistics

summary(psc.bin)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type glm lm identified  
#> linear predictor succesfully obtained with a median of  1.2 
#> Average expected response: 0.755 
#> Average observed response: 0.48 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'GLM' 
#>  Family: binomial 
#>  Link: logit 
#> 
#> Formula: 
#> event ~ vi/age60 + ecog + allmets + logafp + alb + logcreat
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median   2.5%     97.5%    Pr(x<0)  Pr(x>0)
#> beta  -1.2148  -1.7068  -0.7058   1.0000   0.0000
#> DIC   56.7748  52.8003  64.5599       NA       NA

Step 4: Plot Output

plot(psc.bin)

Count

Step 1: Load Model

count.mod <- psc::count.mod
count.mod
#> 
#> Call:  glm(formula = count ~ ecog + logafp + alb + logcreat, family = "poisson", 
#>     data = data)
#> 
#> Coefficients:
#> (Intercept)         ecog       logafp          alb     logcreat  
#>    1.414244    -0.170884     0.017803    -0.007255    -0.056183  
#> 
#> Degrees of Freedom: 99 Total (i.e. Null);  95 Residual
#> Null Deviance:       127.2 
#> Residual Deviance: 125.1     AIC: 382.1
#data("count.mod")

Step 2: Fit psc object

Step 3: Review summary statistics

summary(psc.count)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type glm lm identified  
#> linear predictor succesfully obtained with a median of  0.919 
#> Average expected response: 2.49 
#> Average observed response: 2.49 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'GLM' 
#>  Family: poisson 
#>  Link: log 
#> 
#> Formula: 
#> count ~ ecog + logafp + alb + logcreat
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median     2.5%       97.5%      Pr(x<0)    Pr(x>0)  
#> beta   0.001353  -0.224721   0.166299   0.494400   0.505600
#> DIC   14.303257  10.210796  24.531866         NA         NA

Step 4: Plot Output

plot(psc.count)

Continuous

Step 1: Load Model

cont.mod <- psc::cont.mod
cont.mod
#> 
#> Call:  glm(formula = os ~ vi/age60 + ecog + allmets + logafp + alb, 
#>     data = pros)
#> 
#> Coefficients:
#> (Intercept)        viyes        ecog1   allmetsyes       logafp          alb  
#>     2.16581     -2.57037     -2.97107     -2.13909     -0.37175      0.36754  
#>  vino:age60  viyes:age60  
#>     0.09174      0.02106  
#> 
#> Degrees of Freedom: 570 Total (i.e. Null);  563 Residual
#>   (17 observations deleted due to missingness)
#> Null Deviance:       33570 
#> Residual Deviance: 25300     AIC: 3803
#data("cont.mod")

Step 2: Fit psc object

Step 3: Review summary statistics

summary(psc.con)
#> Summary: 
#>  
#> 100 observations selected from the data cohort for comparison 
#> CFM of type glm lm identified  
#> linear predictor succesfully obtained with a median of  10.8 
#> Average expected response: 10.623 
#> Average observed response: 10.743 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'GLM' 
#>  Family: gaussian 
#>  Link: identity 
#> 
#> Formula: 
#> os ~ vi/age60 + ecog + allmets + logafp + alb
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median      2.5%        97.5%       Pr(x<0)     Pr(x>0)   
#> beta   9.869e-02  -6.574e-01   8.769e-01   3.772e-01   6.228e-01
#> DIC   -6.796e+03  -6.898e+03  -6.665e+03          NA          NA

Step 4: Plot Output

plot(psc.con)

Sub group Effects

An attractive feature of Personalised Synthetic Controls is its use in fitting sub-group effects. Whereas other casual inference tools typically make assumptions about population levels of balance and then further assume that this balance holds at sub-group levels, Personalised Synthetic Controls differ in that they estimate treatment effects at a patient level and then average across populations. To estimate sub-group effects then we need only to restrict estimation over some sub-group of the population. This can be achived in 2 ways;

Using an example where we wnat to see if the treatment effect is consistent by patients with ECOG=0 and ECOG =1

Sub group effects by restricting the population

summary(sub1)
#> Summary: 
#>  
#> 53 observations selected from the data cohort for comparison 
#> CFM of type flexsurvreg identified  
#> linear predictor succesfully obtained with a median of  2.84 
#> Average expected response: 12.605 
#> Average observed response: 8.135 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x12d5a4df8>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median    2.5%      97.5%     Pr(x<0)   Pr(x>0) 
#> beta    0.2797   -0.0595    0.5951    0.0378    0.9622
#> DIC   150.6530  144.9974  159.7506        NA        NA
summary(sub2)
#> Summary: 
#>  
#> 47 observations selected from the data cohort for comparison 
#> CFM of type flexsurvreg identified  
#> linear predictor succesfully obtained with a median of  3.423 
#> Average expected response: 8.099 
#> Average observed response: 5.856 
#> 
#> Counterfactual Model (CFM): 
#> A model of class 'flexsurvreg' 
#>  Fit with 3 internal knots
#> 
#> Formula: 
#> Surv(time, cen) ~ vi/age60 + ecog + allmets + logafp + alb + 
#>     logcreat + logast + aet
#> <environment: 0x11ad399a8>
#> 
#> Call:
#>  CFM model + beta
#> 
#> Coefficients:
#>       median     2.5%       97.5%      Pr(x<0)    Pr(x>0)  
#> beta    0.42724    0.08917    0.79416    0.00920    0.99080
#> DIC   100.84940   95.35920  110.01908         NA         NA