The R package penalizedclr provides an implementation of the penalized logistic regression model that can be used in the analysis of matched case-control studies. The implementation allows to apply different penalties to different blocks of covariates, and is therefore particularly useful in the presence of multi-source heterogenous data, such as multiple layers of omics measurements. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection to allow for variable selection in the considered regression model.
You can install the released version of penalizedclr from CRAN with:
And the development version from GitHub with:
Load the package with:
Assume that we have \(K\) independent matched case-control pairs \((Y_{ki}, X_{ki})\), where \(Y_{ki}\), \(k=1,\ldots,K;\) \(i=1,2,\) is a binary variable indicating case control status (1 if case, 0 if control) and \(X_{ki}\) is the associated \(p\)-dimensional vector of covariates. The conditional logistic regression models the probability of being a case given that the observation belongs to the \(k\)-th pair as: \[ {\mathrm {logit}}\left[P(Y_{ki}=1 \mid S=k)\right] = \beta_{0k} + \beta^{T}X_{ki}, \quad k\in\left\{1,\ldots, K\right\}, i\in\left\{1,2\right\} \] where \(S\) is the matched pair id, \(\beta_{0k}\) is the pair specific intercept and \(\beta=(\beta_1, \ldots,\beta_p)^T\) is a \(p\)-dimensional vector of fixed effects.
In the present package we:
estimate \(\beta\) in the high dimensional setting in which the number of covariates \(p\) is much higher than the number of pairs \(K\). We consider a penalized conditional logistic regression, which adds a penalty to the conditional log likelihood. Motivated by current medical applications considering clinical and molecular data, we allow \(X_{ki}\) to be a merge of heteregenous data sources;
perform stability selection to identify important variables, that is, variables for which \(\beta_j\neq 0\), \(j\in\left\{1,\ldots,p\right\}\).
In this section we provide examples of how to fit a penalized conditional regression model with source-specific penalty parameters and how to perform variable selection with penalizedclr.
Initial settings and libraries to be loaded:
We generate a simple multi-source data set, with two groups of covariates containing 12 and 40 variables, respectively. As specified above, each case is matched to a control, and the probability of being a case in each stratum (case-control pair) is obtained from the linear predictor. An intercept is generated independently for each stratum.
# two groups of predictors
p <- c(12, 40)
# percentage and number of non-null variables
p_nz <- c(0.5, 0.2)
m_nz <- round(p*p_nz, 0)
# number of different strata (case-control pairs)
K <- 100
# number of cases and controls in each stratum (not necessarily 1:1 matching,
# other designs are also allowed)
n_cases <- 1
n_ctrl <- 1
# generating covariates
X = cbind(matrix(rnorm(p[1] * K * (n_cases + n_ctrl), 0, 1), ncol = p[1]),
matrix(rnorm(p[2] * K * (n_cases + n_ctrl), 0, 4), ncol = p[2]))
# coefficients
beta <- as.matrix(c(rnorm(m_nz[1], 4, 1),
rep(0, p[1] - m_nz[1]),
rnorm(m_nz[2], 2, 0.8),
rep(0, p[2] - m_nz[2])), ncol = 1)
# stratum membership
stratum <- rep(1:K, each= n_cases+n_ctrl)
# linear predictor
lin_pred <- X %*% beta
prob_case <- exp(lin_pred) / (1 + exp(lin_pred))
# generate the response
Y <- rep(0, length(stratum))
data_sim <- as_tibble(data.frame(stratum = stratum,
probability = prob_case,
obs_id = 1 : length(stratum)))
data_sim_cases <- data_sim %>%
group_by(stratum)%>%
sample_n(n_cases, weight = probability)
Y[data_sim_cases$obs_id] <- 1
The function penalized.clr fits a penalized conditional
logistic regression model with different penalties for different blocks
of covariates. The L1 penalty parameter lambda
(a numeric
vector of the length equal to the number of blocks) is specified by the
user.
This code illustrates how to fit penalized.clr with penalty parameters specified by the user. Here Y is the response vector, X is the multi-source matrix of covariates, stratum is the vector of ids of the matching pairs, and p is the vector of block dimensions. It has the same dimension as the vector of penalty parameters lambda. Penalty parameters are thus specified for each data source, i.e., a block of covariates. It is possible to standardize the variables by setting standardize = TRUE (TRUE by default).
fit1 <- penalized.clr(response = Y, penalized = X, stratum = stratum,
lambda = c(1,4), p = p, standardize = TRUE)
The function returns a list with elements:
penalized - Regression coefficients for the penalized covariates.
unpenalized - Regression coefficients for the unpenalized covariates.
converged - Whether the fitting process was judged to be converged.
lambda - The tuning parameter for L1 used.
alpha - The elastic net mixing parameter used.
Y - The response vector saved as a survival object used internally.
For instance, fit1$penalized
is a numeric vector
containing the regression coefficients for the penalized covariates. In
this simple example, we can compare estimated coefficients with the true
coefficients (the ones that generated our data), by constructing a
2 x 2
contingency table cross-tabulating true non-zero and
estimated non-zero coefficients:
str(fit1)
#> List of 6
#> $ penalized : num [1:52] 0.251 0.212 0.584 0.402 0.613 ...
#> $ unpenalized: num(0)
#> $ converged : logi TRUE
#> $ lambda : num [1:2] 1 4
#> $ alpha : num 1
#> $ Y : 'Surv' num [1:200, 1:2] 1+ 1 1+ 1 1 1+ 1+ 1 1 1+ ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:2] "time" "status"
#> ..- attr(*, "type")= chr "right"
nonzero_index <- (beta != 0) * 1 #index of nonzero coefficients
table(fitted = (fit1$penalized != 0) * 1, nonzero_index)
#> nonzero_index
#> fitted 0 1
#> 0 19 2
#> 1 19 12
The package penalizedclr is not limited to L1, i.e. lasso,
penalty. While L1 penalty is more suited for variable selection, in the
presence of highly correlated covariates, it can be useful to add small
amount of L2 penalty. The two can be combined in an elastic net
framework by specifying the mixing parameter alpha
that can
assume values between 0 and 1. Default alpha=1
gives
lasso.
The function stable.clr performs stability selection (Meinshausen and Bühlmann 2010 and Shah and Samworth 2013) for variable selection in penalized conditional logistic regression. For details on stability selection, we refer to the original publications. Briefly, a set of L1 penalties is considered, and for each considered value of the penalty, \(2B\) random subsamples of \(\lfloor K/2 \rfloor\) matched pairs are taken and a penalized model is estimated. For each covariate and penalty value, selection probability is estimated as the proportion of estimated models in which the associated coefficient estimate is different from zero. Finally, the overall selection probability of a variable is obtained by taking the maximum selection probability over all considered penalty values.
Parameter \(B\) is set to 100 by default, but can be changed by the user. Note that this choice will have an impact on the computation time, and higher values of \(B\) will lead to a slower computation.
The function returns a list containing the selection probabilities for each covariate, i.e. the proportion of estimated models in which the associated coefficient estimate is different from zero. The user can then set a threshold for selection probability (values ranging from 0.55 to 0.9 are recommended) and obtain the set of selected covariates.
The following code performs stability selection when all covariates are considered as a single block (a single data source) and a sequence of L1 penalties contains 3 values.
stable1 <- stable.clr(response = Y, B = 50,
penalized = X, stratum = stratum,
lambda.seq = c(1, 4, 6), parallel = TRUE)
The function returns a list with two elements:
Pistab a numeric vector giving estimates of selection probabilities for each penalized covariate.
lambda.seq a sequence of L1 penalties used.
To inspect the results, we can, for instance, print the covariates with selection probability higher than 0.6: 2, 3, 4, 5, 13, 16, 17, 18, 39, 47.
We can inspect the histogram of selection probabilities.
and compute variable selection classification accuracy:
(tab1 <- table(stable = (stable1$P>0.65) * 1, nonzero_index))
#> nonzero_index
#> stable 0 1
#> 0 37 7
#> 1 1 7
(tab1[1,1] + tab1[2,2])/(sum(tab1))
#> [1] 0.8461538
This histrogram shows null variables in blue and signal variables in red.
# plotting selection probabilities for true non-zero (red) and zero (blue) coefficients
s_prob_nonzero <- cut(stable1$P[nonzero_index == 1],
breaks = seq(0,1, by = 0.1), ordered_result = T)
s_prob_zero <- cut(stable1$P[nonzero_index == 0],
breaks = seq(0,1, by = 0.1), ordered_result = T)
barplot(table(s_prob_nonzero), col = 2)
barplot(table(s_prob_zero), add =T, col = 4 )
This code implements stability selection while taking into account
the block structure of covariates (\(p\)). To achieve this, we use the function
stable.clr.g which extends stable.clr to allow for
blocks (groups) of covariates. In this case, the function takes as an
argument lambda.list
, a list of numeric vectors of L1
penalties. Each vector has the length equal to the number of blocks,
with the \(i\)th element giving the
penalty to be applied to the \(i\)th
data source.
stable2 <- penalizedclr::stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = list(c(4,1), c(1,4)))
The selection probability histogram in this case.
It is possible to obtain a data adaptive sequence of L1 penalty parameters via the functions default.pf, that searches for the suitable vector of penalty factors (up to a multiplicative constant), and find.default.lambda that searches for the optimal L1 penalty for a given vector of penalty factors.
The first function, follows the procedure proposed by Gerhard Schulze
in his Master Thesis (Clinical Outcome Prediction based on Multi-Omics
Data: Extension of IPF-LASSO). In short, in the first step, a tentative
conditional logistic regression model is fit, either to each covariates
block separately (type.step1 = "sep"
) or to all blocks
jointly (type.step1 = "comb"
). The penalty factor for each
block is then obtained as the inverse of the mean of the absolute values
of the estimated coefficients of that block. In this way, blocks that
contain covariates with large estimated coefficients will be penalized
less. If all estimated coefficients pertaining to a block are zero, the
function returns a message. The function returns a list with penalty
factors corresponding to different blocks.
The second function relies on the cv.clogitL1
function
of the clogitL1
package to perform cross-validation and,
for a given vector of penalty factors, for instance obtained from
default.pf
returns the value of lambda
that
achieves the minimal cross-validated deviance, see documentation of
package clogitL1
for more details.
(pf <- default.pf(response = Y, stratum = stratum, penalized = X,
nfolds = 10, alpha = 0.3,
standardize = TRUE, p = p, type.step1 = "comb"))
#> $pf
#> [1] 32.70923 42.73989
(lambda <- find.default.lambda(response = Y, penalized = X,
standardize = TRUE, stratum = stratum,
p = p, pf.list = pf))
#> [[1]]
#> [1] 0.1375511
stable3 <- stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = list(c(pf[[1]][1]*as.numeric(lambda), pf[[1]][2]*as.numeric(lambda))))
The histogram of selection probabilities with true signal variables in red and null variables in blue.
The user can augment the data adaptive vector of penalty factors with other vectors.
alt.pf.list<- list(c(1,4), c(2,4), c(4,1), c(4,2), pf$pf)
alt.lambda <- find.default.lambda(response = Y, penalized = X,
standardize = TRUE, stratum = stratum,
p = p, pf.list = alt.pf.list)
lambda.matrix <- mapply(function (x,y) x*y, lambda, alt.pf.list)
lambda.list <- lapply(seq_len(ncol(lambda.matrix)), function(i) lambda.matrix[,i])
stable4 <- stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = lambda.list )
The histogram of selection probabilities:
Boulesteix, A. L., De Bin, R., Jiang, X., & Fuchs, M. (2017). IPF-LASSO: Integrative-penalized regression with penalty factors for prediction based on multi-omics data. Computational and mathematical methods in medicine, 2017.
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.
Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80.