L0Learn
is a fast toolkit for L0-regularized learning.
L0 regularization selects the best subset of features and can outperform
commonly used feature selection methods (e.g., L1 and MCP) under many
sparse learning regimes. The toolkit can (approximately) solve the
following three problems \[
\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,
\beta \rangle) + \lambda ||\beta||_0 \quad \quad (L0)
\] \[
\min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i,
\beta \rangle) + \lambda ||\beta||_0 + \gamma||\beta||_1 \quad (L0L1)
\]
\[ \min_{\beta_0, \beta} \sum_{i=1}^{n} \ell(y_i, \beta_0+ \langle x_i, \beta \rangle) + \lambda ||\beta||_0 + \gamma||\beta||_2^2 \quad (L0L2) \] where \(\ell\) is the loss function, \(\beta_0\) is the intercept, \(\beta\) is the vector of coefficients, and \(||\beta||_0\) denotes the L0 norm of \(\beta\), i.e., the number of non-zeros in \(\beta\). We support both regression and classification using either one of the following loss functions:
The parameter \(\lambda\) controls the strength of the L0 regularization (larger \(\lambda\) leads to less non-zeros). The parameter \(\gamma\) controls the strength of the shrinkage component (which is the L1 norm in case of L0L1 or squared L2 norm in case of L0L2); adding a shrinkage term to L0 can be very effective in avoiding overfitting and typically leads to better predictive models. The fitting is done over a grid of \(\lambda\) and \(\gamma\) values to generate a regularization path.
The algorithms provided in L0Learn
are based on cyclic
coordinate descent and local combinatorial search. Many computational
tricks and heuristics are used to speed up the algorithms and improve
the solution quality. These heuristics include warm starts, active set
convergence, correlation screening, greedy cycling order, and efficient
methods for updating the residuals through exploiting sparsity and
problem dimensions. Moreover, we employed a new computationally
efficient method for dynamically selecting the regularization parameter
\(\lambda\) in the path. For more
details on the algorithms used, please refer to our paper Fast
Best Subset Selection: Coordinate Descent and Local Combinatorial
Optimization Algorithms.
The toolkit is implemented in C++ along with an easy-to-use R interface. In this vignette, we provide a tutorial on using the R interface. Particularly, we will demonstrate how use L0Learn’s main functions for fitting models, cross-validation, and visualization.
L0Learn can be installed directly from CRAN by executing:
If you face installation issues, please refer to the Installation Troubleshooting Wiki. If the issue is not resolved, you can submit an issue on L0Learn’s Github Repo.
To demonstrate how L0Learn
works, we will first generate
a synthetic dataset and then proceed to fitting L0-regularized models.
The synthetic dataset (y,X) will be generated from a sparse linear model
as follows:
This dataset can be generated in R as follows:
set.seed(1) # fix the seed to get a reproducible result
X = matrix(rnorm(500*1000),nrow=500,ncol=1000)
B = c(rep(1,10),rep(0,990))
e = rnorm(500)
y = X%*%B + e
We will use L0Learn to estimate B from the data (y,X). First we load L0Learn:
We will start by fitting a simple L0 model and then proceed to the case of L0L2 and L0L1.
To fit a path of solutions for the L0-regularized model with at most
20 non-zeros using coordinate descent (CD), we use the
L0Learn.fit
function as follows:
This will generate solutions for a sequence of \(\lambda\) values (chosen automatically by
the algorithm). To view the sequence of \(\lambda\) along with the associated support
sizes (i.e., the number of non-zeros), we use the print
method as follows:
#> lambda gamma suppSize
#> 1 0.068285500 0 0
#> 2 0.067602600 0 1
#> 3 0.055200200 0 2
#> 4 0.049032300 0 3
#> 5 0.040072500 0 6
#> 6 0.038602900 0 7
#> 7 0.037264900 0 8
#> 8 0.032514200 0 10
#> 9 0.001142920 0 11
#> 10 0.000821227 0 13
#> 11 0.000702292 0 14
#> 12 0.000669520 0 15
#> 13 0.000489938 0 17
To extract the estimated B for particular values of \(\lambda\) and \(\gamma\), we use the function
coef(fit,lambda,gamma)
. For example, the solution at \(\lambda = 0.0325142\) (which corresponds to
a support size of 10) can be extracted using
#> 1001 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept 0.01052346
#> V1 1.01601401
#> V2 1.01829953
#> V3 1.00606260
#> V4 0.98308999
#> V5 0.97389545
#> V6 0.96148014
#> V7 1.00990709
#> V8 1.08535033
#> V9 1.02685765
#> V10 0.94236638
#> V11 .
#> V12 .
...
The output is a sparse vector of type dgCMatrix
. The
first element in the vector is the intercept and the rest are the B
coefficients. Aside from the intercept, the only non-zeros in the above
solution are coordinates 1, 2, 3, …, 10, which are the non-zero
coordinates in the true support (used to generated the data). Thus, this
solution successfully recovers the true support. Note that on some BLAS
implementations, the lambda
value we used above (i.e.,
0.0325142
) might be slightly different due to the
limitations of numerical precision. Moreover, all the solutions in the
regularization path can be extracted at once by calling
coef(fit)
.
The sequence of \(\lambda\)
generated by L0Learn
is stored in the object
fit
. Specifically, fit$lambda
is a list, where
each element of the list is a sequence of \(\lambda\) values corresponding to a single
value of \(\gamma\). Since L0 has only
one value of \(\gamma\) (i.e., 0), we
can access the sequence of \(\lambda\)
values using fit$lambda[[1]]
. Thus, \(\lambda=0.0325142\) we used previously can
be accessed using fit$lambda[[1]][7]
(since it is the 7th
value in the output of print
). So the previous solution can
also be extracted using
coef(fit,lambda=fit$lambda[[1]][7], gamma=0)
.
We can make predictions using a specific solution in the grid using
the function predict(fit,newx,lambda,gamma)
where newx is a
testing sample (vector or matrix). For example, to predict the response
for the samples in the data matrix X using the solution with \(\lambda=0.0325142\), we call the prediction
function as follows:
#> 500 x 1 Matrix of class "dgeMatrix"
#> [,1]
#> [1,] 0.44584683
#> [2,] -1.52213221
#> [3,] -1.11755544
#> [4,] -0.93180249
#> [5,] -4.02095821
#> [6,] 2.02300763
#> [7,] 2.03371819
#> [8,] 0.92234198
#> [9,] 2.33359737
#> [10,] 0.92194909
#> [11,] -4.33165265
#> [12,] -2.13282518
#> [13,] -7.21962511
...
We can also visualize the regularization path by plotting the
coefficients of the estimated B versus the support size (i.e., the
number of non-zeros) using the plot(fit,gamma)
method as
follows:
The legend of the plot presents the variables in the order they
entered the regularization path. For example, variable 7 is the first
variable to enter the path, and variable 6 is the second to enter. Thus,
roughly speaking, we can view the first \(k\) variables in the legend as the best
subset of size \(k\). To show the lines
connecting the points in the plot, we can set the parameter
showLines=TRUE
in the plot
function, i.e.,
call plot(fit, gamma=0, showLines=TRUE)
. Moreover, we note
that the output of the plot
function above is a
ggplot
object, which can be further customized using the
ggplot2
package.
We have demonstrated the simple case of using an L0 penalty. We can
also fit more elaborate models that combine L0 regularization with
shrinkage-inducing penalties like the L1 norm or squared L2 norm. Adding
shrinkage helps in avoiding overfitting and typically improves the
predictive performance of the models. Next, we will discuss how to fit a
model using the L0L2 penalty for a two-dimensional grid of \(\lambda\) and \(\gamma\) values. Recall that by default,
L0Learn
automatically selects the \(\lambda\) sequence, so we only need to
specify the \(\gamma\) sequence.
Suppose we want to fit an L0L2 model with a maximum of 20 non-zeros and
a sequence of 5 \(\gamma\) values
ranging between 0.0001 and 10. We can do so by calling
L0Learn.fit
with penalty="L0L2"
,
nGamma=5
, gammaMin=0.0001
, and
gammaMax=10
as follows:
fit <- L0Learn.fit(X, y, penalty="L0L2", nGamma = 5, gammaMin = 0.0001, gammaMax = 10, maxSuppSize=20)
L0Learn
will generate a grid of 5 \(\gamma\) values equi-spaced on the
logarithmic scale between 0.0001 and 10. Similar to the case of L0, we
can print a summary of the regularization path using the
print
function as follows:
#> lambda gamma suppSize
#> 1 0.003251690 10.000000000 0
#> 2 0.003219170 10.000000000 1
#> 3 0.002776000 10.000000000 2
#> 4 0.002687010 10.000000000 3
#> 5 0.002577800 10.000000000 3
#> 6 0.002062240 10.000000000 6
#> 7 0.001861720 10.000000000 7
#> 8 0.001710400 10.000000000 8
#> 9 0.001644570 10.000000000 9
#> 10 0.001153900 10.000000000 10
#> 11 0.000482462 10.000000000 10
#> 12 0.000385970 10.000000000 12
#> 13 0.000374391 10.000000000 14
#> 14 0.000363159 10.000000000 15
#> 15 0.000352264 10.000000000 15
#> 16 0.000281811 10.000000000 19
#> 17 0.000273357 10.000000000 19
#> 18 0.032139100 0.562341325 0
#> 19 0.031817800 0.562341325 1
#> 20 0.027437500 0.562341325 1
#> 21 0.021950000 0.562341325 4
#> 22 0.021291500 0.562341325 4
#> 23 0.017033200 0.562341325 9
#> 24 0.011404900 0.562341325 10
#> 25 0.004768580 0.562341325 10
#> 26 0.003814860 0.562341325 10
#> 27 0.003051890 0.562341325 10
#> 28 0.002441510 0.562341325 10
#> 29 0.001953210 0.562341325 10
...
The sequence of \(\gamma\) values
can be accessed using fit$gamma
. To extract a solution we
use the coef
method. For example, extracting the solution
at \(\lambda=0.0011539\) and \(\gamma=10\) can be done using
#> 1001 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept -0.02424084
#> V1 0.04267751
#> V2 0.04765638
#> V3 0.04958919
#> V4 0.05100251
#> V5 0.04411968
#> V6 0.05385520
#> V7 0.05597895
#> V8 0.04374915
#> V9 0.05429957
#> V10 0.03644110
#> V11 .
#> V12 .
...
Similarly, we can predict the response at this pair of \(\lambda\) and \(\gamma\) for the matrix X using
The regularization path can also be plot at a specific \(\gamma\) using
plot(fit, gamma)
. Finally, we note that fitting an L0L1
model can be done by just changing the penalty
to “L0L1” in
the above (in this case gammaMax
will be ignored since it
is automatically selected by the toolkit; see the reference manual for
more details.)
By default, L0Learn
uses coordinate descent (CD) to fit
models. Since the objective function is non-convex, the choice of the
optimization algorithm can have a significant effect on the solution
quality (different algorithms can lead to solutions with very different
objective values). A more elaborate algorithm based on combinatorial
search can be used by setting the parameter
algorithm="CDPSI"
in the call to L0Learn.fit
.
CDPSI
typically leads to higher-quality solutions compared
to CD, especially when the features are highly correlated. CDPSI is
slower than CD, however, for typical applications it terminates in the
order of seconds.
We will demonstrate how to use K-fold cross-validation (CV) to select
the optimal values of the tuning parameters \(\lambda\) and \(\gamma\). To perform CV, we use the
L0Learn.cvfit
function, which takes the same parameters as
L0Learn.fit
, in addition to the number of folds using the
nFolds
parameter and a seed value using the
seed
parameter (this is used when randomly shuffling the
data before performing CV).
For example, to perform 5-fold CV using the L0L2
penalty
(over a range of 5 gamma
values between 0.0001 and 0.1)
with a maximum of 50 non-zeros, we run:
cvfit = L0Learn.cvfit(X, y, nFolds=5, seed=1, penalty="L0L2", nGamma=5, gammaMin=0.0001, gammaMax=0.1, maxSuppSize=50)
Note that the object cvfit
has a member fit
(accessed using cvfit$fit
) which is output of running
L0Learn.fit
on (y,X). The cross-validation errors can be
accessed using the cvMeans
attribute of cvfit
:
cvfit$cvMeans
is a list where the ith element,
cvfit$cvMeans[[i]]
, stores the cross-validation errors for
the ith value of gamma (cvfit$fit$gamma[i]
). To find the
minimum cross-validation error for every gamma
, we call the
min
function for every element in the list
cvfit$cvMeans
, as follows:
#> [[1]]
#> [1] 1.283026
#>
#> [[2]]
#> [1] 1.000699
#>
#> [[3]]
#> [1] 0.9900166
#>
#> [[4]]
#> [1] 0.9899541
#>
#> [[5]]
#> [1] 0.9900055
The above output indicates that the 4th value of gamma achieves the
lowest CV error (=0.9899542
). We can plot the CV errors
against the support size for the 4th value of gamma, i.e.,
gamma = cvfit$fit$gamma[4]
, using:
The above plot is produced using the ggplot2
package and
can be further customized by the user. To extract the optimal \(\lambda\) (i.e., the one with minimum CV
error) in this plot, we execute the following:
optimalGammaIndex = 4 # index of the optimal gamma identified previously
optimalLambdaIndex = which.min(cvfit$cvMeans[[optimalGammaIndex]])
optimalLambda = cvfit$fit$lambda[[optimalGammaIndex]][optimalLambdaIndex]
optimalLambda
#> [1] 0.00809626
To print the solution corresponding to the optimal gamma/lambda pair:
#> 1001 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept 0.01044999
#> V1 1.01472611
#> V2 1.01711155
#> V3 1.00494630
#> V4 0.98208003
#> V5 0.97274703
#> V6 0.96057359
#> V7 1.00895157
#> V8 1.08392099
#> V9 1.02580620
#> V10 0.94108163
#> V11 .
#> V12 .
...
The optimal solution (above) selected by cross-validation correctly recovers the support of the true vector of coefficients used to generated the model.
All the commands and plots we have seen in the case of regression
extend to classification. We currently support logistic regression
(using the parameter loss = "Logistic"
) and a smoothed
version of SVM (using the parameter loss="SquaredHinge"
).
To give some examples, we first generate a synthetic classification
dataset (similar to the one we generated in the case of regression):
set.seed(1) # fix the seed to get a reproducible result
X = matrix(rnorm(500*1000),nrow=500,ncol=1000)
B = c(rep(1,10),rep(0,990))
e = rnorm(500)
y = sign(X%*%B + e)
An L0-regularized logistic regression model can be fit by
specificying loss = "Logistic"
as follows:
#> lambda gamma suppSize
#> 1 2.03415e+01 1e-07 1
#> 2 1.82815e+01 1e-07 2
#> 3 1.80478e+01 1e-07 2
#> 4 1.44382e+01 1e-07 5
#> 5 1.40051e+01 1e-07 5
#> 6 1.12041e+01 1e-07 7
#> 7 1.08679e+01 1e-07 7
#> 8 8.69435e+00 1e-07 10
#> 9 4.45223e+00 1e-07 10
#> 10 3.56178e+00 1e-07 10
#> 11 2.84943e+00 1e-07 10
#> 12 2.27954e+00 1e-07 10
#> 13 1.82363e+00 1e-07 10
#> 14 1.45891e+00 1e-07 10
...
The output above indicates that \(\gamma=10^{-7}\)—by default we use a small ridge regularization (with \(\gamma=10^{-7}\)) to ensure the existence of a solution. To extract the coefficients of the solution with \(\lambda = 8.69435\):
#> 1001 x 1 sparse Matrix of class "dgCMatrix"
#>
#> Intercept -0.1797751
#> V1 2.0617431
#> V2 2.2437744
#> V3 1.9869710
#> V4 2.1627060
#> V5 2.1201722
#> V6 2.1177628
#> V7 1.8449653
#> V8 2.4375011
#> V9 2.3808994
#> V10 2.1134238
#> V11 .
#> V12 .
...
The above indicates that the 10 non-zeros in the estimated model match those we used in generating the data (i.e, L0 regularization correctly recovered the true support). We can also make predictions at the latter \(\lambda\) using:
#> 500 x 1 Matrix of class "dgeMatrix"
#> [,1]
#> [1,] 5.786002e-01
#> [2,] 2.067591e-02
#> [3,] 5.357189e-02
#> [4,] 7.589991e-02
#> [5,] 1.460869e-04
#> [6,] 9.886345e-01
#> [7,] 9.620193e-01
#> [8,] 9.201535e-01
#> [9,] 9.912170e-01
#> [10,] 7.672292e-01
#> [11,] 3.756393e-05
#> [12,] 5.631011e-03
#> [13,] 2.491432e-07
...
Each row in the above is the probability that the corresponding
sample belongs to class \(1\). Other
models (i.e., L0L2 and L0L1) can be similarly fit by specifying
loss = "Logistic"
.
Finally, we note that L0Learn also supports a smoothed version of SVM
by using squared hinge loss (loss = "SquaredHinge"
). The
only difference from logistic regression is that the
predict
function returns \(\beta_0 + \langle x, \beta \rangle\) (where
\(x\) is the testing sample), instead
of returning probabilities. The latter predictions can be assigned to
the appropriate classes by using a thresholding function (e.g., the sign
function).
Starting in version 2.0.0, L0Learn supports sparse matrices of type dgCMatrix. If your sparse matrix uses a different storage format, please convert it to dgCMatrix before using it in L0Learn. L0Learn keeps the matrix sparse internally and thus is highly efficient if the matrix is sufficiently sparse. The API for sparse matrices is the same as that of dense matrices, so all the demonstrations in this vignette also apply for sparse matrices. For example, we can fit an L0-regularized model on a sparse matrix as follows:
# As an example, we generate a random, sparse matrix with
# 500 samples, 1000 features, and 10% nonzero entries.
X_sparse <- Matrix::rsparsematrix(nrow=500, ncol=1000, density=0.1, rand.x = rnorm)
# Below we generate the response using the same linear model as before,
# but with the sparse data matrix X_sparse.
y_sparseX <- as.vector(X_sparse%*%B + e)
# Call L0Learn.
fit_sparse <- L0Learn.fit(X_sparse, y_sparseX, penalty="L0")
# Note: In the setup above, X_sparse is of type dgCMatrix.
# If your sparse matrix is of a different type, convert it
# to dgCMatrix before calling L0Learn, e.g., using: X_sparse <- as(X_sparse, "dgCMatrix").
In certain applications, it is desirable to always include some of
the variables in the model and perform variable selection on others.
L0Learn
supports this option through the
excludeFirstK
parameter. Specifically, setting
excludeFirstK = K
(where K is a non-negative integer)
instructs L0Learn
to exclude the first K variables in the
data matrix X
from the L0-norm penalty (those K variables
will still be penalized using the L2 or L1 norm penalties.). For
example, below we fit an L0
model and exclude the first 3
variables from selection by setting excludeFirstK = 3
:
Plotting the regularization path:
We can see in the plot above that first 3 variables are included in all the solutions of the path.
Starting in version 2.0.0, L0Learn supports bounds for CD algorithms
for all losses and penalties. (We plan to support bound constraints for
the CDPSI algorithm in the future). By default, L0Learn does not apply
bounds, i.e., it assumes \(-\infty <=
\beta_i <= \infty\) for all i. Users can supply the same
bounds for all coefficients by setting the parameters lows
and highs
to scalar values (these should satisfy:
lows <= 0
, lows != highs
, and
highs >= 0
). To use different bounds for the
coefficients, lows
and highs
can be both set
to vectors of length p
(where the i-th entry corresponds to
the bound on coefficient i).
All of the following examples are valid.
low_vector <- c(rep(-0.1, 500), rep(-0.125, 500))
fit <- L0Learn.fit(X, y, penalty="L0", lows=low_vector, highs=0.25)
We can see the coefficients are subject to the bounds.
#> [1] 0.25
#> [1] -0.1
#> [1] -0.1222599
By default, L0Learn
selects the sequence of lambda
values in an efficient manner to avoid wasted computation (since close
\(\lambda\) values can typically lead
to the same solution). Advanced users of the toolkit can change this
default behavior and supply their own sequence of \(\lambda\) values. This can be done
supplying the \(\lambda\) values
through the parameter lambdaGrid
. L0Learn versions before
2.0.0 would also require setting the autoLambda
parameter
to FALSE
. This parameter remains in version 2.0.0 for
backwards compatibility, but is no longer needed or used.
Specifically, the value assigned to lambdaGrid
should be
a list of lists of decreasing positive values (doubles). The length of
lambdaGrid
(the number of lists stored) specifies the
number of gamma parameters that will fill between gammaMin
,
and gammaMax
. In the case of L0 penalty,
lambdaGrid
must be a list of length 1. In case of L0L2/L0L1
lambdaGrid
can have any number of sub-lists stored. The
length of lambdaGrid
(the number of lists stored) specifies
the number of gamma parameters that will fill between
gammaMin
, and gammaMax
. The ith element in
LambdaGrid
should be a decreasing sequence
of positive lambda values which are used by the algorithm for the ith
value of gamma. For example, to fit an L0 model with the sequence of
user-specified lambda values: 1, 1e-1, 1e-2, 1e-3, 1e-4, we run the
following:
userLambda <- list()
userLambda[[1]] <- c(1, 1e-1, 1e-2, 1e-3, 1e-4)
fit <- L0Learn.fit(X, y, penalty="L0", lambdaGrid=userLambda, maxSuppSize=1000)
To verify the results we print the fit object:
#> lambda gamma suppSize
#> 1 1e+00 0 0
#> 2 1e-01 0 0
#> 3 1e-02 0 10
#> 4 1e-03 0 62
#> 5 1e-04 0 257
Note that the \(\lambda\) values
above are the desired values. For L0L2 and L0L1 penalties, the same can
be done where the lambdaGrid
parameter.
userLambda <- list()
userLambda[[1]] <- c(1, 1e-1, 1e-2, 1e-3, 1e-4)
userLambda[[2]] <- c(10, 2, 1, 0.01, 0.002, 0.001, 1e-5)
userLambda[[3]] <- c(1e-4, 1e-5)
# userLambda[[i]] must be a vector of positive decreasing reals.
fit <- L0Learn.fit(X, y, penalty="L0L2", lambdaGrid=userLambda, maxSuppSize=1000)
#> lambda gamma suppSize
#> 1 1e+00 10.00000000 0
#> 2 1e-01 10.00000000 0
#> 3 1e-02 10.00000000 0
#> 4 1e-03 10.00000000 8
#> 5 1e-04 10.00000000 132
#> 6 1e+01 0.03162278 0
#> 7 2e+00 0.03162278 0
#> 8 1e+00 0.03162278 0
#> 9 1e-02 0.03162278 10
#> 10 2e-03 0.03162278 26
#> 11 1e-03 0.03162278 65
#> 12 1e-05 0.03162278 625
#> 13 1e-04 0.00010000 311
#> 14 1e-05 0.00010000 409
Hussein Hazimeh and Rahul Mazumder. Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms. Operations Research (2020).
Antoine Dedieu, Hussein Hazimeh, and Rahul Mazumder. Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives. JMLR (to appear).