Risk Score Vignette

Introduction

Risk scores are sparse linear models that map an integer linear combination of covariates to the probability of an outcome occurring. Unlike regression models, risk score models consist of integer coefficients for often dichotomous variables. This allows risk score predictions to be easily computed by adding or subtracting a few small numbers.

Risk scores developed heuristically by altering logistic regression models have decreased performance, as there is a fundamental trade-off between the model’s simplicity and its predictive accuracy. In contrast, this package presents an optimization approach to learning risk scores, where the constraints unique to risk score models are integrated into the model-fitting process, rather than implemented afterward. This vignette demonstrates how to use the riskscores package to build a risk score model to predict breast cancer diagnosis.

library(riskscores)

Optimization Problem

The riskscores package uses a simulated annealing (or cyclical coordinate descent if otherwise specified) algorithm to solve the following optimization problem.

\[\begin{equation} \begin{aligned} \min_{\alpha,\beta} \quad & \frac{1}{n} \sum_{i=1}^{n} (\gamma y_i x_i^T \beta - log(1 + exp(\gamma x_i^T \beta))) + \lambda_0 \sum_{j=1}^{p} 1(\beta_{j} \neq 0)\\ \textrm{s.t.} \quad & l \le \beta_j \le u \; \; \; \forall j = 1,2,...,p\\ &\beta_j \in \mathbb{Z} \; \; \; \forall j = 1,2,...,p \\ &\beta_0, \gamma \in \mathbb{R} \\ \end{aligned} \end{equation}\]

These constraints ensure that the model will be sparse and include only integer coefficients.

Loading Example Data

First we’ll load in an example dataset. In this example, we want to develop a risk score model that predicts whether a breast tissue sample is benign using features recorded during a biopsy. The breastcancer dataset was originally accessed from the UCI Repository and can be loaded into your environment from the riskscores package as so:

data("breastcancer")

This dataset contains 683 observations and 9 features. Our goal is to develop a risk score model that predicts whether a breast tissue sample is benign using 9 (or fewer) features recorded during a biopsy:

Clump thickness
Uniformity of cell size
Uniformity of cell shape
Marginal adhesion
Single epithelial cell size
Bare nuclei
Bland chromatin
Normal nucleoli
Mitoses

Data Preprocessing

Before building a risk score model, data often need to be preprocessed. Specifically, the dataset needs to have a binary outcome with all other variables containing either binary or integer values.

The breastcancer dataset is mostly ready to go. We’ll still need to split out our data into a matrix with all covariates (X) and a vector with the outcome data (y). In this case, the first column in our dataset contains the outcome variable.

y <- breastcancer[,1]
X <- as.matrix(breastcancer[,-1])

Cross Validation

The penalty coefficient $\lambda_0$ controls the sparsity of the model – a larger value of $\lambda_0$ will result in fewer non-zero coefficients. We can use cross validation to find the optimal $\lambda_0$ value that creates a sufficiently sparse model without sacrificing performance.

Ideally, each cross-validation fold should contain an approximately equal proportion of cases. The riskscores package contains the function stratify_folds() that creates fold IDs with an equal proportion of cases in each fold. These fold IDs can be entered into the cv_risk_mod() function under the foldids parameter. Otherwise, cv_risk_mod() will set random fold IDs.

foldids <- stratify_folds(y, nfolds = 5, seed = 5)

The cv_risk_mod() function runs cross validation for a grid of possible $\lambda_0$ values. If the user does not specify the vector of $\lambda_0$ values to test, the program constructs this $\lambda_0$ sequence. The maximum $\lambda_0$ in this sequence is the smallest value such that all coefficients in the logistic regression model are zero. The minimum $\lambda_0$ in the sequence is calculated using the user-defined lambda_ratio argument. The $\lambda_0$ grid is created by generating nlambda values linear on the log scale from the minimum $\lambda_0$ to the maximum $\lambda_0$. We’ve set nlambda to 25, so the program will construct an appropriate sequence of 25 $\lambda_0$ values to test using cross validation.

cv_results <- cv_risk_mod(X, y, foldids = foldids, nlambda = 25)

Running plot() on a cv_risk_mod object creates a plot of mean deviance for each $\lambda_0$ value in the grid. The number of nonzero coefficients that are produced by each $\lambda_0$ value when fit on the full data are listed at the top of the plot. The $\lambda_0$ value with the lowest mean deviance (“lambda_min”) is indicated in red, and its standard deviation is marked with a red dashed line. Its precise value can be accessed by calling cv_results$lambda_min. If we want a sparser model, we could increase $\lambda_0$ to “lambda_1se”, the largest value whose mean deviance is within one standard error of “lambda_min”. This value can be accessed by calling cv_results$lambda_1se. In our example, “lambda_min” creates a model with 8 non-zero coefficients and “lambda_1se” creates a model with 4 non-zero coefficients.

plot(cv_results)

cv_results$lambda_min
#> [1] 0.005755162

cv_results$lambda_1se
#> [1] 0.03920946

To view a dataframe with the full cross-validation results (including both deviance and accuracy metrics), run cv_results$results.

tail(cv_results$results)
#> # A tibble: 6 × 8
#>   lambda0 mean_dev sd_dev mean_acc sd_acc mean_auc  sd_auc nonzero
#>     <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>   <dbl>   <int>
#> 1  0.0576     37.2   11.5    0.955 0.0139    0.988 0.00958       5
#> 2  0.0845     42.3   13.1    0.944 0.0232    0.984 0.00964       4
#> 3  0.124      49.6   18.7    0.934 0.0313    0.971 0.0317        4
#> 4  0.182      60.1   14.6    0.925 0.0300    0.961 0.0279        3
#> 5  0.267      76.5   25.8    0.889 0.0478    0.929 0.0396        2
#> 6  0.392      64.9   18.1    0.918 0.0237    0.958 0.0203        2

Fitting a Risk Score Model

We’ll fit a model on the full data using the function risk_mod(). We’ll use the “lambda_min” value determined by cross-validation as our $\lambda_0$ parameter.

mod <- risk_mod(X, y, lambda0 = cv_results$lambda_1se)

The integer risk score model can be viewed by calling mod$model_card. An individual’s risk score can be calculated by multiplying each covariate response by its respective number of points and then adding all points together. In our example below, a patient with a ClumpThickness value of 5, a UniformityOfCellShape value of 1, a BareNuclei value of 1, and a BlandChromatin value of 3 would receive a score of $9(5) + 6(1) + 7(1) + 8(3) = 82$.

`mod$model_card`
	Points
ClumpThickness	9
UniformityOfCellShape	6
BareNuclei	7
BlandChromatin	8

Each score can then be mapped to a risk probability. The mod$score_map dataframe maps an integer range of scores to their associated risk. For this example dataset, mod$score_map includes a range of integer scores from 25 to 200, which are the minimum and maximum scores predicted from the training data. The table below shows a sample of these scores mapped to their associated risk. We can see that a patient who received a score of 150 would have a 82.70% risk of their tissue sample being malignant.

`mod$score_map`
Score	Risk
30	0.0009
60	0.0076
90	0.0617
120	0.3592
150	0.8270
180	0.9761
210	0.9971
240	0.9997
270	1.0000
300	1.0000

The function get_risk() can be used to calculate the risk from a given score (or a vector of scores). Likewise, the function get_score() calculates the score associated with a given risk (or vector of risk probabilities).

get_risk(mod, score = 150)
#> [1] 0.8270133

get_score(mod, risk = 0.8270133)
#> [1] 150

We can evaluate the model’s performance under different classification thresholds using the get_metrics() function.

get_metrics(mod, threshold = seq(0.1, 0.9, 0.1))
#>   threshold_risk threshold_score  accuracy sensitivity specificity
#> 1            0.1            97.4 0.9604685   0.9832636   0.9481982
#> 2            0.2           108.7 0.9707174   0.9748954   0.9684685
#> 3            0.3           116.2 0.9692533   0.9623431   0.9729730
#> 4            0.4           122.4 0.9692533   0.9623431   0.9729730
#> 5            0.5           128.1 0.9677892   0.9497908   0.9774775
#> 6            0.6           133.8 0.9648609   0.9414226   0.9774775
#> 7            0.7           140.0 0.9604685   0.9205021   0.9819820
#> 8            0.8           147.5 0.9531479   0.8953975   0.9842342
#> 9            0.9           158.9 0.9428990   0.8577406   0.9887387

Generic Functions

summary

Running summary() on our model will return the intercept, the scores of each nonzero coefficient, the $\gamma$ multiplier value, the $\lambda_0$ regularizer value, the deviance, and the AIC.

summary(mod)
#> 
#> Intercept: -128.1023
#> 
#> Non-zero coefficients:                      .
#> ClumpThickness        9
#> UniformityOfCellShape 6
#> BareNuclei            7
#> BlandChromatin        8
#> 
#> Gamma (multiplier):  0.07145057 
#> Lambda (regularizer):  0.03920946 
#> 
#> Deviance:  124.2174 
#> AIC:  144.2174

coef

A vector containing the risk score model intercept and integer coefficients can be accessed by calling coef() on the risk_mod object. This vector is also saved as $beta within the risk_mod object.

coef(mod) # equivalently: mod$beta
#>                Intercept           ClumpThickness     UniformityOfCellSize 
#>                -128.1023                   9.0000                   0.0000 
#>    UniformityOfCellShape         MarginalAdhesion SingleEpithelialCellSize 
#>                   6.0000                   0.0000                   0.0000 
#>               BareNuclei           BlandChromatin           NormalNucleoli 
#>                   7.0000                   8.0000                   0.0000 
#>                  Mitoses 
#>                   0.0000

We can map our integer score model to an equivalent logistic regression model by multiplying the integer and coefficients by $\gamma$ (saved as $gamma in the risk_mod object).

coef(mod) * mod$gamma
#>                Intercept           ClumpThickness     UniformityOfCellSize 
#>               -9.1529791                0.6430551                0.0000000 
#>    UniformityOfCellShape         MarginalAdhesion SingleEpithelialCellSize 
#>                0.4287034                0.0000000                0.0000000 
#>               BareNuclei           BlandChromatin           NormalNucleoli 
#>                0.5001540                0.5716045                0.0000000 
#>                  Mitoses 
#>                0.0000000

The risk_mod object stores a glm object of this non-integer logistic regression model as $glm_mod.

coef(mod$glm_mod)
#>                Intercept           ClumpThickness     UniformityOfCellSize 
#>               -9.1529791                0.6430551                0.0000000 
#>    UniformityOfCellShape         MarginalAdhesion SingleEpithelialCellSize 
#>                0.4287034                0.0000000                0.0000000 
#>               BareNuclei           BlandChromatin           NormalNucleoli 
#>                0.5001540                0.5716045                0.0000000 
#>                  Mitoses 
#>                0.0000000

predict

Running predict() on a risk_mod object allows for three types of prediction, as the type parameter can be set to either 'link', 'response', or 'score'. These first two options are the same as when predict() is run on a logistic glm object. The added 'score' option returns each subject’s score, as calculated from the integer coefficients in the risk score model.

The table below compares the three possible prediction types for five example subjects.

Comparison of `predict()` outputs
Covariates				Prediction
CT	UCS	BN	BC	‘score’	‘link’	‘response’
5	1	1	3	82	-3.29	0.036
5	4	10	3	163	2.49	0.924
3	1	2	3	71	-4.08	0.017
6	8	4	3	154	1.85	0.864
4	1	1	3	73	-3.94	0.019

The ‘score’ is a linear combination of the covariates and their integer coefficients:

$\text{score} = 9(\text{CT}) + 6(\text{UCS}) + 7 (\text{BN}) + 8(\text{BC})$

The ‘link’ is a linear combination of the covariates using the full logistic regression equation:

$\text{link} = -9.15 + 0.643(\text{CT}) + 0.429(\text{UCS}) + 0.500(\text{BN}) + 0.572(\text{BC})$

The ‘response’ converts these link values to probabilities:

$\text{response} = e^{\text{link}}/(1+e^{\text{link}})$

plot

The relationship between scores and risk can be visualized by calling plot() on a risk_mod object.

plot(mod, score_min = 30, score_max = 300)