kerntools provides R tools for working with a family
of Machine Learning methods called kernel methods. This package
implements several kernel functions for treating nonnegative and real
vectors, real matrices, categorical and ordinal variables, sets, and
strings. Several tools for studying the resulting kernel matrix or to
compare two kernel matrices are available. These diagnostic tools may be
used to infer the kernel(s) matrix(ces) suitability in model training.
kerntools
also provides functions for extracting the
feature importance of Support Vector Machines (SVMs) or displaying
customizable kernel Principal Components Analysis (PCA) plots. For
convenience, widespread model’s performance measures and feature
importance visualization are available for the user.
In this section several examples will be used to illustrate a typical
workflow of kerntools
.
Let’s suppose we are working with the well known iris
dataset. This dataset contains sepal and petal measurements for
N = 150 iris flowers. These flowers belong to three different
species: Iris setosa, Iris versicolor, and Iris
virginica. There are D = 4 numeric variables (also called
“features”) that are measured: Sepal.Length
,
Sepal.Width
, Petal.Length
and
Petal.Width
.
We first standardize these four variables so they have mean 0 and standard deviation 1. This places them on a common ground. Then, we compute our first kernel: the linear kernel.
iris_std <- scale(iris[,c( "Sepal.Length","Sepal.Width","Petal.Length", "Petal.Width")])
KL <- Linear(iris_std)
dim(KL)
#> [1] 150 150
class(KL)
#> [1] "matrix" "array"
The linear kernel is simply the pairwise inner product between all
N samples (in our case: flower measurements). The result is
“stored” on a matrix (the kernel matrix) that has dimensions
NxN. Furthermore, it is always symmetric and positive
semi-definite (PSD). To check the kernel between two samples (for
instance, 32 and 106), we can type either KL[32,106]
or
KL[106,32]
. It should be noted that kernel matrices
generated by kerntools
have class “matrix”, “array”. In
fact, to keep things simple, no function in this package requires any
special classes/objects not present in base R.
Next, we examine this kernel matrix further. Although it is not
excessively large, it is large enough to make simply typing
KL
in R unpractical. Instead, we may summarize its values
using a histogram:
This “almost-gaussian” shape is due to the features’ scaling. Further
parameters, including color, can be passed to histK()
(for
more info, check the documentation of graphics::hist()
).
The von Neumann entropy shown here is optional (in fact, this value can
be computed separately doing vonNeumann(KL)
). Entropy
values close to 0 indicate that all kernel matrix elements are very
similar, while values close to 1 indicate a high variability.
Another possibility is to visualize the whole kernel matrix with a heatmap:
Here, yellow denotes a high similarity between the samples, while red
denotes that the similarity is low (the colour palette is customizable
via the parameter color
). At a glance we see that the first
50 samples (I. setosa) have a higher intra-group similarity.
Also, they are very different from the samples 101-150 (which correspond
to I. virginica). Instead, I. versicolor is kind of
intermediate between these two groups.
To confirm our intuitions about the (dis)similarity among the three species, we may proceed to a widespread ordination method: the Principal Components Analysis (PCA). PCAs can be computed from kernel matrices very easily. In fact, using kernel matrices expands what a PCA can do, but this will be discussed in further sections. To display a beautiful PCA plot that colors the samples by species, we do:
Indeed, we can see that I. setosa and I. virginica
are the most different groups, while I. versicolor and
I.virginica are very close. The colors can be changed if
desired with the kPCA(...,colors)
parameter.
After seeing this plot, we can infer that a predictive model from these data will work well. Although we have a ton of machine learning methods at our disposal, in this vignette we will stick with the kernel methods. More specifically, we will use the most famous kernel method: the Support Vector Machine (SVM).
SVMs are not implemented in kerntools
.
However, they are in other R packages like kernlab
or
e1071
. Here we will use the ksvm()
function of
the former package:
library(kernlab)
set.seed(777)
## Training data
test_idx <- sample(1:150)[1:30] # 20% of samples
train_idx <- (1:150)[-test_idx]
KL_train <- KL[train_idx,train_idx]
## Model (training data)
linear_model <- kernlab::ksvm(x=KL_train, y=iris_species[train_idx], kernel="matrix")
linear_model
#> Support Vector Machine object of class "ksvm"
#>
#> SV type: C-svc (classification)
#> parameter : cost C = 1
#>
#> [1] " Kernel matrix used as input."
#>
#> Number of Support Vectors : 27
#>
#> Objective Function Value : -0.9459 -0.3184 -14.3709
#> Training error : 0.025
First and foremost: in prediction models, it is mandatory to have an additional test set so a honest estimation of the model’s performance can be computed (we will do this latter). Also, please note that in a real-world machine learning setting, training data should have been preprocessed first and then the same exact preprocessing should have been applied to test data. In our case, the only preprocessing was standardize the dataset: thus, the mean and standard deviation should have been computed from training data, and then these values should have been used for standardize both the training and test sets. That being said, in order to not interrupt the flow of this vignette, we will use leave things as they are.
Now returning to our (questionably obtained) model, we have a very low training error. The support vectors (which are the only samples that are relevant for us, as the rest are not used to define the SVM discriminating hyperplane) constitute only the 22% (approx) of samples. Not bad.
Before jumping to the test set, we may be interested in another topic: feature importance. This means studying which variables are considered more important by the model when discriminating between classes. Feature importance is important for avoiding “black-box models”: prediction models that we know that work well, but not why.
Obtaining the importances out of a SVM model can be somewhat
convoluted (this is discussed later in more depth) and sometimes
downright impossible. In our particular case, the only problem is that
we wanted to classify 3 classes (species)… but SVM classifiers are
binary. For discriminating 3 classes, kernlab
in fact
builds 3 classifiers: “setosa vs versicolor”, “setosa vs virginica”, and
“versicolor vs virginica”. These 3 classifiers constitute the
linear_model
object and the prediction of the class of a
sample is done by a voting scheme. To simplify things, for the features’
importance part, we will focus only on the third classifier: “versicolor
vs virginica”, which we have seen previously that are the two most
related species. The way to go here is to obtain the index of the
Support Vectors in our model, and their coefficients. All that is
gracefully provided by kernlab
. Then, we will return to
kerntools
and call the function svm_imp()
:
## Third model: Versicolor vs virginica
sv_index <- kernlab::alphaindex(linear_model)[[3]] # Vector with the SV indices
sv_coef <- kernlab::coef(linear_model)[[3]] # Vector with the SV coefficients
feat_imp3 <- svm_imp(X=iris_std[train_idx,],svindx=sv_index,coeff=sv_coef)
#> Do not use this function if the SVM model was created with the RBF,
#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels
Note that here we need the data that we used to compute
KL
: iris_std.
It is very important to
use this version of the dataset, and not any other version with more
pre-processing (or the “original” without pre-processing).
svm_imp()
has parameters like center
,
scale
and cos.norm
to take this widespread
normalization techniques into account, but it is better to play
safe.
Conveniently, kerntools
provides a function to visualize
the features’ importance:
#> $first_features
#> [1] "Petal.Length" "Petal.Width" "Sepal.Width" "Sepal.Length"
#>
#> $cumsum
#> [1] 1
#>
#> $barplot
#> [,1]
#> [1,] 0.7
#> [2,] 1.9
#> [3,] 3.1
#> [4,] 4.3
As we can see, the model considers the petals most discriminating that the sepals, and within the petals’ measures, the petal length.
If we return to the PCA, we could also check the weight of each
variable in the first and second PCs (that is, the ones we displayed).
To do so, it comes in handy the kPCA_imp()
function. Again,
this function requires the dataset that generated the KL
matrix:
loadings <- kPCA_imp(iris_std)
#> Do not use this function if the PCA was created with the RBF,
#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels
pcs <- loadings$loadings[1:2,]
pcs
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> PC1 0.5210659 -0.2693474 0.58041310 0.56485654
#> PC2 0.3774176 0.9232957 0.02449161 0.06694199
It seems that the first component is dominated by Petal Length and Petal Width but also by Sepal Length. The second component, that plays a role in discriminating I. versicolor and I.virginica, is dominated by Sepal Width. The PCA disagrees a bit with the SVM feature importances, but remember that in the latter we focused only on the “versicolor vs virginica” problem, while in the former we are looking at the ordination of the three classes. We may represent the contributions of the four features for the 1st PC, and to make things easier we will include the 2nd PC onto the barplot:
plotImp(pcs[1,], y=pcs[2,], ylegend="PC2",absolute=FALSE, main="PC1", leftmargin = 7, color="rosybrown1")
#> $first_features
#> [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width"
#>
#> $cumsum
#> [1] 1
#>
#> $barplot
#> [,1]
#> [1,] 0.7
#> [2,] 1.9
#> [3,] 3.1
#> [4,] 4.3
We used absolute=FALSE
because the contribution of each
variable to the PC is relevant not only in magnitude, but also in sign.
Pink bars represent PC1, while the black line represents PC2 (parameter
y
). As we only wanted to see the order and the relative
magnitude, the X axis show the relative contribution (in the
plotImp()
function, relative=TRUE
by
default).
With kerntools
, we can draw the contributions on the PCA
plot as arrows. We only need the PCA plot (given by kPCA()
)
and the contributions (given by kPCA_imp()
):
(Note that the arrows are scaled to match with the original PCA plot.
They are somewhat orientative: their directions are correct, and longer
arrows represent a greater contribution to a PC that shorter arrows;
however, usually the arrows’ magnitudes do not coincide with the actual
magnitudes that can be computed from kPCA_imp()
).
And now, finally, we are going to check the performance in the test
set (considering the 3 classes). To do so, we subset KL
to
have a suitable test x training matrix, input the matrix into
our linear_model, and compare with the predicted species with the actual
species:
KL_test <- as.kernelMatrix(KL[test_idx,train_idx])
## Prediction (test data)
pred_class <- predict(linear_model,KL_test)
actual_class <- iris_species[test_idx]
pred_class
#> [1] versicolor versicolor virginica virginica versicolor virginica
#> [7] virginica virginica versicolor versicolor versicolor versicolor
#> [13] versicolor versicolor versicolor versicolor virginica versicolor
#> [19] versicolor versicolor virginica versicolor virginica versicolor
#> [25] versicolor virginica versicolor versicolor versicolor versicolor
#> Levels: setosa versicolor virginica
actual_class
#> [1] setosa setosa versicolor versicolor setosa setosa
#> [7] setosa versicolor setosa setosa versicolor virginica
#> [13] virginica virginica virginica virginica setosa versicolor
#> [19] setosa setosa setosa setosa versicolor setosa
#> [25] virginica versicolor virginica virginica virginica versicolor
#> Levels: setosa versicolor virginica
Mmmm… maybe we were a bit overconfident. It seems that the model has ignored I. setosa completely.
We can compute numerically how “good” (or wrong) our model is
according to different performance measures implemented in
kerntools
. When dealing with classification, all of them
need a contingency table that contrasts the actual and the predicted
classes (also known as “confusion matrix”). The most simple measure is
the accuracy: number of right predictions divided by the number of
predictions (which is the test set size: in our case, 30).
ct <- table(actual_class,pred_class) # Confusion matrix
ct
#> pred_class
#> actual_class setosa versicolor virginica
#> setosa 0 9 4
#> versicolor 0 3 5
#> virginica 0 9 0
Acc(ct) ## Accuracy
#> [1] 0.1
Acc_rnd(actual_class) ## Accuracy of the random model
#> [1] 0.3488889
As expected, our accuracy is overwhelmingly low. We can compare the result with the accuracy of the random model (according to the class distribution on the test): 0.35, which means that our model performs a lot worse than someone classifying at random.
We can explore other measures to infer what are the problems with our prediction model (for instance, if a species is systematically missclassified, etc.). For this example, we can compute these measures by class:
Prec(ct,multi.class = "none") ## Precision or Positive Predictive Value
#> setosa versicolor virginica
#> 0.0000000 0.1428571 0.0000000
Rec(ct,multi.class = "none") ## Recall or True Positive Rate
#> setosa versicolor virginica
#> 0.000 0.375 0.000
Spe(ct,multi.class = "none") ## Specificity or True Negative Rate
#> setosa versicolor virginica
#> 1.0000000 1.0000000 0.5714286
F1(ct,multi.class = "none") ## F1 (harmonic mean of Precision and Recall)
#> setosa versicolor virginica
#> 0.0000000 0.2068966 0.0000000
(In case we want the overall performance measure, we can compute the
mean of the three classes, or type
multi.class="macro"
).
The precision measure tell us that none of the samples predicted to be “virginica” or “setosa” are correct (in the case of “setosa”, because none was predicted), and only some (1/7) that were predicted to be “versicolor” were right. The recall shows that only 3/8 “versicolor” samples in the test were correctly classified as “versicolor”, while there is none of “setosa” or “virginica.” F1 is useful because it gives a “mean” of Precision and Recall. Meanwhile, the low specificity of “versicolor” points that a lot of samples that were not “versicolor” were predicted as such.
In the previous section we picked naively the first model we could train, with not-so-great results. Here, we are going to complicate things a bit hoping that we will obtain a model that works.
(Note: of course, iris flower classification is a simple task. In fact, it can be achieved pretty decently with the linear kernel, as can be deduced from the previous PCA: a linear classifier is enough to discriminate the flowers using only the 1st and 2nd PCs. However, for the sake of the example, we will use a different kernel in the present section).
The radial basis function (RBF) kernel is something like the “gold standard” among kernels. Unlike the linear kernel (which is the most simple or “plain” kernel), it is nonlinear: in fact, the RBF kernel is a universal approximator. We can compute it doing:
Kernel matrix values are typically between 0 and 1. The linear kernel
required only our dataset, but RBF()
has a (hyper)parameter
called gamma (g
for short). The value of this
hyperparameter should be decided by us, which is an important decision,
because it will affect the decision boundary of the kernel. Fortunately,
some heuristics to estimate a good gamma exist. kerntools
implement three of them, which are available in the function
estimate_gamma()
:
estimate_gamma(iris_std)
#> $d_criterion
#> [1] 0.25
#>
#> $scale_criterion
#> [1] 0.2512584
#>
#> $quantiles_criterion
#> 90% 50% 10%
#> 0.05570343 0.16670322 1.73533468
In the previous histogram we visualized the RBF with the gamma given by the “d_criterion” (and almost the one given by the “scale criterion”). The third heuristic gives us a distribution of “good” gamma values. Now, for the sake of comparison, we will compute the RBF kernel using the median of the “quantiles_criterion”:
Not only the histogram changes: the von Neumann entropy changes as
well. It important to remark that the RBF kernel is very sensitive to
the gamma values. The higher entropy with respect to that of the linear
kernel reflects that, here, we have a higher variability in the kernel
matrix values. (That can be also be deduced comparing both histograms.
Conversely, if we do heatK(Krbf)
, we will observe more
extreme values/colors than before). This paper recommends
an entropy between 0.3-0.5, so maybe this will be reflected on the SVM
model’s performance?
Now, we can also do a kernel PCA. Our previous kernel PCA used a linear kernel so, in reality, it was identical to a “normal” PCA. This time however we are using a different kernel and now we can actually say that this is a kernel PCA. The main difference is that the projection of samples is not going to be linear. Sometimes, this creates strange patterns that are difficult to interpret.
As later we are going to train a SVM model, it may occur to us that it would be great to do a PCA only with the training samples, so we can compare the prediction model with the PCA side by side. To do so, we will use the same training indices than in the previous section. Even better: what if we compute a (kernel) PCA with the training samples, and then project the test samples over them?
Krbf_train <- Krbf2[train_idx,train_idx]
Krbf_test <- Krbf2[test_idx,train_idx]
rbf_kpca <- kPCA(K=Krbf_train, Ktest=Krbf_test, plot = 1:2, y = iris_species[train_idx], title = "RBF PCA")
rbf_kpca$plot
(Note: remember that, in a real-world problem, the standardization of the dataset should have been done with the center and std deviation of the training set.)
Said and done! However, now the patterns on this kernel PCA are a
bit… radial. Still, I. setosa is again on one side, and I.
versicolor and I. virginica on the other. The red, green
and blue samples are the training samples, where the grey samples
correspond to the test samples we projected a posteriori (any
other color can be specified in the kPCA
parameter
na_col
).
What now? As we have generated more than one kernel matrix from the
same data (thanks to the linear and the RBF kernels), we may compare
these matrices. To do so, we can use a kerntools
function
called simK
:
simK(list(linear=KL,rbf_0.166=Krbf, rbf_0.25=Krbf2))
#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).
#> This function does NOT verify the symmetry and PSD criteria.
#> linear rbf_0.166 rbf_0.25
#> linear 1.0000000 0.5208955 0.4803192
#> rbf_0.166 0.5208955 1.0000000 0.9898203
#> rbf_0.25 0.4803192 0.9898203 1.0000000
simK
will first remind us that a matrix should have
several mathematical properties to be a kernel matrix. When we work with
kernel matrices generated by kerntools
(that is:
Linear()
, RBF()
, etc.) everything will be
alright. However, you can come to kerntools
with your
precomputed kernel matrices (as long as they have class “matrix”,
“array”). kerntools
implicitly trusts the user knows what
he/she is doing, so remember using proper kernel matrices.
simK
returns a score between 0 and 1: 1 is complete
similarity, and 0 is complete dissimilarity. We can see that the two RBF
matrices are very similar, while the linear kernel matrix is around a
50% similar to the RBF matrices.
We could also compare the two PCAs. An option to do so is computing
the RV coefficient (Co-Inertia Analysis). However, the RV coefficient
between rbf_kpca
and kpca
should give the same
result than simK(list(KL,Krbf))
. It should be noted that
this equivalence only holds if the dataset is centered beforehand, as
PCAs usually are computed using centered data (for that reason we have
kPCA(..., center=TRUE)
by default). If a kernel matrix was
obtained from a non-centered data, it can be centered afterwards with
centerK()
(more of this in later sections).
Another way to compare two PCAs is called the Procrustes Analysis.
This analysis compares the correlation between two projections after
“removing” the translation, scale and rotation effects. Although is not
properly a kernel method, kerntools
can do a basic
Procrustes Analysis. In our data, we have a moderate Procrustes
correlation: 0.68 (the correlation coefficient is bounded between 0 and
1).
rbf_kpca <- kPCA(K=Krbf)
proc <- Procrustes(kpca$projection,rbf_kpca)
proc$pro.cor # Procrustes correlation
#> [1] 0.6862007
(This is a moment as good as any other to show that
kPCA()
can return the kernel PCA projection without
displaying any plot. In that case, all graphical parameters like colors,
labels, etc. are ignored.)
With all of these, we will train a brand new (and hopefully better) model. We will re-use the same training and test samples:
####### Model (training data)
model <- kernlab::ksvm(x=Krbf_train, y=iris_species[train_idx], kernel="matrix", C=10)
model
#> Support Vector Machine object of class "ksvm"
#>
#> SV type: C-svc (classification)
#> parameter : cost C = 10
#>
#> [1] " Kernel matrix used as input."
#>
#> Number of Support Vectors : 30
#>
#> Objective Function Value : -4.1707 -2.7089 -92.348
#> Training error : 0.008333
A very low training error, but now we are wiser. What about the performance in test?
Krbf_test <- as.kernelMatrix(Krbf_test)
####### Prediction (test data)
pred_class <- predict(model,Krbf_test)
actual_class <- iris_species[test_idx]
ct <- table(actual_class,pred_class) # Confusion matrix
Acc(ct) ## Accuracy
#> [1] 0.5
Wow, that seems a lot better! However, before we get excited, we must remember that this is a point estimation of accuracy, that comes from a specific test set (the 30 samples we chose randomly in the previous section). Another test set surely will deliver a different accuracy. What if we tried to compute a confidence interval (CI) to have an idea of how other test sets will behave?
kerntools
provides two ways to obtain a CI: the Normal
Approximation, and bootstrapping. Normal approximation is quicker, while
bootstrapping is usually considered to be safer (more details: here):
## Accuracy CI (95%)
Normal_CI(value = 0.5,ntest = 30) ## Accuracy CI (95%)
#> [1] 0.3210806 0.6789194
Boots_CI(target = actual_class, pred = pred_class, nboots = 2000,index = "acc")
#> 2.5% 97.5%
#> 0.5016667 0.3333333 0.7000000
Both functions default to a 95% CI, but that can be changed via the
confidence
parameter. According to the normal
approximation, the accuracy is 0.5 (0.32, 0.68), while according to
bootstrap strategy, it is 0.5 (0.33, 0.66). The CI is wide because the
test is very small (only 30 samples). However, with this test and CI
(with the 95% confidence) we cannot really assure that this
model is really different than the random model, which had an accuracy
of 0.35 (as we computed in the previous section). Useful reminder: next
time, we should choose a larger test set.
Before call it a day, we are going to compute again the other performance measures. This time, we will not compute them class by class, but on average (the “macro” approach):
Prec(ct) ## Precision or Positive Predictive Value
#> It is identical to weighted Accuracy
#> [1] 0.4357143
Rec(ct) ## Recall or True Positive Rate
#> [1] 0.4807692
Spe(ct) ## Specificity or True Negative Rate
#> [1] 0.8085477
F1(ct) ## F1 (harmonic mean of Precision and Recall)
#> [1] 0.4484848
If we desire to do, we can compute a CI for these values: for
instance, to bootstrap the macro-F1 value, we could simply type
index = "f1"
in the Boots_CI()
function. In
any case, we should congratulate ourselves because the performance is
clearly higher than last time. Training seriously a machine
learning model involves fine hyperparameter tuning (remember that C in
ksvm()
?) that we have almost completely skipped. That is:
we should use a strategy like, say, grid search, and compare the
performance measures of each hyperparameter combination via cross
validation, which is far beyond the purposes of this vignette (and
kerntools
).
Finally, a warning about computing feature importances in SVM and/or
feature contribution to PCs in PCA. In a few words: don’t do that when
using the RBF kernel. More precisely: of all kernel functions
implemented here, do not ever try to recover the feature contributions
for RBF()
, Laplace()
, Jaccard()
,
Ruzicka()
, BrayCurtis()
and
Kendall()
unless you know very well what you are
doing. If you type something like:
####### RBF versicolor vs virginica model:
sv_index <- kernlab::alphaindex(model)[[3]]
sv_coef <- kernlab::coef(model)[[3]]
svm_imp(X=iris_std[train_idx,],svindx=sv_index,coeff=sv_coef)
#> Do not use this function if the SVM model was created with the RBF,
#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 0.8931422 0.7828191 12.3522663 10.0792776
something will be computed, but the result is wrong.
Do not ignore the warning raised here. This is not right because,
ultimately, all kernels behave like the linear kernel: they compute the
inner product of some features. But what features? That is the question.
Under the hood, all kernels “send” the original features (the feature
that live in the input space) to other space (usually
higher-dimensional) called the feature space, and they compute
the inner product there. The kernel conflates these two steps into one,
which usually simplifies a lot the calculations and saves a lot of
memory space: this is called the “kernel trick”. But to compute
analytically the feature contributions we need to go that feature space.
To make things worse, the feature space that the kernel implicitly is
using depends on things like the dimensionality of input data, the kind
of kernel, the specific value of its hyperparameters, etc. Going to this
feature space is only trivial for the linear kernel: only then input
space = feature space. Instead, for the RBF kernel, this feature space
is infinite dimensional. Some techniques to estimate it exist (see for
instance: Explicit
Approximations of the Gaussian Kernel), but they are not yet
implemented in kerntools
(and maybe they never are).
The natural workflow of this package has been shown (twice) in previous sections. For that reason, the remainder of this vignette will deal with more obscure (and interesting) matters concerning kernels for “non-standard” kinds of data. Also, we will delve deeper in the feature space and normalization effects.
Until now, we have worked with kernels for real vectors. That is, we
had a dataset that consisted of several features (four in our case: the
sepal and petal length and width) measured in a set of individuals (the
150 iris flowers). Another way of looking at it is considering that we
had 150 vectors of length 4 (incidentally, this is the way kernel
functions look at data). These were real vectors (or, at least, they
were after we standardized them). Unknowingly, we have also have worked
with kernels for real matrices: we compared three kernel matrices with
simK()
and the result was… another kernel matrix. In fact,
simK()
is simply a wrapper of Frobenius()
. In
the Frobenius kernel, the input of the function (the “objects” they work
with) are not vectors, but numeric matrices.
Most machine learning methods work primarily with real vectors or, in
some cases, matrices. In the case of the kernel methods, they can work
with virtually any kind of data we can think of. This is because what
the kernel method (SVM, kernel PCA, etc) sees is the kernel matrix, not
the original objects. So, as long as we can create a valid
(i.e. symmetric and PSD) kernel function for our objects, everything
will turn just well. The list of kernel functions is endless. Right now,
kerntools
can deal effortlessly with the following kinds of
data:
All of them are commonly used in different fields. For instace, categorical data is as widespread as numeric data (or more), text mining and content analysis is right now a very lively research field, and the Bray-Curtis and Ruzicka kernels are closely related to famous beta-diversity indices used in ecological studies.
We will illustrate how kerntools
works with categorical
variables. (For the rest of kernel functions, please read in detail
their specific documentation page). kerntools
includes a
categorical dataset called showdata
:
summary(showdata)
#> Favorite.color Favorite.actress Favorite.actor
#> blue :20 Sophie Turner :22 Peter Dinklage:20
#> red :20 Emilia Clarke :19 Kit Harington :17
#> black :14 Anya Chalotra :12 Henry Cavill :15
#> purple :12 Freya Allan :10 Lee Jung-jae :15
#> green : 8 Helena Bonham Carter: 8 Hyun Bin : 8
#> lightblue: 8 Lorraine Ashbourne : 7 Josh O'Connor : 7
#> (Other) :18 (Other) :22 (Other) :18
#> Favorite.show Liked.new.show
#> Game of Thrones :22 No :48
#> The witcher :17 Yes:52
#> Bridgerton :14
#> Squid game :11
#> The crown :11
#> La casa de papel: 8
#> (Other) :17
Here we can see 5 categorical features (class: “factor”). Typical
approaches to this kind of data involve recoding them to “dummy”
variables, so a single categorical variable is transformed to L
dummy variables (where L=number of classes, or using the R
nomenclature, number of levels). Presence of a given class is
marked with a 1 in its corresponding column, while the rest of entries
are 0. This is called one-hot-encoding, and in kerntools
this is done with the dummy_var()
function:
dummy_showdata <- dummy_data(showdata)
dummy_showdata[1:5,1:3]
#> Favorite.color_black Favorite.color_blue Favorite.color_green
#> 1 0 0 0
#> 2 1 0 0
#> 3 0 0 0
#> 4 0 1 0
#> 5 0 0 0
(sometimes, like in Linear Models, the design matrix contains
L-1 dummy variables. This kind of recoding can done with
model.matrix()
)
The approach using kernels is a bit different. Here we will work with the original dataset. The kernel will make the pairwise comparisons of the N = 100 samples and, for every pair of samples, it will ask: it this class equal in the two samples, or is it different? For example: “Favorite.color” is “red” in sample 1, “black” in sample 2, and “red” again in sample 3. The comparison of this categorical feature between samples 1-2 will return FALSE, while comparing samples 1-3 will return TRUE.
As we have not a single categorical variable but D=5, we
should combine the results of this comparison for the D
variables. That’s why we stated comp="sum"
: to make the
kernel sum the FALSES and TRUES of pairwise comparing “Favorite.color”,
“Favorite.actress”, “Favorite.actor”, “Favorite.show” and
“Liked.new.show” (we have also the option to average them, or compute a
weighting average, if we consider that some features are more important
than others). The histogram of KD
shows there are a few
identical samples, a lot of samples that are completely different, and
most of samples only are equivalent in one of the 5 features.
Now we have our kernel matrix! Now we can do with it whatever we
want, included training a prediction model or computing a kernel PCA.
Yes, exactly: although PCA is a technique that was originally created
for real numeric data, another advantage yet of kernels is that we can
do PCA of everything. For simplicity, here we will not train the
prediction model (but you are free to follow the steps shown in the
previous section), and will only show the kernel PCA As
showdata
contains the result of a (fictional) survey with
the idea to predict if people preferences could predict it they liked or
not a new show, this time, we are computing the kernel for features 1:4,
so we can color the kernel PCA with feature 5 (“Did you like the
show?”). Furthermore, for a change, we will also draw ellipses around
the centroids of the “Yes” group and the “No” group:
KD <- Dirac(showdata[,1:4], comp="sum",feat_space=TRUE)
dirac_kpca <- kPCA(KD$K,plot=1:2,y=factor(showdata$Liked.new.show),ellipse=0.66,title="Dirac kernel PCA")
dirac_kpca$plot
It seems that the group that liked the show forms a tight cluster, while people that do not liked it is scattered along the PC1. Now, we can study the contributions of each class and see if our intuitions are confirmed:
pcs <- kPCA_imp(KD$feat_space)
#> Do not use this function if the PCA was created with the RBF,
#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels
pc1 <- plotImp(pcs$loadings[1,],leftmargin=15,nfeat=10,absolute = FALSE, relative = FALSE,col ="bisque")
pc2 <- plotImp(pcs$loadings[2,],leftmargin=17,nfeat=10,absolute = FALSE, relative = FALSE, col="honeydew3")
(These plots show only the top 10 features.)
The PC1 seems lead especially by Game of Thrones (and related actors/actresses) to the right, and The Witcher (and related actors/actresses) to the left, with a small contribution of the black color. The PC2 (which seems more relevant than PC1) seems led for The Witcher on one side, and color blue and the Squid Game on the other (honorific mention here for Helena Bonham Carter). If we draw the arrows:
features <- union(pc1$first_features,pc2$first_features)
kPCA_arrows(plot=dirac_kpca$plot,contributions=t(pcs$loadings[1:2,features]),colour="grey15")
In summary, it would seem that a relevant fraction of these dataset is divided between the Game of Thrones and The Witcher fans, but both of them have one thing in common: they very clearly did not like the new show.
In the previous code we computed not only the Dirac kernel matrix,
but we also computed its feature space (feat_space=TRUE
).
Conveniently, the natural feature space where the Dirac kernel works is
the same generated by one-hot-encoding. That is what allowed us to
compute the contributions of each class (or dummy variable) to the PC1
and 2.
kerntools
provides functions for some normalization
techniques. Several are specific of kernel matrices, other are for data
matrices in general, and some of them are for both.
First we will see the normalization of data matrices or data.frames.
One of the most used techniques is standardization, which we saw in
previous sections. This is already implemented in R base via
scale(X,center=TRUE,scale=TRUE)
, which allows standardizing
(by column) or only centering or scaling (by column).
kerntools
has the minmax()
normalization,
which normalizes the dataset between 0 and 1. The centerX()
function centers a squared dataset by row or column (yes, only squared
datasets: to center by row a NxD dataset, you may use
scale(t(X),center=T,scale=F)
). Another useful function is
TSS()
, which operates too by row or column, and transforms
absolute frequencies to relative ones, so the row (or column) sums to 1.
In this vein, cosnormX()
normalizes by the L2 norm by row
(or column): that is, the norm of each row (or column) sums to 1. These
functions usually default to row-normalization, as this is the way
kernel functions look at data (minmax()
is an exception).
Finally, normalization for the Frobenius norm is available for data
matrices (and also kernel matrices) via frobNorm()
.
Apart from the Frobenius normalization, kerntools
has
two more normalization functions for kernel matrices:
cosNorm()
and centerK()
. The first one applies
the cosine normalization to a kernel matrix, so its maximum value is 1
(sometimes, this also bound the minimum value around 0). This operation
is related to cosnormX()
. In fact, when working with the
linear kernel (but only in that case!), these two operations are
equivalent:
KL1 <- Linear(cosnormX(iris[,1:4])) # important: by row
KL2 <- cosNorm(Linear(iris[,1:4])) # a third valid option is: Linear(iris[,1:4], cos.norm=TRUE)
simK(list(KL1=KL1,KL2=KL2))
#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).
#> This function does NOT verify the symmetry and PSD criteria.
#> KL1 KL2
#> KL1 1 1
#> KL2 1 1
The second function is centerK()
(needless to say, this
function is somewhat related to centerX()
). Again,
centering the dataset by column and then computing the linear kernel, or
computing the linear kernel and then centering it with
centerK()
is equivalent. Then, why have two duplicated ways
for doing the same? Well, apart of speed (which expression is faster
depends on the dataset dimension, that is, if it has more rows or
columns), these expressions are only equivalent when using the linear
kernel. When using another kernel, cosnormX()
and
centerK()
normalize or center the kernel matrix… according
to features in feature space. Not in the input space. For this reason,
this:
center_iris <- scale(iris[,1:4],scale=FALSE,center=TRUE)
histK(RBF(center_iris,g=0.25),col="aquamarine")
and this:
don’t look alike in the slightest.
(Incidentally, it should be noted that RBF is translation-invariant
with respect to variables in input space. That is: standardization is
useful for this kernel, but simply centering is not.
RBF(iris[,1:4],g=0.25)
and using center_iris
is the same.)
In summary: if we want that the two are equivalent (like in the
linear kernel), we should computing the L2 norm or the centering using
the feature space, not the input space. This cannot be done for
the RBF()
function, although it is possible for the kernels
stated in the previous subsection.
Up to this moment we have explained normalization for real data. But
what about the other kinds of data? Well, kerntools
favors
the approach the other kinds of data are best dealt with kernel
functions. Also remember that some of the kerntools
kernel
functions of the previous subsection will also return the feature space
on demand. Still, kerntools
offers a basic one-hot-encoding
for categorical data. That is provided by dummy_data()
,
which converts a categorical dataset to a one-hot-encoded one. (This is
the feature space where the Dirac kernel works). This function requires
that the user specifies the number of levels per factor, but this can be
easily done with another function: dummy_var()
.
One advantage of using kernel matrices instead of original datasets is that kernel matrices can be combined very easily. For instance, imagine that we have two (or more) sources of data for the same individuals. In our example, dataset1 is numeric and has dimension NxD1, while the dataset2 has dimension NxD2 and contains other kind of data (for example categorical). You cannot sum or multiply these two datasets; however, you can sum or multiply their kernel matrices K1 and K2.
Let’s see a very simple illustration with the dataset
mtcars
.
dim(mtcars)
#> [1] 32 11
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars
has 11 features: 9 are numeric while 2 can be
though as categorical: vs (engine, which can be V-shaped or straight),
and am (transmission: automatic or manual). We can split mtcars in two
parts: a data.frame with the numeric features and a data.frame with the
categorical ones:
Now we prepare a kernel matrix for each data.frame. For the dataset1 we use the linear kernel and for the dataset2 we use the Dirac kernel:
K <- array(dim=c(32,32,2))
K[,,1] <- Linear(MTCARS[[1]]) ## Kernel for numeric data
K[,,2] <- Dirac(MTCARS[[2]]) ## Kernel for categorical data
We can create a “consensus” kernel from K1 and K2 using a
kerntools
function:
It should be noted that, here, K1
has the same weight
than K2
when computing the final kernel, although
K1
has 9 variables and K2
has only 2. If we
want to weight equally each one of the 11 variables in the final kernel,
we can do:
coeff <- sapply(MTCARS,ncol)
coeff # K1 will weight 9/11 and K2 2/11.
#> num cat
#> 9 2
Kcons_var <- MKC(K,coeff=coeff)
Now, maybe we fancy comparing K1
and K2
to
our consensus kernel matrices:
simK(list(Kcons=Kcons,K1=K[,,1],K2=K[,,2]))
#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).
#> This function does NOT verify the symmetry and PSD criteria.
#> Kcons K1 K2
#> Kcons 1.0000000 1.0000000 0.7871094
#> K1 1.0000000 1.0000000 0.7871069
#> K2 0.7871094 0.7871069 1.0000000
simK(list(Kcons_var=Kcons_var,K1=K[,,1],K2=K[,,2]))
#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).
#> This function does NOT verify the symmetry and PSD criteria.
#> Kcons_var K1 K2
#> Kcons_var 1.0000000 1.0000000 0.7871074
#> K1 1.0000000 1.0000000 0.7871069
#> K2 0.7871074 0.7871069 1.0000000
Mmm… something strange is happening here. Shouldn’t K2
be more similar to the consensus matrix in the former case than in the
latter? This phenomenon is caused because we did not normalize dataset1
nor K1.
And then, we averaged K1
and
K2
, without taking into account that K1
has
very large values in comparison to K2
:
That is: we though that K2
was overrepresented in the
consensus kernel… but actually it was the other way around.
In previous sections we have seen the feature importance given by SVM models. That can be considered like some kind of a posteriori feature importance. However, we should be cautious regarding the implicit weights that we give to the features before training the model. We can think about this like some kind of a priori (my apologies to bayesian statisticians for this nomenclature!) feature importance. It is important to have this implicit weighting in mind because the SVM (or whatever other method we use) is not seeing the original data, but our kernel matrices. In fact, this “scale” problem arises in other (non-kernel) methods: for these reason we are advised to normalize or standardize numeric datasets.
Now, this time for real, we try to weight equally all our 11 features:
K[,,1] <- Linear(minmax(MTCARS[[1]])) ## Kernel for numeric data
K[,,2] <- Dirac(MTCARS[[2]],comp="sum") ## Kernel for categorical data
Kcons <- MKC(K)
Kcons_var <- MKC(K,coeff=coeff)
simK(list(Kcons=Kcons,K1=K[,,1],K2=K[,,2]))
#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).
#> This function does NOT verify the symmetry and PSD criteria.
#> Kcons K1 K2
#> Kcons 1.0000000 0.9791777 0.9663366
#> K1 0.9791777 1.0000000 0.8939859
#> K2 0.9663366 0.8939859 1.0000000
simK(list(Kcons=Kcons_var,K1=K[,,1],K2=K[,,2]))
#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).
#> This function does NOT verify the symmetry and PSD criteria.
#> Kcons K1 K2
#> Kcons 1.0000000 0.9977012 0.9222967
#> K1 0.9977012 1.0000000 0.8939859
#> K2 0.9222967 0.8939859 1.0000000
The details maybe sound a bit too specific, but the min-max
normalization bounds the numeric data between 0 and 1. The range [0,1]
is the same than the one-hot-encoding for categorical variables (the
feature space related to the Dirac kernel). In addition, we chose
comp=sum
because the linear kernel also “sums” each one of
the features. Now, both K1
and K2
are almost
equally similar to the consensus in the first case, and K2
is less similar in the second case. K1
and K2
have still a high similarity, but this is probably caused because
features in mtcars
are correlated.