This vignette covers changes between versions 2.0.1 and 2.0.2.
This document explains how to use clustcurv R
package for clustering multiple nonparametric curves, under the survival
and regression framework. To this end, we illustrate the use of the
package using some real data sets. In the case of the survival context,
the algorithm to determine groups automatically is applied to Veterans’
Administration Lung Cancer Data survival package.
For the regression analysis, the clustcurv R package includes a data set
called data(barnacle5)
with measurements of rostro-carinal
length and dry weight of barnacles collected from five sites of Galicia
(northwest of Spain).
We will use Veterans’ Administration Lung Cancer Data
data(veteran)
to illustrate the package capabilities to
build clusters of survival curves based on a covariate. This data set is
available in survival package. In this study, a total of 137 males with
advanced inoperable lung cancer were randomized either a standard or
test chemotherapy. The primary endpoint for therapy comparison was time
to death recorded for each patient. Other covariates were also recorded.
One of them is the categorical variable
histological type of tumor
with four levels: squamous,
small cell, adeno, and large cell
After regular installation with install.packages()
, then
load the packages and the data set with
Clusters and estimates of the survival curves are obtained using the
ksurvcurves()
function or survclustcurves()
function. The main difference between them is that
ksurvcurves()
, given a fixed value of \(K\), allows determing the group for which
each survival function belongs. In addition,
survclustcurves()
is able to determine automatically the
number of groups. The functions will verify if data has been introduced
correctly and will create kcurves
and
clustcurves
objects, respectively. Both functions allow
determining groups using the optimization algorithm \(K\)-means or \(K\)-medians
(e.g. algorithm = 'kmeans'
, or
algorithm = 'kmedians'
). The first three arguments must be
introduced, where time
is a vector with event-times,
status
for their corresponding indicator statuses, and
x
is the categorical covariate.
One can be interesting to know the assignment of the survival curves
to the group which they belong and the automatic selection of the number
of groups. As we mentioned, it is possible by means of the
survclustcurves()
function. The following input command
provides an example of the output using, as well, the \(K\)-medians algorithm
(i.e. algorithm = 'kmedians'
)
res <- survclustcurves(time = veteran$time, status = veteran$status,
x = veteran$celltype, algorithm = 'kmedians', nboot = 100, seed=300716)
#> Checking 1 cluster...
#> Checking 2 clusters...
#>
#> Finally, there are 2 clusters.
In the above function it is also included an argument for reducing
executing time by means of parallelizing the testing procedure. This is
cluster = TRUE
. Related to this argument, the number of
cores to be used in the parallelized procedure can be specified with the
argument ncores
. By default, ncores = NULL
, so
that the number is equal to the number of cores of the machine - 1.
The following piece of code can be executed for obtaining a small summary of the fit
summary(res)
#>
#> Call:
#> survclustcurves(time = veteran$time, status = veteran$status,
#> x = veteran$celltype, nboot = 100, algorithm = "kmedians",
#> seed = 300716)
#>
#> Clustering curves in 2 groups
#>
#> Number of observations: 111
#> Cluster method: kmedians
#>
#> Factor's levels:
#> [1] "squamous" "smallcell" "adeno" "large"
#>
#> Clustering factor's levels:
#> [1] 1 2 2 1
#>
#> Testing procedure:
#> H0 Tvalue pvalue
#> 1 1 20.822684 0.00
#> 2 2 7.480014 0.29
#>
#> Available components:
#> [1] "num_groups" "table" "levels" "cluster" "centers" "curves" "method" "data" "algorithm" "call"
As can be seen, the summary()
function, as well as the
print()
function, can be used to obtained some brief
information about the output from survclustcurves()
.
The graphical representation of the fitted model can be easily
obtained using the function autoplot()
. The plot obtained,
specifying the arguments groups_by_color = FALSE
and
interactive = TRUE
, represents the estimated survival
curves for each level of the factor nodes by means of the Kaplan-Meier
estimator. As expected, the survival of patients can be influenced by
the cellular type of the tumor.
The assignment of the curves to the three groups can be observed in
the following plot simply typing groups_by_color = TRUE
Equivalently, the following piece of code shows the input commands
and the results obtained with the algorithm = 'kmeans'
. The
number of groups and the assignments are equal as those ones obtained
with the algorithm = 'kmedians'
.
We will use barnacle’s growth data data(barnacle5)
to
illustrate the package capabilities to build clusters of regression
curves based on a covariate. This data set (barnacle5
) is
available in the clustcurv package. A total of 5000 specimens were
collected from five sites of the region’s Atlantic coastline and
corresponds to the stretches of coast where this species is harvested:
Punta do Mouro, Punta Lens, Punta de la Barca, Punta del Boy and Punta
del Alba. Two biometric variables of each specimen were measured:
RC
(Rostro-carinal length, maximum distance across the
capitulum between the ends of the rostral and carinal plates) and
DW
(Dry Weight).
data("barnacle5")
head(barnacle5)
#> DW RC F
#> 1 0.52 12.0 laxe
#> 2 1.46 18.9 laxe
#> 3 0.05 6.4 laxe
#> 4 0.17 9.4 laxe
#> 5 0.05 6.2 laxe
#> 6 0.41 12.2 laxe
Here, the idea is to know the relation between RC
and
DW
variables along the coast, i.e., to analyze if the
barnacle’s growth is similar in all locations F
or by
contrast, if it is possible to detect geographical differentiation in
growth. To do this, the regclustcurves()
function will be
used with the input variables y
, x
,
z
, by means of executing the following piece of code
fit.bar <- regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
nboot = 100, seed = 300716, algorithm = 'kmeans')
#> Checking 1 cluster...
#> Checking 2 clusters...
#> Checking 3 clusters...
#>
#> Finally, there are 3 clusters.
The output of this function can be observed with print()
or summary()
functions. Below, there is an example of
this
print(fit.bar)
#>
#> Call:
#> regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
#> nboot = 100, algorithm = "kmeans", seed = 300716)
#>
#> Clustering curves in 3 groups
#>
#> Number of observations: 5000
#> Cluster method: kmeans
summary(fit.bar)
#>
#> Call:
#> regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
#> nboot = 100, algorithm = "kmeans", seed = 300716)
#>
#> Clustering curves in 3 groups
#>
#> Number of observations: 5000
#> Cluster method: kmeans
#>
#> Factor's levels:
#> [1] "laxe" "lens" "barca" "boy" "alba"
#>
#> Clustering factor's levels:
#> [1] 2 3 1 1 2
#>
#> Testing procedure:
#> H0 Tvalue pvalue
#> 1 1 0.94353014 0.00
#> 2 2 0.15463483 0.02
#> 3 3 0.02348982 0.46
#>
#> Available components:
#> [1] "num_groups" "table" "levels" "cluster" "centers" "curves" "method" "data" "algorithm" "call"
Equivalent to the example with survival curves shown before, the
results obtained above can be plotted using the
autoplot()
Never mind ->
install.packages('clustcurv')