Clustering Multiple Nonparametric Curves

Nora M. Villanueva & Marta Sestelo

2024-10-23

This vignette covers changes between versions 2.0.1 and 2.0.2.

This document explains how to use clustcurv R package for clustering multiple nonparametric curves, under the survival and regression framework. To this end, we illustrate the use of the package using some real data sets. In the case of the survival context, the algorithm to determine groups automatically is applied to Veterans’ Administration Lung Cancer Data survival package. For the regression analysis, the clustcurv R package includes a data set called data(barnacle5) with measurements of rostro-carinal length and dry weight of barnacles collected from five sites of Galicia (northwest of Spain).

Clustering multiple survival curves

We will use Veterans’ Administration Lung Cancer Data data(veteran) to illustrate the package capabilities to build clusters of survival curves based on a covariate. This data set is available in survival package. In this study, a total of 137 males with advanced inoperable lung cancer were randomized either a standard or test chemotherapy. The primary endpoint for therapy comparison was time to death recorded for each patient. Other covariates were also recorded. One of them is the categorical variable histological type of tumor with four levels: squamous, small cell, adeno, and large cell

Introduction

After regular installation with install.packages(), then load the packages and the data set with

library(clustcurv)
library(survival)
head(veteran)
#>   trt celltype time status karno diagtime age prior
#> 1   1 squamous   72      1    60        7  69     0
#> 2   1 squamous  411      1    70        5  64    10
#> 3   1 squamous  228      1    60        3  38     0
#> 4   1 squamous  126      1    60        9  63    10
#> 5   1 squamous  118      1    70       11  65    10
#> 6   1 squamous   10      1    20        5  49     0

Algorithm for determing groups of survival curves

Based on \(K\)-medians

Clusters and estimates of the survival curves are obtained using the ksurvcurves() function or survclustcurves() function. The main difference between them is that ksurvcurves(), given a fixed value of \(K\), allows determing the group for which each survival function belongs. In addition, survclustcurves() is able to determine automatically the number of groups. The functions will verify if data has been introduced correctly and will create kcurves and clustcurves objects, respectively. Both functions allow determining groups using the optimization algorithm \(K\)-means or \(K\)-medians (e.g. algorithm = 'kmeans', or algorithm = 'kmedians'). The first three arguments must be introduced, where time is a vector with event-times, status for their corresponding indicator statuses, and x is the categorical covariate.

One can be interesting to know the assignment of the survival curves to the group which they belong and the automatic selection of the number of groups. As we mentioned, it is possible by means of the survclustcurves()function. The following input command provides an example of the output using, as well, the \(K\)-medians algorithm (i.e. algorithm = 'kmedians')

res <- survclustcurves(time = veteran$time, status = veteran$status,
x = veteran$celltype, algorithm = 'kmedians', nboot = 100, seed=300716)
#> Checking 1 cluster... 
#> Checking 2 clusters... 
#> 
#> Finally, there are 2 clusters.

In the above function it is also included an argument for reducing executing time by means of parallelizing the testing procedure. This is cluster = TRUE. Related to this argument, the number of cores to be used in the parallelized procedure can be specified with the argument ncores. By default, ncores = NULL, so that the number is equal to the number of cores of the machine - 1.

The following piece of code can be executed for obtaining a small summary of the fit


summary(res)
#> 
#> Call:  
#> survclustcurves(time = veteran$time, status = veteran$status, 
#>     x = veteran$celltype, nboot = 100, algorithm = "kmedians", 
#>     seed = 300716)
#> 
#> Clustering curves in 2 groups
#> 
#> Number of observations:  111
#> Cluster method:  kmedians 
#> 
#> Factor's levels:
#> [1] "squamous"  "smallcell" "adeno"     "large"    
#> 
#> Clustering factor's levels:
#> [1] 1 2 2 1
#> 
#> Testing procedure:
#>   H0    Tvalue pvalue
#> 1  1 20.822684   0.00
#> 2  2  7.480014   0.29
#> 
#> Available components:
#>  [1] "num_groups" "table"      "levels"     "cluster"    "centers"    "curves"     "method"     "data"       "algorithm"  "call"

As can be seen, the summary() function, as well as the print() function, can be used to obtained some brief information about the output from survclustcurves().

The graphical representation of the fitted model can be easily obtained using the function autoplot(). The plot obtained, specifying the arguments groups_by_color = FALSE and interactive = TRUE, represents the estimated survival curves for each level of the factor nodes by means of the Kaplan-Meier estimator. As expected, the survival of patients can be influenced by the cellular type of the tumor.

autoplot(res , groups_by_colour = FALSE, interactive = TRUE)

The assignment of the curves to the three groups can be observed in the following plot simply typing groups_by_color = TRUE

autoplot(res , groups_by_colour = TRUE, interactive = TRUE)

Based on \(K\)-means

Equivalently, the following piece of code shows the input commands and the results obtained with the algorithm = 'kmeans'. The number of groups and the assignments are equal as those ones obtained with the algorithm = 'kmedians'.

res2 <- survclustcurves(time = veteran$time, status = veteran$status,
x = veteran$celltype, algorithm = 'kmeans', nboot = 100, seed=300716)
#> Checking 1 cluster... 
#> Checking 2 clusters... 
#> 
#> Finally, there are 2 clusters.

Clustering multiple regression curves

We will use barnacle’s growth data data(barnacle5) to illustrate the package capabilities to build clusters of regression curves based on a covariate. This data set (barnacle5) is available in the clustcurv package. A total of 5000 specimens were collected from five sites of the region’s Atlantic coastline and corresponds to the stretches of coast where this species is harvested: Punta do Mouro, Punta Lens, Punta de la Barca, Punta del Boy and Punta del Alba. Two biometric variables of each specimen were measured: RC (Rostro-carinal length, maximum distance across the capitulum between the ends of the rostral and carinal plates) and DW (Dry Weight).

data("barnacle5")
head(barnacle5)
#>     DW   RC    F
#> 1 0.52 12.0 laxe
#> 2 1.46 18.9 laxe
#> 3 0.05  6.4 laxe
#> 4 0.17  9.4 laxe
#> 5 0.05  6.2 laxe
#> 6 0.41 12.2 laxe

Here, the idea is to know the relation between RC and DW variables along the coast, i.e., to analyze if the barnacle’s growth is similar in all locations F or by contrast, if it is possible to detect geographical differentiation in growth. To do this, the regclustcurves() function will be used with the input variables y, x, z, by means of executing the following piece of code

fit.bar <- regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F,
                          nboot = 100, seed = 300716, algorithm = 'kmeans')
#> Checking 1 cluster... 
#> Checking 2 clusters... 
#> Checking 3 clusters... 
#> 
#> Finally, there are 3 clusters.

The output of this function can be observed with print() or summary() functions. Below, there is an example of this

print(fit.bar)
#> 
#> Call:  
#> regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F, 
#>     nboot = 100, algorithm = "kmeans", seed = 300716)
#> 
#> Clustering curves in 3 groups
#> 
#> Number of observations:  5000
#> Cluster method:  kmeans
summary(fit.bar)
#> 
#> Call:  
#> regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F, 
#>     nboot = 100, algorithm = "kmeans", seed = 300716)
#> 
#> Clustering curves in 3 groups
#> 
#> Number of observations:  5000
#> Cluster method:  kmeans 
#> 
#> Factor's levels:
#> [1] "laxe"  "lens"  "barca" "boy"   "alba" 
#> 
#> Clustering factor's levels:
#> [1] 2 3 1 1 2
#> 
#> Testing procedure:
#>   H0     Tvalue pvalue
#> 1  1 0.94353014   0.00
#> 2  2 0.15463483   0.02
#> 3  3 0.02348982   0.46
#> 
#> Available components:
#>  [1] "num_groups" "table"      "levels"     "cluster"    "centers"    "curves"     "method"     "data"       "algorithm"  "call"

Equivalent to the example with survival curves shown before, the results obtained above can be plotted using the autoplot()

autoplot(fit.bar, groups_by_color = TRUE, interactive = TRUE)

Never mind -> install.packages('clustcurv')