Download a copy of the vignette to follow along here: a_complete_example.Rmd
We recommend you go through the simple example before working through this one.
This vignette walks through how the core functionality of this package, including:
Your data should be loaded into the R environment in the following format:
If you wish to use imputation to handle missingness in your data, you can take a look at the imputation vignette which outlines a basic workflow for meta clustering across multiple imputations of the same dataset.
The package comes with a few mock dataframes based on real data from the Adolescent Brain Cognitive Development study:
anxiety
(anxiety scores from the CBCL)depress
(depression scores from the CBCL)cort_t
(cortical thicknesses)cort_sa
(cortical surface areas in mm^2)subc_v
(subcortical volumes in mm^3)income
(household income on a 1-3 scale)pubertal
(pubertal status on a 1-5 scale)Here’s what the cortical thickness data looks like:
library(metasnf)
class(cort_t)
#> [1] "tbl_df" "tbl" "data.frame"
dim(cort_t)
#> [1] 188 152
str(cort_t[1:5, 1:5])
#> Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 5 variables:
#> $ unique_id: chr "NDAR_INV0567T2Y9" "NDAR_INV0GLZNC2W" "NDAR_INV0IZ157F8" "NDAR_INV0J4PYA5F" ...
#> $ mrisdp_1 : num 2.6 2.62 2.62 2.6 2.53
#> $ mrisdp_2 : num 2.49 2.85 2.29 2.67 2.76
#> $ mrisdp_3 : num 2.8 2.78 2.53 2.68 2.83
#> $ mrisdp_4 : num 2.95 2.85 2.96 2.94 2.99
cort_t[1:5, 1:5]
#> unique_id mrisdp_1 mrisdp_2 mrisdp_3 mrisdp_4
#> 1 NDAR_INV0567T2Y9 2.601 2.487 2.801 2.954
#> 2 NDAR_INV0GLZNC2W 2.619 2.851 2.784 2.846
#> 3 NDAR_INV0IZ157F8 2.621 2.295 2.530 2.961
#> 4 NDAR_INV0J4PYA5F 2.599 2.670 2.676 2.938
#> 5 NDAR_INV0OYE291Q 2.526 2.761 2.829 2.986
The first column unique_id
is the unique identifier
(UID) for all subjects in the data.
Here’s the household income data:
dim(income)
#> [1] 275 2
str(income[1:5, ])
#> Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 2 variables:
#> $ unique_id : chr "NDAR_INV0567T2Y9" "NDAR_INV0GLZNC2W" "NDAR_INV0IZ157F8" "NDAR_INV0J4PYA5F" ...
#> $ household_income: num 3 NA 1 2 1
income[1:5, ]
#> unique_id household_income
#> 1 NDAR_INV0567T2Y9 3
#> 2 NDAR_INV0GLZNC2W NA
#> 3 NDAR_INV0IZ157F8 1
#> 4 NDAR_INV0J4PYA5F 2
#> 5 NDAR_INV0OYE291Q 1
Putting everything in a list will help us get quicker summaries of all the data.
data <- list(
anxiety,
depress,
cort_t,
cort_sa,
subc_v,
income,
pubertal
)
# The number of rows in each dataframe:
lapply(data, dim)
#> [[1]]
#> [1] 275 2
#>
#> [[2]]
#> [1] 275 2
#>
#> [[3]]
#> [1] 188 152
#>
#> [[4]]
#> [1] 188 152
#>
#> [[5]]
#> [1] 174 31
#>
#> [[6]]
#> [1] 275 2
#>
#> [[7]]
#> [1] 275 2
# Whether or not each dataframe has missing values:
lapply(data,
function(x) {
any(is.na(x))
}
)
#> [[1]]
#> [1] TRUE
#>
#> [[2]]
#> [1] TRUE
#>
#> [[3]]
#> [1] FALSE
#>
#> [[4]]
#> [1] FALSE
#>
#> [[5]]
#> [1] FALSE
#>
#> [[6]]
#> [1] TRUE
#>
#> [[7]]
#> [1] TRUE
Some of our data has missing values and not all of our dataframes
have the same number of participants. SNF can only be run with complete
data, so you’ll need to either use complete case analysis (removal of
observations with any missing values) or impute the missing values to
proceed with the clustering. As mentioned above, metasnf
can be used to visualize changes in clustering results across different
imputations of the data.
For now, we’ll just examine the simpler complete-case analysis
approach by reducing our dataframes to only common and complete
observations. This can be made easier using the
get_complete_uids
function.
complete_uids <- get_complete_uids(data, uid = "unique_id")
print(length(complete_uids))
#> [1] 87
# Reducing dataframes to only common subjects with no missing data
anxiety <- anxiety[anxiety$"unique_id" %in% complete_uids, ]
depress <- depress[depress$"unique_id" %in% complete_uids, ]
cort_t <- cort_t[cort_t$"unique_id" %in% complete_uids, ]
cort_sa <- cort_sa[cort_sa$"unique_id" %in% complete_uids, ]
subc_v <- subc_v[subc_v$"unique_id" %in% complete_uids, ]
income <- income[income$"unique_id" %in% complete_uids, ]
pubertal <- pubertal[pubertal$"unique_id" %in% complete_uids, ]
The data_list
structure is a structured list of
dataframes (like the one already created), but with some additional
metadata about each dataframe. It should only contain the input
dataframes we want to directly use as inputs for the clustering. Out of
all the data we have available to us, we may be working in a context
where the anxiety and depression data are especially important outcomes,
and we want to know if we can find subtypes using the rest of the data
which still do a good job of separating out subjects by their anxiety
and depression scores.
We’ll start by creating a data list that stores our input features.
# Note that you do not need to explicitly name every single named element
# (data = ..., name = ..., etc.)
data_list <- generate_data_list(
list(
data = cort_t,
name = "cortical_thickness",
domain = "neuroimaging",
type = "continuous"
),
list(
data = cort_sa,
name = "cortical_surface_area",
domain = "neuroimaging",
type = "continuous"
),
list(
data = subc_v,
name = "subcortical_volume",
domain = "neuroimaging",
type = "continuous"
),
list(
data = income,
name = "household_income",
domain = "demographics",
type = "continuous"
),
list(
data = pubertal,
name = "pubertal_status",
domain = "demographics",
type = "continuous"
),
uid = "unique_id"
)
This process removes any observations that did not have complete data
across all provided input dataframes. If you’d like to keep track of
that information, you can set the “return_missing” parameter to
TRUE
and receive a list containing the data_list as well as
the removed observations. The structure of the data list is a nested
list tracking the data, the name of the dataframe, what domain (broader
source of information) the data belongs to, and the type of feature
stored in the dataframe. Options for feature type include “continuous”,
“discrete”, “ordinal”, and “categorical”.
The “uid” parameter is the name of the column in the dataframes that uniquely identifies each observation. Upon data list creation, the UID will be converted to “subjectkey” for consistency across other functions in the package.
We can get a summary of our constructed data_list
with
the summarize_dl
function:
summarize_dl(data_list)
#> name type domain length width
#> 1 cortical_thickness continuous neuroimaging 87 152
#> 2 cortical_surface_area continuous neuroimaging 87 152
#> 3 subcortical_volume continuous neuroimaging 87 31
#> 4 household_income continuous demographics 87 2
#> 5 pubertal_status continuous demographics 87 2
Each input dataframe now has the same 87 subjects with complete data. The width refers to the number of columns in each dataframe, which equals 1 for the UID (subjectkey) column + the number of features in the dataframe.
data_list
now stores all the features we intend on using
for the clustering. We’re interested in knowing if any of the clustering
solutions we generate can distinguish children apart based on their
anxiety and depression scores. To do this, we’ll also create a data list
storing only features that we’ll use for evaluating our cluster
solutions and not for the clustering itself. We’ll refer to this as the
“target_list”.
target_list <- generate_data_list(
list(anxiety, "anxiety", "behaviour", "ordinal"),
list(depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
summarize_dl(target_list)
#> name type domain length width
#> 1 anxiety ordinal behaviour 87 2
#> 2 depressed ordinal behaviour 87 2
Note that it is not necessary to make use of a partition of input and out-of-model measures in this way. If you’d like to have no target list and instead use every single feature of interest for the clustering, you can stick to just using one data list.
The settings_matrix
stores all the information about the
settings we’d like to use for each of our SNF runs. Calling the
generate_settings_matrix
function with a specified number
of rows will automatically build a randomly populated
settings_matrix
.
set.seed(42)
settings_matrix <- generate_settings_matrix(
data_list,
nrow = 20,
min_k = 20,
max_k = 50
)
settings_matrix[1:5, ]
#> row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
#> 1 1 0.5 29 20 2 1 1 1 1 1
#> 2 2 0.4 26 20 1 1 1 1 1 1
#> 3 3 0.3 44 20 2 2 1 1 1 1
#> 4 4 0.3 43 20 1 1 1 1 1 1
#> 5 5 0.5 29 20 2 2 1 1 1 1
#> mix_dist inc_cortical_thickness inc_cortical_surface_area
#> 1 1 1 0
#> 2 1 1 1
#> 3 1 1 0
#> 4 1 1 1
#> 5 1 1 1
#> inc_subcortical_volume inc_household_income inc_pubertal_status
#> 1 1 0 1
#> 2 1 1 1
#> 3 0 1 1
#> 4 0 1 1
#> 5 1 1 1
The columns are:
row_id
: Integer to keep track of which row is
whichalpha
- A hyperparameter for SNF (feature that
influences the subtyping process)k
- A hyperparameter for SNFt
- A hyperparameter for SNFsnf_scheme
- the specific way in which input data gets
collapsed into a final fused network (discussed further in the SNF
schemes vignette)clust_alg
- Which clustering algorithm will be applied
to the final fused network produced by SNF*_dist
- Which distance metric will be used for the
different types of data (discussed further in the distance
metrics vignette)inc_*
- binary columns indicating whether or not an
input dataframe is included (1) or excluded (0) from the corresponding
SNF run (discussed further in the settings
matrix vignette)Without specifying any additional parameters,
generate_settings_matrix
randomly populates these columns
and ensures that no generated rows are identical.
What’s important for now is that the matrix (technically a dataframe
in the R environment) contains several rows which each outline a
different but reasonable way in which the raw data could be converted
into a cluster solution. Further customization of the
settings_matrix
will enable you to access the broadest
possible space of reasonable cluster solutions that your data can
produce using SNF and ideally get you closer to a generalizable and
useful solution for your context. More on settings_matrix
customization can be found in the settings
matrix vignette.
Setting the optional seed
parameter (which will affect
the seed of your entire R session) ensures that the same settings matrix
is generated each time we run the code.
While we end up with a random set of settings here, there is nothing
wrong with manually altering the settings matrix to suit your needs. For
example, if you wanted to know how much of a difference one input
dataframe made, you could ensure that half of the rows included this
input dataframe and the other half didn’t. You can also add random rows
to an already existing dataframe using the
add_settings_matrix_rows
function (further discussed in the
vignette).
The batch_snf
function integrates the data in the
data_list
using each of the sets of settings contained in
the settings_matrix
. The resulting structure is an
solutions_matrix
which is an extension of the
settings_matrix
that contains columns specifying which
cluster each subject was assigned for the corresponding
settings_matrix
row.
solutions_matrix <- batch_snf(data_list, settings_matrix)
colnames(solutions_matrix)[1:30]
#> [1] "row_id" "alpha"
#> [3] "k" "t"
#> [5] "snf_scheme" "clust_alg"
#> [7] "cont_dist" "disc_dist"
#> [9] "ord_dist" "cat_dist"
#> [11] "mix_dist" "inc_cortical_thickness"
#> [13] "inc_cortical_surface_area" "inc_subcortical_volume"
#> [15] "inc_household_income" "inc_pubertal_status"
#> [17] "nclust" "subject_NDAR_INV0567T2Y9"
#> [19] "subject_NDAR_INV0J4PYA5F" "subject_NDAR_INV10OMKVLE"
#> [21] "subject_NDAR_INV15FPCW4O" "subject_NDAR_INV19NB4RJK"
#> [23] "subject_NDAR_INV1HLGR738" "subject_NDAR_INV1KR0EZFU"
#> [25] "subject_NDAR_INV1L3Y9EOP" "subject_NDAR_INV1TCP5GNM"
#> [27] "subject_NDAR_INV1ZHRDJ6B" "subject_NDAR_INV2EJ41YSZ"
#> [29] "subject_NDAR_INV2PK6C85M" "subject_NDAR_INV2XO1PHCT"
It goes on like this for some time.
Just like that, 20 different cluster solutions have been generated!
In practice, you may end up wanting to create hundreds or thousands
of cluster solutions at a time. If you have access to a multi-core
system, batch_snf
can be run with parallel processing
enabled. See ?batch_snf
or the parallel
processing vignette for more information.
You can pull the clustering results out of each row using the
get_cluster_solutions
function:
cluster_solutions <- get_cluster_solutions(solutions_matrix)
head(cluster_solutions)
#> subjectkey 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 1 subject_NDAR_INV0567T2Y9 5 3 1 1 1 1 1 3 3 2 1 1 1 3 2 1 1 1 1 1
#> 2 subject_NDAR_INV0J4PYA5F 2 3 7 2 6 2 4 4 3 4 4 7 2 3 2 4 2 7 2 3
#> 3 subject_NDAR_INV10OMKVLE 1 2 3 1 4 2 3 1 2 2 1 2 2 2 1 1 3 4 2 2
#> 4 subject_NDAR_INV15FPCW4O 1 2 5 2 5 2 3 1 2 3 1 5 2 2 1 1 2 5 2 3
#> 5 subject_NDAR_INV19NB4RJK 4 2 9 1 7 2 3 2 2 3 8 6 2 2 1 5 3 2 2 2
#> 6 subject_NDAR_INV1HLGR738 4 2 9 1 7 2 3 2 2 1 8 6 2 2 1 5 3 2 2 2
Now that we have access to 20 different clustering solutions, we’ll need to find some way to pick an optimal one to move forward with for additional characterization. In this case, plotting or running stats manually on each of the solutions might be a reasonable way to determine which ones we like the most. But when the number of solutions generated goes up into the hundreds (or thousands), we’re going to need some more automated approaches.
The main approach we recommend using is the meta clustering approach described in Caruana et al., 2016. Meta clustering is a good approach to use when the criterion for what makes one cluster solution better than another is hard to formalize quantitatively (and hence, hard to fully automate).
The idea is to cluster the clustering solutions themselves to arrive at a manageable number of qualitatively similar meta clusters. From there, characterization of representative solutions from each meta cluster can be done to efficiently identify the top choice.
The first step is to calculate the adjusted Rand index (ARI) between each pair of cluster solutions. This metric tells us how similar the solutions are to each other, thereby allowing us to find clusters of cluster solutions.
We can visualize the resulting inter-cluster similarities with a
heatmap. First, call get_matrix_order
to get a hierarchical
clustering-based ordering of the rows in the adjusted rand indices.
meta_cluster_order <- get_matrix_order(solutions_matrix_aris)
# Just a vector of numbers
meta_cluster_order
#> [1] 10 19 6 13 3 18 5 12 7 17 20 4 15 9 2 14 1 8 11 16
This order can be passed into the
adjusted_rand_index_heatmap
function to get a clearer view
of existing meta clusters.
ari_hm <- adjusted_rand_index_heatmap(
solutions_matrix_aris,
order = meta_cluster_order
)
save_heatmap(
heatmap = ari_hm,
path = "./adjusted_rand_index_heatmap.png",
width = 330,
height = 300,
res = 100
)
The clustering solutions are along the rows and columns of the above
figure, and the cells at the intersection between two solutions show how
similar (big ARI) those solutions are to each other. The diagonals
should always be red, representing the maximum value of 1, as they show
the similarity between any clustering solution and itself.
Complete-linkage, Euclidean-distance based hierarchical clustering is
being applied to these solutions to obtain the row ordering. This is
also the default approach used by the ComplexHeatmap
package, the backbone of all heatmap functions in
metasnf
.
This heatmap is integral to understanding what meta clusters exist in your data (given the space of parameters you explored in the settings matrix). We will later see how to further customize this heatmap to add in rich information about how each plotted cluster solution differs on various measures of quality, different clustering settings, and different levels of separation on input and out-of-model measures.
First, we’ll divide the heatmap into meta clusters by visual
inspection. The indices of our meta cluster boundaries can be passed
into the adjusted_rand_index_heatmap
function as the
split_vector
parameter. You can determine this vector
through trial and error or by using the shiny_annotator
function (a wrapper around functionality from the
InteractiveComplexHeatmap
package).
A demonstration of the shiny app can be seen here: Meta Cluster Identification
Note that while the shiny app is running, your R console will be unresponsive.
Clicking on cell boundaries and tracking the row/column indices (printed to the R console as well as displayed in the app) can get us results looking something like this:
split_vec <- c(2, 5, 12, 17)
ari_mc_hm <- adjusted_rand_index_heatmap(
solutions_matrix_aris,
order = meta_cluster_order,
split_vector = split_vec
)
save_heatmap(
heatmap = ari_mc_hm,
path = "./ari_mc_hm.png",
width = 330,
height = 300,
res = 100
)
At this point, we have our meta clusters but are not yet sure of how they differ in terms of their structure across our input and out-of-model features.
We start by running the extend_solutions
function, which
will calculate p-values representing the strength of the association
between cluster membership (treated as a categorical feature) and any
feature present in a provided data list and/or target_list.
extend_solutions
also adds summary p-value measures
(min, mean, and max) of any features present in the target list.
# Only looking at our out-of-model p-values
extended_solutions_matrix <- extend_solutions(
solutions_matrix,
target_list = target_list
)
# What columns have been added?
old_cols <- colnames(extended_solutions_matrix) %in% colnames(solutions_matrix)
new_cols <- !old_cols
head(extended_solutions_matrix[, new_cols])
#> cbcl_anxiety_r_pval cbcl_depress_r_pval min_pval mean_pval max_pval
#> 1 0.7344315 0.7263495 0.7263495 0.7303905 0.7344315
#> 2 0.6644469 0.3238038 0.3238038 0.4941253 0.6644469
#> 3 0.3961268 0.6817113 0.3961268 0.5389191 0.6817113
#> 4 0.4843825 0.7564136 0.4843825 0.6203981 0.7564136
#> 5 0.2256718 0.7895325 0.2256718 0.5076021 0.7895325
#> 6 0.2566391 0.1324778 0.1324778 0.1945585 0.2566391
# Re-running to calculate the p-value for every single input and out-of-model
# feature:
extended_solutions_matrix <- extend_solutions(
solutions_matrix,
data_list = data_list,
target_list = target_list
)
Functionally, the only difference between the data_list
and target_list
arguments are that the
target_list
features will also be a part of those summary
measures.
# Also would work, but without any summary p-values
extended_solutions_matrix <- extend_solutions(
solutions_matrix,
data_list = c(data_list, target_list)
)
# Also would work, but now every feature would be part of the summaries
extended_solutions_matrix <- extend_solutions(
solutions_matrix,
target_list = c(data_list, target_list)
)
The summary p-value measures can be suppressed by setting
calculate_summaries = FALSE
.
This extended_solutions_matrix
we created can be passed
into the adjusted_rand_index_heatmap
function to easily
visualize the level of separation on each of our features for each of
our cluster solutions.
annotated_ari_hm <- adjusted_rand_index_heatmap(
solutions_matrix_aris,
order = meta_cluster_order,
split_vector = split_vec,
data = extended_solutions_matrix,
top_hm = list(
"Depression p-value" = "cbcl_depress_r_pval",
"Anxiety p-value" = "cbcl_anxiety_r_pval",
"Overall outcomes p-value" = "mean_pval"
),
bottom_bar = list(
"Number of Clusters" = "nclust"
),
annotation_colours = list(
"Depression p-value" = colour_scale(
extended_solutions_matrix$"cbcl_depress_r_pval",
min_colour = "purple",
max_colour = "black"
),
"Anxiety p-value" = colour_scale(
extended_solutions_matrix$"cbcl_anxiety_r_pval",
min_colour = "green",
max_colour = "black"
),
"Overall outcomes p-value" = colour_scale(
extended_solutions_matrix$"mean_pval",
min_colour = "lightblue",
max_colour = "black"
)
)
)
save_heatmap(
heatmap = annotated_ari_hm,
path = "./annotated_ari_hm.png",
width = 700,
height = 500,
res = 100
)
The adjusted_rand_index_heatmap
function wraps around a
more generic similarity_matrix_heatmap
function in this
package, which itself is a wrapper around
ComplexHeatmap::Heatmap()
. Consequently, anything that can
be done in the ComplexHeatmap
package can be done here.
This also makes the documentation for the ComplexHeatmap
package one of the best places to learn more about what can be done for
these heatmaps.
The data for the annotations don’t necessarily need to come from functions in metasnf either. If you wanted to, for example, highlight the solutions where two very specific observations happened to cluster together, you could easily add that in as another annotation:
extended_solutions_matrix2 <- extended_solutions_matrix |>
dplyr::mutate(
key_subjects_cluster_together = dplyr::case_when(
subject_NDAR_INVLF3TNDUZ == subject_NDAR_INVLDQH8ATK ~ TRUE,
TRUE ~ FALSE
)
)
annotated_ari_hm2 <- adjusted_rand_index_heatmap(
solutions_matrix_aris,
order = meta_cluster_order,
split_vector = split_vec,
data = extended_solutions_matrix2,
top_hm = list(
"Depression p-value" = "cbcl_depress_r_pval",
"Anxiety p-value" = "cbcl_anxiety_r_pval",
"Key Subjects Clustered Together" = "key_subjects_cluster_together"
),
bottom_bar = list(
"Number of Clusters" = "nclust"
),
annotation_colours = list(
"Depression p-value" = colour_scale(
extended_solutions_matrix$"cbcl_depress_r_pval",
min_colour = "purple",
max_colour = "black"
),
"Anxiety p-value" = colour_scale(
extended_solutions_matrix$"cbcl_anxiety_r_pval",
min_colour = "purple",
max_colour = "black"
),
"Key Subjects Clustered Together" = c(
"TRUE" = "blue",
"FALSE" = "black"
)
)
)
save_heatmap(
heatmap = annotated_ari_hm2,
path = "./annotated_ari_hm2.png",
width = 700,
height = 500,
res = 100
)
Now that we’ve visually delineated our meta clusters, we can get a quick summary of what sort of separation exists across our input and held out features by taking a closer look at one representative cluster solution from each meta cluster.
This can be achieved by the get_representative_solutions
function, which extracts one cluster solution per meta cluster based on
having the highest average ARI with the other solutions in that meta
cluster.
rep_solutions <- get_representative_solutions(
solutions_matrix_aris,
split_vector = split_vec,
order = meta_cluster_order,
extended_solutions_matrix
)
mc_manhattan <- mc_manhattan_plot(
rep_solutions,
data_list = data_list,
target_list = target_list,
hide_x_labels = TRUE,
point_size = 2,
text_size = 12,
threshold = 0.05,
neg_log_pval_thresh = 5
)
ggplot2::ggsave(
"mc_manhattan.png",
mc_manhattan,
height = 4,
width = 8,
dpi = 100
)
neg_log_pval_thresh
sets a threshold for the negative
log of the p-values to be displayed. At a value of 5, any p-value that
is smaller than 1e-5 will be truncated to 1e-5.
The plot as it is is a bit unwieldy plot given how many neuroimaging ROIs are present. Let’s take out the cortical thickness and surface area measures to make the plot a little clearer.
We’ll also be able to read some feature measures more clearly when we dial the number of features plotted back a bit, as well.
rep_solutions_no_cort <- dplyr::select(
rep_solutions,
-dplyr::contains("mrisdp")
)
mc_manhattan2 <- mc_manhattan_plot(
rep_solutions_no_cort,
data_list = data_list,
target_list = target_list,
point_size = 4,
threshold = 0.01,
text_size = 12,
domain_colours = c(
"neuroimaging" = "cadetblue",
"demographics" = "purple",
"behaviour" = "firebrick"
)
)
mc_manhattan2
ggplot2::ggsave(
"mc_manhattan2.png",
mc_manhattan2,
height = 8,
width = 12,
dpi = 100
)
Note that the Manhattan plot automatically uses a vertical line to
separate features in the data_list argument from those in the
target_list. Vertical boundaries can be controlled with the
xints
parameter.
If you see something interesting in your heatmap, you may be curious to know how that corresponds to the settings that were in your settings matrix.
You could certainly stack on each setting to the ARI heatmap as
annotations, but that may be a bit cumbersome given how many settings
there are. Another option is to start by taking a first look at entire
settings_matrix
, sorted by the meta cluster results,
through the settings_matrix_heatmap
function.
sm_hm <- settings_matrix_heatmap(
settings_matrix,
order = meta_cluster_order
)
save_heatmap(
heatmap = sm_hm,
path = "./settings_matrix_heatmap_ordered.png",
width = 400,
height = 500,
res = 75
)
This heatmap rescales all the columns in the settings_matrix to have
a maximum value of 1. The purpose of re-ordering the settings matrix in
this way is to see if any associations exist between certain settings
values and pairwise cluster solution similarities. If there are any
particular important settings, you can simply add them into your
adjusted rand index heatmap annotations. Recall that the
solutions_matrix
(and, by extension, the
extended_solutions_matrix
) is an extension of the settings
matrix, so no further data manipulation is needed to add a setting as a
heatmap annotation.
Give it a try with the code below:
annotated_ari_hm3 <- adjusted_rand_index_heatmap(
solutions_matrix_aris,
order = meta_cluster_order,
split_vector = c(11, 14),
data = extended_solutions_matrix,
top_hm = list(
"Depression p-value" = "cbcl_depress_r_pval",
"Anxiety p-value" = "cbcl_anxiety_r_pval",
"Key Subjects Clustered Together" = "key_subjects_cluster_together"
),
left_hm = list(
"Clustering Algorithm" = "clust_alg" # from the original settings
),
bottom_bar = list(
"Number of Clusters" = "nclust" # also from the original settings
),
annotation_colours = list(
"Depression p-value" = colour_scale(
extended_solutions_matrix$"cbcl_depress_r_pval",
min_colour = "purple",
max_colour = "black"
),
"Anxiety p-value" = colour_scale(
extended_solutions_matrix$"cbcl_anxiety_r_pval",
min_colour = "purple",
max_colour = "black"
),
"Key Subjects Clustered Together" = c(
"TRUE" = "blue",
"FALSE" = "black"
)
)
)
Quality metrics are another useful heuristic for the goodness of a
cluster that don’t require any contextualization of results in the
domain they may be used in. metasnf enables measures of silhouette
scores, Dunn indices, and Davies-Bouldin indices. To calculate these
values, we’ll need not only the cluster results but also the final fused
network (the similarity matrices produced by SNF) that the clusters came
from. These similarity matrices can be collected from the
batch_snf
using the return_similarity_matrices
parameter:
batch_snf_results <- batch_snf(
data_list,
settings_matrix,
return_similarity_matrices = TRUE
)
solutions_matrix <- batch_snf_results$"solutions_matrix"
similarity_matrices <- batch_snf_results$"similarity_matrices"
This time, the output of batch_snf
is a list. The first
element of the list is a single solutions_matrix, like what we usually
get. The second element is yet another list containing one
final fused network (AKA similarity matrix / similarity matrix) per SNF
run. Using those two lists, we can calculate the above mentioned quality
metrics:
silhouette_scores <- calculate_silhouettes(
solutions_matrix,
similarity_matrices
)
dunn_indices <- calculate_dunn_indices(
solutions_matrix,
similarity_matrices
)
db_indices <- calculate_db_indices(
solutions_matrix,
similarity_matrices
)
The first function is a wrapper around
cluster::silhouette
while the second and third come from
the clv package. clv isn’t set as a mandatory part of
the installation, so you’ll ned to install it yourself to calculate
these two metrics.
The original documentation on these functions can be helpful for interpreting and working with them:
metasnf offers tools to evaluate two different measures of stability:
You can learn more about running stability calculations in the stability and coclustering vignette.
If you can specify a metric or objective function that may tell you how useful a clustering solution will be for your purposes in advance, that makes the cluster selection process much less arbitrary.
There are many ways to go about doing this, but this package offers
one way through a target_list
. The target_list
contains dataframes what we can examine our clustering results over
through linear regression (continuous data), ordinal regression (ordinal
data), or the Chi-squared test (categorical data).
Just like when generating the initial data_list
, we need
to specify the name of the column in the provided dataframes that is
originally being used to uniquely identify the different observations
from each other with the uid
parameter.
We will next extend our solutions_matrix
with p-values
from regressing the target_list
features onto our generated
clusters.
extended_solutions_matrix <- extend_solutions(solutions_matrix, target_list)
colnames(extended_solutions_matrix)[1:25]
#> [1] "row_id" "alpha"
#> [3] "k" "t"
#> [5] "snf_scheme" "clust_alg"
#> [7] "cont_dist" "disc_dist"
#> [9] "ord_dist" "cat_dist"
#> [11] "mix_dist" "inc_cortical_thickness"
#> [13] "inc_cortical_surface_area" "inc_subcortical_volume"
#> [15] "inc_household_income" "inc_pubertal_status"
#> [17] "nclust" "subject_NDAR_INV0567T2Y9"
#> [19] "subject_NDAR_INV0J4PYA5F" "subject_NDAR_INV10OMKVLE"
#> [21] "subject_NDAR_INV15FPCW4O" "subject_NDAR_INV19NB4RJK"
#> [23] "subject_NDAR_INV1HLGR738" "subject_NDAR_INV1KR0EZFU"
#> [25] "subject_NDAR_INV1L3Y9EOP"
# Looking at the newly added columns
head(no_subs(extended_solutions_matrix))
#> row_id alpha k t snf_scheme clust_alg cont_dist disc_dist ord_dist cat_dist
#> 1 1 0.5 29 20 2 1 1 1 1 1
#> 2 2 0.4 26 20 1 1 1 1 1 1
#> 3 3 0.3 44 20 2 2 1 1 1 1
#> 4 4 0.3 43 20 1 1 1 1 1 1
#> 5 5 0.5 29 20 2 2 1 1 1 1
#> 6 6 0.4 26 20 2 1 1 1 1 1
#> mix_dist inc_cortical_thickness inc_cortical_surface_area
#> 1 1 1 0
#> 2 1 1 1
#> 3 1 1 0
#> 4 1 1 1
#> 5 1 1 1
#> 6 1 1 1
#> inc_subcortical_volume inc_household_income inc_pubertal_status nclust
#> 1 1 0 1 5
#> 2 1 1 1 3
#> 3 0 1 1 9
#> 4 0 1 1 2
#> 5 1 1 1 8
#> 6 1 1 1 2
#> cbcl_anxiety_r_pval cbcl_depress_r_pval min_pval mean_pval max_pval
#> 1 0.7344315 0.7263495 0.7263495 0.7303905 0.7344315
#> 2 0.6644469 0.3238038 0.3238038 0.4941253 0.6644469
#> 3 0.3961268 0.6817113 0.3961268 0.5389191 0.6817113
#> 4 0.4843825 0.7564136 0.4843825 0.6203981 0.7564136
#> 5 0.2256718 0.7895325 0.2256718 0.5076021 0.7895325
#> 6 0.2566391 0.1324778 0.1324778 0.1945585 0.2566391
If you just want the p-values:
target_pvals <- get_pvals(extended_solutions_matrix)
head(target_pvals)
#> row_id cbcl_anxiety_r_pval cbcl_depress_r_pval min_pval mean_pval max_pval
#> 1 1 0.7344315 0.7263495 0.7263495 0.7303905 0.7344315
#> 2 2 0.6644469 0.3238038 0.3238038 0.4941253 0.6644469
#> 3 3 0.3961268 0.6817113 0.3961268 0.5389191 0.6817113
#> 4 4 0.4843825 0.7564136 0.4843825 0.6203981 0.7564136
#> 5 5 0.2256718 0.7895325 0.2256718 0.5076021 0.7895325
#> 6 6 0.2566391 0.1324778 0.1324778 0.1945585 0.2566391
There is a heatmap for visualizing this too:
pval_hm <- pval_heatmap(target_pvals, order = meta_cluster_order)
save_heatmap(
heatmap = pval_hm,
path = "./pval_heatmap_ordered.png",
width = 400,
height = 500,
res = 100
)
These p-values hold no real meaning for the traditional hypothesis-testing context, but they are reasonable proxies of the magnitude of the effect size / separation of the clusters across the features in question. Here, they are just a tool to find clustering solutions that are well-separated according to the outcome measures you’ve specified. Finding a cluster solution like this is similar to a supervised learning approach, but where the optimization method is just random sampling. The risk for overfitting your data with this approach is considerable, so make sure you have some rigorous external validation before reporting your findings.
We recommend using label propagation (provided by the
SNFtool package in the groupPredict
function) for
validation: take the top clustering solutions found in some training
data, assign predicted clusters to some held out test subjects, and then
characterize those test subjects to see how well the clustering solution
seemed to have worked.
Here’s a quick step through of the complete procedure, from the beginning, with label propagation to validate our findings.
The metasnf
package comes equipped with a function to do
the training/testing split for you :)
# All the subjects present in all dataframes with no NAs
all_subjects <- data_list[[1]]$"data"$"subjectkey"
# Remove the "subject_" prefix to allow merges with the original data
all_subjects <- gsub("subject_", "", all_subjects)
# Dataframe assigning 80% of subjects to train and 20% to test
assigned_splits <- train_test_assign(train_frac = 0.8, subjects = all_subjects)
# Pulling the training and testing subjects specifically
train_subs <- assigned_splits$"train"
test_subs <- assigned_splits$"test"
# Partition a training set
train_cort_t <- cort_t[cort_t$"unique_id" %in% train_subs, ]
train_cort_sa <- cort_sa[cort_sa$"unique_id" %in% train_subs, ]
train_subc_v <- subc_v[subc_v$"unique_id" %in% train_subs, ]
train_income <- income[income$"unique_id" %in% train_subs, ]
train_pubertal <- pubertal[pubertal$"unique_id" %in% train_subs, ]
train_anxiety <- anxiety[anxiety$"unique_id" %in% train_subs, ]
train_depress <- depress[depress$"unique_id" %in% train_subs, ]
# Partition a test set
test_cort_t <- cort_t[cort_t$"unique_id" %in% test_subs, ]
test_cort_sa <- cort_sa[cort_sa$"unique_id" %in% test_subs, ]
test_subc_v <- subc_v[subc_v$"unique_id" %in% test_subs, ]
test_income <- income[income$"unique_id" %in% test_subs, ]
test_pubertal <- pubertal[pubertal$"unique_id" %in% test_subs, ]
test_anxiety <- anxiety[anxiety$"unique_id" %in% test_subs, ]
test_depress <- depress[depress$"unique_id" %in% test_subs, ]
# A data list with just training subjects
train_data_list <- generate_data_list(
list(train_cort_t, "cortical_thickness", "neuroimaging", "continuous"),
list(train_cort_sa, "cortical_sa", "neuroimaging", "continuous"),
list(train_subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(train_income, "household_income", "demographics", "continuous"),
list(train_pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
# A data list with training and testing subjects
full_data_list <- generate_data_list(
list(cort_t, "cortical_thickness", "neuroimaging", "continuous"),
list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
list(income, "household_income", "demographics", "continuous"),
list(pubertal, "pubertal_status", "demographics", "continuous"),
uid = "unique_id"
)
# Construct the target lists
train_target_list <- generate_data_list(
list(train_anxiety, "anxiety", "behaviour", "ordinal"),
list(train_depress, "depressed", "behaviour", "ordinal"),
uid = "unique_id"
)
# Find a clustering solution in your training data
set.seed(42)
settings_matrix <- generate_settings_matrix(
train_data_list,
nrow = 5,
min_k = 10,
max_k = 30
)
train_solutions_matrix <- batch_snf(
train_data_list,
settings_matrix
)
extended_solutions_matrix <- extend_solutions(
train_solutions_matrix,
train_target_list
)
extended_solutions_matrix |> colnames()
#> [1] "row_id" "alpha"
#> [3] "k" "t"
#> [5] "snf_scheme" "clust_alg"
#> [7] "cont_dist" "disc_dist"
#> [9] "ord_dist" "cat_dist"
#> [11] "mix_dist" "inc_cortical_thickness"
#> [13] "inc_cortical_sa" "inc_subcortical_volume"
#> [15] "inc_household_income" "inc_pubertal_status"
#> [17] "nclust" "subject_NDAR_INV0567T2Y9"
#> [19] "subject_NDAR_INV0J4PYA5F" "subject_NDAR_INV10OMKVLE"
#> [21] "subject_NDAR_INV15FPCW4O" "subject_NDAR_INV19NB4RJK"
#> [23] "subject_NDAR_INV1HLGR738" "subject_NDAR_INV1KR0EZFU"
#> [25] "subject_NDAR_INV1L3Y9EOP" "subject_NDAR_INV1ZHRDJ6B"
#> [27] "subject_NDAR_INV2PK6C85M" "subject_NDAR_INV2XO1PHCT"
#> [29] "subject_NDAR_INV3CU5Y9BZ" "subject_NDAR_INV3MBSY16V"
#> [31] "subject_NDAR_INV3N0QFDLO" "subject_NDAR_INV3Y027GVK"
#> [33] "subject_NDAR_INV40Z7GVYJ" "subject_NDAR_INV49UPOXHJ"
#> [35] "subject_NDAR_INV4N5XGZE8" "subject_NDAR_INV4OWRB536"
#> [37] "subject_NDAR_INV4X80QUZY" "subject_NDAR_INV50JL2RXP"
#> [39] "subject_NDAR_INV5BRNFYQC" "subject_NDAR_INV6RVH5KZS"
#> [41] "subject_NDAR_INV6WBQCY2I" "subject_NDAR_INV752EFAQ0"
#> [43] "subject_NDAR_INV7QO93CJH" "subject_NDAR_INV84G9ONXP"
#> [45] "subject_NDAR_INV8EHP6W1U" "subject_NDAR_INV8MJFUKIW"
#> [47] "subject_NDAR_INV8WGK6ECZ" "subject_NDAR_INV94AKNGMJ"
#> [49] "subject_NDAR_INV9GAZYV8Q" "subject_NDAR_INV9IREH05N"
#> [51] "subject_NDAR_INV9KC3GVMU" "subject_NDAR_INV9NFKZ82A"
#> [53] "subject_NDAR_INV9S1BMDE5" "subject_NDAR_INVA68OU0YK"
#> [55] "subject_NDAR_INVADCYZ38B" "subject_NDAR_INVAYM8WTIN"
#> [57] "subject_NDAR_INVB8O4LAQV" "subject_NDAR_INVBAP80W1R"
#> [59] "subject_NDAR_INVBTRW1NUK" "subject_NDAR_INVCIXE0496"
#> [61] "subject_NDAR_INVCYBSZD0N" "subject_NDAR_INVD37Z9N61"
#> [63] "subject_NDAR_INVD61ZUBC7" "subject_NDAR_INVEQ1OBNSM"
#> [65] "subject_NDAR_INVEVBDLSTM" "subject_NDAR_INVEY0FMJDI"
#> [67] "subject_NDAR_INVFLU0YINE" "subject_NDAR_INVFNZPWMSI"
#> [69] "subject_NDAR_INVFY76P8AJ" "subject_NDAR_INVG3T0PXW6"
#> [71] "subject_NDAR_INVG8BRLSO9" "subject_NDAR_INVH1KV76BQ"
#> [73] "subject_NDAR_INVH3P4T8C2" "subject_NDAR_INVH4FZC2XB"
#> [75] "subject_NDAR_INVH8QN7WLT" "subject_NDAR_INVHERPS382"
#> [77] "subject_NDAR_INVHM3XS68O" "subject_NDAR_INVI1RKT9MX"
#> [79] "subject_NDAR_INVIZFV08RU" "subject_NDAR_INVJ574KX6A"
#> [81] "subject_NDAR_INVK3FL5CP2" "subject_NDAR_INVKB0CYO1H"
#> [83] "subject_NDAR_INVKHWS26UN" "subject_NDAR_INVKTUMPLXY"
#> [85] "subject_NDAR_INVL4NIUZYF" "subject_NDAR_INVLF3TNDUZ"
#> [87] "subject_NDAR_INVLI58ERQC" "subject_NDAR_INVLIQRM8KC"
#> [89] "subject_NDAR_INVLXDP1SWT" "subject_NDAR_INVMBOZVEA4"
#> [91] "subject_NDAR_INVMIWOSHJN" "cbcl_anxiety_r_pval"
#> [93] "cbcl_depress_r_pval" "min_pval"
#> [95] "mean_pval" "max_pval"
# The fifth row had the lowest minimum p-value across our outcomes
lowest_min_pval <- min(extended_solutions_matrix$"min_pval")
which(extended_solutions_matrix$"min_pval" == lowest_min_pval)
#> [1] 1
# Keep track of your top solution
top_row <- extended_solutions_matrix[4, ]
# Use the solutions matrix from the training subjects and the data list from
# the training and testing subjects to propagate labels to the test subjects
propagated_labels <- lp_solutions_matrix(top_row, full_data_list)
head(propagated_labels)
#> subjectkey group 4
#> 1 subject_NDAR_INV0567T2Y9 train 1
#> 2 subject_NDAR_INV0J4PYA5F train 1
#> 3 subject_NDAR_INV10OMKVLE train 2
#> 4 subject_NDAR_INV15FPCW4O train 1
#> 5 subject_NDAR_INV19NB4RJK train 2
#> 6 subject_NDAR_INV1HLGR738 train 1
tail(propagated_labels)
#> subjectkey group 4
#> 82 subject_NDAR_INVG5CI7XK4 test 1
#> 83 subject_NDAR_INVGDBYXWV4 test 1
#> 84 subject_NDAR_INVHEUWA52I test 2
#> 85 subject_NDAR_INVK9ULDQA2 test 1
#> 86 subject_NDAR_INVKYH529RD test 1
#> 87 subject_NDAR_INVLDQH8ATK test 2
You could, if you wanted, see how all of your clustering solutions propagate to the test set, but that would mean reusing your test set and removing the protection against overfitting conferred by this procedure.
propagated_labels_all <- lp_solutions_matrix(
extended_solutions_matrix,
full_data_list
)
head(propagated_labels_all)
#> subjectkey group 1 2 3 4 5
#> 1 subject_NDAR_INV0567T2Y9 train 1 1 5 1 10
#> 2 subject_NDAR_INV0J4PYA5F train 2 1 5 1 3
#> 3 subject_NDAR_INV10OMKVLE train 1 1 3 2 5
#> 4 subject_NDAR_INV15FPCW4O train 1 1 4 1 4
#> 5 subject_NDAR_INV19NB4RJK train 1 1 8 2 9
#> 6 subject_NDAR_INV1HLGR738 train 1 2 8 1 9
tail(propagated_labels_all)
#> subjectkey group 1 2 3 4 5
#> 82 subject_NDAR_INVG5CI7XK4 test 1 1 2 1 2
#> 83 subject_NDAR_INVGDBYXWV4 test 1 1 4 1 4
#> 84 subject_NDAR_INVHEUWA52I test 2 1 1 2 1
#> 85 subject_NDAR_INVK9ULDQA2 test 1 1 1 1 1
#> 86 subject_NDAR_INVKYH529RD test 1 1 7 1 7
#> 87 subject_NDAR_INVLDQH8ATK test 1 1 6 2 8
That’s all!
If you have any questions, comments, suggestions, bugs, etc. feel free to post an issue at https://github.com/BRANCHlab/metasnf.
Caruana, Rich, Mohamed Elhawary, Nam Nguyen, and Casey Smith. 2006. “Meta Clustering.” In Sixth International Conference on Data Mining (ICDM’06), 107–18. https://doi.org/10.1109/ICDM.2006.103.
Wang, Bo, Aziz M. Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Benjamin Haibe-Kains, and Anna Goldenberg. 2014. “Similarity Network Fusion for Aggregating Data Types on a Genomic Scale.” Nature Methods 11 (3): 333–37. https://doi.org/10.1038/nmeth.2810.