Suppose we have collected measurements about bacterial abundances from a number of samples, and those samples fall into one of several groups. We want to know if there is a statistically significant difference between the groups, that is, whether it looks like the microbiome samples from the different groups look like they could all have come from all come from the same distribution.
One good non-parametric family of tests for this problem is based on the Friedman-Rafsky1 test. The idea is to compute distances between the samples, create a graph based on those distances, and use the number of edges between samples of the same type (the number of “pure edges”) as a test statistic. We can then compute a \(p\)-value by comparing the observed test statistic to the distribution of the test statistic under the permutation distribution.
From the description above, we see that we have some choices to make.
We need to define a distance between the samples and choose a method for
creating a graph from those distances. These choices are responsible for
most of the arguments to graph_perm_test
, the primary
function in this package.
The distance
argument in graph_perm_test
allows you to specify a distance. This can be any distance implemented
in phyloseq
, and it should be taken from the following
list:
library(phyloseq)
unlist(distanceMethodList)
## UniFrac1 UniFrac2 DPCoA JSD vegdist1 vegdist2
## "unifrac" "wunifrac" "dpcoa" "jsd" "manhattan" "euclidean"
## vegdist3 vegdist4 vegdist5 vegdist6 vegdist7 vegdist8
## "canberra" "bray" "kulczynski" "jaccard" "gower" "altGower"
## vegdist9 vegdist10 vegdist11 vegdist12 vegdist13 vegdist14
## "morisita" "horn" "mountford" "raup" "binomial" "chao"
## vegdist15 betadiver1 betadiver2 betadiver3 betadiver4 betadiver5
## "cao" "w" "-1" "c" "wb" "r"
## betadiver6 betadiver7 betadiver8 betadiver9 betadiver10 betadiver11
## "I" "e" "t" "me" "j" "sor"
## betadiver12 betadiver13 betadiver14 betadiver15 betadiver16 betadiver17
## "m" "-2" "co" "cc" "g" "-3"
## betadiver18 betadiver19 betadiver20 betadiver21 betadiver22 betadiver23
## "l" "19" "hk" "rlb" "sim" "gl"
## betadiver24 dist1 dist2 dist3 designdist
## "z" "maximum" "binary" "minkowski" "ANY"
You can see the help page on distances for more information. The distance should be chosen carefully and should reflect the type of differences between samples you are interested in.
graph_perm_test
allows you to specify one of four
options for a type of graph: a minimum spanning tree, a \(k\)-nearest neighbors graph, and two types
of thresholded graphs. These are passed to the type
argument.
type = "mst"
creates a minimum
spanning tree. The minimum spanning tree places edges between the
samples so that all of the samples are connected and the sum of the
distances between samples connected by an edge is minimized.type = "knn"
creates a \(k\)-nearest neighbors graph. For each
sample, we place an edge between it and its \(k\) nearest neighbors. This of course
requires you to specify \(k\) with the
argument knn
. A small number, on the order of 1 to 3 is
likely a good choice.type = "threshold.distance"
creates a distance
threshold graph, and requires you to specify max.dist
. The
graph will be created by placing an edge between any pair of points
where the distance between them is less than max.dist
.type = "threshold.nedges"
creates a distance threshold
graph, and requires you to specify nedges
. The graph will
be created by computing distances between every pair of samples, and
placing an edge between the nedges
pairs of samples with
the smallest distances between them.Note that the knn
argument is only used with
type = "knn"
, the max.dist
argument is only
used if type = "threshold.distance"
, and the
nedges
argument is only used if
type = "threshold.nedges"
. type = "mst"
requires no additional arguments.
In some simulations we saw that the minimum spanning tree and k-nearest neighbors had the most power. The minimum spanning tree is the simplest choice since it doesn’t require specifying any further parameters, but if you have reason to believe that other types of graphs would be more appropriate in your application they are also available. The \(k\)-nearest neighbors graph might be desirable because it gives an interpretable test statistic: the number of nearest neighbors that are of the same type.
Suppose that we have collected the data in the
enterotype
dataset, which is available in the phyloseq
package as a phyloseq
object. We can load the data and look
at it with the following commands:
library(ggplot2)
# not necessary, but I like the white background with ggplot
theme_set(theme_bw())
library(phyloseqGraphTest)
data(enterotype)
enterotype
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 553 taxa and 280 samples ]
## sample_data() Sample Data: [ 280 samples by 9 sample variables ]
## tax_table() Taxonomy Table: [ 553 taxa by 1 taxonomic ranks ]
Suppose we want to test for differences between sequencing platforms
(the SeqTech
column in the sample data). We have also
decided we want to use the Jaccard dissimilarity and a \(k\)-nearest neighbors graph with \(k\) = 1 to perform our test. Then we would
use the following commands to run the test and view the output:
gt = graph_perm_test(enterotype,
sampletype = "SeqTech",
distance = "jaccard",
type = "knn",
knn = 1)
gt
## Output from graph_perm_test
## ---------------------------
## Observed test statistic: 197 pure edges
## 221 total edges in the graph
## Permutation p-value: 0.002
We see that the difference between sequencenig platforms is statistically significant, with a \(p\)-value of .002. The effect is also quite substantial: we see from the observed test statistic that out of the 221 total edges in the 1-nearest neighbors graph, 197 of them connect samples of the same type.
The output from graph_perm_test
is a
psgraphtest
object, which is a list containing information
about the test. The elements of the list are:
observed
: The observed test statistic, the number of
pure edges.
perm
: A vector containing the value of the test
statistic (the number of pure edges) in each of the permuted
datasets.
pval
: The p-value for the permutation test. This is
the fraction of times the number of pure edges in the permuted dataset
exceeded the number of pure edges in the observed dataset.
net
: The graph used for testing.
sampletype
: A vector containing the group label for
each sample.
type
: The type of graph used.
These can be inspected by hand, but the package also contains some functions for plotting the results.
The function plot_test_network
plots the graph we
created on the samples, the sample identities, and the edge types (pure
or mixed, i.e. edges between samples of the same type or edges between
samples of different types). Here we see that the nearest neighbor graph
connects largely samples of the same type.
plot_test_network(gt)
The function plot_permutations
will plot a histogram of
the number of pure edges in each of the permuted datasets along with the
number of pure edges in the observed dataset. For this dataset, we see
that the number of pure edges in the observed dataset is well outside of
the permutation distribution.
plot_permutations(gt)
There are a couple of other arguments to the
graph_perm_test
function. nperm
is the number
of permutations to use for the test. The default is 499, and it can be
increased or decreased depending on how much computational time you have
and how closely you want to approximate the full permutation
distribution.
You can also specify a stratifying variable using the
grouping
argument. This is necessary in repeated measures
designs. Suppose for instance that we have mice in two different
litters, and we would like to test for equality of the distributions
from the two litters. If we have more than one sample taken from each of
the mice, permuting the litter label over all the samples independently
will not give a valid test because of the dependence between samples
taken from the same mouse. We can fix this by considering the mice the
independent units and permuting the litter label over mouse instead of
over sample to obtain a valid test.
Friedman, J.H. and Rafsky, L.C. “Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests.” The Annals of Statistics (1979):697-717.↩︎