Compact letter displays (CLDs) are a popular way to display multiple comparisons, especially when there are more than a few means to compare. They are problematic, however, because they are prone to misinterpretation (more details later). Here we present some background on CLDs, and show some adaptations and alternatives that may be less prone to misinterpretation.
CLDs generalize an “underlining” technique shown in some old experimental design and analysis textbooks, where results may be displayed something like this:
trt1 ctrl trt3 trt2 trt4
----------
------------------
The observed means are sorted in increasing order, so in this
illustration, trt1
has the lowest mean, ctrl
has the next lowest, and trt4
has the highest. The
underlines group the means such that the extremes of each group are
not significantly different according to a statistical test
conducted at a specified alpha level. So in this illustration,
trt1
is significantly less than trt3
,
trt2
, and trt4
, but not ctrl
; and
in fact trt4
is significantly greater than all the
others.
This grouping also illustrates the dangers created by careless
interpretations. Some observers of this chart might say that
“trt1
and ctrl
are equal” and that
“ctrl
, trt3
, and trt2
are equal”
– when in fact we have merely failed to show they are different. And
further confusion results because mathematical equality is transitive –
that is, these two statements of equality would imply that
trt1
and trt2
must be equal, seemingly
contradicting the finding that they are significantly different.
Statistical nonsignificance does not have the transitivity
property!
The underlining method becomes problematic in any case where the standard errors (SEs) of the comparisons are unequal – for example if we have unequal sample sizes, or a model with non-homogeneous variances. When the SEs are unequal, it is possible, for example, for two adjacent means to be significantly different, while two more distant ones do not differ significantly. If that happens, we can’t use underlines to group the means. The problem here is that lines are continuous, and that continuousness forces a continuum of groupings.
However, Piepho (2004) solved this problem by using symbols instead of lines, and creating a display where any two means associated with the same symbol are deemed to not be statistically different. Using symbols, it is possible to have non-contiguous groupings, e.g., it is possible for two means to share a symbol while an intervening one does not share the same symbol. Such a display is called a compact letter display. We do not absolutely require actual letters, just symbols that can be distinguished from one another. In the case where all the differences have equal SEs, the CLD will be the “same” as the result of grouping lines, in that each distinct symbol will span a contiguous range of means that can be interpreted as a grouping line.
The R package multcompView (Graves et al.,
2019) provides an implementation of the Piepho algorithm. The multcomp
package (Hothorn et al. 2008) provides a generic
cld()
function, and the emmeans package
provides a cld()
method for emmGrid
objects.
As a moving example, we simulate some data from an unbalanced design with 7 treatments labeled A, B, …, G; and fit a model to those
set.seed(22.10)
mu = c(16, 15, 19, 15, 15, 17, 16) # true means
n = c(19, 15, 16, 18, 29, 2, 14) # sample sizes
foo = data.frame(trt = factor(rep(LETTERS[1:7], n)))
foo$y = rnorm(sum(n), mean = mu[as.numeric(foo$trt)], sd = 1.0)
foo.lm = lm(y ~ trt, data = foo)
There are only four distinct true means underlying these seven
treatments: Treatments B
, D
, and
E
have mean 15, treatments A
and
G
have mean 16, and treatments F
and
C
are solo players with means 17 and 19 respectively.
Let’s see a compact letter display for the marginal means. (Call this CLD #1)
foo.emm = emmeans(foo.lm, "trt")
library(multcomp)
cld(foo.emm)
## trt emmean SE df lower.CL upper.CL .group
## B 14.6 0.246 106 14.1 15.1 1
## E 15.0 0.177 106 14.6 15.3 1
## D 15.3 0.224 106 14.8 15.7 1
## G 15.3 0.254 106 14.8 15.9 1
## A 16.4 0.218 106 15.9 16.8 2
## F 16.6 0.673 106 15.2 17.9 12
## C 19.3 0.238 106 18.9 19.8 3
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 7 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
The default “letters” for the emmeans implementation
are actually numbers, and we have three groupings indicated by the
symbols 1
, 2
, and 3
. This
illustrates a case where grouping lines would not have worked, as we see
in the fact that group 1
is not contiguous. We have (among
other results) that treatment A
differs significantly from
treatments B
, E
, D
,
G
, and C
(at the default 0.05 significance
level, with Tukey adjustment for multiple testing). and that
C
is significantly greater than all the other means since
it is the only mean in group 3
.
An annotation warns that two means in the same group are not
necessarily the same; yet CLDs present a strong visual message that they
are. The careless reader who makes this mistake will have trouble with
the gap in group 1
, asking how A
can differ
from G
and yet G
and F
, are “the
same.” The explanation is that the SE of F
is huge, owing
to its very small sample size, so it is hard for it to be
statistically different from other means. It is almost a gift
to obtain a non-contiguous grouping like this, as it forces the user to
think more carefully about what these grouping do and do not imply.
Given the discussion above, one might wonder if it is possible to construct a CLD in such a way that means sharing the same symbol are actually shown to be the same? The answer is yes (otherwise we wouldn’t have asked the question!) – and it is quite easy to do, thanks to two things:
TRUE
for any pair that is statistically different
(those means must receive different grouping letters), and
FALSE
otherwise; and the algorithm works for any
such Boolean matrixFALSE
is they are shown to be
equivalent and TRUE
if not shown to be equivalent.For our example, suppose, based on subject-matter considerations,
that two means that differ by less than 1.0 can be considered
equivalent. In the emmeans setup, we specify that we
want equivalence testing simply by providing this nonzero threshold
value as a delta
argument. In addition, we typically will
not make multiplicity adjustments to equivalence tests. Here is the
result we obtain (call this CLD #2)
cld(foo.emm, delta = 1, adjust = "none")
## trt emmean SE df lower.CL upper.CL .equiv.set
## B 14.6 0.246 106 14.1 15.1 1
## E 15.0 0.177 106 14.6 15.3 12
## D 15.3 0.224 106 14.8 15.7 2
## G 15.3 0.254 106 14.8 15.9 2
## A 16.4 0.218 106 15.9 16.8 3
## F 16.6 0.673 106 15.2 17.9 4
## C 19.3 0.238 106 18.9 19.8 5
##
## Confidence level used: 0.95
## Statistics are tests of equivalence with a threshold of 1
## P values are left-tailed
## significance level used: alpha = 0.05
## Estimates sharing the same symbol test as equivalent
So we obtain five groupings – but only two if we ignore those that
apply to only one mean. We have that treatments B
and
E
can be considered equivalent, and treatments
E
, D
, and G
are considered
equivalent. It is also important to know that we cannot say
that means in different groups are significantly different.
Unlike CLD #1, we are showing only groupings of means that we can
show to be the same. The first four means, which were grouped
together earlier, are now assigned to two equivalence groupings. And
treatment F
is not grouped with any other mean – which
makes sense because we have so little data on that treatment that we can
hardly say anything.
Another variation is to simply reverse all the Boolean flags we used in constructing CLD #1. Then two means will receive the same letter only if they are significantly different. Thus, we really obtain ungrouping letters. We label these groupings “significance sets.” The resulting display has a distinctively different appearance, because common symbols tend to be far apart rather than contiguous. (Call this CLD #3)
cld(foo.emm, signif = TRUE)
## trt emmean SE df lower.CL upper.CL .signif.set
## B 14.6 0.246 106 14.1 15.1 1
## E 15.0 0.177 106 14.6 15.3 2
## D 15.3 0.224 106 14.8 15.7 3
## G 15.3 0.254 106 14.8 15.9 4
## A 16.4 0.218 106 15.9 16.8 1234
## F 16.6 0.673 106 15.2 17.9 5
## C 19.3 0.238 106 18.9 19.8 12345
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 7 estimates
## significance level used: alpha = 0.05
## Estimates sharing the same symbol are significantly different
Here we have five significance sets. By comparing with CLD #1, you can confirm that each significant difference shown explicitly here corresponds to one shown implicitly (by not sharing a group) in CLD #1.
Compact letter displays show symbols based on statistical testing results. In such tests, we have strong conclusions or findings – those that have small P values, and weak conclusions or non-findings – those where the P value is not less than some \(\alpha\). When we create visual flags such as grouping lines or symbols, those come across visually as findings, and the problem with standard CLDs is that those are the non-findings. We show two simple ways to use software that creates CLDs so that actual findings are flagged with symbols. It is hoped that people will find these modifications useful in visually displaying comparisons among means.
Graves, Spencer, Piepho Hans-Pieter, Selzer, Luciano, and Dorai-Raj,
Sundar (2019). multcompView: Visualizations of Paired
Comparisons. R package version 0.1-8,
https://CRAN.R-project.org/package=multcompView
Hothorn, Torsten, Bretz, Frank, and Westfall, Peter (2008). Simultaneous Inference in General Parametric Models. Biometrical Journal 50(3), 346–363.
Piepho, Hans-Peter (2004). An algorithm for a letter-based representation of all pairwise comparisons, Journal of Computational and Graphical Statistics 13(2) 456–466.