Categorical morphological data (discrete characters) should be treated as factors when imported to calculate character distances, as the symbols used to represent different states are arbitrary (e.g., could be equally represented by letters, such as for DNA data). If continuous variables are used as phylogenetic characters, those should be read in from a separate file and treated as numeric data, since input values for each state (e.g., 0.234; 2.456; 3.567; etc) represent true distance between data points.
Categorical data including symbols for inapplicable and missing data
(typically "-"
and "?"
, respectively) will be
read in and treated as separate categories of data relative to numerical
symbols for different character states ("0"
,
"1"
, "2"
, etc.). Therefore, there are a few
options users may follow for handling morphological phylogenetic
datasets to account for inapplicable/missing data before importing it
into EvoPhylo
. Users may either convert
inapplicable/missing to NA
or they may choose to keep the
original symbols.
In the example provided below, converting inapplicable/missing
conditions to NA
will ignore the respective taxa with
inapplicable/missing data to calculate inter-character distances. The
resulting distance matrix will introduce NaN
to every
pairwise comparison involving two characters with NA
(all
comparisons including character 5, as well as any pairwise comparisons
involving characters 4, 5 and 7) (Table 2-in blue). Statistical tests
and clustering methods cannot utilize such matrices with
NaN
as data entries and removal of observations
contributing to excessive NaN
would have to be performed.
However, removing observations with excessive inapplicable/missing data
is not possible for character partitioning because each character in the
dataset must be assigned to at least one partition (regardless of the
amount of missing or inapplicable data).
Taxon A | Taxon B | |
---|---|---|
Char1 | 0 | 0 |
Char2 | 1 | 1 |
Char3 | 0 | 0 |
Char4 | 0 | ? |
Char5 | ? | ? |
Char6 | 1 | 1 |
Char7 | ? | 1 |
Char8 | 0 | 0 |
Char9 | 1 | 1 |
Char10 | 1 | 1 |
Besides, in comparisons between characters inclusive of states with
NA
, the latter will contribute 0 difference to the distance
matrix. For instance, distance between characters 6 (1,1) and 7
(NA
, 1) is 0 (Table 2-in red). The implicit assumption with
option 1 is that unknown characters contribute 0 distance. Therefore,
this approach biases the distance matrix by minimizing the overall
distance between characters to the lowest possible values. It assumes
that, whatever the true condition represented by the unknown state, it
is always assumed to be equal to the known character states (e.g.,
character states scored as β1β for Taxa A and B).
Alternatively, keeping the original inapplicable/missing data symbol
will make the inapplicables/missing data to be treated as a distinct
categorical variable relative to numeric symbols. As a result, pairwise
comparisons with characters with unknown data will avoid the
introduction of NaN
, allowing all characters to be
considered (Table 3-in blue). This approach assumes that unknown states
are always different from any known states, which will bias the distance
matrix by increasing the overall distance between characters.
Fortunately, however, Gower distances (as used here) are normalized by
the number of variables in the dataset (number of taxa in this case),
which reduces this bias. For instance, in a simple comparison between
two characters sampled from two taxa (A and B), e.g., character 6 (1,1)
and character 7 (NA, 1) from the example in the online vignette, the raw
distance between these characters is 1.0, but the Gower distance between
them is 1/2 = 0.5 (Table 3-in red).
|
|