Description of EDA plot types
Global PCA plots -
The global PCA plots present a view of the two prokaryotic domains
and are generated from the complete
data set using all 223 benchmarks. To maintain a consistent perspective,
taxa are selected at the phylum or class level and overlayed back onto
the base plot. The identity of the highlighted points can be viewed by
placing the cursor directly over those of interest. In all cases, global
plots are for the first two principal components, which account for > 85%
of the total variance within the data set.
Phylum- and class-level PCA plots -
In some instances, adequate resolution of species is not possible within
the
global PCA plots. Reasons for this include overlap of point within a
given 2-D space, point occlusion, or insufficient variability within
the 16S rDNA to provide meaningful separation using the global coordinate
system. In such cases, it is useful to recompute the principal components
of a subset of sequences and benchmarks, selected based on taxonomic
affilitation. Sepa ration and visualization of subgroups is enhanced
by "rotating" the plots, which is accomplished by using various
two-way combinations of the first, second and third principal components.
Screeplots -
A screeplot plots the eigenvalues against their indices, and breaks
visually into a steady downward slope
and a gradual tailing away, analogous to a mountain and the scree that
makes up the alluvial fan at the base of a mountain. The breakpoint in
the downward slope in the plot indicates the break between the "important" and
less important principal components which make up the scree.
Dynamic heatmaps - Non-optimized, dynamic
heatmaps of subsets of the data used in PCA analysis were generated to
help explain positioning of individual points in some plots. Note that
the scale and coloration changes from on heatmap to the next.
The analysis -
One of the major problems plaguing the use of 16S rDNA for deterministic
purposes is the lack of a carefully
vetted set of sequences, in which the taxonomic annotation was carefully
reviewed and updated. Our analyses began with a set of 6635 sequences
(> 1399 nts, < 4% ambiguities) that had been reported as coming
from type strains or from strains of validly named species. These are
identified as the "unresolved" set as there remained a number
of taxonomic and nomenclatural errors within this data set. The "resolved" set
is a subset of 6377 sequences for which we could confirm identity and
taxonomic placement. Within this subset remain some likely placement
errors that are indicative of misnamed species. These are predominantly
within the phyla Firmicutes and Actinobacteria.
Page 1|2|3|4
|