ADDENDUM – Self Organized Self Correcting Classifier Algorithm
(Garrity & Lilburn)
Copyright, Michigan State University, 2004, All rights
reserved
STEP 1. Data input
Read distance
matrix (ASCII file) into S-Plus matrix (DIST.MAT)
DIST.MAT is the output from PAUP
Create
corresponding taxonomy S-Plus data.frame from master taxonomy data.frame
(TAX.TABLE)
TAX.TABLE initially contains three columns (as factor variables): the
species name, the
numerical position of
the species in the current version of Bergey's Taxonomic Outline,
and the corresponding family to which the species is assigned with the
current version of
the taxonomy.
NOTE - each data
element (16S rRNA distance vector or corresponding taxonomy vector) is uniquely
identified by the RDP ID. This identifier serves as a pointer for
indexing into data structures.
Sort distance
matrix according to Bergey's Outline Sequence
Visualize the
DIST.MAT as a heat map (see gl_sup_fig1.png)
STEP 2. Reorder DIST.MAT (first pass)
NOTE - The goal
at this stage of the analysis is to examine DIST.MAT at the genus level, to
identify any potential misplacements that arise because of
misidentification of sequences
or because of a failure to identify synonyms and update the nomenclature
associated with
such sequences.
Create a vector
of genus names for each sequence (GENUS.NAME)
Create a vector
of unique genus names (TAXON.LEVEL)
Initialize an
S-Plus list for storing RDP-IDs for each unique genus, attach the corresponding
name to each list element (TAXON.SEQ.NEW)
FOR each unique
genus name
IF two or less representative sequences exists for a given genus
add RDP-IDs to the corresponding element in TAXON.SEQ.NEW
ENDIF
IF more than two representative sequences exist for a given genus
extract the genus-level submatrix from DIS.MAT and assign to TEMP.DIS
IF sum of the matrix == 0
add RDP-IDs to the corresponding element in TAXON.SEQ.NEW
NEXT
NOTE - This step is required to avoid failure of subsequent
clustering
ENDIF
Jitter the matrix to eliminate ties
FOR each row in TEMP.DIS
determine the rank-order of the evolutionary distance values
store in TEMP.ORDER
NEXT
FOR each column in TEMP.ORDER
identify the name of the top-ranked sequence, add to TEMP.REORDER
NEXT
Identify any names dropped because of a tie-score for top-ranking
(DROPPED.NAMES)
Reorder TEMP.DIS according to merged name vector (TEMP.REORDER, DROPPED.NAMES)
Reorder TEMP.ORDER according to merged name vector (TEMP.REORDER,
DROPPED.NAMES)
NOTE - at this point I will create a mask based on a sample-based
threshold score
of the second-best match.
This value is used to eliminate the problem
of artificially low values arising from self-matching. This next series
of
steps involve vectorized operations on TEMP.ORDER and TEMP.DIS to
extract
indices and values in a single step.
IF second-best matches are found within TEMP.ORDER
calculate the threshold value (default currently set to 90%ile)from
corresponding values in TEMP.DIS
ENDIF
IF a second-best matches are not found within TEMP.ORDER (this can occur
when
ties exist in TEMP.ORDER)
calculate the threshold value (default currently set to 90%ile)from
corresponding values in TEMP.DIS ranked at > 1 and < 2.
ENDIF
Create a logical matrix of neighbors (NEIGHBORHOOD)in which values >
90%ile set to T
Recast NEIGHBORHOOD as binary matrix
Multiply TEMP.DIS by NEIGHBORHOOD to create a mask of nearest neighbors
Hierarchical
Binary clustering of NEIGHBORHOOD along both dimensions
NOTE - current model employees binary distance matrix, complete linkage
and
reordering of the resulting model according to mean evolutionary
distance of each sequence in TEMP
Reorder TEMP.DIS along both dimensions according to the cluster analysis
ENDIF
NEXT
Create new index
into DIST.MAT from TAXON.SEQ.NEW
Plot heat
map (see
gl_sup_fig2.png)
NOTE - At this
point, misidentified sequences can be visualized in the matrix and can be
identified
through interactive visualization of the distance matrix in the heat map
viewer.
STEP 3. - Removal of misidentified sequences
Restructure
DIST.MAT based along TAXON.SEQ.NEW
Create modified
version of TAXON.SEQ.NEW (TAXON.SEQ.NEW2) by removal of misidentified
sequences identified in heat map
Create new index
into DIST.MAT from TAXON.SEQ.NEW2
Plot heat map
(see gl_sup_fig3.png)
STEP 4. - Repeat STEP 2 with misidentified strains
removed (see gl_sup_fig4.png)
STEP 5. - Visualization of revised DIST.MAT with
Misidentified sequences
NOTE - this
matrix is simply created by combining the revised index created from
TAXON.SEQ.NEW and the index of misidentified sequences(see gl_sup_fig6.png).
STEP 6. - Matching of unknown/misidentified strains to
known strains
NOTE - This
analysis requires the use of partial matrices to allow matching of the
unidentified
or misidentified sequence to its nearest neighbor.
Create ID.MAT
(DIST.MAT[(misidenfied,excluded sequences), correct sequences])
Jitter matrix to
eliminate ties
FOR each row in
ID.MAT
determine the rank-order of the evolutionary distance values
store in TEMP.ORDER
NEXT
FOR each column
in TEMP.ORDER
Create ordered matrix of evolutionary distances (TEMP.ORDER)
NEXT
Identify any
names dropped because of a tie-score for top-ranking (DROPPED.NAMES)
Reorder TEMP.DIS
according to merged name vector (TEMP.REORDER, DROPPED.NAMES)
Remove known
errors
NOTE - when working below the domain level, there are valid reasons for
dropping
sequences that are known to be in error that would otherwise
skew/distort the data.
FOR each
sequence
Identify name of the correctly identified sequence that is closest to
the unknown
or misidentified sequence (BEST.MATCH)
NEXT
Create a
dataframe (NEAREST.NEIGHBORS) from the row names of TEMP.ORDER and
columns 1 and 2 of TAX.TABLE, indexed along BEST.MATCH. Then, add a column of genus names
based on those appearing in the species names in column 1. Add appropriate column headers to
columns 4 and 5.
Create a revised
taxonomy table in which each sequence will be added back according to the best
match of a correctly identified sequence. (REVISED.TAX.TABLE)
STEP 7. Reorder the complete distance matrix with the
"identified" sequences added back
Restructure
DIST.MAT based on REVISED.TAX.TABLE
Repeat STEP 2
STEP 8. - Outlier detection
NOTE - As is the
case in a clustering routine, sequences that are clearly outside of a group
will be added back to the group with which closest affiliation is found
unless some
mechanism is included to identify those which exceed a threshold based
on sample statistics.
In this case, the sample statistics will be set to an evolutionary
distance > 3 stdev above
or below that of other members of a genus (TAXON.LEVEL).
FOR each
instance in TAXON.LEVEL
IF number of taxon members >= 4
NOTE - this step is necessary to have a reasonable minimal
sample size of 6
non-reflected values.
Create a temporary distance matrix (TEMP.REF)
Create a vector (OUTLIER) of evolutionary distance that fall below
3*stdev of the
lower triangle of TEMP.REF
Create a matrix (TEMP2)of logical values (as 0,1) that identify where
the values
in OUTLIER occur.
Calculate row sums for TEMP2 to identify which taxa are outliers
Extract names of outliers from TEMP2 for those rows in which the row sum
> 3Q of TEMP2
ENDIF
NEXT
Identify
outliers in REVISED.GAMMA.TAX by setting the MPI species and family to
"placement error"
STEP 9. - Resolution of placement errors
Create ID.MATRIX
for sequences identified as placement errors
Assign ID.MATRIX
to TEMP
Repeat STEP 6
STEP 10. - Add back mislabeled strains
Re-create
REVISED.TAX.TABLE from REVISED.TAX.TABLE (placement errors excluded) and
NEAREST.NEIGHBORS
NOTE - data frame built in row-wise fashion
Recreate
GENUS.NAMES and TAXON.LEVEL as in STEP-2
Reorder DIST.MAT
by REVISED.TAX.TABLE
Repeat STEP 2
(see gl_sup_fig8.png)
STEP 11. - Calculation of genus medioids
Restructure
DIST.MAT along both dimensions by TAXON.SEQ.NEW in STEP 10
Recreate vector
of GENUS.NAMES
Create TAXON
matrix (GENUS NAME x unique(GENUS.NAME))
Create
TAXON.NAMES (=unique(GENUS.NAME)
Create
TAXON.MEDIOIDS matrix
FOR each
TAXON.NAME
Extract taxon-specific submatrix from DIST.MAT
Estimate column means of submatrix
Store vector of column means in TAXON.MEDIODS matrix
NEXT
Assign taxon
names to rows and columns
Transpose
TAXON.MEDIOIDS
Create
TAXON.MEDIOIDS2 (square matrix)
FOR each taxon
name
Extract taxon-specific submatrix from DIST.MAT
IF only one vector occurs
store vector in TAXON.MEDIODS2 matrix
ENDIF
IF more than one vector occurs
Estimate column means of submatrix
store vector of column means in TAXON.MEDIODS2 matrix
ENDIF
NEXT
NOTE -
Restructure medioids by decremental column sort
Assign
TAXON.MEDOIDS2 to X
FOR (each row in
X)-1
Sort matrix column-wise in ascending order
Assign name of first sequence to TAXON.NAME
Reassign X to X[subtract first row,subtract first column]
NEXT
Identify the
last sequence by comparing names of TAXON.MEDOIDS2 with TAXON.NAME and add to
TAXON.NAME
Reorder
TAXON.MEDOIDS2 by TAXON.NAME to yield a smoothed matrix of medioids.
STEP 12. - Restructure the dataset according to the
plotting sequence of medioids(see gl_sup_fig9.png)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
SOSCC script and functions
#bioinformatics.demo<-function(heatmaps=FALSE)
heatmaps <- T
graphsheet()
#step1
gamma.dis<-read.eda("gamma.txt")
gamma.tax<-read.taxonomy(sequence.taxonomy, gamma.dis)
gamma.dis<-gamma.dis[gamma.tax[,3] > 1809 & gamma.tax[,3] <
3030,
gamma.tax[,3] > 1809 & gamma.tax[,3] < 3030]
#This step is
included to eleminate the RDP sequence of Marinobacter(Pseudomonas)
#stanier, which
exhibits some anomalies and significantly effects the final output
#scale.
gamma.dis<-gamma.dis[-grep("stanier*",
as.character(dimnames(gamma.dis)[[1]])),
-grep("stanier*", as.character(dimnames(gamma.dis)[[2]]))]
gamma.dis.bak<-gamma.dis
gamma.tax<-gamma.tax[gamma.tax[,3] > 1809 & gamma.tax[,3] <
3030, ]
gamma.tax<-gamma.tax[-grep("stanier*",
as.character(dimnames(gamma.tax)[[1]])),]
gamma.tax.bak<-gamma.tax
gamma.genus.all<-unique(outline1.3[outline1.3[,5]=="Gammaproteobacteria"
& outline1.3[,9]=="not.basonym" ,8])
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis)
title(main="Unordered Gammaproteobacteria EDA matrix",
cex=0.75)
}
gamma.tax<-gamma.tax[order(gamma.tax[,3]),]
gamma.dis<-gamma.dis[dimnames(gamma.tax)[[1]],
dimnames(gamma.tax)[[1]]]
original.gamma.tax<-gamma.tax
original.gamma.dis<-gamma.dis
#step2
pass1<-step2(gamma.tax, gamma.dis)
misidentified<-pass1$misidentifiedInStep2
taxon.seq.new<-pass1$taxon.seq.new
pass1<-pass1[1:(length(pass1)-1)]
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[unlist(pass1), unlist(pass1)])
title(main="First pass ordering, Gammaproteobacteria EDA
matrix", cex=0.75)
label.heatmap(build.heatmap.label.table(gamma.tax,2,
unique(gamma.tax[,2]),unlist(pass1)),0.45)
}
#step3
gamma.dis<-gamma.dis[unlist(pass1), unlist(pass1)]
taxon.seq.new2<-setdiff(unlist(pass1),misidentified)
pass2<-unlist(taxon.seq.new2)
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[pass2, pass2])
title(main="First pass reorganization of Gammaprotebacteria EDA
matrix
misidentified strains removed." , cex=0.75)
label.heatmap(build.heatmap.label.table(gamma.tax[pass2,],2,
unique(gamma.tax[pass2,2]),unlist(pass2)),0.45)
}
#step4
gamma.tax<-gamma.tax[unlist(pass2),]
gamma.dis<-gamma.dis[unlist(pass2), unlist(pass2)]
pass3<-step2(gamma.tax, gamma.dis)
misidentified<-c(misidentified,pass3$misidentified)
pass3<-pass3[1:(length(pass3)-1)]
pass3<-setdiff(unlist(pass3),misidentified)
gamma.dis<-gamma.dis.bak
gamma.tax<-gamma.tax.bak
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[unlist(pass3), unlist(pass3)])
title(main="Second pass reorganization of Gammaprotebacteria EDA
matrix
misidentified strains removed.", cex=0.75)
label.heatmap(build.heatmap.label.table(gamma.tax[pass3,],2,
unique(gamma.tax[pass3,2]),unlist(pass3)),0.45)
}
correct.placements<-unlist(pass3)
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[misidentified, correct.placements])
title(main="Misidentified vs correctly identified
strains",cex=0.75)
}
#step5
excluded.strains<-setdiff(dimnames(gamma.tax)[[1]],
c(misidentified,as.vector(correct.placements)))
pass4<-unique(c(correct.placements, unique(c(misidentified,
excluded.strains))))
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[unlist(pass4), unlist(pass4)])
title(main="Misidentified and excluded strains added back to
matrix" ,cex=0.75)
label.heatmap(build.heatmap.label.table(gamma.tax[pass3,],2,
unique(gamma.tax[pass3,2]),unlist(pass3)),0.45)
}
#step6
revised.gamma.tax<-step6(gamma.dis[c(misidentified, excluded.strains),
correct.placements], gamma.tax, correct.placements)
known.errors<-revised.gamma.tax$known.errors[revised.gamma.tax$known.errors!=""]
nearest.neighbors<-revised.gamma.tax$nearest.neighbors
gamma.genus<-revised.gamma.tax$all.genus
revised.gamma.tax<-revised.gamma.tax$revised.tax.table
revised.gamma.tax[,3]<-as.character(revised.gamma.tax[,3])
revised.gamma.tax[known.errors,3]<-"known.error"
revised.gamma.tax[,3]<-as.factor(revised.gamma.tax[,3])
#step 7
pass5<-unlist(step7(revised.gamma.tax, gamma.dis, gamma.genus))
pass5<-setdiff(unlist(pass5), c(misidentified, known.errors))
misidentified<-unique(c(misidentified,dimnames(gamma.dis)[[1]][abs(apply(scale(gamma.dis),
1, max))>2.5]))
pass5<-setdiff(unlist(pass5), c(misidentified, known.errors))
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[unlist(pass5), unlist(pass5)])
title(main="Third pass reorganization of Gammaprotebacteria EDA
matrix
misidentified and excluded strains repositioned.", cex=0.75)
label.heatmap(build.heatmap.label.table(revised.gamma.tax[pass5,],2,
unique(revised.gamma.tax[pass5,2]),unlist(pass5)),0.45)
}
#step8
outlier<-0
place.error<-as.vector(unlist(step8(revised.gamma.tax, gamma.dis,
gamma.genus, taxon.seq.new))) # modified
revised.gamma.tax[,3]<-as.character(revised.gamma.tax[,3])
revised.gamma.tax[place.error,3]<-"placement error"
revised.gamma.tax[,3]<-as.factor(revised.gamma.tax[,3])
#Step9
id.matrix<-gamma.dis[revised.gamma.tax[,3]=="placement
error",
revised.gamma.tax[,3]!="placement error"]
nearest.neighbors<-step9(id.matrix, gamma.tax)
#Step10
revised.gamma.tax<-rbind(revised.gamma.tax[revised.gamma.tax[,3]!="placement
error",],
revised.gamma.tax[revised.gamma.tax[,3]=="placement
error",])
gamma.dis<-gamma.dis.bak
gamma.tax<-gamma.tax.bak
gamma.genus<-substring(as.character(revised.gamma.tax[,4]), 1,
regexpr("[^a-zA-Z]",
as.character(revised.gamma.tax[,4]))-1)
pass6<-step7(revised.gamma.tax, gamma.dis, gamma.genus)
seq2remove<-sort(unique(step10(revised.gamma.tax,
gamma.dis)))[-1]
pass6<-setdiff(unlist(pass6), seq2remove)
if(heatmaps)
{
graphsheet()
heatmap(gamma.dis[pass6,pass6])
title(main="Final rearragement of Gammaproteobacteria, \nBergey's
Family Order",cex=0.75)
label.heatmap(build.heatmap.label.table(revised.gamma.tax[pass6,],5,
unique(revised.gamma.tax[pass6,5]),pass6),0.45)
}
#Step 11
revised.gamma.tax<-revised.gamma.tax[pass6, ]
gamma.dis<-gamma.dis[pass6,pass6]
medioids<-step11(revised.gamma.tax, gamma.dis)
genus.order<-medioids$genus.order
#Step12
pass10<-step12(revised.gamma.tax, genus.order)
family.order<-pass10$family.order
pass10<-pass10$x
revised.gamma.tax<-revised.gamma.tax[pass10,]
graphsheet()
heatmap(gamma.dis[pass10,pass10])
title(main="Optimized rearragement of Gammaproteobacteria, \nbased
on genus mediods",
cex=0.75)
label.heatmap(build.heatmap.label.table(revised.gamma.tax[pass10,],5,
family.order,pass10),0.45)
+++++++++++++++++++++++++++++++++++++++++++FUNCTIONS+++++++++++++++++++++++++++++++++++++++++++++++++
Heatmap <- function(temp)
{
image(list(x = 1:dim(temp)[1], y = 1:dim(temp)[2], z =
as.matrix((temp))), axes = FALSE)
image.legend(as.matrix((temp)), x = nrow(temp) * 1.075, y = ncol(temp) *
1.05, size = c(0.125, 6.1), hor = F, cex = 0.66, tck = -0.01, mgp = c(0, 0.5,
0))
invisible()
}
build.heatmap.label.table <-
function(taxonomy.table, label.position = 2, label.source, taxonomy.order)
{
segment.table <- as.data.frame(matrix(0, length(label.source), 6))
dimnames(segment.table)[[1]] <- label.source
dimnames(segment.table)[[2]] <- c("x1", "y1",
"x2", "y2", "x3", "y3")
j
<- 1
segment.table[1, c(1, 3)] <- 0.5
for(i in 1:nrow(taxonomy.table)) {
if(taxonomy.table[i, label.position] == label.source[j]) {
next
}
if(taxonomy.table[i, label.position] != label.source[j]) {
j <- j + 1
segment.table[j, c(1, 3)] <- i - 0.5
}
}
segment.table[, 2] <- -1
segment.table[, 4] <- nrow(taxonomy.table[taxonomy.order, ]) * -0.01
segment.table[, 5] <- nrow(taxonomy.table[taxonomy.order, ]) * -0.0125
segment.table[1:(nrow(segment.table) - 1), 6] <-
segment.table[1:(nrow(segment.table) - 1), 1] +
(segment.table[2:nrow(segment.table), 1] - segment.table[1:(nrow(segment.table)
- 1), 1])/2
segment.table[nrow(segment.table), 6] <-
((nrow(taxonomy.table[taxonomy.order,
]) - segment.table[nrow(segment.table), 1]))/2 +
segment.table[nrow(segment.table), 1]
return(segment.table)
}
label.heatmap <- function(segment.table, size = 1)
{
for(i in 1:nrow(segment.table)) {
segments(segment.table[i, 1], segment.table[i, 2], segment.table[i, 3],
segment.table[i, 4])
text(segment.table[i, 6], segment.table[i, 5],
dimnames(segment.table)[[1]][i], crt = 90, cex = size, adj = 1)
segments(segment.table[i, 2], segment.table[i, 1], segment.table[i, 4] *
0.66, segment.table[i, 3])
text(segment.table[i, 5] * 0.66, segment.table[i, 6],
dimnames(segment.table)[[1]][i], cex = size, adj = 1)
}
}
read.eda <- function(file = "edafile")
{
return(eda.dis <- read.table(file, header = T, row.names = 1, sep =
"\t"))
}
read.taxonomy <- function(taxon.table, eda.dis,
species = 1, family = 6, taxon.seq = 9)
{
return(taxon.table[dimnames(eda.dis)[[1]], c(species, family, taxon.seq)])
}