ADDENDUM Self Organized Self Correcting Classifier Algorithm (Garrity & Lilburn)

 

Copyright, Michigan State University, 2004, All rights reserved

 

 

 

 

 

STEP 1. Data input

 

      Read distance matrix (ASCII file) into S-Plus matrix (DIST.MAT)

 

            DIST.MAT is the output from PAUP

 

      Create corresponding taxonomy S-Plus data.frame from master taxonomy data.frame (TAX.TABLE)

 

            TAX.TABLE initially contains three columns (as factor variables): the species name, the

 

            numerical position of the species in the current version of Bergey's Taxonomic Outline,

 

            and the corresponding family to which the species is assigned with the current version of

 

            the taxonomy.

 

      NOTE - each data element (16S rRNA distance vector or corresponding taxonomy vector) is uniquely

 

            identified by the RDP ID. This identifier serves as a pointer for indexing into data structures.

 

 

      Sort distance matrix according to Bergey's Outline Sequence

 

 

      Visualize the DIST.MAT as a heat map (see gl_sup_fig1.png)

 

 

STEP 2. Reorder DIST.MAT (first pass)

 

      NOTE - The goal at this stage of the analysis is to examine DIST.MAT at the genus level, to

 

            identify any potential misplacements that arise because of misidentification of sequences

 

            or because of a failure to identify synonyms and update the nomenclature associated with

 

            such sequences.

 

      Create a vector of genus names for each sequence (GENUS.NAME)

 

      Create a vector of unique genus names (TAXON.LEVEL)

 

      Initialize an S-Plus list for storing RDP-IDs for each unique genus, attach the corresponding

 

            name to each list element (TAXON.SEQ.NEW)

 

      FOR each unique genus name

 

            IF two or less representative sequences exists for a given genus

 

                  add RDP-IDs to the corresponding element in TAXON.SEQ.NEW

 

            ENDIF

 

            IF more than two representative sequences exist for a given genus

 

                  extract the genus-level submatrix from DIS.MAT and assign to TEMP.DIS

 

                  IF sum of the matrix == 0

 

                        add RDP-IDs to the corresponding element in TAXON.SEQ.NEW

 

                        NEXT

 

                        NOTE - This step is required to avoid failure of subsequent

 

                              clustering

 

                  ENDIF

 

                  Jitter the matrix to eliminate ties

 

                  FOR each row in TEMP.DIS

 

                        determine the rank-order of the evolutionary distance values

 

                        store in TEMP.ORDER

 

                  NEXT

                  FOR each column in TEMP.ORDER

 

                        identify the name of the top-ranked sequence, add to TEMP.REORDER

 

                  NEXT

 

                  Identify any names dropped because of a tie-score for top-ranking (DROPPED.NAMES)

 

                  Reorder TEMP.DIS according to merged name vector (TEMP.REORDER, DROPPED.NAMES)

 

                  Reorder TEMP.ORDER according to merged name vector (TEMP.REORDER, DROPPED.NAMES)

 

                  NOTE - at this point I will create a mask based on a sample-based threshold score

 

                        of the second-best match.  This value is used to eliminate the problem

 

                        of artificially low values arising from self-matching. This next series of

 

                        steps involve vectorized operations on TEMP.ORDER and TEMP.DIS to extract

 

                        indices and values in a single step.

 

                  IF second-best matches are found within TEMP.ORDER

 

                        calculate the threshold value (default currently set to 90%ile)from

 

                        corresponding values in TEMP.DIS

 

                  ENDIF

 

                  IF a second-best matches are not found within TEMP.ORDER (this can occur when

 

                        ties exist in TEMP.ORDER)

 

                        calculate the threshold value (default currently set to 90%ile)from

 

                        corresponding values in TEMP.DIS ranked at > 1 and < 2.

 

                  ENDIF

 

                  Create a logical matrix of neighbors (NEIGHBORHOOD)in which values > 90%ile set to T

 

                  Recast NEIGHBORHOOD as binary matrix

 

                  Multiply TEMP.DIS by NEIGHBORHOOD to create a mask of nearest neighbors

 

                  Hierarchical Binary clustering of NEIGHBORHOOD along both dimensions

 

                        NOTE - current model employees binary distance matrix, complete linkage and

 

                              reordering of the resulting model according to mean evolutionary

 

                              distance of each sequence in TEMP

 

                  Reorder TEMP.DIS along both dimensions according to the cluster analysis

 

            ENDIF

 

      NEXT

 

      Create new index into DIST.MAT from TAXON.SEQ.NEW

 

      Plot heat map     (see gl_sup_fig2.png)

 

      NOTE - At this point, misidentified sequences can be visualized in the matrix and can be identified

 

            through interactive visualization of the distance matrix in the heat map viewer.

 

 

 

STEP 3. - Removal of misidentified sequences

 

      Restructure DIST.MAT based along TAXON.SEQ.NEW

 

      Create modified version of TAXON.SEQ.NEW (TAXON.SEQ.NEW2) by removal of misidentified

 

            sequences identified in heat map

 

      Create new index into DIST.MAT from TAXON.SEQ.NEW2

 

      Plot heat map (see gl_sup_fig3.png)

 

    

 

STEP 4. - Repeat STEP 2 with misidentified strains removed (see gl_sup_fig4.png)

 

 

 

STEP 5. - Visualization of revised DIST.MAT with Misidentified sequences

 

      NOTE - this matrix is simply created by combining the revised index created from TAXON.SEQ.NEW and the index of misidentified sequences(see gl_sup_fig6.png).

 

 

 

STEP 6. - Matching of unknown/misidentified strains to known strains

 

      NOTE - This analysis requires the use of partial matrices to allow matching of the unidentified

 

            or misidentified sequence to its nearest neighbor.

 

      Create ID.MAT (DIST.MAT[(misidenfied,excluded sequences), correct sequences])

 

      Jitter matrix to eliminate ties

 

      FOR each row in ID.MAT

 

            determine the rank-order of the evolutionary distance values

 

            store in TEMP.ORDER

 

      NEXT

 

      FOR each column in TEMP.ORDER

 

            Create ordered matrix of evolutionary distances (TEMP.ORDER)

 

      NEXT

 

      Identify any names dropped because of a tie-score for top-ranking (DROPPED.NAMES)

 

      Reorder TEMP.DIS according to merged name vector (TEMP.REORDER, DROPPED.NAMES)

 

      Remove known errors

 

            NOTE - when working below the domain level, there are valid reasons for dropping

 

                  sequences that are known to be in error that would otherwise skew/distort the data.

 

      FOR each sequence

 

            Identify name of the correctly identified sequence that is closest to the unknown

 

                  or misidentified sequence (BEST.MATCH)

 

      NEXT

 

      Create a dataframe (NEAREST.NEIGHBORS) from the row names of TEMP.ORDER and

 

            columns 1 and 2 of TAX.TABLE, indexed along BEST.MATCH.  Then, add a column of genus names

 

            based on those appearing in the species names in column 1.  Add appropriate column headers to

 

            columns 4 and 5.

 

      Create a revised taxonomy table in which each sequence will be added back according to the best

 

            match of a correctly identified sequence. (REVISED.TAX.TABLE)

 

 

 

STEP 7. Reorder the complete distance matrix with the "identified" sequences added back

 

      Restructure DIST.MAT based on REVISED.TAX.TABLE

 

      Repeat STEP 2

 

 

 

STEP 8. - Outlier detection

 

      NOTE - As is the case in a clustering routine, sequences that are clearly outside of a group

 

            will be added back to the group with which closest affiliation is found unless some

 

            mechanism is included to identify those which exceed a threshold based on sample statistics.

 

            In this case, the sample statistics will be set to an evolutionary distance > 3 stdev above

 

            or below that of other members of a genus (TAXON.LEVEL).

 

      FOR each instance in TAXON.LEVEL

 

            IF number of taxon members >= 4

 

                  NOTE - this step is necessary to have a reasonable minimal sample size of 6

 

                        non-reflected values.

 

                  Create a temporary distance matrix (TEMP.REF)

 

                  Create a vector (OUTLIER) of evolutionary distance that fall below 3*stdev of the

 

                        lower triangle of TEMP.REF

 

                  Create a matrix (TEMP2)of logical values (as 0,1) that identify where the values

 

                        in OUTLIER occur.

 

                  Calculate row sums for TEMP2 to identify which taxa are outliers

 

                  Extract names of outliers from TEMP2 for those rows in which the row sum > 3Q of TEMP2

 

            ENDIF

 

      NEXT

 

      Identify outliers in REVISED.GAMMA.TAX by setting the MPI species and family to "placement error"

 

 

 

STEP 9. - Resolution of placement errors

 

      Create ID.MATRIX for sequences identified as placement errors

 

      Assign ID.MATRIX to TEMP

 

      Repeat STEP 6

 

 

 

STEP 10. - Add back mislabeled strains

 

      Re-create REVISED.TAX.TABLE from REVISED.TAX.TABLE (placement errors excluded) and NEAREST.NEIGHBORS

 

            NOTE - data frame built in row-wise fashion

 

      Recreate GENUS.NAMES and TAXON.LEVEL as in STEP-2

 

      Reorder DIST.MAT by REVISED.TAX.TABLE

 

      Repeat STEP 2 (see gl_sup_fig8.png)

 

 

 

STEP 11. - Calculation of genus medioids

 

      Restructure DIST.MAT along both dimensions by TAXON.SEQ.NEW in STEP 10

 

      Recreate vector of GENUS.NAMES

 

      Create TAXON matrix (GENUS NAME x unique(GENUS.NAME))

 

      Create TAXON.NAMES (=unique(GENUS.NAME)

 

      Create TAXON.MEDIOIDS matrix

 

      FOR each TAXON.NAME

 

            Extract taxon-specific submatrix from DIST.MAT

 

            Estimate column means of submatrix

 

            Store vector of column means in TAXON.MEDIODS matrix

 

      NEXT

 

      Assign taxon names to rows and columns

 

      Transpose TAXON.MEDIOIDS

 

      Create TAXON.MEDIOIDS2 (square matrix)

 

      FOR each taxon name

 

            Extract taxon-specific submatrix from DIST.MAT

 

            IF only one vector occurs

 

                  store vector in TAXON.MEDIODS2 matrix

 

            ENDIF

 

            IF more than one vector occurs

 

                  Estimate column means of submatrix

 

                  store vector of column means in TAXON.MEDIODS2 matrix

 

            ENDIF

 

      NEXT

 

      NOTE - Restructure medioids by decremental column sort

 

      Assign TAXON.MEDOIDS2 to X

 

      FOR (each row in X)-1

 

            Sort matrix column-wise in ascending order

 

            Assign name of first sequence to TAXON.NAME

 

            Reassign X to X[subtract first row,subtract first column]

 

      NEXT

 

      Identify the last sequence by comparing names of TAXON.MEDOIDS2 with TAXON.NAME and add to

 

            TAXON.NAME

 

      Reorder TAXON.MEDOIDS2 by TAXON.NAME to yield a smoothed matrix of medioids.

 

    

 

STEP 12. - Restructure the dataset according to the plotting sequence of medioids(see gl_sup_fig9.png)

 

 

 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

 

SOSCC script and functions

 

 

 

#bioinformatics.demo<-function(heatmaps=FALSE)

 

heatmaps <- T

 

graphsheet()

 

      #step1

 

      gamma.dis<-read.eda("gamma.txt")

 

      gamma.tax<-read.taxonomy(sequence.taxonomy, gamma.dis)

 

      gamma.dis<-gamma.dis[gamma.tax[,3] > 1809 & gamma.tax[,3] < 3030,

 

            gamma.tax[,3] > 1809 & gamma.tax[,3] < 3030]

 

      #This step is included to eleminate the RDP sequence of Marinobacter(Pseudomonas)

 

      #stanier, which exhibits some anomalies and significantly effects the final output

 

      #scale.

 

      gamma.dis<-gamma.dis[-grep("stanier*", as.character(dimnames(gamma.dis)[[1]])),

 

            -grep("stanier*", as.character(dimnames(gamma.dis)[[2]]))]

 

      gamma.dis.bak<-gamma.dis

 

      gamma.tax<-gamma.tax[gamma.tax[,3] > 1809 & gamma.tax[,3] < 3030, ]

 

      gamma.tax<-gamma.tax[-grep("stanier*", as.character(dimnames(gamma.tax)[[1]])),]

 

      gamma.tax.bak<-gamma.tax

 

      gamma.genus.all<-unique(outline1.3[outline1.3[,5]=="Gammaproteobacteria"

 

            & outline1.3[,9]=="not.basonym" ,8])

 

      if(heatmaps)

 

            {

 

                  graphsheet()

 

                  heatmap(gamma.dis)

 

                  title(main="Unordered Gammaproteobacteria EDA matrix", cex=0.75)             

 

            }

 

      gamma.tax<-gamma.tax[order(gamma.tax[,3]),]

 

      gamma.dis<-gamma.dis[dimnames(gamma.tax)[[1]], dimnames(gamma.tax)[[1]]]

 

      original.gamma.tax<-gamma.tax

 

      original.gamma.dis<-gamma.dis

 

    

 

      #step2

 

      pass1<-step2(gamma.tax, gamma.dis)

 

      misidentified<-pass1$misidentifiedInStep2

 

      taxon.seq.new<-pass1$taxon.seq.new

 

      pass1<-pass1[1:(length(pass1)-1)]

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[unlist(pass1), unlist(pass1)])

 

            title(main="First pass ordering, Gammaproteobacteria EDA matrix", cex=0.75)

 

            label.heatmap(build.heatmap.label.table(gamma.tax,2,

 

                  unique(gamma.tax[,2]),unlist(pass1)),0.45)

 

 

 

      }

 

 

 

      #step3

 

      gamma.dis<-gamma.dis[unlist(pass1), unlist(pass1)]

 

      taxon.seq.new2<-setdiff(unlist(pass1),misidentified)                  

 

      pass2<-unlist(taxon.seq.new2)

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[pass2, pass2])

 

            title(main="First pass reorganization of Gammaprotebacteria EDA matrix

 

                  misidentified strains removed." , cex=0.75)

 

            label.heatmap(build.heatmap.label.table(gamma.tax[pass2,],2,

 

                  unique(gamma.tax[pass2,2]),unlist(pass2)),0.45)

 

 

 

      }

 

 

 

      #step4

 

      gamma.tax<-gamma.tax[unlist(pass2),]

 

      gamma.dis<-gamma.dis[unlist(pass2), unlist(pass2)]

 

      pass3<-step2(gamma.tax, gamma.dis)

 

      misidentified<-c(misidentified,pass3$misidentified)

 

      pass3<-pass3[1:(length(pass3)-1)]

 

      pass3<-setdiff(unlist(pass3),misidentified)

 

      gamma.dis<-gamma.dis.bak

 

      gamma.tax<-gamma.tax.bak

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[unlist(pass3), unlist(pass3)])

 

            title(main="Second pass reorganization of Gammaprotebacteria EDA matrix

 

                  misidentified strains removed.", cex=0.75)

 

            label.heatmap(build.heatmap.label.table(gamma.tax[pass3,],2,

 

                  unique(gamma.tax[pass3,2]),unlist(pass3)),0.45)

 

      }

 

      correct.placements<-unlist(pass3)

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[misidentified, correct.placements])

 

            title(main="Misidentified vs correctly identified strains",cex=0.75)

 

      }

 

 

 

      #step5

 

      excluded.strains<-setdiff(dimnames(gamma.tax)[[1]],

 

            c(misidentified,as.vector(correct.placements)))

 

      pass4<-unique(c(correct.placements, unique(c(misidentified, excluded.strains))))

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[unlist(pass4), unlist(pass4)])

 

            title(main="Misidentified and excluded strains added back to matrix" ,cex=0.75)

 

            label.heatmap(build.heatmap.label.table(gamma.tax[pass3,],2,

 

                  unique(gamma.tax[pass3,2]),unlist(pass3)),0.45)

 

 

 

      }

 

    

 

    

 

      #step6

 

      revised.gamma.tax<-step6(gamma.dis[c(misidentified, excluded.strains), correct.placements], gamma.tax, correct.placements)

 

          

 

      known.errors<-revised.gamma.tax$known.errors[revised.gamma.tax$known.errors!=""]

 

      nearest.neighbors<-revised.gamma.tax$nearest.neighbors

 

      gamma.genus<-revised.gamma.tax$all.genus

 

      revised.gamma.tax<-revised.gamma.tax$revised.tax.table

 

      revised.gamma.tax[,3]<-as.character(revised.gamma.tax[,3])

 

      revised.gamma.tax[known.errors,3]<-"known.error"

 

      revised.gamma.tax[,3]<-as.factor(revised.gamma.tax[,3])

 

 

 

      #step 7

 

 

 

      pass5<-unlist(step7(revised.gamma.tax, gamma.dis, gamma.genus))

 

      pass5<-setdiff(unlist(pass5), c(misidentified, known.errors))

 

      misidentified<-unique(c(misidentified,dimnames(gamma.dis)[[1]][abs(apply(scale(gamma.dis),

 

             1, max))>2.5]))

 

      pass5<-setdiff(unlist(pass5), c(misidentified, known.errors))

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[unlist(pass5), unlist(pass5)])

 

            title(main="Third pass reorganization of Gammaprotebacteria EDA matrix

 

                  misidentified and excluded strains repositioned.", cex=0.75)

 

            label.heatmap(build.heatmap.label.table(revised.gamma.tax[pass5,],2,

 

                  unique(revised.gamma.tax[pass5,2]),unlist(pass5)),0.45)

 

          

 

      }

 

 

 

      #step8

 

      outlier<-0

 

      place.error<-as.vector(unlist(step8(revised.gamma.tax, gamma.dis, gamma.genus, taxon.seq.new))) # modified

 

      revised.gamma.tax[,3]<-as.character(revised.gamma.tax[,3])

 

      revised.gamma.tax[place.error,3]<-"placement error"

 

      revised.gamma.tax[,3]<-as.factor(revised.gamma.tax[,3])

 

 

 

      #Step9

 

 

 

      id.matrix<-gamma.dis[revised.gamma.tax[,3]=="placement error",

 

            revised.gamma.tax[,3]!="placement error"]

 

      nearest.neighbors<-step9(id.matrix, gamma.tax)

 

 

 

 

 

      #Step10

 

      revised.gamma.tax<-rbind(revised.gamma.tax[revised.gamma.tax[,3]!="placement error",],

 

            revised.gamma.tax[revised.gamma.tax[,3]=="placement error",])

 

      gamma.dis<-gamma.dis.bak

 

      gamma.tax<-gamma.tax.bak

 

      gamma.genus<-substring(as.character(revised.gamma.tax[,4]), 1,

 

            regexpr("[^a-zA-Z]",  as.character(revised.gamma.tax[,4]))-1)

 

      pass6<-step7(revised.gamma.tax, gamma.dis, gamma.genus)

 

      seq2remove<-sort(unique(step10(revised.gamma.tax, gamma.dis)))[-1]                      

 

      pass6<-setdiff(unlist(pass6), seq2remove)

 

      if(heatmaps)

 

      {

 

            graphsheet()

 

            heatmap(gamma.dis[pass6,pass6])

 

            title(main="Final rearragement of Gammaproteobacteria, \nBergey's Family Order",cex=0.75)

 

            label.heatmap(build.heatmap.label.table(revised.gamma.tax[pass6,],5,

 

                  unique(revised.gamma.tax[pass6,5]),pass6),0.45)

 

      }

 

                

 

      #Step 11

 

      revised.gamma.tax<-revised.gamma.tax[pass6, ]

 

      gamma.dis<-gamma.dis[pass6,pass6]

 

      medioids<-step11(revised.gamma.tax, gamma.dis)

 

      genus.order<-medioids$genus.order

 

 

 

      #Step12

 

      pass10<-step12(revised.gamma.tax, genus.order)

 

      family.order<-pass10$family.order

 

      pass10<-pass10$x

 

      revised.gamma.tax<-revised.gamma.tax[pass10,]

 

      graphsheet()

 

      heatmap(gamma.dis[pass10,pass10])

 

      title(main="Optimized rearragement of Gammaproteobacteria, \nbased on genus mediods",

 

                  cex=0.75)

 

      label.heatmap(build.heatmap.label.table(revised.gamma.tax[pass10,],5,

 

                  family.order,pass10),0.45)

 

 

 

 

 

 

 

 

 

+++++++++++++++++++++++++++++++++++++++++++FUNCTIONS+++++++++++++++++++++++++++++++++++++++++++++++++

 

 

 

Heatmap <- function(temp)

 

{

 

   image(list(x = 1:dim(temp)[1], y = 1:dim(temp)[2], z = as.matrix((temp))), axes = FALSE)

 

   image.legend(as.matrix((temp)), x = nrow(temp) * 1.075, y = ncol(temp) * 1.05, size = c(0.125, 6.1), hor = F, cex = 0.66, tck = -0.01, mgp = c(0, 0.5, 0))

 

   invisible()

 

}

 

 

 

 

 

build.heatmap.label.table <- function(taxonomy.table, label.position = 2, label.source, taxonomy.order)

 

{

 

   segment.table <- as.data.frame(matrix(0, length(label.source), 6))

 

   dimnames(segment.table)[[1]] <- label.source

 

   dimnames(segment.table)[[2]] <- c("x1", "y1", "x2", "y2", "x3", "y3")

 

   j <- 1

 

   segment.table[1, c(1, 3)] <- 0.5

 

   for(i in 1:nrow(taxonomy.table)) {

 

       if(taxonomy.table[i, label.position] == label.source[j]) {

 

          next

 

       }

 

       if(taxonomy.table[i, label.position] != label.source[j]) {

 

          j <- j + 1

 

          segment.table[j, c(1, 3)] <- i - 0.5

 

       }

 

   }

 

   segment.table[, 2] <- -1

 

   segment.table[, 4] <- nrow(taxonomy.table[taxonomy.order,  ]) * -0.01

 

   segment.table[, 5] <- nrow(taxonomy.table[taxonomy.order,  ]) * -0.0125

 

   segment.table[1:(nrow(segment.table) - 1), 6] <- segment.table[1:(nrow(segment.table) - 1), 1] + (segment.table[2:nrow(segment.table), 1] - segment.table[1:(nrow(segment.table) - 1), 1])/2

 

   segment.table[nrow(segment.table), 6] <- ((nrow(taxonomy.table[taxonomy.order,  ]) - segment.table[nrow(segment.table), 1]))/2 + segment.table[nrow(segment.table), 1]

 

   return(segment.table)

 

}

 

 

 

 

 

label.heatmap <- function(segment.table, size = 1)

 

{

 

   for(i in 1:nrow(segment.table)) {

 

       segments(segment.table[i, 1], segment.table[i, 2], segment.table[i, 3], segment.table[i, 4])

 

       text(segment.table[i, 6], segment.table[i, 5], dimnames(segment.table)[[1]][i], crt = 90, cex = size, adj = 1)

 

       segments(segment.table[i, 2], segment.table[i, 1], segment.table[i, 4] * 0.66, segment.table[i, 3])

 

       text(segment.table[i, 5] * 0.66, segment.table[i, 6], dimnames(segment.table)[[1]][i], cex = size, adj = 1)

 

   }

 

}

 

 

 

 

 

read.eda <- function(file = "edafile")

 

{

 

   return(eda.dis <- read.table(file, header = T, row.names = 1, sep = "\t"))

 

}

 

 

 

 

 

 

 

read.taxonomy <- function(taxon.table, eda.dis, species = 1, family = 6, taxon.seq = 9)

 

{

 

   return(taxon.table[dimnames(eda.dis)[[1]], c(species, family, taxon.seq)])

 

}