Medical Bioinformatics in Cytomics

Data analysis in multiparameter flow cytometry
Data pattern classification by data sieving (CLASSIF1 algorithm)

G.Valet

< Cell Biochemistry

1. Background

- Multiparameter flow cytometry data analysis typically concerns the determination of cell frequency within one or several multidimensional gates. An essential part of potentially useful information like fluorescence intensities, average fluorescence surface densities, intercolour fluorescence ratios, coefficients of variation of the fluorescence, light scatter intensity or scatter and fluorescence ratio distributions of the various cell populations remains unconsidererd in this way. The evaluation of the many data columns deriving from such exhaustive analysis of flow cytometric data requires an easily scalable data classification method.

2. Potential of Data Pattern Classification

fig.1 Principle of data sieving

The CLASSIF1 algorithm (1993) performs an unsupervised self learning parallel classification of an essentially unlimited number of data columns. They are combined for example in packages of 50 columns. The most discriminatory columns (fig.1) for up to 10 classification categories are iteratively determined (data sieving) for each package by percentile analysis. The five most informative columns are inserted into new packages, followed by reclassification until 50 or less data columns remain for data pattern classification

providing disease classification masks with typically the 5 to 15 most discriminatory data columns (parameters) (fig.12 rightmost table column). Plotting the cumulative value distributions of the selected mask parameters for healthy individuals and infarction risk patients in a ROC (receiver operating characteristic) like way shows their classifaction potential (fig.1a) although the individual curves do not reach the overall potential of data pattern classification fig.6A/6B).

fig.1a ROC curves for the best five discriminators

  CLASSIF1 classification is:
  - unsupervised, non hierarchical, non
    correlating
  - without statistics, mathematical assumptions
    or neuronal networks
  - fast because only compare operations
     are used for classification and
  - suitable for parallel computing
  - without parameter weighting and
  - robust against outliers or missing values
  - interlaboratory standardizable

fig.12 data pattern of most discriminatory classification parameters

- The potential of data pattern classification in cytomics (system cytometry (1997), cell systems research) (fig.12) concerns the exhaustive knowledge extraction from flow cytometric (single cell molecular bioprofiles) or other multiparameter data by the determination of the most discriminatory data patterns for diagnosis or individualized prediction of disease progression (disease course prediction, outcome prediction). It permits the development of standardized, instrument and laboratory independent data pattern classifiers from flow cytometric list mode, flow bead array, high content image analysis, cDNA (Lymphochip, Affymetrix) or protein expression chip arrays, clinical chemistry, biomedical or clinical data for predictive medicine by cytomics (references: bioinformatics, medical and clinical cytomics, further references)

3. Classification Process

Numeric data values are first transformed into triple matrix characters (fig.2) in the following way:
- Lower and upper percentiles like 10% and 90% (fig.3) are calulated for each data column of reference patients.
- Data column values are then transformed (fig.3) by assigning: (- =diminished) to values below the lower percentile, (0 =unchanged) to values between the lower and upper percentiles and (+ =increased) to values above the upper percentile.
- Disease classification masks ("disease signatures") (fig.3a) for each classification category are determined from the most frequent triple matrix character in each data column of the learning set. Individual patients are classified according to the highest positional coincidence between the patient classification mask ("patient signature") and any one of the disease classification masks.
- A classification (confusion) matrix is established between the known predictive or diagnostic clinical classification of reference and abnormal patient samples against the same classification categories for the CLASSIF1 triple matrix classification. An ideal classification result is characterized by 100% specificity and sensitivity values in the diagonal boxes as well as the for the negative and positive predictive values while the values in all other boxes are 0% (fig.4).
- The data columns as in the case of myocardial infarction risk assessment are obtained by flow cytometric determination of activation antigens on peripheral blood thrombocytes. Four databases, each containing eleven parameters (fig.5). were calculated from the thrombocyte evaluation windows (gates) in two parameter histograms. They were obtained by projection of flow cytometric four parameter list mode data either onto the foward(FSC)/sideward (SSC) light scatter or onto the FSC/fluorescence_1 plane to calculate the expression of thrombocyte surface antigens CD62, CD63 and thrombospondin as well as of spontaneously attached IgG on normal individuals or angiographically identified myocardial risk patients.

- Correct classification for all patient samples of the learning set as well as for the unknown test set (validation set) of patients is achieved (fig.6A,6B), indicating that the CLASSIF1 algorithm has identified a suitable discriminativ data peattern of thrombocyte parameters. (fig.7).

fig.6A/6B myocardial risk assessment

The selected parameters are statistically significantly different between normal individuals and myocardial risk patients (fig.8).

- The unknown test set of patients was defined prior to learning as data records 1,5,10,15... etc of the normal individuals as well as of the myocardial risk patients and remained hidden to the learning process. As an embedded test set, it was measured under the same conditions as the samples of the learning set.

4. Details of the Classification Process

- The principle of data pattern classification (fig.9) assures classification accuracy as well as classification multiplicity to take care of the many combinatorial possibilities between genotype and internal or external exposure influences on observed molecular cell phenotypes.
- Data patterns represent heat maps consisting of (-)=decreased, (0)=unchanged and (+)=increased characters instead of a green, yellow or red color code.
- The optimization of the initial classification, containing all data columns (fig.10A) is achieved by maximizing the sum of values in the diagonal boxes of the classification matrix. The classification process reduces the number of classification parameters in this example from initially 44 to finally 5 data columns by the exclusion of non informative parameters (fig.10B).
- Initially, the most frequent triple matrix character of each database column is inserted into the disease classification mask of each classification category (fig.11A). Data columns without discrimination between classification categories are removed (fig.11B).
- Data records (patient classification masks) are classified according to their highest positional coincidence with any one of the disease classification masks.
- Multiple classifications occur at equal positional coincidence with more than one of the disease classification masks (for example record #17 (N,R) fig.11A). They represent transitional states in disease or classification errors for example in small learning sets, or at comparatively small differences between the different classification categories. They also occur, as just shown, in the first triple matrix that is prior to the removal of non-informative data columns.
- The iterative optimization is achieved by the temporary removal of a single database column or by variable combinations of two database columns from the classification process, followed by reclassification. It is retained whether their temporary absence of the column(s) has improved or deteriorated the classification result.
- The columns are then reinserted and the next data columns are temporarily removed until the positive or negative contribution of all data columns to the classification process is known.
- Data columns having improved the classification result by their absence are removed from further consideration. The remaining data columns represent the optimized disease classification masks.
- The data records of the learning set are reclassified against the disease classification masks to assess the achieved discrimination for the various classification categories (fig.12).
- The data records of the unknown test (validation) set are subsequently classified to verify the robustness of classification for unknown data records (fig.13).
- The classification operations described in chapter 3 and 4 are performed unsupervised that is automkatically by the CLASSIF1 algorithm. No human interference is rquired, once the classification process has been started.

5. Robustness of Classification against Random Aberrations

- The classification of data records, unknown to a learned classifier, is important to avoid erroneous classifications of random statistical aberrations.
- To assess the susceptibility of the CLASSIF1 algorithm for the detection of random statistical aberrations, a 133 column wide data set of 40 data records was generated using a random number generator and kindly made available by Dr.W.Meyer and PD Dr.G.Haroske (Pathologisches Institut, Universität Dresden). The data columns had different means and coefficients of variation between 0.97-25.7% (CV=100*standard deviation/mean). For the classification, each second data record was assigned in sequence to either the arbitrary category#1 or category#2, resulting in 20 category#1 and 20 category#2 records.
- Records 1,5,10,15,20 of each category were prior to the learning phase assigned to the unknown test (validation) set. They remained hidden during the learning phase. This left a learning set of 15 category#1 and 15 category#2 data records
- The classification of the learning set by the CLASSIF1 algorithm provided a specificity of 100% for the correct recognition of category#1 records at a sensitivity of 40% for the recognition of category#2 data records (fig.14A). Parameters #29,#77,#133 (fig.15). were selected with means± standard deviation (SD) of 73.2±6.6/72.8±11.2, 53.3±8.0/59.0±11.3, 25.0±3.7/25.0±5.9 at no statistical difference between the category#1 and category#2 means of the selected data columns.
- The classification of the unknown test set of records resulted in a low specificity of 33.7% for the correct recognition of category#1 records and a low sensitivity of 20.0% for category#2 records (fig.14B).
- The display of the triple matrix display of the learning set (fig.16) and of the unknown test (fig.17) shows the low quality for the classification of the random number data set.
- The result emphasizes the well known fact that the discrimination of random statistical aberrations in a learning set collapses typically during the classification of unknown test sets.
- This contrasts to the robustness of classification in case of existing molecular differences like for the discrimination of risk patients for myocardial infarction from increased thrombocyte activation antigens CD62,CD63 and thrombospondin (fig.5) where learning and test set patients are in >95% of the cases correctly classified and statistically significant differences of the selected parameters exist between both patient groups (fig.7).

6. Laboratory Independent Classification

- Triple matrix classifiers are inherently standardized onto the data of patient reference samples during the classification process (standardized multiparameter data classification (SMDC)). Classifiers are therefore comparable in an instrument and laboratory independent way, in case no differences between the reference groups of different institutions are detected by the CLASSIF1 algorithm. This is advantageous for consensus formation e.g. on leukemia, HIV and thrombocyte classifications by immunophenotyping.
- Reference sample data may be obtained in the case of blood cells from healthy individuals of similar age and sex distribution such as from blood donor samples (biological standardization). This overcomes comparability issues between standardized measurements on different flow cytometers in different institutions using labelled indicator molecules of the same specificity but from different sources (instrument and reagent standardization).
- The systematic analysis of patient classification masks provides information on individual genotypic and exposure influences on expressed data patterns. Such analysis may prove useful for the development of a periodic system of cells that is a relational cell classification system in analogy to the periodic system of elements. Using the molecular properties of tissue or hematopoietic stem cells as reference, different cell types and their activity states could be compared in a standardized way for example during disease development, under therapy but also during cell division, cell differentiation or cell migration.
- The performance of triple matrix classifiers depends on intralaboratory precision rather than on interlaboratory accuracy since measurement accuracy cancels out within certain limits through the normalization of the experimental values onto the respective mean values of the reference samples in each database column. Reference groups are typically constituted from age and sex matched individuals.

7. Comparison of data pattern with multifactorial classification

The earlier developed multifactorial classification software (DIAGNOS1 1987) determines initially the 5 most discriminatory database columns by percentile analysis, followed by the calculation of the 22 multifactors of these columns and finally of the 5 most discriminatory multifactors. Aneuploidy (only in case of tumors) together with the best discriminator of the 5 initial columns in conjunction with the best discriminating 2 multifactors of the altogether 27 columns database are evaluated by the software.

fig.18 DIAGNOS1 multifactorial classification

The multifactorial classification (fig.18) does not reach the results of data pattern classification (fig.6A/6B) although similar parameters are selected, This was also observed for other data sets. The 5 most discriminatory data columns of the IgG, CD62, CD63 and thrombospondin (4x5=20) are available for data pattern analysis while multifactorial classification restricts to the 5 most discriminatory of them plus the 22 derived multifactors that is altogether 27 data columns. Despite the higher column number, less information than in data pattern analysis is available where all most discriminatory columns (n=20) were included.

8. Examples

- FCS1.0 and 2.0 list mode files or dBase3 database exports from database (e.g. Access) or table calculation programs (e.g. Excel) are classified by the CLASSIF1 algorithm in various clinical situations:

myocardial infarction risk assessment (thrombocyte activation antigens)
overtraining risk in competition cyclists (lymphocyte antigens)
survival prognosis in melanoma patients (clinical data)

9. Conclusions

- the CLASSIF1 algorithm provides access to predictive medicine with >95% correct prediction of disease progression in individual patients as well as to standardized diagnostic classifications.
- the CLASSIF1 approach facilitates the elaboration of interlaboratory consensus classifiers in complex multiparameter data sieving or data mining analysis.
- as a practical consequence, diseases can be classified at institutions where no sufficient learning sets can be generated in reasonable times or where costly investigations are necessary to establish appropriate learning sets.
- furthermore the molecular and biochemical properties of many body cell systems during disease can be compared by standardized classification e.g. blood leukocytes versus tissue or effusion leukocytes.

10. Evolution of Concept ( Purdue CD-Series)

CD1 1996(1) ISBN: none	CD2 1996(2) ISSN: 1091-2037	CD3 1997(1) ISBN: 1-890473-02-2	CD4 1997(2) ISBN: 1-890473-03-0
CD5 2000 ISBN: 1-890475-05-7	CD6 2002 ISBN: 0-97117498-3-3	CD7 2003 ISBN: 0-97117498-8-4	CD8 2004 ISBN: 1-890473-C6-5

< Cell Biochemistry

Off-line Internet, a timesaver !

Download the ZIP file containing all Cell Biochemistry pages for example into directory: d:\classimed\, unzip into the same directory, enter the address: file:///d:/classimed/cellbio.html into the URL field of the Internet browser to directly access text & figures on your harddisk free of network delays (further information).

Internet: https://www.classimed.de
last update: Jan 04.2021
first display: Oct 10,1995