Medical Bioinformatics in Cytomics
- Multiparameter flow cytometry data analysis typically concerns the determination of cell frequency within one or several multidimensional gates. An essential part of potentially useful information like fluorescence intensities, average fluorescence surface densities, intercolour fluorescence ratios, coefficients of variation of the fluorescence, light scatter intensity or scatter and fluorescence ratio distributions of the various cell populations rests unconsidererd in this way. The evaluation of the many data columns deriving from such exhaustive analysis of flow cytometric data requires an easiliy scalable data classification method.
The CLASSIF1 algorithm implements a parallel classification
process for an essentially unlimited number of data columns.
They are, for this purpose, sequentially classified in
packages of for example 50 data columns to identify the most
discriminative columns for various classification categories.
After the first classification round, the most discriminative
columns of each data column package are combined to
new data column packages and reclassified in the same way
until 50 or less data columns remain. Their final classification
provides classification masks containing typically the
5 to 15 most discriminatory data columns (parameters) of the
The CLASSIF1 classification is
- non hierarchical
- fast because only compare operations are used for classification
- parallel computable
- without parameter weighting
- robust against missing values (no insertion of assumed parameter values)
- interlaboratory standardizable
- The potential of data pattern classification in
(cell systems research) concerns the exhaustive
knowledge extraction from flow cytometric
(single cell molecular bioprofiles) or
other multiparameter data by the determination of the
most discriminative data patterns for individualized disease
course predictions or diagnostics.
- CLASSIF1 algorithmic data sieving (fig.1) and data pattern classification permit the development of standardized, instrument and laboratory independent data pattern classifiers from flow cytometric list mode, flow bead array, high content image analysis, cDNA (Lymphochip, Affymetrix) or protein expression chip arrays, clinical chemistry, biomedical or clinical data for predictive medicine by cytomics or for diagnostic purposes (literature references: bioinformatics, medical and clinical cytomics, further references )
- Numeric data values are transformed into
triple matrix characters
to permit subsequent data pattern classification.
- Lower and upper percentiles like 10% and 90% (fig.3) are calulated for each data column of the reference patients.
- Data column values are transformed (fig.3) by assigning: (- =diminished) to values below the lower percentile, (0 =unchanged) to values between the lower and upper percentiles and (+ =increased) to values above the upper percentile.
- Disease classification masks ("disease signatures") (fig.3a) for each classification category are determined from the most frequent triple matrix character in each data column of the learning set. Individual patients are classified according to the highest positional coincidence between the patient classification mask ("patient signature") and any one of the disease classification masks.
- A classification (confusion) matrix is established between the known predictive or diagnostic clinical classification of reference and abnormal patient samples against the same classification categories for the CLASSIF1 triple matrix classification. An ideal classification result is characterized by 100% specificity and sensitivity values in the diagonal boxes as well as the for the negative and positive predictive values while the values in all other boxes are 0% (fig.4).
- The data columns as in the case of myocardial infarction risk assessment are obtained by flow cytometric determination of activation antigens on peripheral blood thrombocytes. Four databases, each containing eleven parameters (fig.5). were calculated from the thrombocyte evaluation windows (gates) in two parameter histograms. They were obtained by projection of flow cytometric four parameter list mode data either onto the foward(FSC)/sideward (SSC) light scatter or onto the FSC/fluorescence_1 plane to calculate the expression of thrombocyte surface antigens CD62, CD63 and thrombospondin as well as of spontaneously attached IgG on normal individuals or angiographically identified myocardial risk patients.
- Correct classification for all patient samples of the learning set as well as for the unknown test set (validation set) of patients is achieved (fig.6A,6B), indicating that the CLASSIF1 algorithm has identified a suitable discriminativ data peattern of thrombocyte parameters. (fig.7). The selected parameters are statistically significantly different between normal individuals and myocardial risk patients (fig.8).
- The unknown test set of patients was defined prior to learning as data records 1,5,10,15... etc of the normal individuals as well as of the myocardial risk patients and remained hidden to the learning process. As an embedded test set, it was measured under very similar conditions conditions as the samples of the learning set.
- The principle of data pattern classification
assures classification accuracy as well
as classification multiplicity to take care of the many
combinatorial possibilities between genotype and
internal or external exposure influences on observed
molecular cell phenotypes.
- Data patterns represent heat maps consisting of (-)=decreased, (0)=unchanged and (+)=increased characters instead of a green, yellow or red color code.
- The optimization of the initial classification, containing all data columns (fig.10A) is achieved by maximizing the sum of values in the diagonal boxes of the classification matrix (fig.10B). The classification process reduces the number of classification parameters in this example from initially 44 to finally 5 data columns by the exclusion of non informative parameters.
- Initially, the most frequent triple matrix character of each database column is inserted into the disease classification mask of each classification category. Data columns without discrimination between classification categories are removed. The disease classification masks are then iteratively optimized (fig.11).
- Data records (patient classification masks) are classified according to their highest positional coincidence with any one of the disease classification masks.
- Multiple classifications occur at equal positional coincidence with more than one of the disease classification masks (for example record#17 (N,R) of the first classification mask fig.10). They may represent transitional states in disease or classification errors for example in case of small learning sets, of comparatively small differences between for the selected parameters amongst different categoriesd. They may also occur in the first triple matrix prior to the iterative removal of non-informative parameters.
- The iterative optimization is achieved by the temporary removal of a single database column or by variable combinations of two database columns from the classification process, followed by reclassification. It is retained whether their temporary absence of the column(s) has improved or deteriorated the classification result.
- The columns are then reinserted and the next data columns are temporarily removed until the positive or negative contribution of all data columns to the classification process is known.
- Data columns having improved the classification result by their absence are removed from further consideration. The remaining data columns represent the optimized disease classification masks.
- The data records of the learning set are reclassified against the disease classification masks to assess the achieved discrimination for the various classification categories (fig.12).
- The data records of the unknown test (validation) set are subsequently classified to verify the robustness of classification for unknown data records (fig.13).
- The classification operations described in chapter 3 and 4 are performed unsupervised that is automkatically by the CLASSIF1 algorithm and do not require human interference, once the classification process has been started.
- The classification of data records, unknown to a learned
classifier, is important to avoid erroneous classifications
of random statistical aberrations.
- To assess the susceptibility of the CLASSIF1 algorithm for the detection of random statistical aberrations, a 133 column wide data set of 40 data records was generated using a random number generator and kindly made available by Dr.W.Meyer and PD Dr.G.Haroske (Pathologisches Institut, Universität Dresden). The data columns had different means and coefficients of variation between 0.97-25.7% (CV=100*standard deviation/mean). For the classification, each second data record was assigned in sequence to either the arbitrary category#1 or category#2, resulting in 20 category#1 and 20 category#2 records.
- Records 1,5,10,15,20 of each category were prior to the learning phase assigned to the unknown test (validation) set. They remained hidden during the learning phase. This left a learning set of 15 category#1 and 15 category#2 data records
- The classification of the learning set by the CLASSIF1 algorithm provided a specificity of 100% for the correct recognition of category#1 records at a sensitivity of 40% for the recognition of category#2 data records (fig.14A). Parameters #29,#77,#133 (fig.15). were selected with means± standard deviation (SD) of 73.2±6.6/72.8±11.2, 53.3±8.0/59.0±11.3, 25.0±3.7/25.0±5.9 at no statistical difference between the category#1 and category#2 means of the selected data columns.
- The classification of the unknown test set of records resulted in a low specificity of 33.7% for the correct recognition of category#1 records and a low sensitivity of 20.0% for category#2 records (fig.14B).
- The display of the triple matrix display of the learning set (fig.16) and of the unknown test (fig.17) shows the low quality for the classification of the random number data set.
- The result emphasizes the well known fact that the discrimination of random statistical aberrations in a learning set collapses typically during the classification of unknown test sets.
- This contrasts to the robustness of classification in case of existing molecular differences like for the discrimination of risk patients for myocardial infarction from increased thrombocyte activation antigens CD62,CD63 and thrombospondin (fig.5) where learning and test set patients are in >95% of the cases correctly classified and statistically significant differences of the selected parameters exist between both patient groups (fig.7).
- Triple matrix classifiers are inherently standardized onto the
patient reference samples during the classification process
(standardized multiparameter data classification (SMDC)).
Classifiers are therefore comparable in an instrument and
laboratory independent way, in case no differences between the
reference groups of different institutions are detected by the
This is advantageous for consensus formation e.g. on leukemia,
HIV and thrombocyte classifications by immunophenotyping.
- Reference sample data may be obtained in the case of blood cells from healthy individuals of similar age and sex distribution such as from blood donor samples (biological standardization). This overcomes comparability issues between standardized measurements on different flow cytometers in different institutions using labelled indicator molecules of the same specificity but from different sources (instrument and reagent standardization).
- The systematic analysis of patient classification masks provides information on individual genotypic and exposure influences on expressed data patterns. Such analysis may prove useful for the development of a periodic system of cells that is a relational cell classification system in analogy to the periodic system of elements. Using the molecular properties of tissue or hematopoietic stem cells as reference, different cell types and their activity states could be compared in a standardized way for example during disease development, under therapy but also during cell division, cell differentiation or cell migration.
- The performance of triple matrix classifiers depends on intralaboratory precision rather than on interlaboratory accuracy since measurement accuracy cancels out within certain limits through the normalization of the experimental values onto the respective mean values of the reference samples in each database column. Reference groups are typically constituted from age and sex matched individuals.
- FCS1.0 and 2.0 list mode files or dBase3 database exports from database (e.g. Access) or table calculation programs (e.g. Excel) are classified by the CLASSIF1 algorithm in various clinical situations:
- the CLASSIF1 algorithm provides access to
with >95% correct disease
course prediction in individual patients as well as to
standardized diagnostic classifications.
- the CLASSIF1 approach facilitates the elaboration of interlaboratory consensus classifiers in complex multiparameter data sieving or data mining analysis.
- as a practical consequence, diseases can be classified at institutions where no sufficient learning sets can be generated in reasonable times or where costly investigations are necessary to establish appropriate learning sets.
- furthermore the molecular and biochemical properties of many body cell systems during disease can be compared by standardized classification e.g. blood leukocytes versus tissue or effusion leukocytes.
Download the ZIP file containing all Cell Biochemistry pages for example into directory: d:\classimed\, unzip into the same directory, enter the address: file:///d:/classimed/cellbio.html into the URL field of the Internet browser to directly access text & figures on your harddisk free of network delays (further information).
|© 2017 G.Valet|