![]() |
Medical Bioinformatics in Cytomics |
- Multiparameter flow cytometry data analysis typically concerns the determination of cell frequency within one or several multidimensional gates. An essential part of potentially useful information like fluorescence intensities, average fluorescence surface densities, intercolour fluorescence ratios, coefficients of variation of the fluorescence, light scatter intensity or scatter and fluorescence ratio distributions of the various cell populations rests unconsidererd in this way. The evaluation of the many data columns deriving from such exhaustive analysis of flow cytometric data requires an easiliy scalable data classification method.
The CLASSIF1 algorithm implements a parallel classification
process for an essentially unlimited number of data columns.
They are, for this purpose, classified in packages of for example
50 data columns to identify the most discriminative columns
in each package for the discrimination of up to 10 different classification
categories.
After the first classification round, the most discriminative
columns of each data column package are combined to
new data column packages and reclassified in the same way
until 50 or less data columns remain. Their final classification
provides classification masks containing typically the
5 to 15 most discriminative data columns (parameters) of the
entire dataset.
The CLASSIF1 classification is
- parallel computable (non hierarchical)
- fast because only compare operations are used for classification
- without parameter weighting
- robust against missing values (no insertion of assumed parameter values)
- interlaboratory standardizable
- The potential of data pattern classification in
cytomics
(cell systems research) concerns the exhaustive
knowledge extraction from flow cytometric
(single cell molecular bioprofiles) or
other multiparameter data by the determination of the
most discriminative data patterns for individualized disease
course predictions or diagnostics.
- CLASSIF1 algorithmic data sieving
(fig.1)
and data pattern classification permit the development
of standardized, instrument and laboratory independent
data pattern classifiers from flow cytometric list mode, flow bead array,
high content image analysis, cDNA (Lymphochip, Affymetrix) or protein
expression chip arrays,
clinical chemistry, biomedical or clinical data for
predictive medicine
by cytomics
or for diagnostic purposes
(literature references:
bioinformatics,
medical and clinical cytomics,
further references
)
- Numeric data values are transformed into
triple matrix characters
(fig.2)
to permit subsequent data pattern classification.
- Lower and upper percentiles like 10% and 90%
(fig.3)
are calulated for each data column of the reference patients.
- Data column values are transformed
(fig.3)
by assigning:
(- =diminished)
to values below the lower percentile,
(0 =unchanged)
to values between the lower and upper percentiles and
(+ =increased)
to values above the upper percentile.
- Disease classification masks ("disease signatures")
(fig.3a)
for each classification category are determined from the most
frequent triple matrix character in each data column of the learning set.
Individual patients are classified according to the highest positional
coincidence between the patient classification mask
("patient signature") and any one of the disease classification masks.
- A classification (confusion) matrix is established between the
known predictive or diagnostic clinical classification of reference
and abnormal patient samples against the same
classification categories for the CLASSIF1 triple matrix
classification. An ideal classification result is characterized by 100%
specificity and sensitivity values in the diagonal boxes
as well as the for the negative and positive predictive values
while the values in all other boxes are 0%
(fig.4).
- The data columns as in the case of myocardial infarction risk
assessment are obtained by flow cytometric determination of
activation antigens on
peripheral blood thrombocytes.
Four databases, each containing eleven parameters
(fig.5).
were calculated from the thrombocyte evaluation windows (gates) in
two parameter histograms.
They were obtained by projection of flow cytometric
four parameter list mode data either onto the foward(FSC)/sideward (SSC)
light scatter or onto the FSC/fluorescence_1 plane to calculate the expression
of thrombocyte surface antigens CD62, CD63 and thrombospondin as
well as of spontaneously attached IgG on normal individuals
or angiographically identified myocardial risk patients.
- Correct classification for all patient samples of
the learning set as well as for the unknown test set
(validation set) of patients is achieved
(fig.6A,6B),
indicating that the CLASSIF1 algorithm has identified a
suitable discriminativ data peattern of thrombocyte parameters.
(fig.7).
The selected parameters are statistically
significantly different between normal individuals and
myocardial risk patients
(fig.8).
- The unknown test set of patients was defined prior to
learning as data records 1,5,10,15... etc of the normal individuals
as well as of the myocardial risk patients and remained hidden to
the learning process.
As an embedded test set, it was measured under very
similar conditions conditions as the samples of the learning set.
- The principle of data pattern classification
(fig.9)
assures classification accuracy as well
as classification multiplicity to take care of the many
combinatorial possibilities between genotype and
internal or external exposure influences on observed
molecular cell phenotypes.
- Data patterns represent heat maps
consisting of (-)=decreased, (0)=unchanged and
(+)=increased characters instead of a green, yellow or red
color code.
- The optimization of the initial classification,
containing all data columns
(fig.10A)
is achieved by maximizing the sum of values in the
diagonal boxes of the classification matrix
(fig.10B).
The classification process reduces the number of classification
parameters in this example from initially 44
to finally 5 data columns by the exclusion of non informative
parameters.
- Initially, the most frequent triple matrix character of
each database column is inserted into the disease
classification mask of each classification category.
Data columns without discrimination between
classification categories are removed. The disease
classification masks are then iteratively optimized
(fig.11).
- Data records (patient classification masks) are classified according
to their highest positional coincidence
with any one of the disease classification masks.
- Multiple classifications occur at equal positional
coincidence with more than one of the disease classification
masks
(for example record#17 (N,R) of the first classification
mask fig.10).
They may represent transitional states in disease
or classification errors for example in case of small
learning sets, of comparatively small differences
between for the selected parameters amongst different categoriesd.
They may also occur in the first triple matrix
prior to the iterative removal of
non-informative parameters.
- The iterative optimization is achieved
by the temporary removal of a single database column or by variable
combinations of two database columns from the classification
process, followed by reclassification. It is retained whether their
temporary absence of the column(s) has improved or deteriorated the
classification result.
- The columns are then reinserted and the next data columns are temporarily
removed until the positive or negative contribution of all data
columns to the classification process is known.
- Data columns having improved the classification result by their
absence are removed from further consideration. The remaining data
columns represent the optimized disease classification masks.
- The data records of the learning set
are reclassified against the disease classification masks to
assess the achieved discrimination for the various classification categories
(fig.12).
- The data records of the unknown test (validation) set
are subsequently classified to verify the robustness of
classification for unknown data records
(fig.13).
- The classification operations described in chapter 3 and 4
are performed unsupervised that is automkatically by the CLASSIF1
algorithm and do not require human interference, once the classification
process has been started.
- The classification of data records, unknown to a learned
classifier, is important to avoid erroneous classifications
of random statistical aberrations.
- To assess the susceptibility of the CLASSIF1 algorithm for
the detection of random statistical aberrations, a 133 column
wide data set of 40 data records was generated using a
random number generator and kindly made available by
Dr.W.Meyer and PD Dr.G.Haroske
(Pathologisches Institut, Universität Dresden). The data columns had
different means and coefficients of variation between 0.97-25.7%
(CV=100*standard deviation/mean). For the classification, each second
data record was assigned in sequence to either the arbitrary category#1
or category#2, resulting in 20 category#1 and 20
category#2 records.
- Records 1,5,10,15,20 of each category were prior to the learning
phase assigned to the unknown test (validation) set. They remained
hidden during the learning phase. This left a learning set of
15 category#1 and 15 category#2 data records
- The classification of the learning set by the CLASSIF1 algorithm
provided a specificity of 100% for the correct
recognition of category#1 records at a sensitivity of 40%
for the recognition of category#2 data records
(fig.14A).
Parameters #29,#77,#133
(fig.15).
were selected with means± standard deviation (SD) of 73.2±6.6/72.8±11.2,
53.3±8.0/59.0±11.3, 25.0±3.7/25.0±5.9 at no statistical
difference between the category#1 and category#2 means of the selected
data columns.
- The classification of the unknown test set of records
resulted in a low specificity of 33.7% for the correct
recognition of category#1 records and a low sensitivity of
20.0% for category#2 records
(fig.14B).
- The display of the triple matrix display of the
learning set
(fig.16)
and of the unknown test
(fig.17)
shows the low quality for the
classification of the random number data set.
- The result emphasizes the well known fact that the
discrimination of random statistical aberrations in a learning
set collapses typically during the classification of
unknown test sets.
- This contrasts to the robustness of classification in case of
existing molecular differences like for the discrimination of
risk patients for myocardial infarction from increased thrombocyte activation
antigens CD62,CD63 and thrombospondin
(fig.5)
where learning and test set patients are in >95% of the cases
correctly classified and statistically significant differences
of the selected parameters exist between both patient groups
(fig.7).
- Triple matrix classifiers are inherently standardized onto the
data of
patient reference samples during the classification process
(standardized multiparameter data classification (SMDC)).
Classifiers are therefore comparable in an instrument and
laboratory independent way, in case no differences between the
reference groups of different institutions are detected by the
CLASSIF1 algorithm.
This is advantageous for consensus formation e.g. on leukemia,
HIV and thrombocyte classifications by immunophenotyping.
- Reference sample data may be obtained in the case of blood
cells from healthy individuals of similar age and sex distribution
such as from blood donor samples (biological standardization).
This overcomes comparability issues between standardized
measurements on different flow cytometers in different institutions
using labelled indicator molecules of the same
specificity but from different sources (instrument and reagent
standardization).
- The systematic analysis of patient
classification masks provides information on individual
genotypic and exposure influences on
expressed data patterns. Such analysis may prove useful for
the development of a periodic system of cells
that is a relational cell classification
system in analogy to the periodic system of elements.
Using the molecular properties of tissue or hematopoietic
stem cells as reference, different
cell types and their activity states could be compared in a
standardized way for example during disease development,
under therapy but also during cell division, cell differentiation
or cell migration.
- The performance of triple matrix classifiers depends
on intralaboratory precision rather than on interlaboratory
accuracy since measurement accuracy cancels out
within certain limits through the normalization of the experimental
values onto the respective mean values of the reference samples in each
database column. Reference groups are typically constituted
from age and sex matched individuals.
- FCS1.0 and 2.0 list mode files or dBase3 database exports from database (e.g. Access) or table calculation programs (e.g. Excel) are classified by the CLASSIF1 algorithm in various clinical situations:
- the CLASSIF1 algorithm provides access to
predictive medicine
with >95% correct disease
course prediction in individual patients as well as to
standardized diagnostic classifications.
- the CLASSIF1 approach facilitates the elaboration of
interlaboratory consensus classifiers in complex
multiparameter data sieving or data mining analysis.
- as a practical consequence, diseases
can be classified at institutions where no sufficient learning sets can
be generated in reasonable times or where costly investigations are
necessary to establish appropriate learning sets.
- furthermore the molecular and biochemical properties of
many body cell systems during disease can be compared
by standardized classification e.g. blood leukocytes versus
tissue or effusion leukocytes.
![]() ISBN: none |
![]() ISSN: 1091-2037 |
![]() ISBN: 1-890473-02-2 |
![]() ISBN: 1-890473-03-0 |
![]() ISBN: 1-890475-05-7 |
CD6 2002
ISBN: 0-97117498-3-3 |
CD7 2003
ISBN: 0-97117498-8-4 |
CD8 2004
ISBN: 1-890473-C6-5 |
Download the ZIP file containing all Cell Biochemistry pages for example into directory: d:\classimed\, unzip into the same directory, enter the address: file:///d:/classimed/cellbio.html into the URL field of the Internet browser to directly access text & figures on your harddisk free of network delays (further information).
© 2019 G.Valet |