User:MSInokuma/sandbox

Introduction
Probability state Modeling is a modeling algorithm developed by C. Bruce Bagwell which uses broadened quantile functions for visualizing the relative order of cellular biomarker modulations on cells in multi-dimensional cytometry data. The basic premise of PSM is that a probability based axis can quantify the relative order of the changes in biomarker expression that occur in systems such as the ontogeny of immune cells,.

Flow cytometry and the complementary technology, mass cytometry are powerful tools that generate quantifiable high dimensional data on an individual cell basis. The technology has broad application in a wide variety of fields, including immunology, microbiology , plant and marine biology , and is often used in the identification of cell populations, determining cell characteristics and function, detection of biomarkers and in prognostic or diagnostic applications. These rapidly growing technologies are driving the generation of increasingly complex datasets, resulting in the development of a range of computational analysis approaches.

A number of computational methods have been developed for the flow cytometry bioinformatics including preprocessing, such as compensation for spectral overlap, transforming data for visualization and analysis, normalizing data, and quality assessment. Identification of cell populations has been traditionally approached with the use of two-dimensional scatter plots, but with the rapid growth in the multidimensional nature of the data, the need for tools to identify, visualize and analyze populations in high dimensional space has emerged.

Using probability state model, is a set of generalized Q functions called expression profiles (EPs) which correspond to correlated measurements. The common cumulative probability axis can be a surrogate for time or cellular progression, such as maturation. A user defined construct, called a cell type is a collection of EPs used to identify a population of cells. This modeling program minimizes or maximizes one or more objective functions, making it suitable for automation, which can address the subjective nature of traditional gating approaches. Additionally, this approach avoids the “curse of dimensionality”, which compromises statistical outcomes and the computational performance of clustering methods commonly used in analysis of high dimensional flow and mass cytometry data ,

Details
A probability state model represents a set of cell populations or “cell types”. Each cell type contains a set of broadened Q functions called expression profiles, which correspond to correlated measurements. To understand how PSM can model a number of correlated measurements, it is important to understand the relationship and properties among Histograms, P functions and Q functions. A P function is also known as cumulative distribution function or CDF. A Q function is also known as a quantile functions and is the inverse of the P function.

In Figure 1, Panel A illustrates a simulation of n = 5,000 measurements, x, randomly drawn from some distribution. The collection of all measurements is represented as X. Panel B shows a 100-bin histogram, hx, of this set of measurements, X. Panel C shows a probability histogram, px, where each bin’s frequency is divided by the total number of measurements, n. Panel D shows the cumulative probability distribution, Px, of px, where each successive bin’s probability is the sum of the preceding bin’s probabilities. The Px distribution can also be posed as a function commonly known as a P function. The P function has the property that if each measurement, x, is evaluated by Px, the resultant values form a set of uniform random numbers, Ux, which have relatively uniform frequency distribution shown in Panel E. Panel F is the inverse cumulative probability distribution, Qx, which can also be posed as a function, commonly known as a Q function. The Q function has the property that if another set of n uniform random numbers, U, is evaluated by the Q function, a set of equivalent measurement, Y, is produced as shown in Panels G and H. As demonstrated in Panel H, equivalency between these two data sets can be appreciated as similar histograms hx and hy, where their differences are only due to counting error. The pseudo code (code 1) represents the basic logic of stochastic sslection given some weight vector. The function obtains a uniform random numbers, u ranging between 0 and 1. It then finds the s element in W that is associated with the cumulative probability, cP, being just greater than u. this part of the code is equivalent to evaluating the Q function formed by a weight vector, W, with a uniform random number. At the end, it dithers the result for the purpose of graphics representation.

The cumulative probability axis show in Figure 1F is expressed more conveniently in PSM as a percent, which will be designated as . The relationships shown in Figure 1 can be represented using random variable. When the symbol, U, is written, it is understood that it means an infinite set of uniform random numbers ranging between 0 and 1. The data shown in Figure 1A were synthesized by evaluating a subset of uniform random number, Ux, (Figure 1E), with the inverse function, Px-1(Fig 1F), which is also represented as, Qx.

Applications
The approach used in PSM is to display the relative order of cellular biomarkers on a probability-based axis. As an example, a probability state model was used to analyze a flow cytometry experiment files from healthy human bone marrow samples stained with a set of immunophenotyping markers to determine cell-surface biomarkers that indicate developmental stages of B cell differentiation in the bone marrow Each of the samples were stained with CD19, CD10, CD34, CD4. The markers CD20, TdT, CD81, CD22, CD44 and CD9 were included in a subset of samples. PSM uses broadened cumulative percent distributions, known as expression profiles as its model components. Each expression profile has the same independent x-axis, cumulative percent, so a number of measurement dimensions can be “stacked”. When applied to cellular progressions, such as differentiation of B cells in the bone marrow, the analysis illustrates the relative timing of cellular biomarker modulations as B cell differentiate. A model was constructed to select events that were low for side-scatter (SSC) and CD19 expressing events. The analyses revealed that CD19 expressing B cells in the healthy human bone marrow have at least three distinct coordinated changes, forming four stages labeled as B1, B2, B3, B4. At the end of B1: CD34 expression downregulated with TdT while CD45, CD81 and CD20 slightly upregulate. At the end of B2, CD45 and CD20 up-regulate. At the end of B3 and beginning of B4, CD10, CD38 and C81 down regulate while CD22 and CD44 upregulate. (see figure b cell overlay)(show figure with full model and description)