DeMix

DeMix is a statistical method for deconvolving mixed cancer transcriptomes to predict the likely proportion of tumor and stromal cell samples using a linear mixture model. It was developed by Ahn et al.

Demix explicitly considers four possible scenarios: matched tumor and normal samples, with reference genes; matched tumor and normal samples, without reference genes; unmatched tumor and normal samples, with reference genes; and unmatched tumor and normal samples, without reference genes.

Reference genes are a set of genes for which expression profiles have been accurately estimated based on external data in all constituting tissue types.

Introduction
Solid tumor samples obtained from clinical practice are highly heterogeneous. They consist of multiple clonal populations of cancer cells as well as adjacent normal tissue, stromal, and infiltrating immune cells. The highly heterogeneous structure of tumor tissues could complicate or bias various genomic data analysis. Removing heterogeneity is of substantial interest to isolate expression data from mixed samples in silico.

It is important to estimate and account for the tumor purity, or the percentage of cancer cells in the tumor sample before analyses. Owing to the marked differences between cancer and normal cells, it is possible to estimate tumor purity from high-throughput genomic or epigenomic data.

DeMix estimates the proportion and gene expression profile from cancer cells in mixed samples. In this method, the mixed sample is assumed to be composed only by two cell types: cancer cells (without any known priori gene expression profile) and normal cells (with known gene expression data, which can either come from tumor-matched or unmatched samples).

DeMix was developed for microarray data and shows that it was important to use the raw data as input assuming it follows a log-normal distribution as is the case for microarray, instead of working with log-transformed data as most other methods did. DeMix estimates the variance of the gene expression in the normal samples and uses this in the maximum likelihood estimation to predict the cancer cell gene expression and proportions, using thus implicitly a gene-specific weight for each gene.

DeMix is the first method to follow a linear mixture of gene expression levels on data before they are log-transformed. This method analyzes data from heterogeneous tumor samples before the data are log-transformed, estimates individual level expression levels in each sample and each gene in an unmatched design.

Method
Let $$N_{ig} \sim LN(\mu_{N_g}, \sigma^2_{N_g})$$ and $$T_{ig} \sim LN(\mu_{T_g}, \sigma^2_{T_g})$$ be the expression level for a gene g and sample $$i$$ from pure normal and tumor tissues, respectively. LN represents the $$log_2$$ Normal distribution. When the $$log_2$$ Normal assumption is violated, a deterioration of accuracy should be expected. The expression level from tumor tissue $$T_{ig}$$ is not observed. Let $$Y_{ig}$$ denote the expression level of a clinically derived tumor sample which is observed. Let $$\pi_i$$, unknown, denote the proportion of tumor tissue in sample $$i$$. The raw measured data is written as a linear equation as


 * $$Y_{ig}=\pi_iT_{ig}+(1-\pi_i)N_{ig}$$

Note that $$Y_{ig}$$ does not follow a $$log_2$$ Normal distribution when both $$N_{ig}$$ and $$T_{ig}$$ follow a $$log_2$$ Normal distribution.

There are mainly two steps in the DeMix method:

Step 1: Given the $$Y$$'s and the distribution of the  $$N$$'s, the likelihood of observing  $$Y$$ is maximized in order to search for  $$\{\pi, \mu_T, \sigma^2_T\}$$.

Step 2: Given the $$\pi$$'s and the distribution of the  $$T$$'s and the  $$N's$$, an individual pair of  $$(T, N)$$ is estimated for each sample and each gene.

These steps are then adapted to specific data scenarios.

DeMix was developed using the Nelder–Mead optimization procedure which includes a numerical integration of the joint density. DeMix takes a two-stage approach by first estimating the $$\pi_i$$s and then estimating the means and variances of gene expressions based on the $$\hat{\pi}_i$$s. A joint model that estimates all parameters simultaneously will be able to further incorporate the uncertainty measure of the tissue proportions. However, the estimation step from such a model can be computationally intensive and may not be suitable for the analysis of high-throughput data.

Usage
DeMix addresses four data scenarios: with or without a reference gene and matched or unmatched design. Although the algorithm requires a minimum of one gene as a reference gene, it is recommended to use at least 5 to 10 genes to alleviate the potential influence from outliers and to identify an optimal set of $$\pi$$s. DeMix assumes the mixed sample is composed of at most two cellular compartments: normal and tumor, and that the distributional parameters of normal cells can be estimated from other available data. For other situations, more complex modeling may be needed.