User:Debazthed/sandbox

Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities. Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach. = Algorithm = The following describes the algorithm used by probalign to determine the base pair probabilities.

Alignment score
To score an alignment of two sequences two things are needed: The score $$S(a)$$ of an alignment a is defined as:
 * a similarity function $$\sigma(x,y)$$ (e.g. PAM, BLOSUM,...)
 * affine gap penalty: $$ g(k) = \alpha + \beta k$$

$$ S(a) = \sum_{x_i-y_j \in a} \sigma(x_i,y_j) + \text{gap cost}$$

Now the boltzmann weighted score of an alignment a is:

$$ e^{\frac{S(a)}{T}} = e^{\frac{\sum_{x_i-y_j \in a} \sigma(x_i,y_j) + \text{gap cost}}{T}} = \left( \prod_{x_i - y_i \in a} e^{\frac{\sum_{x_i-y_j \in a} \sigma(x_i,y_j)}{T}} \right) \cdot e^{\frac{gapcost}{T}}$$

Where $$T$$ is a scaling factor.

The probability of an alignment assuming boltzmann distribution is given by

$$Pr[a|x,y] = \frac{e^{\frac{S(a)}{T}}}{Z}$$

Where $$Z$$ is the partition function, i.e. the sum of the boltzmann weights of all alignments.

Dynamic Programming
Let $$Z_{i,j}$$ denote the partition function of the prefixes $$x_0,x_1,...,x_i$$ and $$y_0,y_1,...,y_j$$. Three different cases are considered: Then we have: $$Z_{i,j} = Z^{M}_{i,j} + Z^{D}_{i,j} + Z^{I}_{i,j}$$
 * 1) $$Z^{M}_{i,j}:$$ the partition function of all alignments of the two prefixes that end in a match.
 * 2) $$Z^{I}_{i,j}:$$ the partition function of all alignments of the two prefixes that end in an insertion $$(-,y_j)$$.
 * 3) $$Z^{D}_{i,j}:$$ the partition function of all alignments of the two prefixes that end in a deletion $$(x_i,-)$$.

Initialization
The matrixes are initialized as follows:
 * $$Z^{M}_{0,j} = Z^{M}_{i,0} = 0$$
 * $$Z^{M}_{0,0} = 1$$
 * $$Z^{D}_{0,j} = 0$$
 * $$Z^{I}_{i,0} = 0$$

Recursion
The partition function for the alignments of two sequences $$x$$ and $$y$$ is given by $$Z_{|x|,|y|}$$, which can be recursively computed:
 * $$Z^{M}_{i,j} = Z_{i-1,j-1} + \sigma(x_i,y_j)$$
 * $$Z^{D}_{i,j} = Z^{D}_{i-1,j} \cdot e^{\frac{\beta}{T}} + Z^{M}_{i-1,j} \cdot e^{\frac{g(1)}{T}} + Z^{I}_{i-1,j} + e^{\frac{g(1)}{T}}$$
 * $$Z^{I}_{i,j}$$ analogously

Base pair probability
Finally the probability that positions $$x_i$$ and $$y_j$$ form a base pair is given by:

$$P(x_i - y_j|x,y) = \frac{Z_{i-1,j-1} \cdot e^{\frac{\sigma(x_i,y_j)}{T}} \cdot Z_{i+1,j+1}}{Z_{|x|,|y|}}$$ = References =

= See also = = External Links = Probalign Webservice
 * ProbCons
 * Multiple Sequence Alignment