User:Strnaseqer/nussinov

The Nussinov algoriothm is a nucleic acid structure prediction algorithm used in computational biology to predict the folding of an RNA molecule that makes use of dynamic programming principles. The algorithm was developed by Ruth Nussinov in the late 1970s.

Backgound
RNA origami occurs when an RNA molecule "folds" and binds to itself. This folding often determines the function of the RNA molecule. RNA folds at different levels, this algorithm predicts the secondary structure of the RNA.

Scoring
We score an algorithm by simply counting the total number of complementary base pairs. Thus, attempting to maximize the score maximizes the total number of hydrogen bonds.

Motivation
Consider an RNA sequence $$S$$ whose elements are taken from the set $$ \{A, U, C, G\}$$. Let us imagine we have an optimal solution to the subproblem of folding $$S_i$$ to $$S_{j-1}$$, and an optimal solution for folding $$S_u$$ to $$S_v$$ $$i\leq u\leq v\leq j-1$$. Now, to align $$S_i$$ to $$S_{j}$$, we have two options:


 * 1) Leave $$S_{j}$$ unpaired, and keep the structure of $$S_i$$ to $$S_{j-1}$$. The score for this alignment will be equal to the score of the aligmnent of $$S_i$$ to $$S_{j-1}$$, as no new base pairs were created.
 * 2) Pair $$S_{j}$$ with $$S_{k}$$, where $$i\leq k<j$$. The score for this alignment will be the score of the base pairing, plus the score of the best alignment of $$S_i$$ to $$S_{k-1}$$ and $$S_{k+1}$$ to $$S_{j-1}$$.

Algorithm
Consider an RNA sequence $$S$$ of length $$n$$ such that $$S_i\in \{A, U, C, G\}$$.

Construct an $$n\times n$$ matrix $$M$$. Initialize $$M$$ such that

$$M(i, i)=0$$

$$M(i, i-1) = 0$$

for $$1\leq i\leq n$$.

$$M(i,j)$$ will contain the maximum score for the subsequence $$S_i...S_j$$. Now, fill in entries of $$M$$ up and to the right, so that

$$M(i,j) = \min_{i\leq k<j}\begin{cases}M(i, k-1)+M(k+1, j-1)+\text{Score}(S_k,S_j) \\ M(i, j-1)\end{cases}$$

where $$\text{Score}(S_k,S_j)=\begin{cases}1,&S_k\text{ and }S_j \text{ complementary}\\ 0,&\text{otherwise.}\end{cases}$$

After this step, we have a matrix $$M$$ where $$M(i,j)$$ represents the optimal score of the folding of $$S_i...S_j$$.

To determine the structure of by traceback, we first create an empty list of pairs $$P$$. We initialize with $$i=0,j=n$$. Then, we follow one of three scenarios.

When the traceback finishes, $$P$$ contains all of the paired bases.
 * 1) If $$j\leq i$$, the procedure stops.
 * 2) If $$M(i,j)=M(i,j-1)$$, then set $$i=i,j=j-1$$ and continue.
 * 3) Otherwise, for all $$k: i\leq k<j$$, if$$S_k$$ and $$S_j$$ are complementary and $$M(i,j)=M(i,k-1)+M(k+1,j-1)+1$$, append $$(k,j)$$ to $$P$$, then traceback both with $$i=i,j=k-1$$ and $$i=k+1,j=j-1$$.

Limitations
The Nussinov algorithm does not account for the three-dimensional shape of RNA, nor predict RNA pseudoknot s. Furthermore, in its basic form, it does not account for a minimum stem loop size. However, it is still useful as a fast algorithm for basic prediction of secondary structure.