User:Shanebarratt/sandbox



The three-base periodicity property in the field of Genomics is a property that is characteristic of protein-coding DNA sequences. The existence of this property can be shown by performing Fourier analysis on signals derived from segments of DNA sequences. Because of its predictive power, it has been used as a preliminary indicator in gene prediction.

DNA sequences are inherently signals as they are functions of an independent variable, position on the sequence. Thus, signal processing methods can be applied to them after the symbolic string is properly mapped to one (or more) numerical sequences. The reason for this periodicity is not known for sure, and is likely the result of a few interrelated factors.

History
This property has been dissected, tested and derived in a chronology of papers from different universities. The property was first observed as early as 1982, when Fickett observed periodicity in DNA sequences by looking at the autocorrelation function of individual nucleotide sequences and comparing them to that of a randomly generated sequence. Silverman and Linksker defined the Fourier transform of a sequence of bases, described how to "fourier analyze" it and proposed sample applications of this technique. Tsonis, Elsner and Tsonis did Fourier analysis ofn coding, non-coding and random sequences and proposed a reason for the 3-periodicity property found in coding sequences. Dodin proposed a method for analyzing the periodicity of DNA sequences based on the correlation function of the symbolic sequence. Tiwari, Ramachandran, Bhattacharya and Ramaswamy examined the signal-to-noise ratio of the period-3 peak within a sliding window over a sequence to identify likely coding regions.

Coding vs. Non-Coding DNA
DNA stores the information required to assemble, maintain and reproduce every living organism. A protein is a large molecule ("macromolecule") made up of smaller subunits, amino acids. DNA sequences are made up of codons, three-long nucleotide stretches, that correspond to specific amino acids. DNA creates RNA which then helps synthesize proteins. Thus, coding DNA is defined as the sections of the genome that are actually transcribed into amino acids in proteins. Noncoding DNA is sections of a DNA sequence that don't necessarily code for proteins. Identification of coding regions is important, as this information can be used in gene identification and then more generally full-genome annotation.

Formal Statement of the Property
The 3-periodicity property states that the spectral energy
 * $$|S[k]|^2 = |A[k]|^2 + |T[k]^2 + |C[k]|^2 + |G[k]|^2$$

derived from the DFT's of the four binary signals representing a DNA protein coding region of length $$N$$, exhibits a peak at discrete frequency $$k=N/3$$.

DNA Sequences as Numerical Sequences
Formally, a DNA sequence $$s[i]$$ is an ordered list of symbols from a dictionary of nucleotides $$(A,C,T,G)$$. There are multiple ways to view this as a numerical sequence, and they will all be applied to the same example:


 * $$s[i] = \{A, T, G, C, A, G, C\}$$


 * Binary Projection

Define a binary signal for each nucleotide, $$U_{\alpha}[i]$$, which is 0 when the i-th position in the sequence is the nucleotide $$\alpha$$ and 1 otherwise. Formally,


 * $$u_{\alpha}[i] = \begin{array}{cc}

\{ & \begin{array}{cc} 1 & s[i]\, \text{is} \,\alpha\\ 0 & \text{o/w} \end{array} \end{array}$$

This creates four signals which encode the position of the four nucleotides in the sequence. For the above example, the projected signals would be ATGCAGC, $$u_a$$ = 1000100, $$u_c$$ = 0001001, $$u_t$$ = 0100000, $$u_g$$ = 0010010


 * Correlation Function

Let $$E(i,j)$$ be a function that is 1 if $$s[i] = s[j]$$ and $$0$$ otherwise. Let $$c[t] = \sum_{i=t}^N E(i, i-t)$$. In other words, it counts the number of times that the nucleotide at a position is equal to the nucleotide t-away on the sequence. It is invariant to changing the labeling of the bases. $$c[0]$$ will always be equal to the length of the sequence. For the above example, the projected signals would be

$$c(0) = 7$$, $$c(1) = 0$$, $$c(2) = 0$$, $$c(3) = 2$$, $$c(4) = 1$$, $$c(5) = 0$$, $$c(6) = 0$$


 * Linear Map

For mathematical purposes, a gene sequence could be viewed as a signal by mapping each nucleotide into the range [1,4]. For the example above and map $$\{A -> 1, T -> 2, C -> 3, G -> 4\}$$, the sequence would manifest itself as $$1243143$$.

This process generates a single signal for the sequence, but raises questions such as which out of 24 (4!) maps should be used and what effect does this map have on our analysis. Unlike the other two mapping methods, this one is not invariant to changing the labeling of the bases but having the same structure (i.e. AACA -> TTGT), which could be detrimental for some applications.

Spectral Analysis of DNA Sequences
Once the DNA sequence has been converted into a numerical sequence, spectral analysis can be performed on that sequence. Recall the DFT is defined by the analysis equation

$$X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-2 \pi i  k n / N},$$

and produces a N-long complex signal $$X[k]$$ that represents the spectrum of the signal. The index $$k$$ of the frequency response corresponds to an angular frequency $$\omega$$ of $$2 \pi k / N$$. For the 3-base periodicity property, the frequency content at $$k = N / 3$$ should be analyzed as it corresponds to a period of 3.

Recall that the power spectra of a sequence is equal to the magnitude of the frequency vector squared,

$$P[k] = |X[k]|^2$$

Tiwari applied the DFT to analysis of DNA sequences using the binary projection operator. They calculated the spectra of the four projected nucleotide sequences using the DFT. Call them $$C_{A}[k], C_{C}[k], C_{T}[k], C_{G}[k]$$. Then, the power spectra of all the signals were combined,

$$C[k] = |C_{A}[k]|^2 + |C_{C}[k]|^2 + |C_{T}[k]|^2 + |C_{G}[k]|^2.$$

Then, they calculated the Signal-to-Noise ratio of this signal and performed a threshold test on that value to determine whether or not that stretch of DNA is coding.

Anti-Notch Filter
Instead of computing a Discrete Fourier Transform on different segments of the signal, the analysis can be performed in the time-domain through the use of an anti-notch filter at frequency $$N/3$$, which is much faster and performs a similar operation. An anti-notch filter applied to the signals essentially approximates the magnitude of the DFT at $$N/3$$, and a similar method is used to threshold this value to determine whether or not the window is coding or non-coding. Vaidyanathan and Yoon implemented and evaluated this method. An ideal anti-notch filter at frequency $$N/3$$ has an impulse response of


 * $$h[n] = e^{2 \pi n j / 3}$$

and thus the frequency content of the output of a anti-notch filtered signal is $$X[k] = A[N/3]$$ if $$k=N/3$$ and 0 otherwise. By taking the magnitude of the time-domain signal, and invoking Parseval's Theorem, we get the magnitude of the frequency response. By the above logic, this is just $$|A[N/3]|^2$$ which is the quantity of interest.

Spectrograms
Spectrograms are a good way to view how the frequency content of a signal changes over time. The most common way to compute a spectrogram is to compute a Fourier transform over different segments of the signals, convert the frequency magnitude plot into an image, and concatenate those images. This is a useful way to visually identify coding and non-coding regions of DNA and to inspect other patterns that might exist.

As a Feature in Gene Identification
Identifying genes in a DNA sequence is harder than just finding what segments are coding and near impossible to identify by visually inspecting spectrograms. Genes are made up of both coding and non-coding regions, called introns and exons. Thus, the transition between coding and non-coding regions must be examined and analyzed properly to identify genes. Computing the "level" of 3-periodicity over different (possibly overlapping) windows of the sequence generates a plot of 3-periodicity over time.

These long stretches of coding vs. non-coding can then be classified as introns or exons and the entire segment heuristically labeled as gene or non-gene.

Why Can it Discriminate Coding from Non-Coding?
Binary signals can be parsed into something called a position count function (PCF), which counts the number of one's at phase $$s$$ at positions that are multiples of $$\omega$$. This is expressed mathematically for a binary signal $$A$$ as


 * $$C_w^A (s) = \sum_{i=0}^{\frac{N-1}{\omega}} A[\omega i + s].$$

For $$\omega=3$$, there are three PCFs $$(C_3^A(0)$$, $$C_3^A(1)$$ and $$C_3^A(2))$$ which each count the number of one's at positions $$\{0,3,6,...\}$$, $$\{1,4,7,...\}$$, $$\{2,5,8,...\}$$ respectively. It can be shown that the magnitude of the DFT at $$N/3$$ is in fact directly related to the PCF and is equal to


 * $$|\tilde{A}[N/3]|^2 = \frac{1}{2} [(C_3^A(0) - C_3^A(1))^2 + (C_3^A(1) - C_3^A(2))^2 + (C_3^A(2) - C_3^A(0))^2].$$

In other words, the spectral power at $$N/3$$ in the DFT depends only on the difference between the PCFs at different phases. This relation here is the fundamental to the explanation for why this method can identify coding regions. This is because the spectral power at $$N/3$$ will be determined by how the codons that make up the sequence are sampled.

If the codons are sampled uniformly at random, as they would be in a noncoding segment of DNA, there is a high chance that the PCFs would not differ by a significant amount and the power at that frequency will be low.

However, in a protein-coding sequence, the DNA sequence is made up of a string of codons which correspond to amino acids. Because the genetic code is degenerate (more than one codon map to a single amino acid) and samples from the amino acids rather than the codons, it is likely that the codons are not sampled uniformly thus leading to differences in the PCFs.

There is also empirical evidence for why this method works. In other words, over multiple studies this method has been able to discriminate coding vs. non-coding DNA segments. These are discussed in the next section.

Experiments
This method has been applied to sequence data from a number of organisms, details of which can be found in the references section. A few will be summarized here.

Tiwari, who wrote the paper to first apply DFT to analyzing periodicity of DNA sequences, applied this method to S.cerevisiae and H.influenzuae. For S.cerevisiae, they were able to locate 413 out of 483 probable genes (ORFs). For H.influenzuae, they were able to locate 167 out of 194 identified genes. In both studies, they had a zero false-positive rate.

Datta and Asif analyzed the algorithm's ability to identify coding regions of different lengths in chromosome III of C. elegans. Longer coding sequences are detected with higher probability. This seems to be a consequence of the Uncertainty principle (shorter-time signals spread out in frequency content) and the fact that fewer codons are provided in shorter sequences.

Pros

 * Requires no training data

The method can be run on any DNA sequence, where as other methods such as BLAST, FASTA and Smith-Waterman require empirical data.


 * Independent of variations in base compositions

This is because the total spectral power is


 * $$|S[k]|^2 = |A[k]|^2 + |T[k]^2 + |C[k]|^2 + |G[k]|^2,$$

and no one base contributes more than another.


 * Low false-positive rate


 * Robust to sequencing errors resulting in frameshifts

This is due to the property that shifting a sequence does not change the magnitude of its Discrete Fourier Transform.


 * $$x[n-n_0] \leftrightarrow X[k] e^{-j 2 \pi k n_0 / N}$$
 * $$|X[k] e^{-j 2 \pi k n_0 / N}| = |X[k]|$$

Cons

 * Three-base periodicity found to be lacking in some genes
 * Outperformed by modern empirical and specialized gene prediction methods

Binary Projection + Spectral Analysis
https://gist.github.com/sbarratt/03b3b04f6b779c0570b0b28958d1a860