Genetic saturation

Genetic saturation is the result of multiple substitutions at the same site in a sequence, or identical substitutions in different sequences, such that the apparent sequence divergence rate is lower than the actual divergence that has occurred. When comparing two or more genetic sequences consisting of single nucleotides, differences in sequence observed are only differences in the final state of the nucleotide sequence. Single nucleotides that undergoing genetic saturation change multiple times, sometimes back to their original nucleotide or to a nucleotide common to the compared genetic sequence. Without genetic information from intermediate taxa, it is difficult to know how much, or if any saturation has occurred on an observed sequence. Genetic saturation occurs most rapidly on fast-evolving sequences, such as the hypervariable region of mitochondrial DNA, or in short tandem repeats such as on the Y-chromosome.

In phylogenetics, saturation effects result in long branch attraction, where the most distant lineages have misleadingly short branch lengths. It also decreases phylogenetic information contained in the sequences.

Multiple substitutions
Multiple substitutions take place when single nucleotides undergo multiple changes before reaching their final nucleotide identity. A sequence is said to be saturated because mutation has acted multiple times upon nucleotides and observed change in sequence is, in fact, less than the historical change in sequence.

Detection
It is possible to estimate the amount of saturation that a sequence might have undergone by estimating the substitution rate of a genetic sequence and how much time has passed since divergence. Divergence rates are estimated from a variety of sources including ancestral DNA, fossil records and biographical events. This use of molecular clocks to determine divergence is controversial because of its potential for inaccuracy and assumptions made in the model (such as consistent mutation rate for all branches) and is used mostly as an estimation tool. Genetic saturation can also be estimated by comparing the number of observed differences in nucleotide sequences between multiple pairs of species. The number of observed substitutions between sequences of different species can be compared to the number of inferred substitutions based on branch length to find the approximate point where the number of inferred substitutions surpasses the number of observed substitutions. This method can give researchers an idea of the level of saturation of a particular gene but is thought to underestimate the amount of saturation, especially for very large branch lengths.

Impact on phylogenetics
In the field of molecular phylogenetics, the distances and relationships between species are investigated by looking at the DNA, RNA or amino acid sequences of an organism. When phylogenetic trees are constructed without considering possible saturation, the possibility of multiple substitutions can cause the distance between taxa to appear much smaller than the true distance. Multiple sequence alignment, a common technique to construct phylogenies, relies on the comparison of homologous sequences. It can easily be confounded by genetic saturation because the homologous loci under investigation show no indication whether or not more than one substitution on each nucleotide separates the taxa being described. Substitution decreases the amount of phylogenetic information that can be contained in sequences, especially when deep branches are involved. This is particularly evident in studies examining arthropod groups. Furthermore, saturation effects can lead to a gross underestimation of divergence time. This is mainly attributed to the randomization of the phylogenetic signal with the number of observed sequence mutations and substitutions. The effects of saturation can mask the true amount of divergence time leading to inaccurate phylogenetic trees.



The principle of parsimony in genetic saturation analysis
Parsimony plays a fundamental role in genetic saturation analysis. This principle gives preference to the simplest explanation that can explain the data. In regards to genetic saturation, parsimony means that the hypothesized relationship is one that has the smallest number of character changes. Using parsimony to analyze genetic saturation can lead to conflict when creating a phylogenetic tree. When only sequence data is used, it is possible to come up with numerous phylogenetic trees with the same amount of parsimony.

Long branch attraction
Genetic saturation contributes to long-branch attraction in its ability to greatly mix up genetic code without easily observable associated phenotypic changes. Long branch attraction occurs when two relatively outgrouped taxa are seemingly closely linked. The more substitution mutations, the more likely it is for previously dissimilar sequences to share nucleotides and as a result, show homology in phylogenetic tree calculations. Long-branch attraction due to saturation has been proposed to be the cause of links in ancient phylogenies and puts into question even some of the earliest relationships between eukaryotes, archaea, and eubacteria.

Gene site saturation mutagenesis
Gene site saturation mutagenesis (GSSM) is mutagenesis technique of one or more codons in a gene to create a library of variants covering all other codons at that position. It is used in biochemistry and protein engineering to explore the functions and characteristics of specific amino acid sequences. This systemic identification of amino acid substitutions allows researchers to look at every possible variant of each position. This will provide crucial structural information about the protein of interest and will identify amino acid sequences that are more vital to the function of the protein.



Researchers often lean towards using a one-step PCR-based to explore the specific effects of different variations in an amino acid of interest within a protein with GSSM. With a one-step PCR-based approached, researchers create a primer that has a corresponding sequence to the protein of interest at its two ends. Only one codon of a three codon amino acid sequence is substituted.

The type of codon set, will determine the number of sequences that can be derived from GSSM. To determine which codon set to use, researchers will need to check the library quality on the DNA level, which means that massive sequence data is needed. If all 3 positions can be substituted for each of the four different nucleotides, researchers can code for all 20 amino acids. Although it’s possible to code for all 20 amino acids, this is not the most efficient method. The most efficient method is to use an NNK codon degeneracy, also known as a limited codon set. This method, will result in only 32 codons rather than 64.

Advantages of GSSM
In comparison to other techniques, GSSM is able to offer unique advantages such as:
 * A complete analysis of every position in a given gene, which can be helpful in identifying critical positions. Critical positions are identified by analyzing the immensity of the effects of mutagenesis — both positive and negative. GSSM can also identify positions that are more flexible, as GSSM at these positions will have less of an impact on the amino acid.
 * A residue-specific analysis, which allows for researchers to create a schematic representation of the amino acid. This allows for more complex and detailed genetic research in further studies.
 * An ability to look at the effects of various amino acids without knowing any structural information about the protein. The data collected can then provide valuable insight into this area.
 * Fast delivery times and cost-efficiency.

GSSM was able to open up a whole frontier in genetic research, as it revolutionized fundamental beliefs about DNA. Before GSSM, researchers mutated DNA through radiation or with various chemicals. Both of these methods are imprecise.