User:CNArmstrong/sandbox

Single-nucleotide polymorphism

In genetics, a single-nucleotide polymorphism (SNP /snɪp/; plural /snɪps/) is a substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of the population (e.g. 1% or more).[1]

For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be the alleles for this specific position.

SNPs pinpoint differences in our susceptibility to a wide range of diseases (e.g. sickle-cell anemia, β-thalassemia and cystic fibrosis).[2][3][4] The severity of illness and the way the body responds to treatments are also manifestations of genetic variations caused by SNPs. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer's disease.[5]

A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency. SNVs differ from SNPs in that when a SNV is detected from one organism, the SNV could potentially be a SNP but this cannot be determined from only one organism.[6][7] SNP however means the nucleotide varies in a species' population of organisms. SNVs may arise in somatic cells which is classified as a somatic single-nucleotide variation or single-nucleotide alteration and can be caused by cancer. SNVs also commonly arise in molecular diagnostics such as designing PCR primers to detect viruses, in which the viral RNA or DNA sample may contain SNVs.

Types of single-nucleotide polymorphism (SNPs)

Single-nucleotide polymorphisms may fall within coding sequences of genes, non-coding regions of genes, or in the intergenic regions (regions between genes). SNPs within a coding sequence do not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code.

SNPs in the coding region are of two types: synonymous and nonsynonymous SNPs. Synonymous SNPs do not affect the protein sequence, while nonsynonymous SNPs change the amino acid sequence of protein. The nonsynonymous SNPs are of two types: missense and nonsense. All types of SNPs can have an observable phenotype:

·       SNPs in non-coding regions can manifest in a higher risk of cancer,[35] and may affect mRNA structure and disease susceptibility.[36] Non-coding SNPs can also alter the level of expression of a gene, as an eQTL (expression quantitative trait locus).

·       SNPs in coding regions:

o  Synonymous Substitutions:

§  Do not result in a change of amino acid in the protein, but still can affect its function in other ways. An example would be a seemingly silent mutation in the multidrug resistance gene 1 (MDR1), which codes for a cellular membrane pump that expels drugs from the cell, can slow down translation and allow the peptide chain to fold into an unusual conformation, causing the mutant pump to be less functional (in MDR1 protein e.g. C1236T polymorphism changes a GGC codon to GGT at amino acid position 412 of the polypeptide (both encode glycine) and the C3435T polymorphism changes ATC to ATT at position 1145 (both encode isoleucine)).[37]

o  Nonsynonymous Substitutions:

§ missense – single change in the base results in change in amino acid of protein and its malfunction which leads to disease (e.g. c.1580G>T SNP in LMNA gene – position 1580 (nt) in the DNA sequence (CGT codon) causing the guanine to be replaced with the thymine, yielding CTT codon in the DNA sequence, results at the protein level in the replacement of the arginine by the leucine in the position 527,[38] at the phenotype level this manifests in overlapping mandibuloacral dysplasia and progeria syndrome)

§ nonsense – point mutation in a sequence of DNA that results in a premature stop codon, or a nonsense codon in the transcribed mRNA, and in a truncated, incomplete, and usually nonfunctional protein product (e.g. Cystic fibrosis caused by the G542X mutation in the cystic fibrosis transmembrane conductance regulator gene).[39]

SNPs that are not in protein-coding regions may still affect gene splicing, transcription factor binding, messenger RNA degradation, or the sequence of noncoding RNA. Gene expression affected by this type of SNP is referred to as an eSNP (expression SNP) and may be upstream or downstream from the gene.

Applications break into section or bullet points

Association studies can determine whether a genetic variant is associated with a disease or trait.

A tag SNP is a representative single-nucleotide polymorphism in a region of the genome with high linkage disequilibrium (the non-random association of alleles at two or more loci). Tag SNPs are useful in whole-genome SNP association studies, in which hundreds of thousands of SNPs across the entire genome are genotyped.

Haplotype mapping: sets of alleles or DNA sequences can be clustered so that a single SNP can identify many linked SNPs.

Linkage disequilibrium (LD), a term used in population genetics, indicates non-random association of alleles at two or more loci, not necessarily on the same chromosome. It refers to the phenomenon that SNP allele or DNA sequence that are close together in the genome tend to be inherited together. LD can be affected by two parameters (among other factors, such as population stratification): 1) The distance between the SNPs [the larger the distance, the lower the LD]. 2) Recombination rate [the lower the recombination rate, the higher the LD].

Variations in the DNA sequences of humans can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. SNPs are also critical for personalized medicine. Examples include biomedical research, forensics, pharmacogenetics, and disease causation, as outlined below.

More than 335 million SNPs have been found across humans from multiple populations. A typical genome differs from the reference human genome at 4 to 5 million sites, most of which (more than 99.9%) consist of SNPs and short indels.[10]

Within a genome

The genomic distribution of SNPs is not homogenous; SNPs occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and "fixing" the allele (eliminating other variants) of the SNP that constitutes the most favorable genetic adaptation.[11] Other factors, like genetic recombination and mutation rate, can also determine SNP density.[12]

SNP density can be predicted by the presence of microsatellites: AT microsatellites in particular are potent predictors of SNP density, with long (AT)(n) repeat tracts tending to be found in regions of significantly reduced SNP density and low GC content.[13]

Within a population

There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another. Within a population, SNPs can be assigned a minor allele frequency—the lowest allele frequency at a locus that is observed in a particular population.[14] This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms.

With this knowledge scientists have developed new methods in analyzing population structures in less studied species.[15][16][17] By using pooling techniques the cost of the analysis is significantly lowered.[citation needed] These techniques are based on sequencing a population in a pooled sample instead of sequencing every individual within the population by itself. With new bioinformatics tools there is a possibility of investigating population structure, gene flow and gene migration by observing the allele frequencies within the entire population. With these protocols there is a possibility in combining the advantages of SNPs with micro satellite markers.[18][19] However, there are information lost in the process such as linkage disequilibrium and zygosity information.

Clinical research

SNPs' greatest importance in clinical research is for comparing regions of the genome between cohorts (such as with matched cohorts with and without a disease) in genome-wide association studies. SNPs have been used in genome-wide association studies as high-resolution markers in gene mapping related to diseases or normal traits. SNPs without an observable impact on the phenotype (so called silent mutations) are still useful as genetic markers in genome-wide association studies, because of their quantity and the stable inheritance over generations.

Forensics

SNPs have historically been used to match a forensic DNA sample to a suspect but has been made obsolete due to advancing STR-based DNA fingerprinting techniques. However, the development of next-generation-sequencing (NGS) technology may allow for more opportunities for the use of SNPs in phenotypic clues such as ethnicity, hair color, and eye color with a good probability of a match(citation for Kidd et al.). This can additionally be applied to increase the accuracy of facial reconstructions by providing information that may otherwise be unknown, and this information can be used to help identify suspects even without a STR DNA profile match.

Some cons to using SNPs versus STRs is that SNPs yield less information than STRs, and therefore more SNPs are needed for analysis before a profile of a suspect is able to be created. Additionally, SNPs heavily rely on the presence of a database for comparative analysis of samples. However, in instances with degraded or small volume samples, SNP techniques are an excellent alternative to STR methods. SNPs (as opposed to STRs) have an abundance of potential markers, can be fully automated, and a possible reduction of required fragment length to less than 100bp.[26]

Pharmacogenetics

Some SNPs are associated with the metabolism of different drugs.[28][29][30] SNP's can be mutations, such as deletions, which can inhibit or promote enzymatic activity; such change in enzymatic activity can lead to decreased rates of drug metabolism.[31] The association of a wide range of human diseases like cancer, infectious diseases (AIDS, leprosy, hepatitis, etc.) autoimmune, neuropsychiatric and many other diseases with different SNPs can be made as relevant pharmacogenomic targets for drug therapy.[32]

Disease

A single SNP may cause a Mendelian disease, though for complex diseases, SNPs do not usually function individually, rather, they work in coordination with other SNPs to manifest a disease such as in Osteoporosis.[33] One of the earliest successes in this field was finding a single base mutation in the non-coding region of the APOC3 (apolipoprotein C3 gene) that associated with higher risks of hypertriglyceridemia and atherosclerosis.[34]. Some diseases caused by SNPs include rheumatoid arthritis, crohn’s disease, breast cancer, alzheimers, and some autoimmune disorders. Large scale association studies have been performed to attempt to discover additional disease causing SNPs within a population, but a large number of them are still unknown.

Examples table?

rs6311 and rs6313 are SNPs in the Serotonin 5-HT2A receptor gene on human chromosome 13.[40]

A SNP in the F5 gene causes Factor V Leiden thrombophilia.[41]

rs3091244 is an example of a triallelic SNP in the CRP gene on human chromosome 1.[42]

TAS2R38 codes for PTC tasting ability, and contains 6 annotated SNPs.[43]

rs148649884 and rs138055828 in the FCN1 gene encoding M-ficolin crippled the ligand-binding capability of the recombinant M-ficolin.[44]

An intronic SNP in DNA mismatch repair gene PMS2 (rs1059060, Ser775Asn) is associated with increased sperm DNA damage and risk of male infertility.[45]

Databases table?

As there are for genes, bioinformatics databases exist for SNPs.

dbSNP is a SNP database from the National Center for Biotechnology Information (NCBI). As of June 8, 2015, dbSNP listed 149,735,377 SNPs in humans.[46][47]

Kaviar[48] is a compendium of SNPs from multiple data sources including dbSNP.

SNPedia is a wiki-style database supporting personal genome annotation, interpretation and analysis.

The OMIM database describes the association between polymorphisms and diseases (e.g., gives diseases in text form)

dbSAP – single amino-acid polymorphism database for protein variation detection[49]

The Human Gene Mutation Database provides gene mutations causing or associated with human inherited diseases and functional SNPs

The International HapMap Project, where researchers are identifying Tag SNPs to be able to determine the collection of haplotypes present in each subject.

GWAS Central allows users to visually interrogate the actual summary-level association data in one or more genome-wide association studies.

The International SNP Map working group mapped the sequence flanking each SNP by alignment to the genomic sequence of large-insert clones in Genebank. These alignments were converted to chromosomal coordinates that is shown in Table 1.[50] This list has greatly increased since, with, for instance, the Kaviar database now listing 162 million single nucleotide variants (SNVs).

Chromosome                   Length(bp)  All SNPs                        TSC SNPs

Total SNPs kb per SNP Total SNPs kb per SNP

1                  214,066,000                    129,931       1.65             75,166         2.85

2                  222,889,000                    103,664       2.15             76,985         2.90

3                  186,938,000                    93,140         2.01             63,669         2.94

4                  169,035,000                    84,426         2.00             65,719         2.57

5                  170,954,000                    117,882       1.45             63,545         2.69

6                  165,022,000                    96,317         1.71             53,797         3.07

7                  149,414,000                    71,752         2.08             42,327         3.53

8                  125,148,000                    57,834         2.16             42,653         2.93

9                  107,440,000                    62,013         1.73             43,020         2.50

10                127,894,000                    61,298         2.09             42,466         3.01

11                129,193,000                    84,663         1.53             47,621         2.71

12                125,198,000                    59,245         2.11             38,136         3.28

13                93,711,000  53,093         1.77             35,745         2.62

14                89,344,000  44,112         2.03             29,746         3.00

15                73,467,000  37,814         1.94             26,524         2.77

16                74,037,000  38,735         1.91             23,328         3.17

17                73,367,000  34,621         2.12             19,396         3.78

18                73,078,000  45,135         1.62             27,028         2.70

19                56,044,000  25,676         2.18             11,185         5.01

20                63,317,000  29,478         2.15             17,051         3.71

21                33,824,000  20,916         1.62             9,103           3.72

22                33,786,000  28,410         1.19             11,056         3.06

X                 131,245,000                    34,842         3.77             20,400         6.43

Y                 21,753,000  4,193           5.19             1,784           12.19

RefSeq        15,696,674  14,534         1.08

Totals          2,710,164,000                 1,419,190    1.91             887,450       3.05

Nomenclature

The nomenclature for SNPs include several variations for an individual SNP, while lacking a common consensus.

The rs### standard is that which has been adopted by dbSNP and uses the prefix "rs", for "reference SNP", followed by a unique and arbitrary number.[51] SNPs are frequently referred to by their dbSNP rs number, as in the examples above.

The Human Genome Variation Society (HGVS) uses a standard which conveys more information about the SNP. Examples are:

c.76A>T: "c." for coding region, followed by a number for the position of the nucleotide, followed by a one-letter abbreviation for the nucleotide (A, C, G, T or U), followed by a greater than sign (">") to indicate substitution, followed by the abbreviation of the nucleotide which replaces the former[52][53][54]

p.Ser123Arg: "p." for protein, followed by a three-letter abbreviation for the amino acid, followed by a number for the position of the amino acid, followed by the abbreviation of the amino acid which replaces the former.[55]

SNP analysis

SNPs can be easily assayed due to only containing two possible alleles and three possible genotypes involving the two alleles: homozygous A, homozygous B and heterozygous AB, leading to many possible techniques for analysis. Some include: DNA sequencing; capillary electrophoresis; mass spectrometry; single-strand conformation polymorphism (SSCP); single base extension; electrochemical analysis; denaturating HPLC and gel electrophoresis; restriction fragment length polymorphism; and hybridization analysis.

Programs for prediction of SNP effects

An important group of SNPs are those that corresponds to missense mutations causing amino acid change on protein level. Point mutation of particular residue can have different effect on protein function (from no effect to complete disruption its function). Usually, change in amino acids with similar size and physico-chemical properties (e.g. substitution from leucine to valine) has mild effect, and opposite. Similarly, if SNP disrupts secondary structure elements (e.g. substitution to proline in alpha helix region) such mutation usually may affect whole protein structure and function. Using those simple and many other machine learning derived rules a group of programs for the prediction of SNP effect was developed:[61]

SIFT This program provides insight into how a laboratory induced missense or nonsynonymous mutation will affect protein function based on physical properties of the amino acid and sequence homology.

LIST[62][63] (Local Identity and Shared Taxa) estimates the potential deleteriousness of mutations resulted from altering their protein functions. It is based on the assumption that variations observed in closely related species are more significant when assessing conservation compared to those in distantly related species.

SNAP2

SuSPect

PolyPhen-2

PredictSNP

MutationTaster: official website

Variant Effect Predictor from the Ensembl project

SNPViz[64] This program provides a 3D representation of the protein affected, highlighting the amino acid change so doctors can determine pathogenicity of the mutant protein.

PROVEAN

PhyreRisk is a database which maps variants to experimental and predicted protein structures.[65]

Missense3D is a tool which provides a stereochemical report on the effect of missense variants on protein structure.[66]

See also

Affymetrix

HapMap

Illumina

International HapMap Project

Short tandem repeat (STR)

Single-base extension

SNP array

SNP genotyping

SNPedia

Snpstr

SNV calling from NGS data

Tag SNP

TaqMan

Variome