Tag SNP

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Linkage Disequilibrium


Two loci are said to be in linkage equilibrium (LE) if their inheritance is an independent event. If the alleles at those loci are non-randomly inherited then we say that they are at linkage disequilibrium (LD). LD is most commonly caused by physical linkage of genes. When two genes are inherited on the same chromosome, depending on their distance and the likelihood of recombination between the loci they can be at high LD. However, LD can be also observed due to functional interactions where even genes from different chromosomes can jointly confer an evolutionarily selected phenotype or can affect the viability of potential offspring.

In families LD is highest because of the lowest numbers of recombination events (fewest meiosis events). This is especially true between inbred lines. In populations LD exists because of selection, physical closeness of the genes that causes low recombination rates or due to recent crossing or migration. On a population level, processes that influence linkage disequilibrium include genetic linkage, epistatic natural selection, rate of recombination, mutation, genetic drift, random mating, genetic hitchhiking and gene flow.

When a group of SNPs are inherited together because of high LD there tends to be redundant information. The selection of a tag SNP as a representative of these groups reduces the amount of redundancy when analyzing parts of the genome associated with traits/diseases. The regions of the genome in high LD that harbor a specific set of SNPs that are inherited together are also known as haplotypes. Therefore, tag SNPs are representative of all SNPs within a haplotype.

Haplotypes
The selection of tag SNPs is dependent on the haplotypes present in the genome. Most sequencing technologies provide the genotypic information and not the haplotypes i.e. they provide information on the specific bases that are present but do not provide phasic information (at which specific chromosome each of the bases appear). Determination of haplotypes can be done through molecular methods (Allele Specific PCR, Somatic cell hybrids). These methods distinguish which allele is present at which chromosome by separating the chromosomes before genotyping. They can be very time-consuming and expensive, so statistical inference methods have been developed as a less expensive and automated option. These statistical-inference software packages utilize parsimony, maximum likelihood, and Bayesian algorithms to determine haplotypes. Disadvantage of statistical-inference is that a proportion of the inferred haplotypes could be wrong.

Population differences
When haplotypes are used for genome wide association studies, it is important to note the population being studied. Often different populations will have different patterns of LD. One example of differentiating patterns are African-descended populations vs. European and Asian-descended populations. Since humans originated in Africa and spread into Europe and then the Asian and American continents, the African populations are the most genetically diverse and have smaller regions of LD while European and Asian-descended populations have larger regions of LD due to founder effect. When LD patterns differ in populations, SNPs can become disassociated with each other due to the changes in haplotype blocks. This means that tag SNPs, as representatives of the haplotype blocks, are unique in populations and population differences should be taken into account when performing association studies.

GWAS
Almost every trait has both genetic and environmental influence. Heritability is the proportion of phenotypic variance that is inherited from our ancestors. Association studies are used to determine the genetic influence on phenotypic presentation. Although mostly used for mapping diseases to genomic areas, they can also be used to map heritability of any phenotype like height, eye color etc.

Genome-wide association studies (GWAS) use single-nucleotide polymorphisms (SNPs) to identify genetic associations with clinical conditions and phenotypic traits. They are hypothesis free and use a whole-genome approach to investigate traits by comparing a large group of individuals that express a phenotype with a large group of people that don't. The ultimate goal of GWAS is to determine genetic risk factors that can be used to make predictions about who is at risk for a disease, what are the biological underpinnings of disease susceptibility and creating new prevention and treatment strategies. The National Human Genome Research Institute and the European Bioinformatics Institute publishes the GWAS Catalog, a catalog of published genome-wide association studies that highlights statistically significant associations between hundreds of SNPs with a broad range of phenotypes.



Due to the large number of possible SNP variants (more than 149 million as of June 2015 ) it is still very expensive to sequence all SNPs. That is why GWAS use customizable arrays (SNP chips) to genotype only a subset of the variants identified as tag snps. Most GWAS use products from the two primary genotyping platforms. The Affymetrix platform prints DNA probes on a glass or silicone chip that hybridize to specific alleles in the sample DNA. The Illumina platform uses bead-based technology, with longer DNA sequences and produces better specificity. Both platforms are able to genotype more than a million tag SNPs using either pre-made or custom DNA oligos.

Genome-wide studies are predicated on the common disease-common variant (CD/CV) hypothesis which states that common disorders are influenced by common genetic variation. Effect size (penetrance) of the common variants needs to be smaller relative to those found in rare disorders. That means that the common SNP can explain only a small portion of the variance due to genetic factors and that common diseases are influenced by multiple common alleles of small effect size. Another hypothesis is that common diseases are caused by rare variants that are synthetically linked to common variants. In that case the signal produced from GWAS is an indirect (synthetic) association between one or more rare causal variants in linkage disequilibrium. It is important to recognize that this phenomenon is possible when selecting a group for tag SNPs. When a disease is found to be associated with a haplotype, some SNPs in that haplotype will have synthetic association with the disease. To pinpoint the causal SNPs we need a greater resolution in the selection of haplotype blocks. Since whole genome sequencing technologies are rapidly changing and becoming less expensive it is likely that they will replace the current genotyping technologies providing the resolution needed to pinpoint causal variants.

HapMap
Because whole genome sequencing of individuals is still cost prohibitive, the international HapMap Project was constructed with a goal to map the human genome to haplotype groupings (haplotype blocks) that can describe common patterns of human genetic variation. By mapping the entire genome to haplotypes, tag SNPs can be identified to represent the haplotype blocks examined by genetic studies. An important factor to consider when planning a genetic study is the frequency and risk incurred by specific alleles. These factors can vary in different populations so the HapMap project used a variety of sequencing techniques to discover and catalog SNPs from different sets of populations. Initially the project sequenced individuals from Yoruba population of African origin (YRI), residents of Utah with western European ancestry (CEU), unrelated individuals from Tokyo, Japan (JPT) and unrelated Han Chinese individuals from Beijing, China (CHB). Recently their datasets have been expanded to include other populations (11 groups)

Steps for tag SNP selection
Selection of maximum informative tag SNPs is an NP complete problem. However, algorithms can be devised to provide approximate solution within a margin of error. The criteria that are needed to define each tag SNP selection algorithm is the following:


 * 1) Define area to search - the algorithm will attempt to locate tag SNPs in neighborhood N(t) of a target SNP t
 * 2) Define a metric to assess the quality of tagging - the metric needs to measure how well a target SNP t can be predicted using a set of its neighbors N(t) i.e. how well a tag SNP as a representative of the SNPs in a neighborhood N(t) can predict a target SNP t. It can be defined as a probability that the target SNP t has different values for any pair of haplotypes i and j where the value of the SNP s is also different for the same haplotypes. The informativeness of the metric can be represented in terms of a graph theory, where every SNP s is represented as a graph Gs whose nodes are haplotypes. Gs has an edge between the nodes (i,j) if and only if the values of s are different for the haplotypes Hi, Hj.
 * 3) Derive the algorithm to find representative SNPs - the goal of the algorithm is to find the minimal subset of tag SNPs selected with maximum informativeness between each tag SNP with every other target SNP
 * 4) Validate the algorithm

Feature selection
Methods for selecting features fall into two categories: filter methods and wrapper methods. Filter algorithms are general preprocessing algorithms that do not assume the use of a specific classification method. Wrapper algorithms, in contrast, “wrap” the feature selection around a specific classifier and select a subset of features based on the classifier's accuracy using cross-validation.

The feature selection method suitable for selecting tag SNPs must have the following characteristics:
 * scale well for large number of SNPs;
 * not require explicit class labeling and should not assume the use of a specific classifier because classification is not the goal of tagging SNP selection;
 * allow the user to select different numbers of tag SNPs for different amounts of tolerated information loss;
 * have comparable performance with other methods satisfying the three first conditions.

Selection algorithms
Several algorithms have been proposed for selecting tag SNPs. The first approach was based on the measure of goodness of SNP sets and searched for SNP subsets that are small but attain high value of the defined measure. Examining every SNP subset to find good ones is computationally feasible only for small data sets. Another approach uses principal component analysis (PCA) to find subsets of SNPs capturing majority of the data variance. A sliding windows method is employed to repeatedly apply PCA to short chromosomal regions. This reduces the data produced and also does not require exponential search time. Yet it is not feasible to apply the PCA method to large chromosomal data sets as it is computationally complex. The most commonly used approach, block-based method, exploits the principle of linkage disequilibrium observed within haplotype blocks. Several algorithms have been devised to partition chromosomal regions into haplotype blocks which are based on haplotype diversity, LD, four-gamete test and information complexity and tag SNPs are selected from all SNPs that belong to that block. The main presumption in this algorithm is that the SNPs are biallelic. The main drawback is that the definition of blocks is not always straightforward. Even though there is a list of criteria for forming the haplotype blocks, there is no consensus on the same. Also, local correlations based selection of tag SNPs ignores inter-block correlations.

Unlike the block-based approach, a block-free approach does not rely on the block structure. The SNP frequency and recombination rates are known to vary across the genome and some studies have reported LD distances much longer than the reported maximum block sizes. Setting a strict border for the neighborhood is not desired and the block-free approach looks for tag SNPs globally. There are several algorithms to perform this. In one algorithm, the non-tagging SNPs are represented as boolean functions of tag SNPs and set theory techniques are used to reduce search space. Another algorithm searches for subsets of markers that can come from non-consecutive blocks. Due to the marker neighborhood, the search space is reduced.

Optimizations
With the number of individuals genotyped and number of SNPs in databases growing, tag SNP selection takes too much time to compute. In order to improve the efficiency of the tag SNP selection method, the algorithm first ignores the SNPs being biallelic, and then compresses the length (SNP number) of the haplotype matrix by grouping the SNP sites with the same information. The SNP sites that partition the haplotypes into the same group are called redundant sites. The SNP sites which contain distinct information within a block are called non-redundant sites (NRS). In order to further compress the haplotype matrix, the algorithm needs to find the tag SNPs such that all haplotypes of the matrix can be distinguished. By using the idea of joint partition, an efficient tag SNPs selection algorithm is provided.

Validation of the accuracy of the algorithm
Depending on how the tag SNPs are selected, different prediction methods have been used during the cross-validation process. Machine learning method was employed to predict the left-out haplotype. Another approach predicted the alleles of a non-tagging SNP n from the tag SNPs that had the highestcorrelation coefficient with n. If a single highly correlated tag SNP t is found, the alleles are assigned so their frequencies agree with the allele frequencies of t. When multiple tagging SNPs have the same (high) correlation coefficient with n, the common allele of n has advantage. It is easy to see that in this case the prediction method agrees well with the selection method, which uses PCA on the matrix of correlation coefficients between SNPs.

There are other ways to assess the accuracy of a tag SNP selection method. The accuracy can be evaluated by the quality measure R2, which is the measure of association between the true numbers of haplotype copies defined over the full set of SNPs and the predicted number of haplotype copies where the prediction is based on the subset of tagging SNPs. This measure assumes diploid data and explicit inference of haplotypes from genotypes.

Another assessment method due to Clayton is based on a measure of the diversity of haplotypes. The diversity is defined as the total number of differences in all pairwise comparison between haplotypes. The difference between a pair of haplotypes is the sum of differences over all the SNPs. The Clayton's diversity measure can be used to define how well a set of tag SNPs differentiate different haplotypes. This measure is suitable only for haplotype blocks with limited haplotype diversity and it is not clear how to use it for large data sets consisting of multiple haplotype blocks.

Some recent works evaluate tag SNPs selection algorithms based on how well the tagging SNPs can be used to predict non-tagging SNPs. The prediction accuracy is determined using cross-validation such as leave-one-out or hold out. In leave-one-out cross-validation, for each sequence in the data set, the algorithm is run on the rest of the data set to select a minimum set of tagging SNPs.

Tagger
Tagger is a web tool available for evaluating and selecting tag SNPs from genotypic data such as the International HapMap Project. It utilizes pairwise methods and multimarker haplotype approaches. Users can upload HapMap genotype data or pedigree format and the linkage disequilibrium patterns will be calculated. Tagger options allow for the user to specify chromosomal landmarks, which indicate regions of interest in the genome for picking tag SNPs. The program then produces a list of tag SNPs and their statistical test values as well as a coverage report. It is developed by Paul de Bakker in the labs of David Altshuler and Mark Daly at the Center for Human Genetic Research of Massachusetts General Hospital and Harvard Medical School, at the Broad Institute.

CLUSTAG and WCLUSTAG
In the freeware CLUSTAG and WCLUSTAG, there contain cluster and set-cover algorithms to obtain a set of tag SNPs that can represent all the known SNPs in a chromosomal region. The programs are implemented with Java, and they can run in Windows platform as well as the Unix environment. They are developed by SIO-IONG AO et al. in The University of Hong Kong.