User:SarahKusala/Exome Sequencing



Exome sequencing (also known as targeted exome capture) is strategy to selectively sequence the coding regions of the human genome to identify novel genes associated with disorders. Presently, genome sequencing of large sample sizes is difficult partly due to the high cost associated with the technique. At present, it is necessary to use an alternative approach, in which certain regions of the genome, are targeted, enriched and sequenced. This technique requires ~5% as much sequencing as a whole genome. The “exome” represents all the exons in the human genome (i.e., the protein-coding region of the genome). Exons are functionally important sequences of DNA which represents the regions in genes that are translated into protein. In total there are about 180,000 exons found in the human genome . These protein coding regions constitute about 1% of the human genome which corresponds to about 30 Mb in length. . It is estimated that these regions contain about 85% of the disease-causing mutations.



The approach to sequencing the complete coding region has the potential to be clinically relevant in genetic diagnosis due to current understanding of functional consequences in sequence variation. The goal of this approach is to identify the functional variation that is responsible for both mendelian and common diseases such as Miller syndrome and Alzheimer’s disease without the high costs associated with whole-genome sequencing while maintaining high coverage in sequence depth.

Exome Sequencing as an Efficient Strategy
Exome sequencing is an efficient strategy to identify these rare causal variants of mendelian disorders over whole genome sequencing because:


 * 1) 	Positional cloning strategies have reduced power to positively identify causal rare variants
 * 2) 	The majority of genetic variants that underlie mendelian disorders interfere with protein-coding sequences
 * 3) 	A large number or rare nonsynonymous substitutions are predicted to be deleterious
 * 4) 	Splice sites are sequences in which there is high functional variation

The exome represents an enriched portion of the genome where there is potential to identify variants with large effect sizes.

Mendelian Disorders
Rare diseases affect less than 200,000 individuals in the United States and are of interest because the identification of the genetic basis can provide knowledge on biological pathways and therapeutic targets. It is suspected that there are more than 7,000 rare mendelian diseases that affect millions of people in the US<ref name="multiple" /. To date many of the mendelian diseases are known to be caused by rare mutations that affect protein function. The mutations that are known to cause mendelian disorders are located in protein-coding regions. Non-coding regions on the other hand are likely to have weak or neutral effects on a disease phenotype. To date, less than half of all rare monogenic disorders have been discovered. The identification of genetic variants is limited by sample size of affected individuals, reduced penetrance, locus heterogeneity, and alleles that impair reproductive fitness. These factors make it difficult to map genetic traits by linkage analysis and they reduce the power to detect variants using positional cloning. For both dominant and recessive traits finding an excess of mutations in the same region will provide evidence that a disease gene has been identified. A benefit of using this approach is that it requires only a small number of unrelated cases to identify a causal gene.

Technological Platforms
The technical platforms used to carry out exome sequencing are DNA microarrays for the capture and enrichment of the exome and next-generation sequencing technologies.

Target-enrichment strategies
Target-enrichment methods allow to selectively capture genomic regions of interest from a DNA sample prior to sequencing. Several target-enrichment strategies have been developed.

PCR
PCR is one of the most commonly used enrichment strategies for more than 20 years. This approach is known to be useful in classical Sanger sequencing because a uniplex PCR used to generate a single DNA sequence is similar in read length to a typical amplicon. Mutliplex PCR reactions which require several primers is challenging although strategies to get around this have been developed. A limitation to this method is the size of the genomic target because a large amount of DNA is required. The PCR based approach is highly effective, yet it is not able to target genomic regions that are several megabases in length due to the cost and quantity of DNA that is needed.

Molecular Inversion Probes (MIP)
This is a technique that targets the amplification of multiple genomic regions. Accurate genotypes can be achieved from sequencing using this method. This method is suggested to be useful to target a small number of regions in a large sample sizes. A major disadvantage of this method for target enrichment is poor uniform capture of target regions as well as the cost associated with covering large target sets.

Hybrid Capture
This technique involves hybridizing shotgun libraries of genomic DNA to target-specific sequences on a microarray. Roche NimbleGen was first to take this technology and adapt it for next-generation sequencing. They developed the Sequence Capture Human Exome 2.1M Array to capture ~180,000 coding exons. This method is both time-saving and cost-effective compared to traditional PCR based methods. The Agilent Capture Array and the comparative genomic hybridization array are also other methods that can be used for hybrid capture of target sequences. A large amount of DNA is required for this technique as well as expensive equipment.

In-Solution Capture
This method was developed to improve on the hybridization capture target-enrichment method. In-solution capture as opposed to hybrid capture, has an excess of probes to template. The optimal target size is about 3.5 Mb in length which allows for excellent sequence coverage of the target regions. For exome captures, both solution and array perform equivalently.

Sequencing
There are several sequencing platforms available including the classical Sanger sequencing. Other platforms include the Roche 454 sequencer and the Illumina Genome Analyzer II which have both been used for exome sequencing.

Importance of Exome Sequencing
A study published in September 2009 discussed a proof of concept experiment to determine if it was possible to identify causal genetic variants using exome sequencing. They sequenced four individuals with Freeman-Sheldon syndrome (FSS) (OMIM 193700), a rare autosomal dominant disorder known to be caused by a mutation in the gene MYH3. Eight HapMap individuals were also sequenced to remove common variants in order to identify the causal gene for FSS. After exclusion of common variants, the authors were able to positively identify MYH3. This confirms that exome sequencing can be used to identify causal variants of rare disorders. This is the first reported study that used exome sequencing as an approach to identify an unknown causal gene for a rare mendelian disorder.

A second report was conducted on exome sequencing of individuals with a mendelian disorder known as Miller syndrome (MIM#263750), a rare disorder of autosomal recessive inheritance. Two siblings and two unrelated individuals with Miller syndrome were studied. They looked at variants that have the potential to be pathogenic such as nonsynonymous mutations, splice acceptor and donor sites and small insertions or deletions. Since Miller syndrome is a rare disorder, it is expected that the causal variant has not been previously identified. Previous exome sequencing studies of common single nucleotide polymorphisms (SNPs) in public SNP databases were used to further exclude candidate genes. After exclusion of these genes, the authors found mutations in DHODH that were shared among individuals with Miller syndrome. Each individual with Miller syndrome was a heterozygote for the DHODH mutations. This mutations was inherited as each parent of an affected individual was found to be a carrier.

This is the first time exome sequencing has been shown to identify a novel gene responsible for a rare mendelian disease. This exciting finding demonstrates is that exome sequencing has the potential to locate causative genes in complex diseases, which previously has not been possible due to limitations in traditional methods. Targeted capture and massively parallel sequencing represents a cost-effective strategy with high sensitivity and specificity to detect variants causing protein-coding changes in individual human genomes.

Genotyping vs Exome Sequencing
There are multiple technologies available to undertake methods to identify causal genetic variants associated with disease. Each technology has its own technical, financial and throughput limitations. Microarrays for example, require hybridization probes of known sequence and are therefore limited by probe design and thus prevent the identification of genetic changes that can be detected. Massively parallel sequencing technologies used for exome sequencing on the other hand makes it now possible to identify the cause of many unknown diseases. This technology addresses the present limitations of hybridization genotyping arrays and classical sequencing.

Although, exome sequencing is an expensive method relative to other technologies (e.g., hybridization-based technologies) currently available, it is an efficient strategy to identify the genetic basis that underlie rare mendelian disorders. This approach has become increasingly more useful with the falling cost and increased throughput of whole genome sequencing. Even by only sequencing the exomes of individuals, a large quantity of data and sequence information is generated which requires a significant amount of data analysis. This requires changes in programs that can align and assemble sequence reads.

Limitations
Exome sequencing is capable of only identifying those variants found in the coding region of genes which affect protein function. It is not able to identify the structural and non-coding variants associated the disease which can be found using other methods such as whole genome sequencing. There remains 99% of the human genome that is not covered using exome sequencing. Whole genome sequencing will eventually become a standard approach and allow us to gain a deeper understanding of genetic variation. Presently, this technique is not practical due to the high costs and time associated with sequencing large numbers of genomes. Exome sequencing allows sequencing of portions of the genome over at least 20 times as many samples compared to whole genome sequencing. For translation of identified rare variants into the clinic, sample size and the ability to interpret the results to provide a clinical diagnosis indicates that with the current knowledge in genetics, this approach may be the most valuable.

The statistical analysis the large quantity of data generated from sequencing approaches is a challenge. False positive and false negative findings are associated with genomic resequencing approaches and it is a critical issue. A few strategies have been developed to improve the quality of exome data such as:


 * Comparing the correlation of genetic variants between sequencing and array-based genotyping mthods
 * Comparing the coding SNPs to a whole genome sequenced individual with the disorder
 * Comparing the coding SNPs with Sanger sequencing to HapMap individuals

Recessive disorders would not have single nucleotide polymorphisms (SNPs) in public databases such dbSNP. Genes for recessive disorders are usually easier to identify than dominant disorders because genes are less likely to have more than one rare nonsynonymous variants. Using catalogs of common variation from a study exome or genome-wide would be more reliable than using HapMap. A challenge in this approach is that as the number of exomes sequenced increases, dbSNP will also increase in the number of uncommon variants. It will be necessary to develop thresholds to define the common variants that are unlikely to be associated with a disease phenotype.

Genetic heterogeneity and population ethnicity are also major limitations as it may increase the number false positive and false negative findings which will make the identification of candidate genes more difficult. Of course it is possible to decrease stringency of the threshold in the presence of heterogeneity and ethnicity, however it will reduce the power to detect variants as well.

Ethical Implications
New technologies in genomics has changed the way researchers approach both basic and translational research. With approaches such as exome sequencing it is possible to significantly enhance the data generated from individual genomes which has put forth a series of questions on how to deal with the vast amount of information. Should the individuals in these studies be allowed to have access to their sequencing information? Is it possible to interpret theses results for these individuals and are the identified genetic variants clinically relevant? This data can lead to unexpected findings and complicate clinical utility and patient benefit.This area of genomics still remains a challenge and researchers are looking into how to address these questions.