Whole genome bisulfite sequencing

Whole genome bisulfite sequencing is a next-generation sequencing technology used to determine the DNA methylation status of single cytosines by treating the DNA with sodium bisulfite before high-throughput DNA sequencing. The DNA methylation status at various genes can reveal information regarding gene regulation and transcriptional activities. This technique was developed in 2009 along with reduced representation bisulfite sequencing after bisulfite sequencing became the gold standard for DNA methylation analysis.

Whole genome bisulfite sequencing measures single-cytosine methylation levels genome-wide and directly estimates the ratio of molecules methylated rather than enrichment levels. Currently, this technique has recognized and tested approximately 95% of all cytosines in known genomes. With the improvement of library preparation methods and next-generation sequencing technology over the past decade, whole genome bisulfite sequencing has become an increasingly widespread and informative method for analyzing DNA methylation in epigenomic-wide studies.

History
Prior to the development of whole genome bisulfite sequencing, genome methylation analysis relied heavily on early non-specific and differential methods such as paper chromatography, high-performance liquid chromatography, and thin-layer chromatography to analyze methylation profiles. These methods were limited by the inability to amplify methylated DNA via polymerase chain reaction in vitro due to loss of methylation status. As a result, much of these early methods relied on detecting and analyzing naturally-manifested methylated cytosines in vivo rather than chemically methylated cytosines.

In 1970, a breakthrough occurred when it was discovered that treating DNA with sodium bisulfite deaminated cytosine residues into uracil. In the following decade, this discovery led to the revelation that unmethylated cytosine reacted much faster to sodium bisulfite treatment than did 5-methylcytosine. This difference in reaction rates created the possibility of identifying chemical changes in DNA as an easily detectable genetic marker. Whole genome bisulfite sequencing was derived as a combination of this bisulfite treatment and next-generation sequencing technology, such as shotgun sequencing.

The whole genome sequencing technique was first applied to the DNA methylation mapping at single nucleotide resolution to Arabidopsis thaliana in 2008, and shortly after in 2009, the first single-base-resolution DNA methylation map of the entire human genome was created using whole genome bisulfite sequencing. Since its development, many various protocols of whole genome bisulfite sequencing have been developed aiming to improve the efficiency and efficacy of its single-base mapping. As the costs of next-generation sequencing have decreased, whole genome bisulfite sequencing has become more widely used in clinical and experimental research. Currently, multiple public datasets of genomic data have been established, and this technique has recognized and tested approximately 95% of all cytosines in known genomes.

Method
The following steps are derived from one potential workflow of conventional whole genome bisulfite sequencing: target DNA extraction, bisulfite conversion, library amplification, and bioinformatics analysis. However, various sequencing systems and analysis tools often adapt the technical parameters and order of the following step processes in order to optimize assay coverage and efficacy.

DNA extraction
Library preparation protocols undergo DNA fragmentation, end repair, dA-tailing, and adapter ligation prior to bisulfite treatment and library amplification. Standard fragmentation under high-throughput technology such as Illumina Genome Analyser and Solexa requires nebulization to generate fragments that range from 0-1200 base pairs. After fragmentation, end repair enzymes and complementary adapters are then applied to the DNA in an end-prep polymerase chain reaction and adapter ligation reaction, respectively. Size selection occurs before the DNA is treated with sodium bisulfite.

Conventional methods of eukaryotic DNA preparation during sequencing use a wide variety of DNA input amount, varying from as little as 10 ng for novel NGS library alternatives, such as the tagmentation approach, to as much as 500-1000 ng of DNA as sample input.

Bisulfite conversion
The adapter-ligated DNA sample is treated with sodium bisulfite, a chemical compound that converts unmethylated cytosines into uracil, at low pH and high temperatures. The chemical reaction is depicted in Figure 1, where sulfonation occurs at the carbon-6 position of cytosine to produce the intermediate cytosine sulfonate. This intermediate then undergoes irreversible hydrolytic deamination to create uracil sulfonate. Under alkaline conditions, uracil sulfonate desulfonates to generate uracil.

This enables methylation detection by distinguishing the methylated cytosines (5-methylcytosine), which resist bisulfite treatment, from uracil. During amplification by polymerase chain reaction, the uracils are converted into thymines. Methylated cytosines are then recognized as cytosines. Their locations are then identified by comparison of the bisulfite-treated and original DNA sequence.

Following bisulfite treatment, purification of the sample is required to remove unwanted products including bisulfite salts.

Library amplification
In order to amplify the epigenome library, bisulfite-treated DNA is primed to generate DNA with a specific tagging sequence. The 3' end of this sequence is then tagged again, creating DNA fragments with markers on either end. These fragments are amplified in a final polymerase chain reaction reaction, after which the library is prepped for sequencing-by-synthesis. This is demonstrated in Figure 2, in which high-throughput sequencing system developed by biotechnology company, Illumina, perform comprehensive assays based on sequencing-by-synthesis of base pairs.

Bioinformatics analysis
Following library amplification, a series of analyses can be performed on the expanded library to determine various methylation characteristics or map a genome-wide methylation profile.

One such study aligns the new reads against the reference genome in order to directly compare locations of methylated cytosines and C-T mismatches. This requires software such as SOAP for side-by-side comparison of the genomes. Another potential sequencing analysis is methylated cytosine calling, which computes methylated cytosine ratios by mapping probabilities based on read quality. This helps determine methylated cytosine locations across the genome. Finally, global trends of methylome can be analyzed by calculating the distribution ratios of CG, CHGG, and CHH in methylated cytosines across the genome. These ratios can reflect features of whole genome methylation maps of certain species.

Applications
Due to its ability to screen methylation status at single-nucleotide resolution across a given genome, whole genome bisulfite sequencing has become increasingly promising in aiding fundamental epigenomics research, novel hypotheses on DNA methylation, and investigations of future large-scale epidemiological studies. This whole genome approach is also capable of sensitive cytosine-methylation detection under specific sequences across an entire genome, which increases its potential to identify specific DNA methylation sites and their relation to certain gene expressions.

DNA Methylation
The whole genome bisulfite sequencing technique is capable of sensitive cytosine-methylation detection under specific sequences across an entire genome, which increases its potential to identify specific DNA methylation sites and their relation to certain gene expressions. The use of whole genome bisulfite sequencing to create the first human DNA methylome in 2009 also helped identify a significant ratio of non-CG methylation. As a result, multiple single-base resolution methylomes of the human genome continue to be produced in order to identify the role of intragenic DNA methylation in gene expression and regulation. Future studies aim to use whole genome bisulfite sequencing in order to investigate the role DNA methylation has in multifarious cellular processes such as cellular differentiation, embryogenesis, X-inactivation, genomic imprinting, and tumorigenesis. Single-nucleotide maps have already been sequenced for two human cell lines, H1 human embryonic stem cells and IMR90 fetal lung fibroblasts, in order to study patterns of non-CG methylation in human cells.

Developmental biology
Whole genome bisulfite sequencing has also been applied to developmental biology studies in which non-CG methylation was discovered prevalent in pluripotent stem cells and oocytes. This technique helped researchers discover that non-CG methylation accumulated during oocyte growth and covered over half of all methylation in mouse germinal vesicle oocytes. Similarly, in plants, whole genome bisulfite sequencing was used to examine CG, CHH, and CHG methylation. It was then discovered that the plant germline conserved CG and CHG methylation while mammals lost CHH methylation in microspores and sperm cells.

Other fields
The unlimited resources provided by the approach of an entire genome have spurred many novel hypotheses on how whole genome bisulfite sequencing could be used in other various fields including disease diagnosis and forensic science. Studies have shown that whole genome bisulfite sequencing could detect abnormal methylation, or more specifically hyper-methylated suppressor genes, that are often seen in cancers including leukemia. Additionally, whole genome bisulfite sequencing has been applied to blood spot samples in forensic investigations to generate high-quality DNA methylation analyses on dried stains.

Technical concerns
The widespread use of whole genome bisulfite sequencing has been primarily limited by its excessive cost, complex data output, and minimal required coverage. Due to the high amount and subsequent cost of DNA input, many studies using whole genome bisulfite sequencing assays occur with few or no biological replicates. For human samples, the US National Institutes of Health (NIH) Roadmap Epigenomics Project recommends a minimum of 30x coverage sequencing to achieve accurate results and approximately 80 million aligned, high quality reads. Consequently, large-scale studies for genomic-wide methylation profiling remain less cost-effective, often requiring multiple re-sequences of the entire genome multiple times for every experiment. Current studies are being conducted to reduce the conventional minimum coverage requirements while maintaining mapping accuracy.

Finally, the technique is also limited the complexity of data and lack of sufficiently advanced analytical tools for downstream computational requirements. The current bioinformatics requirements for accurate data interpretation are ahead of existing technology, which stalls the accessibility of sequencing results to the general public.

Biases and over-representation of DNA methylation
Additionally, there are biological limitations concerning various steps in the standard protocol, particularly in the library preparation method. One of the biggest concerns is the potential of bias in the base composition of sequences and over-representation of methylated DNA data following bioinformatics analyses. Bias can arise from multiple unintended effects of bisulfite conversion including DNA degradation. This degradation can cause uneven sequence coverage by misrepresenting genomic sequences and overestimating 5-methylcytosine values. Additionally, the bisulfite conversion process only distinguishes unmethylated cytosine from 5-methylcytosine. As a result, specificity between 5-methylcytosine and 5-hydroxymethylcytosine is limited. Another potential source of bias rises from polymerase chain reaction amplification of the library, which affects sequences with highly skewed base compositions due to high rates of polymerase sequence errors in high AT-content, bisulfite-converted DNA.