End-sequence profiling

End-sequence profiling (ESP) (sometimes "Paired-end mapping (PEM)") is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations. Briefly, the target genomic DNA is isolated and partially digested with restriction enzymes into large fragments. Following size-fractionation, the fragments are cloned into plasmids to construct artificial chromosomes such as bacterial artificial chromosomes (BAC) which are then sequenced and compared to the reference genome. The differences, including orientation and length variations between constructed chromosomes and the reference genome, will suggest copy number and structural aberration.

Artificial chromosome construction
Before analyzing target genome structural aberration and copy number variation (CNV) with ESP, the target genome is usually amplified and conserved with artificial chromosome construction. The classic strategy to construct an artificial chromosome is bacterial artificial chromosome (BAC). Basically, the target chromosome is randomly digested and inserted into plasmids which are transformed and cloned in bacteria. The size of fragments inserted is 150–350 kb. Another commonly used artificial chromosome is fosmid. The difference between BAC and fosmids is the size of the DNA inserted. Fosmids can only hold 40 kb DNA fragments, which allows a more accurate breakpoint determination.

Structural aberration detection
End sequence profiling (ESP) can be used to detect structural variations such as insertions, deletions, and chromosomal rearrangement. Compare to other methods that look at chromosomal abnormalities, ESP is particularly useful to identify copy neutral abnormalities such as inversions and translocations that would not be apparent when looking at copy number variation. From the BAC library, both ends of the inserted fragments are sequenced using a sequencing platform. Detection of variations is then achieved by mapping the sequenced reads onto a reference genome.

Inversion and translocation
Inversions and translocations are relatively easy to detect by an invalid pair of sequenced-end. For instance, a translocation can be detected if the paired-ends are mapped onto different chromosomes on the reference genome. Inversion can be detected by divergent orientation of the reads, where the insert will have two plus-end or two minus-end.

Insertion and deletion
In the case of an insertion or a deletion, mapping of the paired-end is consistent with the reference genome. But the read are disconcordant in apparent size. The apparent size is the distance of the BAC sequenced-ends mapped in the reference genome. If a BAC has an insert of length (l), a concordant mapping will show a fragment of size (l) in the reference genome. If the paired-ends are closer than distance (l), an insertion is suspected in the sampled DNA. A distance of (l< μ-3σ) can be used as a cut-off to detect an insertion, where μ is the mean length of the insert and σ is the standard deviation. In case of a deletion, the paired-ends are mapped further away in the reference genome compared to the expected distance (l> μ-3σ).

Copy number variation
In some cases, discordant reads can also indicate a CNV for example in sequences repeats. For larger CNV, the density of the reads will vary accordingly to the copy number. An increase of copy numbers will be reflected by increasing mapping of the same region on the reference genome.

ESP history
ESP was first developed and published in 2003 by Dr. Collins and his colleagues in University of California, San Francisco. Their study revealed the chromosome rearrangements and CNV of MCF7 human cancer cells at a 150kb resolution, which is much more accurate compared to both CGH and spectral karyotyping at that time. In 2007, Dr. Snyder and his group improved the ESP to 3kb resolution by sequencing both pairs of 3-kb DNA fragments without BAC construction. Their approach is able to identify deletions, inversions, insertions with an average breakpoint resolution of 644bp, which close to the resolution of polymerase chain reaction (PCR).

ESP applications
Various bioinformatics tools can be used to analyze end-sequence profiling. Common ones include BreakDancer, PEMer, Variation Hunter, common LAW, GASV, and Spanner. ESP can be used to map structural variation at high-resolution in disease tissue. This technique is mainly used on tumor samples from different cancer types. Accurate identification of copy neutral chromosomal abnormalities is particularly important as translocation can lead to fusion proteins, chimeric proteins, or misregulated proteins that can be seen in tumors. This technique can also be used in evolution studies by identifying large structural variation between different populations. Similar methods are being developed for various applications. For example, a barcoded Illumina paired-end sequencing (BIPES) approach was used to assess microbial diversity by sequencing the 16S V6 tag.

Advantages and limitations
Resolution of structural variation detection by ESP has been increased to the similar level as PCR, and can be further improved by selection of more evenly sized DNA fragments. ESP can be applied for either with or without constructed artificial chromosome. With BAC, precious samples can be immortalized and conserved, which is particularly important for small quantity of smalls which are planned for extensive analyses. Furthermore, BACs carrying rearranged DNA fragments can be directly transfected in vitro or in vivo to analyze the function of these arrangements. However, BAC construction is still expensive and labor-intensive. Researchers should be really careful to choose which strategy they need for particular project. Because ESP only looks at short paired-end sequences, it has the advantage of providing useful information genome-wide without the need for large-scale sequencing. Approximately 100-200 tumors can be sequenced at a resolution greater than 150kb when compared to sequencing an entire genome.