ScGET-seq

Single-cell genome and epigenome by transposases sequencing (scGET-seq) is a DNA sequencing method for profiling open and closed chromatin. In contrast to single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq), which only targets active euchromatin, scGET-seq is also capable of probing inactive heterochromatin.

This is achieved through the use of TnH, which is created by linking the chromodomain (CD) of heterochromatin protein-1-alpha (HP-1$$\alpha$$) to the Tn5 transposase. TnH is then able to target histone 3 lysine 9 trimethylation (H3K9me3), a marker for heterochromatin.

Akin to RNA velocity, which uses the ratio of spliced to unspliced RNA to infer the kinetics of changes in gene expression over the course of cellular development, the ratio of TnH to Tn5 signals obtained from scGET-seq can be used to calculate chromatin velocity, which measures the dynamics of chromatin accessibility over the course of cellular developmental pathways.

History
Transcriptional regulation is tightly linked to chromatin states. Chromatin that is open, or permissive to transcription, make up only 2-3% of the genome, but encompass 94.4% of transcription factor binding sites. Conversely, more tightly packed DNA, or heterochromatin, is responsible for genome organization and stability. Chromatin density also changes over the course of cellular differentiation processes, but there is a lack of high-throughput sequencing methods for directly assaying heterochromatin.

Many genomic-related diseases such as cancer are highly linked to changes in their epigenome. Cancers in particular are characterized by single-cell heterogeneity, which can drive metastasis and treatment resistance. The mechanisms that underlie these processes are still largely unknown, although the advent of single-cell technologies, including single-cell epigenomics, has contributed greatly to their elucidation.

In 2015, ATAC-seq, which uses the Tn5 transposase to fragment and tag accessible chromatin, or euchromatin, for sequencing, became feasible at the single-cell resolution. scGET-seq builds upon this technology by also providing information on heterochromatin, providing a more comprehensive look at chromatin structure and dynamics within each cell.

Sample preparation
Sample preparation for scGET-seq starts with obtaining a suspension of nuclei from cells using a method appropriate for the starting material.

The next step is to produce the TnH transposase. Tn5 is a transposase that cuts and ligates adapters to genomic regions unbound by nucleosomes (open chromatin). HP-1a is a member of the HP1 family and is able to recognize and specifically bind to H3K9me3. Its chromodomain uses an induced-fit mechanism for recognizing this chromatin modification. Linking the first 112 amino acids of HP-1a containing the chromodomain to Tn5 using a three poly-tyrosine-glycine-serine (TGS) linker leads to the creation of the TnH transposase, which is capable of targeting heterochromatin marked by H3K9me3.

Library preparation is done using a modified protocol for single-cell ATAC-seq, where the nuclei suspension is sequentially incubated with the Tn5 transposase first, and then TnH.

Data analysis
The goals of the data analysis are:
 * 1) To identify and characterize distinct cell populations using clustering
 * 2) To profile chromatin accessibility across the genome
 * 3) To predict copy-number variants and single-nucleotide variants

Pre-processing

 * 1) Post-sequencing, reads need to be demultiplexed and mapped to the appropriate reference genome. Duplicated reads are identified and removed.
 * 2) "Peaks", or regions in the DNA enriched in the number of reads mapped, are identified.
 * 3) Quality control is performed, and cells with low numbers of reads or few detected features are filtered out.
 * 4) Four count matrices (matrices where each column is a cell and each row is a feature) are generated: Tn5-dhs, Tn5-complement, TnH-dhs and TnH-complement, representing signal from accessible and compacted chromatin.

Dimension reduction, visualization and clustering
Each of the matrices are filtered of shared regions and then normalized and log2 transformed. Linear dimension reduction is done using principal component analysis (PCA). Groups of cells are identified using a k-NN algorithm and Leiden algorithm. Finally, the four matrices are combined using matrix factorization and UMAP reduction.

Cell identification annotation
There are two approaches to cell identity annotation: Annotation based on feature annotation of ATAC peaks, and annotation based on integration with reference scRNA-seq data.

Current
By using the ratio of Tn5 to TnH signals, quantitative values describing how quickly and in what direction chromatin remodelling is taking place can be calculated (chromatin velocity). By isolating regions that are most dynamic and identifying which transcription factors bind there, chromatin velocity can be used to infer the dynamic epigenetic processes happening within a given cell and the contributions of various transcription factors to those processes.

Future
Chromatin remodelling precedes changes in gene expression and enhances the understanding of trajectories and mechanisms of cellular changes. Thus, platforms and tools for integration of multimodal data are areas of active research  Incorporating temporal and directionality elements through integration of chromatin velocity with RNA velocity has been proposed to reveal even more information about differentiation pathways.

Limitations
scGET-seq has some of the same limitations as scATAC-seq. Both processes require nuclei samples from viable cells, and high cellular viability. Low cellular viability leads to high background DNA contamination that do not accurately represent authentic biological signals. Additionally, the sparsity and noisy nature of scATAC-seq and scGET-seq data makes analysis challenging, and there is no consensus yet on how to best manage this data

Another limitation is that scGET-seq still needs the validation of SNVs results by bulk genome sequencing. Even though there is a high correlation of mutations between bulk exome sequencing and scGET-seq results, scGET-seq fails to capture all exome SNVs.