Scaffolding (bioinformatics)

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps. When creating a draft genome, individual reads of DNA are second assembled into contigs, which, by the nature of their assembly, have gaps between them. The next step is to then bridge the gaps between these contigs to create a scaffold. This can be done using either optical mapping or mate-pair sequencing.

Assembly software
The sequencing of the Haemophilus influenzae genome marked the advent of scaffolding. That project generated a total of 140 contigs, which were oriented and linked using paired-end reads. The success of this strategy prompted The Institute for Genomic Research to develop the scaffolding program Grouper for their other sequencing projects. Until 2001, Grouper was the only stand-alone scaffolding software. After the Human Genome Project and Celera proved that it was possible to create a large draft genome, several other similar programs were created. Bambus was created in 2003 and was a rewrite of the original grouper software, but afforded researchers the ability to adjust scaffolding parameters. This software also allowed for optional use of other linking data, such as contig order in a reference genome.

Algorithms used by assembly software are very diverse, and can be classified as based on iterative marker ordering, or graph based. Graph based applications have the capacity to order and orient over 10,000 markers, compared to the maximum 3000 markers capable of iterative marker applications. Algorithms can be further classified as greedy, non greedy, conservative, or non conservative. Bambus uses a greedy algorithm, defined as such because it joins together contigs with the most links first. The algorithm used by Bambus 2 removes repetitive contigs before orienting and ordering them into scaffolds. SSPACE also uses a greedy algorithm that begins building its first scaffold with the longest contig provided by the sequence data. SSPACE is the most commonly cited assembly tool in biology publications, likely due to the fact that it is rated as a significantly more intuitive program to install and run than other assemblers.

In recent years, there has been an advent of new kinds of assemblers capable of integrating linkage data from multiple types of linkage maps. ALLMAPS is the first of such programs and is capable of combining data from genetic maps, created using SNPs or recombination data, with physical maps such as optical or synteny maps.

Some software, like ABySS and SOAPdenovo, contain gap filling algorithms which, although they do not create any new scaffolds, serve to decrease the gap length between contigs of individual scaffolds. A standalone program, GapFiller, is capable of closing a larger amount of gaps, using less memory than gap filling algorithms contained within assembly programs.

Utturkar et al. investigated the utility of several different assembly software packages in combination with hybrid sequence data. They concluded that the ALLPATHS-LG and SPAdes algorithms were superior to other assemblers in terms of the number of, maximum length of, and N50 length of contigs and scaffolds.

Scaffolding and next generation sequencing
Most high-throughput, next generation sequencing platforms produce shorter read lengths compared to Sanger sequencing. These new platforms are able to generate large quantities of data in short periods of time, but until methods were developed for de novo assembly of large genomes from short read sequences, Sanger sequencing remained the standard method of creating a reference genome. Although Illumina platforms are now able to generate mate pair reads with average lengths of 150bp, they were originally only able to generate reads of 75bp or less, which caused many people in the science community to doubt a reliable reference genome could ever be constructed with short read technology. The increased difficulty of contig and scaffold assembly associated with the new technologies has created a demand for powerful new computer programs and algorithms capable of making sense of the data.

One strategy that incorporates high-throughput next generation sequencing is hybrid sequencing, wherein several sequencing technologies are used at different levels of coverage, so that they can complement each other with their respective strengths. The release of the SMRT platform, from Pacific Biosciences, marked the beginning of single molecule sequencing and long read tech. It has been shown that 80-100X coverage with SMRT technology, which generates average read with lengths of 5456bp, is usually sufficient to create a finished de novo assembly for prokaryotic organisms. When the funds for that level of coverage are not available to a researcher, they might decide to use a hybrid approach.

Goldberg et al. evaluated the effectiveness of combining high throughput pyrosequencing with traditional Sanger sequencing. They were able to greatly increase N50 contig length and decrease gap length, and even to close one microbial genome with this approach.

Optical mapping
It has been shown that integration of linkage maps can aid de novo assemblies with long range, chromosome scale recombination data, without which, assemblies can be subject to macro ordering errors. Optical mapping is the process of immobilizing the DNA on a slide and digesting it with restriction enzymes. The fragment ends are then fluorescently tagged and stitched back together. For the last two decades, optical mapping has been prohibitively expensive, but recent advances in technology have reduced cost significantly.