Read (biology)

In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

Read length
Sequencing technologies vary in the length of reads produced. Reads of length 20-40 base pairs (bp) are referred to as ultra-short. Typical sequencers produce read lengths in the range of 100-500 bp. However, Pacific Biosciences platforms produce read lengths of approximately 1500 bp. Read length is a factor which can affect the results of biological studies. For example, longer read lengths improve the resolution of de novo genome assembly and detection of structural variants. It is estimated that read lengths greater than 100 kilobases (kb) will be required for routine de novo human genome assembly. Bioinformatic pipelines to analyze sequencing data usually take into account read lengths.

Generations of sequencing and read lengths
A genome is the complete genetic information of an organism or a cell. Single or double stranded nucleic acids store this information in a linear or in a circular sequence. To precisely determine this sequence, over time more efficient technologies with increased accuracy, throughput and sequencing speed have been developed. Sanger and Maxam-Gilbert sequencing technologies were classified as the First Generation Sequencing Technology who initiated the field of DNA sequencing with their publication in 1977. First Generation Sequencing typically has read lengths of 400 to 900 base pairs.

In 2005 Roche’s 454 technology introduced new sequencing technology that was capable of high throughput at low cost. This and similar technologies came to be known as Second Generation Sequencing or Next Generation Sequencing (NGS). One of the hallmarks of NSG is short sequence reads. NGS methods may sequence millions to billions of reads in a single run, and the time it takes to create GigaBase-sized reads is only a few days or hours, making it superior to first-generation sequencing techniques like Sanger sequencing. All NSG techniques produce short reads, i.e. 80–200 bases, as opposed to longer length reads produced by Sanger sequencing.

Beginning in the 2010s, revolutionary new technologies ushered in the Third-Generation Sequencing era (TGS). TGS is a term used to describe methods that are capable of sequencing single DNA molecules without amplification. While Sanger and SRS techniques can only produce read lengths of one kilobase pair, third-generation sequencing technologies can produce read lengths of 5 to 30 kilobase pairs. The longest read length ever generated by a third-generation sequencing technology is 2 million base pairs.

NGS and read mapping
Historically, only one individual per species was addressed due to time and expense constraints, and its sequence served as the species' "reference" genome. These reference genomes can be used to guide resequencing efforts in the same species by serving as a read mapping template. Read mapping is the process to align NGS reads on a reference genome. Any NGS application, such as genome variation calling, transcriptome analysis, transcription factor binding site calling, epigenetic mark calling, metagenomics, and so on, requires read mapping. The performance of these applications is influenced by accurate alignment. Furthermore, because the number of reads is so large, the mapping process must be efficient. There are different methods used to align reads on reference genome depending on how many mismatches and indels are being allowed. Roughly speaking, the methods can be divided into two categories: the seed-and-extension approach and the filtering approach. Many short read aligners use the seed-and-extend strategy, such as BWA-SW, Bowtie 2, BatAlign, LAST, Cushaw2, BWA-MEM, etc. A filter-based approach is used by a number of methods like SeqAlto, GEM, MASAI etc.

Genome assembly and sequence reads
In genomics, reassembling genomes by DNA sequencing is a significant challenge. The retrieved reads span the entire genome uniformly due to random sampling. Reads are stitched together computationally to reconstruct the genome. This process is known as de novo genome assembly.

I Sanger sequencing has larger read length compared to NGS. Two assemblers were developed for assembling Sanger sequencing reads - the OLC assembler Celera and the de Bruijn graph assembler Euler. These two methods were used to put together our human reference genome. However, since Sanger sequencing is low throughput and expensive, only a few genomes are assembled with Sanger sequencing.

Second-generation sequencing reads are short, and these sequencing techniques can efficiently and cost-effectively sequence hundreds of millions of reads. For rebuilding genomes from short sequences, some custom genome assemblers have been built. Their success spawned several de novo genome assembly projects. Although this method is cost-effective, the reads are short and the repeat sections are long, resulting in fragmented genomes.

We now have very long reads (of 10,000 bp) thanks to the arrival of third-generation sequencing. Long reads are capable of resolving the ordering of repeat regions, although they have a high error rate (15–18%). To correct errors in third-generation sequencing reads, a number of computational methods have been devised.

Assembling with short reads and assembling with long reads have different advantages and disadvantages owing to error rates and ease of assembly. Sometimes a hybrid method is preferred, and short reads and long reads are combined to get better result. There are two approaches, the first one is using mate-pair reads and long reads to improve the assembly from the short reads. Second approach is using short reads to correct the errors in long reads.

Advantages and disadvantages of short reads
Second-generation sequencing generates short reads (of length < 300bp) and these are highly accurate (sequencing error rate equals ~1%). Short read sequencing technologies have made sequencing much easier, a lot faster and much cheaper than Sanger sequencing. The August 2019 report from the National Human Genome Research Institute put the cost of sequencing a complete human genome at $942.00 United States dollars (USD).

The inability to sequence lengthy sections of DNA is a drawback shared by all second-generation sequencing technology. To use NGS to sequence a big genome like human DNA, the DNA must be fragmented and amplified in clones ranging from 75 to 400 base pairs, that is why NGS is also known as "shortread sequencing" (SRS). After sequencing short reads, it then becomes a computational problem and many computer programs and techniques have been developed to assemble the random clones into a contiguous sequence.

A necessary step in SRS is polymerase chain reaction which causes preferential amplification of repetitive DNA. SRS also fails to generate sufficient overlap sequence from the DNA fragments. This constitutes a major challenge for de novo sequencing of a highly complex and repetitive genome like the human genome. Another challenge with SRS is the detection of large sequence changes, which is a major roadblock to studying structural variations.

Advantages and disadvantages of long reads
The third-generation sequencing sequences long reads and is often referred to as long read sequencing (LRS). LRS technologies are capable of sequencing single DNA molecules without amplification. The availability of long reads constitutes a great advantage, because it is often difficult to generate long continuous consensus sequence using NGS because of the difficulty of detecting overlaps between NGS short reads, thus impacting the overall quality of assembly. LRS has been shown to considerably improve the quality of genome assemblies in several studies. Another advantage of LRS over NGS is that it provides the simultaneous capability of characterizing a variety of epigenetic marks along with DNA sequencing.

Major challenge of LRS is accuracy and cost. Though with LRS is improving fast in those areas too.