Bloom filters in bioinformatics

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

Bloom filters are primarily used in bioinformatics to test the existence of a k-mer in a sequence or set of sequences. The k-mers of the sequence are indexed in a Bloom filter, and any k-mer of the same size can be queried against the Bloom filter. This is a preferable alternative to hashing the k-mers of a sequence with a hash table, particularly when the sequence is very long, since it is very demanding to store large numbers of k-mers in memory.

Sequence characterization
The preprocessing step in many bioinformatics applications involves classifying sequences, primarily classifying reads from a DNA sequencing experiment. For example, in metagenomic studies it is important to be able to tell if a sequencing read belongs to a new species. and in clinical sequencing projects it is vital to filter out reads from the genomes of contaminating organisms. There are many bioinformatics tools that use Bloom filters to classify reads by querying k-mers of a read to a set of Bloom filters generated from known reference genomes. Some tools that use this method are FACS and BioBloom tools. While these methods may not outclass other bioinformatics classification tools like Kraken, they offer a memory-efficient alternative.

A recent area of research with Bloom filters in sequence characterization is in developing ways to query raw reads from sequencing experiments. For example, how can one determine which reads contain a specific 30-mer in the entire NCBI Sequence Read Archive? This task is similar to that which is accomplished by BLAST, however it involves querying a much larger dataset; while BLAST queries against a database of reference genomes, this task demands that specific reads that contain the k-mer are returned. BLAST and similar tools cannot handle this problem efficiently, therefore Bloom filter based data structures have been implemented to this end. Binary bloom trees are binary trees of Bloom filters that facilitates querying transcripts in large RNA-seq experiments. BIGSI borrows bitsliced signatures from the field of document retrieval to index and query the entirety of microbial and viral sequencing data in the European Nucleotide Archive. The signature of a given dataset is encoded as a set of Bloom filters from that dataset.

Genome assembly
The memory efficiency of Bloom filters has been used in genome assembly as a way to reduce the space footprint of k-mers from sequencing data. The contribution of Bloom filter based assembly methods is combining Bloom filters and de Bruijn graphs into a structure called a probabilistic de Bruijn graph, which optimizes memory usage at the cost of the false positive rate inherent to Bloom filters. Instead of storing the de Bruijn graph in a hash table, it is stored in a Bloom filter.

Using a Bloom filter to store the de Bruijn graph complicates the graph traversal step to build the assembly, since edge information is not encoded in the Bloom filter. Graph traversal is accomplished by querying the Bloom filter for any of the four possible subsequent k-mers from the current node. For example, if the current node is for the k-mer ACT, then the next node must be for one of the k-mers CTA, CTG, CTC or CTT. If a query k-mer exists in the Bloom filter, then the k-mer is added to the path. Therefore, there are two sources for false positives in querying the Bloom filter when traversing the de Bruijn graph. There is the probability that one or more of the three false k-mers exist elsewhere in the sequencing set to return a false positive, and there is the aforementioned inherent false positive rate of the Bloom filter itself. The assembly tools that use Bloom filters must account for these sources of false positives in their methods. ABySS 2 and Minia are examples of assemblers that uses this approach for de novo assembly.

Sequencing error correction
Next-generation sequencing (NGS) methods have allowed the generation of new genome sequences much faster and cheaper than the previous Sanger sequencing methods. However, these methods have a higher error rate, which complicates downstream analysis of the sequence and can even give rise to erroneous conclusions. Many methods have been developed to correct the errors in NGS reads, but they use large amounts of memory which makes them impractical for large genomes, such as the human genome. Therefore, tools using Bloom filters have been developed to address these limitations, taking advantage of their efficient memory usage. Musket and  BLESS are examples of such tools. Both methods use the k-mer spectrum approach for error correction. The first step of this approach is to count the multiplicity of k-mers, however while BLESS only uses Bloom filters to store the counts, Musket uses Bloom filters only to count unique k-mers, and stores non-unique k-mers in a hash table, as described in a previous work

RNA-Seq
Bloom filters are also employed in some RNA-Seq pipelines. RNA-Skim clusters RNA transcripts and then uses Bloom filters to find sig-mers: k-mers that are only found in one of the clusters. These sig-mers are then used to estimate the transcript abundance levels. Therefore, it does not analyze every possible k-mer which results in performance and memory-usage improvements, and has been shown to work as well as previous methods.