User:Rzvogel/Draft of the page for MetaVelvet

MetaVelvet assembler was developed by Toshiaki Namiki et al. it is a set for algorithms that are used to assemble short read metagenomic sequences. It is an extension of Velvet assembler which is a set of algorithms that is used for short read single species genomic sequence assembly.

Algorithm
This algorithm works off the assumption that when the Velvet assembler constructs a de Bruijn graph based on the output of a high-through sequencer, that the graph that is constructed is a collection of sub-graphs. Each of these sub-graphs would represent a species that was in the metagenomic sample what was used. This algorithm will use a lot of the components that are used in Velvet assembler but will break the graph into sub-graphs. To start it will use Velvet to construct a de Bruijn graph for a metagenomic sample that is being assembled. For this to happen the inputted short reads will have to be hashed into k-mers that overlap each other by (k-1) nucleotides. This is because every node in the graph is a collection for k-mers that are chained k-mers that overlap by (k-1) nucleotides. All nodes will have a twin node that is made up of the revise of the compliment of the k-mers in its twin. The edges in the graph will happen when the last sequence in a node overlaps by (k-1) nucleotides with the first sequence of another node. Once the graph is constructed the next step will be to find and separate the sub-graph for each species. This is done by constructing a histogram of the length-weighted frequencies of coverage-values and using Gaussian distribution to detect the peaks. Each peak will be considered to be a representation of a species in the graph. Use two peaks that are found to separate the graph into sub-graphs and repeated until it is fully separated. The biggest problem that will have to be dealt with is detecting and dealing with “chimeric nodes", which are nodes that belong to two or more sub-graphs. When one of this nodes are found it will be split into two nodes, one for each sub-graph. Once the sub-graphs are found Velvet can finish running on the sub-graph and the counts and sequences of the species in the sample can be found.

Example


This is an example of how a Bruijn graph that was constructed using Velvet can be decomposed in several sub-graphs that can be used be Velvet. In this example the twins of the nodes are removed to make the image clearer.

Pseudocode
This algorithm will take in as input a set of short reads of metagenomic data and a value k which is the length of the k-mers. This algorithm will mostly use functions that are built into Velvet. The main addition to this algorithm is steps 4 to 8 in which the sub-graphs are being constructed.

1 MetaVelvet(reads, k) 2    kmers = Hash all elements in reads to a set of k-mers              ''// This will use Velvet 3    G = Construct a de Brujin graph using kmers                      ''// This will use Velvet 4    SG = NULL                                                          ''// Will be a set of all sub-graphs 5    hist =  construct a histogram for G                             //histogram of the "length-weighted frequencies” of coverage-values 6     for each peak p in hist:                                  7         separate p as a sub-graph and add to SG                            // divide “chimeric nodes" if found 8    end for 9    finish running velvet for each sub-graph in SG 10  end MetaVelvet

Separating the sub-graphs will be done by first finding peaks in a histogram. Once those are found the algorithm will label nodes either "on" or "off" depending if they are in the sub-graph of not, this step will happen for each peak in the histogram. The "on" nodes are in the sub-graph and the "off" nodes are not. To find “chimeric nodes" it will look for nodes that have both a "on" and "off" node incoming and has outgoing edges to a "on" and "off" node. These nodes are simply split into two and the algorithm will keep moving on.

Complexity
The Big O running time of MetaVelvet will be equal to the Big O running time of Velvet. This is because the separating of the sub-graphs is not that much work when compared to the running time of Velvet. Since MetaVelvet will not create many new node because the number of “chimeric nodes" should be low, it will not increase the number of nodes in the graphs enough to really effect the running time of the rest of Velvet's operations.

Proof of correctness
This algorithm will work because of two main components. The first is that the Velvet works, this has been shown and it is being used in the real world on real world problems. The other is that the method of decomposition that is used will work. It is clear to see that the first graph that is constructed is in fact an collection of sub-graphs. Metagenomic data that is being used to construct this graph is made up of many different species. If each species in the sample was to be extracted and ran through the assembly process as a single genomic sample, it would produce the same graph as a sub-graph for the metagenomic data. Since both of this components exist it is reasonably to assume the MetaVelvet will work.

Related algorithms
Another algorithm that assembles short read metagenomic sequences is Meta-IDBA by Yu Peng et al. this algorithm will preform better on metagenomic samples where the species in the samples will only differ on a species level.