Talk:Sequence assembly

ALLPATHS
De-novo application is not in the list —Preceding unsigned comment added by 130.225.211.21 (talk) 10:29, 2 September 2009 (UTC)

Add to assembler table
a column that denotes support for paired end data. —Preceding unsigned comment added by 130.225.211.21 (talk) 12:14, 2 September 2009 (UTC)

also separate the "presented / updated date" into Date Available, Date Updated fields for better sorting Scott Daniel (talk) 23:17, 21 April 2017 (UTC)

Algorithm?
What's with the algorithm? Surely, it would be more relevant to give the optimal (NP-complete) algorithm - than a very simplistic one?
 * The assembly problem is a variation of Hamiltonian path problem which is known to be NP-hard in its function problem version known as Travelling salesman problem. The use of heuristic is then justified. However I think that pseudocode should be placed elswehere i.e. in the Shortest common supersequence article.
 * 213.121.151.178 03:37, 26 February 2006 (UTC)>>>

Complexity of de novo vs mapping assembly
I'm not happy with this. Naively comparing the fragments of size n to a genome of size m is O(nm), so not much better than O(n^2) for de novo. The O(log n) seems ridiculous and should probably be O(n log n), but still needs a cite. —Preceding unsigned comment added by Ketil (talk • contribs) 09:14, 26 February 2009 (UTC)


 * A lookup against a fixed set of sequences (like a genome) can be trimmed down to take a maximum of O(n) by using hashing techniques, running on real world data it's even closer to O(1). The paper on "SSAHA" gives quite a nice introduction. Extending the approach to the comparison of one set of sequences against (another or the same) set of sequences then adds an additional layer of complexity, but it's closer to O(log n) than O(n log n). However, papers citing the complexity of the overlapper algorithms are hard to come by, the authors normally don't go that deep into the details. BaChev (talk) 22:38, 27 February 2009 (UTC)

Guys - I'm an amateur in this area compared to the folks commenting here, so I don't expect you to take what I say at face value without checking it out yourselves, but here goes... the complexity of checking all reads against all other reads, where there are N reads in total and each read is of length L, is N^2*L *in space and time*. That is *not* the same as N^2*L in time. If you use N space then the time factor is N*L, and since L is usually fixed for any sequencer (and quite small for short reads), it can be treated as a constant making the complexity - as most people normally use the expression - O(N). The data structure which allows you to do this is a 'trie' or in our case, specifically a 4-way branching tree of maximum depth 'L'. The space is not really proportional to N due to the shared branches from the root of the tree plus the fact that duplicate reads don't require extra storage space other than a count. Back in 2013 I wrote some code for assembling short reads using a trie data structure like this and I can confirm that it did indeed take time linearly proportional to N, and as a bonus, the trie construction was what we call 'embarassingly parallel'. Performance is limited primarily by available RAM rather than CPU cores, but the RAM doesn't have to be all on the one node. A cluster will give an almost linear improvement in speed due to the linear improvement in available RAM. I found the software from 2013 on my backup drive recently and thought it might be worth putting on github ( https://github.com/gtoal/genelab ) for anyone who wants to experiment with a trie-based OLC engine. It's fairly basic in terms of functionality being primarily just the overlap engine, but I'm sure you guys could build your own code around it. Just to re-emphasise the benefit of the data structure, comparing one read against all reads in the trie is about the same cost as a simple string compare. You simply walk down the tree levels one letter at a time. The trie structure is also far more space efficient than a hash table. I've written up what I remember about the code at that github site. Because it's somewhat basic code that doesn't have a user interface to make it immediately usable, I'm not going to mention it on the main page here, but I hope you'll excuse me writing here in the Talk section because I'm 10 years out of touch with the DNA assembly world and this is the only obvious place I could find where DNA assemblers are being discussed in general terms. Anyway - to summarize - if you first convert all reads into a large trie, then all the operations you would normally do with de Bruin algorithms such as matching sequences or building contigs can be done much much faster using that trie than would be done by a de Bruijn style assembler. True, it moves the expensive computation from the de Bruin overlap discovery to the trie-building, but the overall CPU is much less and the pre-computed fast data structure makes all following uses of it just incredibly fast. — Preceding unsigned comment added by 70.124.38.160 (talk) 01:09, 30 June 2023 (UTC)

Comparative scoring framework?
Do any frameworks exists that challenge various assemblers to perform the same assembly and then score their results? --Jhannah (talk) 09:36, 17 April 2009 (UTC)


 * Not to my knowledge but this would be a useful development if it were properly implemented. Currently the strategy for 'benchmarking' these technologies is to provide sequencing reads for which a high quality, reference sequences is known (people use a wide variety of different genomes for this) and simply compare the result to the standard. The different programs are aimed at different situations, so no standard measure is really appropriate. Look at some of the papers referenced in this article for examples of this. —Preceding unsigned comment added by 71.70.238.166 (talk) 21:48, 29 May 2009 (UTC)

Licenses
What the hell does "C++" in the ALLPATHS-LG entry mean? —Preceding unsigned comment added by 162.129.251.17 (talk) 17:55, 3 May 2011 (UTC)

Pictures and figures?
Can we add some? --Dan Bolser (talk) 03:57, 14 September 2012 (UTC)

'Available assemblers' list
This list is rather long - perhaps this section should be renamed 'Notable assemblers', and assemblers without their own article removed. Any thoughts on this? Amkilpatrick (talk) 17:59, 28 June 2017 (UTC)
 * I've been bold and cut this list back with this edit Amkilpatrick (talk) 11:03, 6 July 2017 (UTC)
 * You are treading a very, very fine line here. Who decides what is 'notable'? You? On what criteria: used today, number of citations, number of genomes assembled with them, availability, known by you, own pages in Wikipedia? None of these are good measures. E.g., Euler was the first assembler demonstrating feasibility of kmer (de-Bruijn Graph) assembly, is that not notable? The Staden assembler and Celera assemblers were used for assembly of the human genome, parts of WGA continue to live nowaday in Canu for long read assembly, not notable? Etc.pp for quite a number of assemblers which were on that list. I do agree that this article has its problems and should be revised, maybe by describing how the assembly problem evolved over time and leading to several approaches (overlap, kmer, etc.) for different sequencing types which all behave differently (Sanger, 454, Illumina, PacBio & ONT) and using all kind of tricks (ranging from massive parallelisation to data reduction like bloom filters etc). Simply culling the list is not helpful. BaChev (talk) 12:54, 19 July 2017 (UTC)
 * As I mentioned, the criteria for 'notability' was their own articles in WP. I agree it's not a perfect measure, but I was basing this on the guidelines at WP:CSC, ie, every entry has its own article, or no entry has its own article. I take your point that the list is now maybe too short, but the list of dozens that was there before was equally unhelpful. There is definitely scope for expanding the article with the description you suggested though, perhaps Sequence assembly or a history section would be suitable? Amkilpatrick (talk) 13:37, 19 July 2017 (UTC)
 * WP:CSC has a third category: "Short, complete lists of every item that is verifiably a member of the group." That list is probably still in the 32k boundary proposed. Whatever you do when pruning the list, be prepared to eventually have 'interesting' reactions. If you go back in time for this article you will see an edit war happened when someone tried to push his product in the then list by removing almost everything except that one product and a few fig leaf excuse entries. The current list was a way out of that. BaChev (talk) 16:22, 19 July 2017 (UTC)

Mapping assemblies
I was a bit confused about the section entitled De-novo vs. mapping assembly. It's my understanding that read mapping isn't an assembly technique, but rather something different. I think this article gives some indication of the difference between mapping and assembly. Mapping will only give you an idea of how your organism's genome sequence differs from that of another organism, and will miss any parts of the genome not present in the reference. I think it would be more appropriate to rename this section "De novo assembly vs read mapping", and use it to explain the difference between these two ways to compare the DNA of organisms, rather than referring to them as two methods of assembly.

Happy to discuss the terminology issue here, but if I don't hear back soon I will go ahead and make the edit. Maonao (talk) 08:57, 18 October 2017 (UTC)

Wiki Education assignment: Bioinformatics
— Assignment last updated by Tmmyn (talk) 05:15, 14 October 2022 (UTC)