Talk:K-mer

Spelling
I'm guessing that you meant to say "constant" rather than "consonant". --Rshphx1 (talk) 05:06, 14 March 2013 (UTC)

Is it an alternate term for n-gram?
K-mer, and n-gram all seems to refer to the same concept. I think we should merge K-mer with n-gram.

Quote from tweet: "k-mer? q-gram? n-gram? I hate it when different researchers re-brand existing concepts using different names for no reason whatsoever." https://twitter.com/klmr/status/6081199634

And from a Q&A site: "Both, k-mer and Oligos have the same meaning in the context of Bioinformatics. Another synonym you will find is n-gram.

Oligonucleotides originates from the wet lab site of view whereas k-mer/n-mer/k-tuples/n-grams are coined by computational sequence analysis (such as cryptography or pattern matching). You will find all these words heavily used in NGS literature. The frequency depends on whether the author has a stronger background (or focus) on biology or informatics." https://www.biostars.org/p/10772/#10777

dennis97519 (talk) 14:07, 28 May 2015 (UTC)
 * Against. Please see my comment User_talk:Kku.


 * n-gram is the more comprehensive term and is application-neutral. People talk about k-mers (in analogy to poly-mer) when referring to the particular n-grams that you will get when you apply the ideas from computational linguistics to bioinformatics/molecular biology. In particular, k-mers can also refer to very short concrete DNA sequences that are actually used for genetic engineering. This said, I would fervently vote against a merger. The theoretical basis for chopping up sequences into overlapping motifs can (or should) be found in n-gram article, k-mer should really be left for the biological application. k-mer should refer to n-gram as the more general concept, n-gram should contain a main_article link to the k-mer application of the concept. -- Kku 09:18, 29 May 2015 (UTC)

I've added a "See also" link (although a link in the introduction could also work): it seems logical for k-mer to contain information specific to the domain in an article separate from n-gram. I would also propose to move the Pseudocode section to n-gram or to just get rid of it. RFST (talk) 08:17, 18 June 2015 (UTC)


 * I've added explanatory hatnotes; I think this nomination can now be closed. fgnievinski (talk) 19:43, 9 September 2015 (UTC)

n-grams aren't really a generalization of k-mers, and the uses are different: in computation linguistics the emphasis is on probability of occurrence of n-grams, whereas in DNA-sequence reconstruction the overlap between different k-mers is used to match them to one another. I'd prefer to see n-gram mentioned in the "See also" section (as it was previously), rather than at the start as "a broader coverage related to this topic". Sminthopsis84 (talk) 20:26, 9 September 2015 (UTC)

Why small k-mers can't detect repetition in DNA?
A lot of sections were added by Jrotten9 and most of them were useful. But under section "lower k-mer sizes" it was written that:
 * Smaller k-mers also have the problem of not being able to resolve areas in the DNA where small microsatellites or repeats occur. This is because smaller k-mers will tend to sit entirely within the repeat region and is therefore hard to determine the amount of repetition that has actually taken place.
 * Eg.  For the subsequence ATGTGTGTGTGTGTACG, the amount of repetitions of TG will be lost if a k-mer size less than 16 is chosen. This is because most of the k-mers will sit in the repeated region and may just be discarded as repeats of the same k-mer instead of referring the amount of repeats.

Although it seems legit that smaller k-mers tend to sit entirely in the repeat region, but for the example ATGTGTGTGTGTGTACG you specifically said the amount of repetition of TG will be lost if k<16 for k-mer size. But why?

I tried to create all k-mers with k=15 (meaning edges with 15 nucleatides) and then created 14-long nodes from those edges. After creating the graph, no ambiguity was watched; In k=15 there was only one path.

Notice that in creating a de Bruijn graph not every node can be connected to each other and two nodes with length k-1 can connect if a joining k-mer (edge with k symbols) has been seen in short read(s). In other words, two nodes (length = k-1) can only connect if last k-2 symbols of 1st node is equal to first k-2 symbols of 2nd node, AND the resulting k-mer has been seen in short reads.

If we consider notice above, in the created graph for k=15 (edges with length 15 and nodes with length 14) from short read ATGTGTGTGTGTGTACGATGTGTGTGTGTGTACG, there's only one path and no ambiguity and so no repeat will be lost.

If I am wrong please clarify your example, otherwise correct the article. Sinamomken (talk) 23:23, 4 November 2016 (UTC)

Italicise 'k' or not?
In the title "k" (of "k-mer") is italicised. But in the body it is not (also of "k-mer"). What should it be? --Mortense (talk) 17:56, 15 November 2017 (UTC)