Talk:N-gram

how to discover which publications comprise the google ngram results?? Download the whole corpus ??😳

Wiki Education Foundation-supported course assignment
This article was the subject of a Wiki Education Foundation-supported course assignment, between 24 August 2020 and 9 December 2020. Further details are available on the course page. Student editor(s): Izabellahernandez.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 04:46, 17 January 2022 (UTC)

Application Question
Can you use a N-Gram to analyze the frequency of words (3-?? letters) in the conversations of developing children (grouped by age) and recorded during play/work/dinner activities? what is the smallest word sample size that can be analyzed (good analysis seem to suggest up to 40.000 words, I wonder what is a lower, yet valid number)? Cheers, Doncorto 17:45, 24 May 2010 (UTC) —Preceding unsigned comment added by Doncorto (talk • contribs)

Merge Trigram and Bigram to N-Gram
They're just special cases. The bigram and trigram articles should be deleted, and their entries redirect to n-gram. 67.180.161.52 06:58, 10 October 2006 (UTC)

See the point, but I vote no. There is so much literature (references) where 'bigram' or 'trigram' is the distinguishing feature that these will always be important topics in their own right (and there is some indication that bigram may be the 'fundamental unit' of neuonal computation).

So .. people will likely want to go to bigram as a topic. And it does have a special 'place'. Just as binary is a special case of all bases, and so deserves special treatment. quota 21:33, 10 October 2006 (UTC)

I agree with quota, although a more uniform treatment of bigram, trigram and n-gram would be nice... Skaakt 13:21, 19 October 2006 (UTC)

I agree as well that small entries that explain the equivalence followed by a link to the general page would be very helpful. User::Tdunning 9:41 PST, November 13, 2006

I think they should be merged. If there was one good page that explained what goes into picking the N for an N-gram, than it would be redundant to have the other pages. Further, n-grams is a concept, whether bigram, tragram etc, where the value of n is not the most salient feature. - DustinSmith

Unfortunately that is not so. N- (or n-) grams are being used as 'trade marks' by some 'scientific' investigators. At best they are a useful abbreviation. But the meaning of 'bigram' and 'trigram' can be guessed at from the word itself, as a back-formation from 'monogram' [the mono, there, referring to the object, not the parts].

And of course the most salient feature of bigrams is that they have only two parts. That's whey they are interesting ... quota

Umm, no to the merge. — Tuvok[T@lk/Improve me] 03:18, 2 March 2007 (UTC)

I also vote no. Most discussions of n-grams explicity break out the terms bigram and trigram for special treatment. Anything of a higher order is simply labeled a n-gram.Dalebrearcliffe 18:06, 24 March 2007 (UTC)

I was hoping for some information similar to the Bigram page, specifically related to Letter frequencies, which page linked me to this undecipherable N-gram page. Woodlore (talk) 00:23, 16 January 2009 (UTC)

G-Score
Can someone add to this article, or point me to where i can get more info on the g-score refereneced. The link does not point to a page.

Bayesian Analysis
Can someone point to a paper or article on "It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference." —Preceding unsigned comment added by 203.161.97.253 (talk) 01:46, 28 April 2008 (UTC)

Too technical / insufficient context
I read this whole article and I don't quite understand the general context of this term. I understand the individual examples, but I'd like to see more practical applications, especially near the introduction and written in simpler terms with less jargon. TWCarlson (talk) 13:28, 10 September 2008 (UTC)

Yeah... none of this makes any fucking sense.

Wolfram Alpha n-grams.
Since Google is on this page. I was going to add that you can you wolframalpha to calculate n-grams of a string. if no objections I'll at it.--Mrebus (talk) 07:20, 15 June 2009 (UTC)

An n-gram is not a subsequence
The first sentence of the article says that "an n-gram is a subsequence of n items from a given sequence". So given the sequence "the pig is happy", by the definition at subsequence, "the happy" is a subsequence and thus also a 2-gram.

My understanding is that an n-gram must comprise consecutive items from a sequence e.g. a Substring. By this definition, "the happy" is not a 2-gram.

I propose that the article avoid using "subsequence" and instead use a term that denotes a sequence of consecutive elements from a larger sequence.

worch (talk) 23:32, 7 August 2010 (UTC)


 * Let's use the term "contiguous subsequence". The term "substring" implies that a subsequence consists of symbols, which isn't always the case with n-grams, as they may consist of larger units, e.g. whole words. -- X7q (talk) 06:13, 9 August 2010 (UTC)


 * Also, there actually are things like distant (or skip) n-grams which aren't contiguous subsequences. But they probably should be mentioned in a separate section of the article, not in the introduction. -- X7q (talk) 06:13, 9 August 2010 (UTC)

Typo in conditional probability?
I think, that there should be

"...predicts $$x_{i}$$ based on $$x_{i-1}, \dots, x_{i-n}$$. In Probability terms, this is nothing but $$P(x_{i} | x_{i-1}, \dots, x_{i-n})$$"

instead of

"...predicts $$x_{i}$$ based on $$x_{i}, x_{i-1}, \dots, x_{i-n}$$. In Probability terms, this is nothing but $$P(x_{i} | x_{i}, x_{i-1}, \dots, x_{i-n})$$."

because if an event $$x_i$$ is included in condition, then his conditional probability is equal $$1$$.

n-grams and n-gram models
n-grams themselves have a variety of applications, including collocation analysis, language identification, approximate string matching, etc. n-gram Markov models are also an important class of language model -- but are not the same thing as n-grams. The current article mixes them confusingly. I wonder if we should split them into two articles, and incorporate more material from Markov model (which currently doesn't even link to n-gram!) etc. --Macrakis (talk) 21:49, 27 November 2011 (UTC)
 * Good suggestion. Most of mentions to natural language applications and smoothing techniques in this article should be moved to an independent article about n-gram language models. A (hopefully, high-level) summary of the definition of n-gram language models and applications would be nice to have here, though.  --Whym (talk) 08:45, 28 November 2011 (UTC)
 * I agree, these subjects should definitely be split. I recently created the article n-gram language model, which was itself mostly a split off from content in the article Language model. One option would be to expand the scope of n-gram language model to cover any sort of n-grams (rather than only word sequences). Colin M (talk) 18:03, 10 March 2023 (UTC)

Removed disambiguation link for kernel
The following notation: was part of the header in this N-gram talk page. I decided to replace kernel (mathematics) with kernel trick in the N-gram article, as that seemed the appropriate choice based on the context i.e. kernel usage in ML for SVM's. I then removed the above note from this talk page. --FeralOink (talk) 21:56, 27 April 2012 (UTC)

K-mer?
K-mer seems to be the same as n-gram. Should we put those two together? dennis97519 (talk) 08:30, 29 May 2015 (UTC)


 * No. Clearly two different things. Nuvigil (talk) 16:09, 15 May 2017 (UTC)


 * @dennis97519, @Nuvigil: I added k-mer language to the article before realizing that there was this discussion here. Please accept my apologies and review my edits.  64.132.59.226 (talk) 17:48, 11 April 2018 (UTC)

v-gram, q-gram
I don't think I'm knowledgable enough to do it, but someone should mention v-grams aka variable length grams, and q-grams (I think q-gram may be a synonym form n-gram?). — Preceding unsigned comment added by 72.182.34.126 (talk) 06:00, 26 August 2015 (UTC)

Contradiction in Skip-gram section
The article defines a k-skip-n-gram as "a length-n subsequence where the components occur at distance at most k from each other". Then it states that the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain are 1-skip-2-grams in the text the rain in Spain falls mainly on the plain. But I would say that those words occur at a distance two from each other, so at a distance more than one from each other, which contradicts the previously mentioned definition. So, is the distance between those words considered to be one and not two (there is only one word in between them), is the example incorrect, or is the definition incorrect? —Kri (talk) 00:13, 17 February 2017 (UTC)

remove redirect from skipgram
I suggest we remove the redirect from skipgram since that is a distinct concept. DMH43 (talk) 15:32, 20 December 2023 (UTC)