Talk:Bigram

(I suspect that there is an error, or an ambiguity in the information given below. If TH occurs 50 times in 200 letters of words, then half the text consists only of the bigram TH. This is incredible. Further, if I add up the occurrences of the most frequent 5 bigrams, they add up to 200. 200 bigram occurences would involve 400 letters. Do I misunderstand the definition of bigram occurrence frequency? If so what led me to this misunderstanding? Ramani)


 * I didn't double-check the numbers, but it seems to me you're missing something very important: 1 bigram consists of 2 letters, but 200 bigram occurrences do not necessarily mean 400 letters. Look at the word 'banana' -- The bigrams in this word are 'ba', 'an', 'na', 'an', 'na'. That's 5 bigrams, but only 6 letters.


 * Also, please discuss articles in their respective talk pages, not in the article itself. -- AWendt 11:12, 19 September 2007 (UTC)

-

Thanks. You are right in saying that the number of bigrams in a sequence of n letters is (n-1). But that does not answer the question on how the numbers given in the article are to be interpreted. The article says

"The most common letter bigrams in the English language are listed below, with the expected number of occurrences per 200 letters. In the analysis here, the bigrams are not permitted to span across consecutive words.

TH 50     AT 25       ST 20

ER 40     EN 25       IO 18

ON 39     ES 25       LE 18

AN 38     OF 25       IS 17

... ... ..."

I should expect the sum of numbers shown above should not exceed 199. But the sum is larger. I would appreciate your double-checking the text. Ramani 12 Nov 07  —Preceding unsigned comment added by 122.167.157.78 (talk) 17:25, 12 November 2007 (UTC)


 * Why should the sum not exceed 199? The word BATH, for example, has both a TH and an AT in three letters. quota 21:13, 13 November 2007 (UTC)

There is definitely an error, the number of bigrams in n letters is equal to n-1 but the sum of all the bigrams is much larger than 199. 200 is probably a typo for 2000. —Preceding unsigned comment added by 128.97.19.56 (talk) 21:44, 31 March 2008 (UTC)


 * Indeed. Here's a reference: .  This quotes TH as occurring 5532 times in 40,000 words (which would be about 200,000 letters).  That's ~55 times in 2000 letters, which roughly matches the table above.


 * However, the table in the reference has HE and IN between TH and ER -- but the analysis over the larger sample should be better. I'll modify the article accordingly...  quota (talk) 08:56, 1 April 2008 (UTC)

Bigram frequencies wrong
I suspect there is an error in the bigram frequencies listed in the table. The expected occurrence for a random bigram is 0.15%, yet some of the most occurring bigrams in the table are already less than that. — Preceding unsigned comment added by 129.31.242.3 (talk) 12:51, 14 December 2011 (UTC)

First-sentence ambiguity
>A bigram or digram is every sequence of two adjacent elements in a string of tokens ...

That wording is confusingly ambiguous; the "every" suggests that a bigram/digram is the set of all such sequences in the string.

Wouldn't "A bigram or digram is a sequence of two adjacent elements in a string of tokens ..." be a lot less ambiguous? — Preceding unsigned comment added by 128.229.4.2 (talk) 19:40, 7 November 2012 (UTC)


 * I agree, so I modified the first sentence of this article as you suggested. Please continue to make further improvements. --DavidCary (talk) 16:20, 7 March 2016 (UTC)

Does not meet Wikipedia standards
I just read the intro section and have no idea what it says. Please read Wikipedia guidelines for the lead section. Hypertexting jargon is not a license to use it, particularly in the lead section, which is supposed to be self contained, and especially designed for the lay reader. ( Lazy hypertexting is always sloppy writing! ) I also suggest giving some common examples. Also be aware that a list of true statements is not an explanation—nor even a description.

--2602:306:CFCE:1EE0:C1DD:ED7B:6321:F50D (talk) 14:26, 22 July 2017 (UTC)Doug Bashford

Bigram frequency on a small corpus
Section "Bigram frequency in the English language" shows a table with results "in a small English corpus" (40000 words), and later references a site that has performed the same analysis over nearly a trillion words. Why not just show that analysis instead? The one shown here seems to be very inaccurate (e.g. "ch" isn't listed, despite of it having a frequency of 0.60%). —Cousteau (talk) 19:46, 25 May 2020 (UTC)