Wikipedia:Reference desk/Archives/Mathematics/2021 April 9

= April 9 =

Applying Zipf's Law
Zipf's Law states (roughly) that the frequency of a word is inversely proportional to it's rank in a frequency table, where the constant of proportionality depends on the language. So if wordi is the ith word on the frequency table then the probability of wordi occurring is P(W=wordi) ≈ α/i, where α can be determined by averaging over a suitable sample of words. I wanted to use this model to get an idea of how many words of a foreign language I would need to reach a certain level of fluency. So if I want to reach 99% fluency, meaning I know a word 99% of the time I encounter one, I need to find N so that
 * $$\sum_{i=1}^N \frac{\alpha}{i} = .99.$$

The problem is that the series on the left does not converge, so by choosing N large enough I can get to any "probability", even a probability greater than 1. I've looked at variations on Zipf's law but they all seem to suffer from about the same issue, they predict the ratios between frequencies but aren't much use predicting cumulative frequencies as i grows large. What I'm doing now to get some kind of answer is to sum the frequencies manually and compare with the total number of words in the dataset, basically just looking a the raw data without any effort to model it. Is there a version of Zipf's law that is more compatible with what I'm trying to use it for? --RDBury (talk) 20:39, 9 April 2021 (UTC)
 * This model assumes that not only do you know $$N$$ words, but that they are precisely the $$N$$ most common words. Additionally, the relevant vocabulary can change dramatically depending on the setting. For shopping on the local farmers' market you need other words than for discussing the risks of the use of AI apps for decision making. --Lambiam 21:33, 9 April 2021 (UTC)
 * A fair criticism, but I'm assuming that I'll learn the most common words first. This may not be exactly true but I think it's close enough for an estimate. --RDBury (talk) 18:35, 10 April 2021 (UTC)


 * A version that does not suffer from a divergent tail is
 * $$f_K(i) = \left\lfloor \frac{K}{i} \right\rfloor.$$
 * Asymptotically, $$\textstyle{\sum_{i=1}^\infty f_K(i) = \sum_{i=1}^K f_K(i) \sim K\ln K}.$$
 * --Lambiam 21:06, 9 April 2021 (UTC)
 * The size of the corpus of English lemmas in Webster's Third New International Dictionary and in Wiktionary is about 500,000 (see List of dictionaries by number of words). That is the value of the above sum for $$K=45918.$$ --Lambiam 21:27, 9 April 2021 (UTC)
 * I'm trying to find a model that's independent of the corpus size since presumably the less frequent the word, the less accurately the a finite sample matches the actual probability. Plus, when you get down to the words that occur just once in a given corpus you start to get misspellings, people's names, made-up words, and other one-offs that shouldn't really be counted as the words in the given language. FWIW, I'm using the word lists at wikt:User:Matthias Buchmeier. --RDBury (talk) 18:35, 10 April 2021 (UTC)
 * The fact is that the constant in Zipf's law, in its original form, depends on the corpus size. This is also the case for modestly sized corpora, not containing maladroit malapropisms, mischievious misspellings, nonsensical nonces, or sequipedalian supercalifragilisticexpialidociosities. --Lambiam 20:16, 10 April 2021 (UTC)


 * If the size of the word list (the number of "types") equals $$K,$$ then an estimate of the corpus size (the number of "tokens") from which that list was collected is $$S=K\ln K.$$ You want to find $$N$$ such that $$\textstyle{\sum_{i=1}^N f_K(i)=pS},$$ where $$p$$ is the desired fluency level expressed as a fraction of unity. The lhs can be approximated by $$K\ln N.$$ The equation $$K\ln N=pK\ln K$$ is solved for $$N$$ by $$N=K^p.$$ For example, some sources give the number of words in War and Peace as 587,287. I don't know if these are words in the original Russian or a translation, but let us take $$S=500000$$ as above as a ballpark estimate, corrresponding to $$K=45918$$ or thereabouts. Rounded to a whole number, $$45918^{0.99}=41244.$$ With the floor version above, the 0.99 level is reached for $$N=40918.$$. For comparison, the 0.90 level is reached already for $$N=11416,$$ so this is a clear example of diminishing returns. (It is not very encouraging that to know 99% of the words (tokens) in War and Peace you need to learn 89% of the types. And as you move on to Pushkin, there is probably a bunch of words that Tolstoy never used, so you don't reach the 99% fluency level for Pushkin.) --Lambiam 22:02, 10 April 2021 (UTC)