Wikipedia:Reference desk/Archives/Mathematics/2018 August 11

= August 11 =

How many samples you need to describe a language?
That's a question about linguistics, and besides that, it's also relevant for formal languages. The language or computing ref desks would be valid choice. But the tool is statistics, and I expect a mathematical answer, so here we go.

How broad must be a corpus to describe accurately a language? At least, you could aspire to capture above 99.x% of words and grammar structures. If you limit yourself to describing the language of newspaper articles (just to draw some limitation), how many articles would you need to analyze? How many articles that do not contribute with any new words/new rules do you need to get until you can safely stop? --Doroletho (talk) 13:29, 11 August 2018 (UTC)
 * First of all, formal languages are very different than natural languages so I'm not convinced that an answer for one will be entirely relevant to the other. For natural languages the problem seems similar to the Mark and recapture problem, to which the Lincoln index can be applied in some circumstances. One issue is that word frequency in a natural language is not nearly uniform but follows something akin to Zipf's Law; the articles point to research which allows for this but I didn't see anything freely available. Another problem is the number of words different languages have can vary tremendously, with the numbers of words in some languages being virtually infinite. The article on Inuit languages gives the example 'Tusaatsiarunnanngittualuujunga' which translates to "I cannot hear very well." Yet another issue is that natural languages are constantly changing so what may have been a sufficient sample 5 years ago might miss a significant portion of the language as used today, at least if you're going for 99%+ coverage.
 * The formal language case is complicated by the fact that there are many different classes (see Chomsky hierarchy), and conclusions about one class won't always work for another class. So a formal language is a set of strings defined by a rule undetermined complexity, and you're trying to guess the rule based on a subset, hindered by the fact that you can't say with certainty that a string isn't in the language just because it isn't in the sample. For example if the sample is {"ababab", "aababb", "aaabbb", "aabbab"}, is the language the set of all strings of 'a' and 'b' of length 6, all such strings with start with 'a' and end with 'b', all strings with 3 'a's and 3 'b's, or strings of length 6 in the Dyck language with brackets replaced by 'a' and 'b'? Any answer could be correct or "cabcab" might be a valid word as well in which case none of them are. In addition, the strings in a formal language can be any length and there are usually an infinite number of them. All of this points to the conclusion that statistical analysis alone is insufficient to determine the rule which generates a formal language. As another example, consider the language defined as the set of substrings of the output of some given Pseudorandom number generator. The number of strings in the language is actually quite small compared to the number of possible (i.e. random) strings, and the algorithm to generate the strings is probably relatively simple, but the idea is that statistical tests should be unable to distinguish between the pseudorandom and truly random outputs. Fortunately, the rule which defines a formal languages is usually known from the start, and you generally don't have to guess what they are based on samples.
 * So all this is a very long way of saying that I doubt there is any single answer which would work for any language, whether natural or formal. Probably the best you could do with natural language to use 90% of your sample to generate a word list, then use the remaining 10% as a test to see how complete it is. For formal languages you may not be able to find the specific rule easily, but perhaps there is less specific information you could glean. For example with a large enough sample size you could determine with reasonable accuracy the number of words of length 1, 2, .., 10, and then from that extrapolate an estimate for the number of words of length n. --RDBury (talk) 04:54, 13 August 2018 (UTC)