User:WereSpielChequers/typo study

This is a response to a 2011 study which hypothesised that "percentage of misspellings on Wikipedia articles through time, relative to total content, remains steady."

My own experience as a Wikipedian who has been fixing typos for more than four years is that typos are getting harder to find, WP:AWB users fix most obvious typos so efficiently that I no longer bother to look for conventional typos, though I still fix those I come across. I've now largely moved on to search for typos that conventional spellcheckers won't pick up, for example I abolished the Olympic sport of synchronised ventriloquism or discuss throwing, though I still patrol some words such as thrity because it is an Indian name as well as a typo for thirty. When I first checked "posses" over 80% of the times it was used in Wikipedia it was a typo for possess, possesses or poses; Pretty much all the preforming done by popstars is now performing and while I haven't yet eliminated staring from Bollywood I'm making steady progress. So it surprised me to hear that someone had shown that we have an increasing number of typos.

Having read the methodology I would make the following criticisms.


 * 1) The study picked 2400 random articles. Testing a number of random articles is a perfectly legitimate approach, but some Wikipedia articles are read and or edited dramatically more often than others. If our crowd sourcing model works we should see a pattern that the less often an article is looked at the less likely it is that an error will be spotted and fixed. More pertinently if the errors are disproportionately on the articles that are rarely looked at then the error rate perceived by Wikipedia readers will be less than the error rate calculated by sampling random articles.
 * Recommendation: Test the theory that errors are disproportionately on less frequently read articles and if so weight future studies to give more emphasis to popular articles - the first 100 articles read each hour of a random day would probably do the tric
 * 1) Though the number of Wikipedia articles has been growing over time, so has their average length. This is partly a function of existing articles tending to grow as people add information, partly that standards have risen at New Pages and one line stubs are unlikely to get past the new page patrol. Measuring an error rate in terms of errors per article will tend to understate Wikipedia's quality improvement processes as opposed to a measure of the ratio of correctly and incorrectly spelled words.
 * Recommendation: Measure typos in terms of errors per thousand words not errors per article (the study may have done this or something like it).
 * 1) The study took a random version of those articles rather than the current version. We have lots of vandalism, most of which gets automatically reverted the same moment. Taking a random version of an article gives you as much chance of picking the vandalised version that persisted for 40 seconds as it does the reverted version that may have persisted for 40 days. More importantly we have lots of editors who work on improving articles, including many who specialise in hunting down typos. Looking at a random version omits the very crowd sourcing and article improvement that it purports to disprove the existence of.
 * Recommendation: To measure the quality that Wikipedia has currently achieved, look at the current version of articles not a random and subsequently corrected version.
 * 1) The study defined misspellings as the use of words that are not in a list of valid English words built from wiktionary and other sources, it then screened out numbers, references and numbers ending in a letter and compared the remaining words in each article against that list, treating the nonmatches as misspellings. That means it wouldn't pick up most of the misspellings that I look for as they are valid but incorrect words, but it would treat as misspellings words that were not on that list such as people called Thrity. As anyone who has skimmed through our articles on towns outside the English speaking world, Bollywood movies or anime would anticipate this would lead to a large proportion of false positives. If you also consider the song titles, stage names and band names used by popular entertainers in the HipHop, Punk and Heavy metal genres then this method will experience a very high rate of false positives - this may well account for some of the articles that are over 50% "misspellings" according to this study.  The researcher has conceded that false positive rate is still high"'' But has not yet published that rate, nor the methodology for arriving at it.
 * Recommendation: False positives could be somewhat reduced by removing Intrawiki links and quotations especially where the sic template has been used, but better still needs to be measured and the results weighted by that measure. As a working assumption I would anticipate that this method would have a false positive rate of over 90% if run against the current version of some random Wikipedia articles.

An alternative way to measure Wikipedia's quality with regard to typos is to take a random typo, search for it on Wikipedia, see how often it is correct to have it as a typo and measure the number you needed to correct. The typo team lists quite a few examples at Typo_Team/works_completed. Note that most searches found that particular errors occurred at frequencies that were closer to 1 in a million articles than 1 in a hundred thousand. But they don't list how many are new errors.


 * As a reality check on 30th Dec 2011 I took a random typo from a new page, Assitant. I then searched for it in mainspace and found a total of 50 occurrences, fixed enough to confirm that it was occurring as a typo of Assistant and filed a suggestion to have it added to AutoWikiBrowser (AWB); Only to discover that it was already in AWB. One of the AWB users then did a run targeting that error and gave an analysis of the last 20 they fixed "the average lag was 117.5 days (min 17, max 264)". I must admit that surprised me, I didn't realise that the AWB backlog was so long that a typo could persist for nearly nine months in a Wikipedia article, perhaps we need to recruit more AWB users. But none of those twenty had been in Wikipedia for longer than nine months - which rather indicates that at least for the sorts of typos that AWB can pick up we do eventually fix them.