Wikipedia:Wikipedia Signpost/2009-06-22/Vandalism


 * Loren Cobb (User:Aetheling) holds a Ph.D. in mathematical sociology and is a research professor in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver.

This study has a narrow focus: to determine the distribution of the length of time that vandalism remains on the English-language Wikipedia. This distribution is also known as the survival function for vandalism. The two primary results from this study are: (a) the median time to correction is down to four minutes, and (b) some subtle forms of vandalism still persist for months and even years.

In the past there have been other statistical studies, both formal and informal, of how long vandalism remains in Wikipedia until it is corrected, but almost all of them express their results as a mean time to correction (i.e., as a simple arithmetic average of the observed times). I will show in this study that the distribution function for time to correction has such a fat tail that the mean time to correction is both mathematically and substantively meaningless. The median time to correction, on the other hand, conveys useful information.

Methods
A random sample of 100 articles from the English language edition of Wikipedia was obtained through the use of the random article link in navigation toolbar. For each article, the history log was used to examine each recorded change, starting from the most recent, going back until a clear instance of vandalism was found. Then the changes were scanned in reverse order, going forward until the vandalism was corrected.

For each such instance of vandalism, the elapsed time until correction was computed, in minutes. These are the fundamental data on which this report is based.

In addition, some notes were taken on the general nature of the vandalism. All data collection occurred on 2009-06-11.

Results

 * 1) Of the 100 articles, fully 75 had never been vandalized.
 * 2) Of the 25 articles that were vandalized at least once, the most recent such instance of vandalism was eventually corrected in 23 articles.
 * 3) In five (20%) of the vandalized articles, the most recent instance of vandalism was corrected in less than one minute. A further four instances were corrected in less than two minutes.
 * 4) The median time to correction was four minutes.
 * 5) Two articles were found to have suffered vandalism that was never corrected. One of these was a subtle act of vandalism that was committed on 2007-02-23, and still not detected by the date of the study, 2009-06-11.

Discussion
A histogram of times to correction is shown in the chart to the right. Note that the horizontal axis is depicted on a logarithmic scale, to accomodate its enormously long right-hand tail.

In this histogram there are evidently two separate processes at work. The bulk of the histogram follows a curve that declines as a power function of elapsed time: this is the process by which ordinary readers and editors of Wikipedia stumble across and correct instances of vandalism.

The first two bars on the left, however, are significantly higher than the curve would suggest. The difference between the actual height of the bars and the height predicted by the curve is accounted for by the independent activity of Wikipedia's Recent Change Patrol (RCP). Members of the RCP typically monitor the Recent Change Log for suspicious edits. The RCP is able to correct most blatant vandalism within seconds of occurrence.

Both of these vandalism-correction processes act in concert to produce a remarkable result: the median time to correction for vandalism in this study was found to be just four minutes. Similar (unpublished) studies performed by this author one and two years ago yielded median times to correction of five and six minutes, respectively. It seems apparent that Wikipedia is improving its already impressive rate of vandalism detection and correction.

Problems with Mean Time to Correction
The fact that the estimated curve for the survival function is exponential on a graph whose horizontal axis is logarithmic indicates that the probability density function itself follows a power law distribution, also known as a Pareto distribution, given by the formula


 * $$f(x)=ax^{-b-1}$$

If the parameter $$b$$ in the above formula is less than one — as it is in this case — then the mean of the distribution is infinite. The practical significance of this unusual situation is that any sample mean calculated from empirical data conveys absolutely no information whatsoever about the typical length of time that it takes for an instance of vandalism to be corrected.

The only useful alternative to a sample mean in this situation is the sample median, which is fully robust with respect to long-tailed distributions.

Depending upon what assumptions are made concerning the rate of activity of the RCP, the parameter $$b$$ for the Pareto distribution lies in a range between about 0.25 and 0.40. This range is comfortably below one, indicating that the tail of the distribution is huge and that sample means are completely and utterly useless for describing the data.

Observations on types of vandalism
About 84% of the vandalism that I observed in this random sample seemed to be just adolescent fooling around. Of the 16% that appeared more adult, half seemed to be adult humor or anger, and half seemed to come from people whose intent was to leave a permanent but nearly invisible mark upon Wikipedia. For example, the perpetrator will carefully change the spelling of an obscure name to an incorrect form, or change a location to something that still looks plausible at first glance. I imagine them coming back over and over again to the page that they altered, to see if that subtle little change is still there. Perhaps this impulse is roughly the same as the one which causes people to carve their initials into trees, or to scratch them on rocks.

Conclusions
The fact that 50% of all vandalism is being detected and reverted within an estimated four minutes of appearance should go a long way to allay fears about the susceptibility of English-language Wikipedia articles to malicious vandalism. On the other hand, the fact that an estimated 10% of all vandalism endures for months and even years indicates that some new tools and strategies are needed for rooting out the most subtle and persistent forms of vandalism.

Raw data
The elapsed times (in minutes) to correction for the instances of vandalism found in this study were as follows: { 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 4, 5, 8, 9, 19, 73, 213, 490, 672, 2442, 14176, 152996 }. In addition, two cases of vandalism had never been corrected (until discovered by the author).

de:Wikipedia:Fragen zur Wikipedia/Archiv/2009/Woche 31