Wikipedia:Wikipedia Signpost/2007-10-08/Vandalism study

An academic study combining editing data with page view logs has added some new understanding about the quality and authorship of Wikipedia content. It concluded that frequent editors have the most impact on what Wikipedia readers see, while the effect of vandalism is small but still a matter of growing concern.

The results of the study are reported in a paper titled "Creating, Destroying, and Restoring Value in Wikipedia" (available in PDF), to be published in the GROUP 2007 conference proceedings. It was put together by a research group in the University of Minnesota department of computer science and engineering. Based on sampled data provided by the Wikimedia Foundation, showing every tenth HTTP request over a one-month period, they created a tool for estimating the page views for a Wikipedia article during a given timeframe.

In the absence of this type of data, previous studies have largely relied on an article's edit history for analysis. Interestingly, the study concluded that there is "essentially no correlation between views and edits in the request logs."

The study estimated a probability of less than one-half percent (0.0037) that the typical viewing of a Wikipedia article would find it in a damaged state. However, the chances of encountering vandalism on a typical page view seem to be increasing over time, although the authors identified a break in the trend around June 2006, late in the study period. They attributed this to the increased use of vandalism-repair bots.

Authorship and value
Addressing the debate over "who writes Wikipedia", whether most of the work is done by a core group or occasional passersby, the study introduced a new metric which it called the "persistent word view" (PWV). This gives credit to the contributor who added a sequence of words to an article, combined with how many times article revisions with that contributor's words were viewed. The study came down largely in favor of the core group theory, concluding, "The top 10% of editors by number of edits contributed 86% of the PWVs". However, it may not necessarily refute Aaron Swartz's contention that the bulk of contributions often comes from users who have not registered an account; the Minnesota researchers excluded such edits from parts of their analysis, citing the fact that IP addresses are not stable.

The study built on previous designs for analyzing the quality of Wikipedia articles, notably the "history flow" method developed by a team from the MIT Media Lab and IBM Research Center and the color-coded "trust" system created by two professors from the University of California, Santa Cruz. In their own way, both earlier approaches focused on the survival of text in an article over the course of its edit history. Refining these with its page view data, the Minnesota study argued that "our metric matches the notion of the value of content in Wikipedia better than previous metrics."

Damage control
Looking at the issue of vandalism, the study focused primarily on edits that would subsequently be reverted. Although the authors conceded this might include content disputes as well as vandalism, their qualitative analysis suggested that reverts served as a reasonable indicator of the presence of damaged content.

Statistically, they estimated that about half of all damage incidents were repaired on either the first or second page view. This fits in with the notion that obvious vandalism gets addressed as soon as someone sees it; even in the high-profile Seigenthaler incident it's unlikely that many readers saw the infamous version of the article at the time, as a previous Signpost analysis indicated. However, the study also found that for 11% of incidents, the damage persisted beyond an estimated 100 page views. A few went past 100,000 views, although the authors concluded after examining individual cases that the outliers were mostly false positives.