Wikipedia:Wikipedia Signpost/2006-10-30/Plagiarism cleanup

Wikipedia editors launched an effort last week to clean up a number of articles, following a report that found a significant amount of plagiarism in biographical articles. Most of the affected articles needed to have the plagiarized text removed and several were deleted entirely, although in a few instances the source was in the public domain and the problem could be solved simply with a correct attribution.

Plagiarism reported
The report -- posted at wikipedia-watch.org/psamples.html -- was produced by Daniel Brandt, a critic of Wikipedia over the past year who previously played a role in the Seigenthaler incident. Focusing this time on deceased persons, instead of the living whose articles have become subject to heightened scrutiny since that incident, he reviewed a number of articles for people born before 1900. He claimed that about one percent of his sample contained indications of plagiarism, and concluded that the actual rate would be higher because of limitations in his methodology.

The focus of the analysis was plagiarism, rather than copyright infringement, although most of the cases qualify as examples of both. Brandt explained that he took sentences from Wikipedia articles and ran a Google search on them to see if other sites appeared in the results, while removing cases where attribution was already present in the Wikipedia article. He also attempted to exclude sites that mirror Wikipedia, although he conceded that in a very few cases, the text might have been copied from Wikipedia rather than the other way round.

Citing sources has long been Wikipedia policy, which also declares that there is "no tolerance for copyright violations". Existing processes help to ferret out possible copyright problems as cases are reported, and previous larger-scale problems have led to coordinated responses, such as that of the German Wikipedia last year after text copied from several print reference works was found (see archived story).

Reaction on Wikipedia
The appearance of the report prompted considerable activity and discussion. W.marsh began compiling a list as people reviewed the articles, to keep track of which editors were identified as being responsible for the plagiarism. This allowed others to go back and review the contributions of these editors in case any were serial plagiarists. While most have turned out to be one-time offenders several individuals have contributed a number of copyright violations. So far around 100 articles have been cleaned of plagiarized text, in addition to the 142 originally reported by Brandt. The most egregious offender is User:RJNeb2 who has inserted dozens of film-related biographies apparently plagiarized from an offline source.

In one case, a well-established contributor and administrator, Olivier, had added copyrighted content to a number of articles. He was apparently the victim of a deception, however, because an outside site, www.nobel-winners.com, had copied its content from the Encyclopædia Britannica. The site included a notice that "All text is available under the terms of the GNU Free Documentation License", although it did not include or even link to the text of the license, and the site's boilerplate ended with a copyright notice saying "All Rights Reserved."

Work remains ongoing to clear out the problems in these articles and locate others that have not yet been identified. As W.marsh pointed out, even the deleted articles are generally valid subjects for articles, but a number of those have yet to be re-created.