Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation

What is happening
(bot approval discussion) is reverting or blanking all of the articles created by, based upon a list supplied by and a CCI investigation. There are about 10,000 articles in this list. We've already determined that a significant number of these articles contain prose copied from other sources. There is a further list of just over 13,000 articles that have had significant text added to them by Darius Dhlomo. These, too, are part of the investigation. A more general description of the incident can be found here.

The initial, first pass, operation of the 'bot is to blank the 10,000 articles that were created by Darius Dhlomo. What to do about the further 13,000 articles in the second list is still being discussed, but it is likely to involve a similar mass editing task by a 'bot, following on from this task. Our current plan is to revert the articles on the second list back to the revision prior to any additions by Darius Dhlomo, adding a notice to each article informing editors that this has happened and that the article requires review.

This process is not about determining the motive for the copyright infringement. That is being discussed separately. This is about the cleanup of the result. Our best understanding of the copyright law that applies is that we cannot, having become aware of this mass infringement, do nothing.

Why this is happening
This is being done because, after investigation, it was determined that Darius Dhlomo had been violating copyright on a number of occasions. It turned out that this was happening on quite a large scale, and with a regular pattern. As a consequence, every article that has created is now suspect and has to be reviewed for potential copyright infringement. Unfortunately, that turned out to be a huge number of articles, too many for the normal contributor copyright investigation process, where a small number of dedicated editors manually look through a list of a few hundred articles. The number of articles to review is over twenty three thousand. A handful of people cannot cope with that amount of work.

Instead, we have opted for a process where articles are blanked and the editor community in general is asked to diligently and carefully review those articles that interest them for copyright problems.

The articles are being blanked as a precautionary measure. They aren't being deleted. The edit history remains. However, we cannot legitimately continue to have Wikipedia publish the text of what we suspect will be hundreds if not thousands of copyright violations, until such time as we get around to reviewing each article.

How many and which of those articles actually infringe?
The short factual answer is that we don't know. The long factual answer is that we don't know. At least three volunteers have independently sampled selections from the list, of a few hundred articles, and come to the conclusion that around 10% of the articles contain copyright violations. We have fairly good reason, based upon the editing patterns, to conclude that this extends to all 23,000 articles being investigated. The editing patterns found upon investigation have also led us to conclude that any prose content (i.e. anything more than just raw numbers, names, and dates) is likely not this editor's original writing.

The problem is that we have no mechanical way to determine which 10%. The way that the text was copied defeats automated comparison systems such as CorenSearchBot (which detected only a very few of the copyright violations, and is partly why no warning flags were raised earlier than this ). Individual sentences or paragraphs were taken from prose sources, wholesale, but were re-ordered. Some very light textual revisions were sometimes made, making close paraphrases, not defeating the charges of either copyright violation or plagiarism, but enough to defeat automated text comparison mechanisms.

Furthermore, and equally unfortunately, the copied prose has sometimes been included amongst legitimate contributions made by other Wikipedians, or has been later modified and edited by other Wikipedians (creating derivative works, which we also have to exclude).

Thus we need humans to review the contents of the articles. This is where you come in.

What happens next
What happens next is you. You can help. We want you to help. If you came here because a link to this page turned up in an edit summary on your watchlist, we'd like you to review the articles that you are watching. The idea is that if everyone reviews just a few articles, this mountain ends up being moved by a thousand teaspoons all digging together.

Please read the instructions for what to do and help.

Where this was discussed
The full discussion, including the original CCI case discussion from late August and the subsequent discussion at the administrator's noticeboard for incidents, can be found at Administrators' noticeboard/Incidents/CCI, where we analyzed samples of articles, discussed options, and tried to come up with a means for managing such a huge investigation. There is also relevant discussion at User talk:Uncle G and User talk:Moonriddengirl (and their respective archives for September 2010).