Wikipedia talk:Contributor copyright investigations/Darius Dhlomo/Task explanation/sandbox

What is happening
is blanking all of the articles created by, based upon a list supplied by and a CCI investigation. There are about 10 thousand articles in this list, a significant number of which have been determined to contain text copied from other sources. Another 13,000 articles created by others later had material added by Darius Dhlomo, and these are also part of the investigation. The articles involved are almost entirely about athletes and sports events, and many of them have very little text (just names, figures, lists of results and events, etc). According to Wikipedia's understanding of copyright, pure factual data is not copyrightable, so the concern is about text fragments in the articles that have been apparently copied from elsewhere, not the raw data. We don't have any reason to think there was deliberate infringement, but rather, this issue arose because of the contributor's poor understanding of WP's copyright policy and of what constitutes unacceptable copying.

The initial bot operation will blank only the ~10,000 articles Darius Dhlomo actually created. The precise plan for handling the articles created by others but containing later significant contributions from Darius Dhlomo is still under discussion, but is likely to be done by another bot operation developed as a follow-on to this one. The current idea is to revert those articles back to the revision immediately before Darius Dhlomo's first significant addition to the article, and insert a notice saying this has happened.

After the articles in this current operation are blanked, the next step will be for editors like you to review the old content in the article history, in order to remove or rewrite any copied parts from the articles and restore the non-copied parts. More about this is further down, in the "How can you help?" section.

How many and which of those articles actually infringe?
The short factual answer is we don't know. The long factual answer is we don't know. The mid-length, speculative answer is that of a few hundred of these articles that volunteers have manually examined so far, around 10% show clear signs of containing copied material, so we might extrapolate that to 10% of the total. But which 10%? We don't know, and the copying is fragmented enough to defeat automatic comparison systems like CorenSearchBot (which is part of why it was not detected sooner). This is not 1000's of feature-length articles copied wholesale from magazines or the like. It's scattered sentences or paragraphs retyped or pasted here and there, often from unidentifiable sources (recognizeable as probable copying only from the writing style), sometimes slightly rewritten but not by enough, wrapped around legitimate contributions made by Wikipedians, including many factual tables and lists made by Darius Dhlomo. The Wikipedia Signpost notice may have given the impression that all of the 23000 articles under discussion have infringements. The more correct understanding is that while each of the 23000 potentially have infringements, only a fraction appear to actually have them. But the number is significant enough that as a precautionary measure, we have decided to revert all the potentially infringing contributions until they can all be reviewed, instead of waiting for concrete problems to be found. That's where you come in.

How can you help?
The CCI regulars can't possibly review that many thousand articles in a reasonable amount of time. A much smaller CCI case has been going on for a year. Faced with a problem of this scale, consensus was quickly reached to launch a bot to blank or revert all the uncertain articles. That means the copyvio text is off the top-level pages of the site and out of search engines, but is still available in the article history for analysis purposes. The next task is for community members to examine the blanked material in the history and determine if it is copied. Articles found to be entirely free of copying can be restored by reverting the bot. Articles containing copied material must have the copied text removed or rewritten before restoration. Articles with unremovable copying may have to be deleted. If you are an editor with no prior history of copyright problems, your assistance in reviewing will be greatly appreciated.

How to review
TBD

Who can review
TBD

Why this is happening
This is being done because, after investigation, it was determined that Darius Dhlomo had been violating copyright on a number of occasions. It turned out that this was happening on quite a large scale, and with a regular pattern. As a consequence, every article that has created is now suspect and has to be reviewed for potential copyright infringement. Unfortunately, that turned out to be a huge number of articles, too many for the normal contributor copyright investigation process, where a small number of dedicated editors manually look through a list of a few hundred articles, to scale to. The number of articles to review is almost 10 thousand. A handful of people cannot cope with that amount of work.

Instead, we have opted for a process where articles are blanked and the editor community in general is asked to diligently and carefully review those articles that interest them for copyright problems.

The articles are being blanked as a precautionary measure. They aren't being deleted, note. They are being rolled back to the revision just prior to Darius's first edit, or (for the 10,000 or so articles created by Darius himself) completely blanked. The edit history remains. However, we cannot legitimately continue to have Wikipedia publish the text of what we suspect will be a fair number of copyright violations, until we get around to reviewing each article.

Where this was discussed
The contributor copyright investigation case page for Darius Dhlomo can be found here. You will see some discussion there. There was additional discussion at the administrator's noticeboard for incidents, which you can find at Administrators' noticeboard/Incidents/CCI, where we tried to come up with a means for managing such a huge investigation.

What happens next
What happens next is you. You can help. We want you to help. If you came here because a link to this page turned up in an edit summary on your watchlist, we'd like you to review the articles that you are watching. The idea is that if everyone reviews just a few articles, this mountain ends up being moved by a thousand teaspoons all digging together.

Please read the instructions for what to do and help.

Related

 * Contributor_copyright_investigations/Darius_Dhlomo
 * Administrators'_noticeboard/Incidents/CCI