User:Cobo~enwiki

Please note: ''The following is written in present tense, but the bot itself hasn't actually been finished yet. I'm also seeking verification/permission before I start testing it on Wikipedia itself. It will perform a variety of different tasks, including the detection and reversion of copyvios and will eventually aid in the detection and reversion of easily verifyable types of vandalism.''

Cobo (Copyvio Bot) is an account used to run an experimental copyvio (copyright violation) bot run by User:Veratien.

How Cobo Works
Cobo is written in Ruby, and works by trawling the Recent Changes list for large new articles. When such an article is found, it logs the find in the ##wikipedia-en-copyvios IRC channel on Freenode. Cobo then assigns a score to the article based on predefined rules, and if the score is over a certain amount (currently 8), it searches Google for similar pages to the article. Cobo then gets the source for both the Wikipedia article, and the page being matched to, and attempts to work out where the page in question's main article is situated based on the first and last sentences in the Wikipedia article, and extracts it. Cobo then diffs the a plain text version of the page contents, with all the HTML, Wikitags, newlines, double+ spaces, etc removed, and works out how much the two articles match.

If the text matches more than 90%, then Cobo will replace the text of the article with  . Due to the fact that a complete article diff is used in this process, the chance of Cobo replacing a legitimate article with the copyvio template is very small. Cobo will also check the number of results returned by Google. The smaller the number of results returned is, the more likely it is to be a copyright violation, which adds to the score for the article.

In the event of a definite copyright violation, Cobo logs its replacement in ##wikipedia-en-copyvios for people to check, citing the original author of the article, the article name, and the size of the original article. (The larger it is, the more likely it is to have been copied from another source.)

If an article achieves a high enough score to be a possible but not definite copyvio (currently 6), the bot will log the occurence in ##wikipedia-en-copyvios, but not act on its find.

Ideas for Expansion
In the future the bot will hopefully support the detection and reversion of certain common forms of vandalism, eg, replacing large articles with a small amount of text containing a number of buzzwords, or replacing articles with solid block capitals, etc. This will work on a similar precept to the above, and assign a score to the edits based on the amount of buzzwords, whether or not the author is anonymous, the precentage difference in size of the new article to the old article, whether the new article is comprised entirely of block capitals, whether or not the new article is comprised of one repeating phrase (eg, GNAAGNAAGNAAGNAA, for example), and so on for obviously detectable forms of vandalism.

Please discuss this bot on its talk page, offering suggestions for advancement and improvement, and noting bugs where needed.

The bot is currently limited to making at most one edit every two minutes, and is being tested on live Wikipedia content.