User:Smalljim/AddBad

Recent changes patrol could be better!
It's possible to create a new generation of applications for recent changes patrol (RCP) on Wikipedia. The current tools don't make optimal use of all the information that's available in Wikipedia's irc recent changes feed.

Available information
The information in a recent changes feed includes:
 * editor name or IP address
 * page edited, created or deleted
 * page size change
 * edit summaries, including standard ones like "page blanking"
 * edit filter hits

Configuration
We can have configuration files such as: To be most useful the watchlists should allow regular expression matches.
 * whitelisted editors
 * an IP address – ASN match file
 * an IP geolocation file
 * watchlists of page names
 * watchlists of usernames, IPs, ASNs
 * watchlists of edit summaries

Detection
By using no more than the above information and configuration files we can detect many potentially unwanted actions made by non-whitelisted editors. These include:
 * edits by those who have had edit filter hits
 * further edits by editors who have already been reverted (and warned)
 * further edits to pages that have recently had reverts on them
 * large additions or removals of content, or blanking of pages
 * the creation of new pages by editors who have already been reverted or had pages deleted
 * unusually fast or prolific editing
 * edits to frequently-vandalised pages
 * IP editors making similar edits using the same ISP or from the same area
 * matches on edit summaries
 * and several others

Although one of these actions in isolation may not be problematic, repeated actions or a combination of more than one of them is much more likely to be.

Making use of edit filter hits is possibly the most significant improvement that can be made (not least because the edit filters have access to the text diffs). I think all the large Wikipedias have extensive sets of edit filters that can detect many forms of vandalism and other inappropriate edits. One can envisage a closer association between the edit filters and a new generation of RCP applications, with the filters being adjusted more interactively.

User interface
After detecting potential unwanted/vandalism actions, we have to decide how to present the information to the user. This could be minimal: simply presenting the most likely events one after another, as the current applications do. Or we could use an information-rich interface that shows details of all the recent events that pass a threshold, highlighted in some way according to the program's assessment of how bad they are – the user can then select the events he's most interested in. I prefer this approach.

AddBad
What follows is a description of a prototype application provisionally called AddBad that I have been developing to demonstrate the above principles. As a way of prioritising actions that may be worth looking at, the application awards "badness" points to editors based on events such as reverts, warnings, edit filter hits etc. AddBad has an information-rich interface and uses colour to highlight edits according to the badness accumulated by the editor. When running, around one or two potentially-bad edits per second are notified (depending on activity, of course), making for a set of easily-followed constantly updating lists, which as can be seen form a colourful display that is packed with relevant information.

As an example, an editor might accumulate 30 badness points for hitting an edit filter that warns that it has detected swear words in the edit. If the editor persists in posting the edit (despite the automatic warning), it will appear as a relatively low priority bad edit. A revert and a level 1 warning from another editor (or ClueBot NG) would award say 10 + 50 more badness points to the vandal editor. If the vandal then makes another edit we will be alerted with a brighter highlight reflecting the 90 badness points he now has. Further edits that result in reverts/warnings will add more badness resulting in even brighter highlighting, and so on. If we ourselves revert/warn, a lot of badness is awarded to ensure that we can easily follow his subsequent edits. In the case of false alerts, we can easily zero an editor's badness, or add him to an "ignore today" list, or even add him to the whitelist. Alternatively, we can add badness to editors whose actions look suspicious.

The configuration files in AddBad add a significant aspect that is not utilised by the present generation of AV programs. For example, if an editor name is regex-matched in a config file, then every edit made by that editor is alerted using a distinctive highlight. If that editor hits an edit filter or gets reverted, badness is awarded as above, increasing the highlighting. Or he can be easily ignored if appropriate. Some vandals repeatedly hit the same page or range of pages, using different IP addresses or account names: edits by non-whitelisted editors to these pages can be notified too, with extra highlighting if there is a regex match on the editor name or IP, or on the ASN. Because the configuration files are persistent and changes to them can be applied immediately, there's a decent chance that vandalism like this can be tracked over long periods if necessary, even if it evolves.

Customisation of the config files also allows AddBad to be adapted to focus on particular aspects that the user is interested in; and further tailoring can be achieved by adjusting the badness points awarded for each type of action (each edit filter can have a different score, for example). This customisation would be beneficial when several recent changes patrollers are online at the same time, since it would reduce the likelihood that they are all chasing the same bad edits: a phenomenon with which every Huggle user will be familiar.

In addition to all the above, new page creations by non-whitelisted editors are displayed, as are speedy deletion requests and deletions of those pages. Reverts, warnings and blocks are shown as they happen too, as well as other relevant events such as AIV reports. It's quite possible to leave the application running in the background while working on something else and only take action when vividly-highlighted edits appear, or to sit back and just watch as the seamier side of Wikipedia is acted out before your eyes. It's reassuring to see how much vandalism is quickly reverted by the dedicated band of recent changes patrollers using the existing tools – but AddBad regularly reveals unwanted edits that have been missed by others.

Application details
In its prototype form, AddBad is a set of Perl scripts, with a bit of jQuery to make the web interface work. One script collects, massages and stores the irc rc feed. A second script tails the output of the first, analyses each line to determine if, where, and how it should be shown, and uses Ajax to regularly update the scrolling lists on its webpage, as shown above (which is served from a local web server). Clicks on individual entries on the webpage can show diffs, or (at present) send an edit history or page history to the program user's logged-in Wikipedia session for processing. . An integrated front end for reverting etc. (like Huggle's) would convert it into a fully-fledged anti-vandalism program.

At present I don't plan to put in the additional work that would make AddBad suitable for wider use, but could be persuaded if there's enough interest. However I hope these notes describe some useful principles for anyone interested in creating a new generation recent changes patrol or anti-vandalism program (or enhancing the existing applications). I'd be happy to discuss these principles with any bona-fide editors.