User talk:Crispy1989/Dataset

Naming the links
Should the links be default numbered or named? Naming the links would help differentiate which edits are which. --209.244.31.53 (talk) 17:46, 8 April 2008 (UTC)
 * It doesn't matter which is used - either way, only the actual link is used. —Preceding unsigned comment added by Crispy1989 (talk • contribs) 20:35, 10 April 2008 (UTC)
 * Some are removing the labels. Those would help with quality control and reading the current composition of the dataset. 209.244.31.53 (talk) 18:45, 27 May 2008 (UTC)
 * What I meant by labels is some kind of description of the diff, instead of opening every diff. 209.244.31.53 (talk) 20:11, 13 October 2008 (UTC)

Easier way to build the data set?
Hi, I'd love to help build a better mousetrap, but there's got to be a better way to get a volume of the needed data faster. Do you think you would be able to relatively quickly whip up a tool that would give say a list of links cluebot has currently identified as vandalism or have been rolled back using admin rollback and a quick and dirty interface that would allow clicking a checkbox to verify that the link is in fact vandalism and then submitting the list en-masse? That way with the list of links I can quickly open a chunk of them in tabs, or pop-ups, verify them and check the box. I think this would allow lots of people to help get your data very fast. Currently it requires cutting and pasting urls from semi random places in short, it's tedious. - Taxman Talk 16:17, 23 May 2008 (UTC)
 * Yes, we are currently working on a tool to do almost exactly this. The only difference is that, instead of working from the current Cluebot's output, it just uses random main namespace revisions.  The current cluebot's output only includes edits that it thinks can be definitely classified as vandalism or not vandalism.  It does not include fringe cases (which we want to be included here).  The tool is almost finished.  A link to it will be posted as soon as it is complete. Crispy1989 (talk) 17:58, 1 September 2008 (UTC)

A method
Open a history with &limit=9999&action=history. From bottom to top look at each diff for a page, capture every other diff unless it is a special edit such as adding more than a few words of original text or this. Then call that page history searched through. 209.244.31.53 (talk) 18:50, 29 May 2008 (UTC)
 * Picking more random edits is better, but what we must do is not select a diff we already have, and we need to find particular edits. 209.244.31.53 (talk) 01:52, 30 July 2008 (UTC)

Vandalism-enriched dataset starting point
You could start by harvesting the edits from the "Special:Contributions" page of users that appear in the | block log. For extra credit, parse the reason field for "vandalism" or "spamming", etc. Then you begin your human reviewing and classification scheme, which has definite merits for creating a training set for an ANN.

12.44.50.248 (talk) 19:14, 3 July 2008 (UTC)
 * My only concern with this is that it might result in nonrandom sampling of edits. Crispy1989 (talk) 17:59, 1 September 2008 (UTC)


 * What does seem to be very clear, however, is that you have to build up this dataset with some automated assistance, because based on what's been submitted so far I really don't see it getting large enough quickly enough otherwise (rather assuming that it needs to grow by at least an order of magnitude). I would have thought that a reasonable initial "vandalism" dataset would be everything against which rollback has been used (by users who have not had rollback rights subsequently withdrawn), plus everything reverted by ClueBot without a false positive being reported.  A reasonable initial "constructive edits" dataset might be the entire edit history of a reasonable number of respected well-established users (select based on e.g. lots of edits and a clean or nearly clean block log, and/or maybe adminship).  I suppose that developers could probably assist you in getting at this sort of list with some suitably constructed DB queries.


 * Of course this approach will lead to some false positives and some false negatives, but I guess that the advantages of having a large dataset will probably outweigh this. Admittedly this is to some extent a hunch, but I am basing my hunch particularly on experience with the spam filtering in Gmail.  Gmail tends to be extremely good at spam filtering, and my guess is that it is heavily reliant on people training the filter using the "report as spam" button.  I would be very surprised if Google employ anyone to sift through double-checking these, so there will be some false positives (no doubt some of them even maliciously submitted).  And needless to say, there will also be a lot of false negatives, where people don't bother to click the "report as spam" link.  But if they can still do as well as they do with what they have, then this does suggest that the gain in compiling a large dataset is worth the risk of including some erroneous items if the errors are unlikely to be systematic.  &mdash; Alan✉ 18:24, 1 September 2008 (UTC)


 * We are indeed testing using a very large dataset that is entirely automatically generated by the current Cluebot. It classifies nonvandalism by something below an even stricter threshold value than normally used in its scoring system.  The problem with having a dataset with even a few imperfections is that the neural network will tend to actually discover these imperfections and classify them correctly, even though they are incorrect in the dataset.  This completely messes up the statistics that we use to assess how good a current version of the neural network is.  We did create an interface to aid users in generating the dataset (see my main userpage), which will be what is primarily used.  We may also implement a system for manually reviewing entries that the neural network classifies contrary to the dataset.  This could help largely eliminate dataset errors even in a larger dataset.  Crispy1989 (talk) 05:40, 16 September 2008 (UTC)