Wikipedia talk:WikiProject Vandalism studies/Study2/Archive2

Proposed table
To get things rolling I have made the following table to gather data. I know it is cumbersome, but I figured others can propose revision. I made the table sortable so that should be interesting for analysis purpose. Remember 03:05, 13 April 2007 (UTC)

Table
ATOE means "at time of edit"
 * Table looks great! I think we should probably remove the columns for "change in time" and other derivative stats.  Those will be useful for crunching data, but because they can be gleaned from data in the table, I'm not sure it would need to be included in the data gathering table.  I imagine all serious crunching will actually be done outside of wikipedia in another spreadsheet program. But great job with the table.  Now we just need to solve how get those tricky data points.  Martschink 13:48, 14 April 2007 (UTC)


 * And how did you get the number of categories at the time of the edit? Did you use that tool at the bottom of this page? Martschink 14:36, 14 April 2007 (UTC)
 * No I just counted them up. Remember 15:34, 14 April 2007 (UTC)


 * I've reduced the width of the table, as well as the number of attributes per edit to be gathered, by doing the following:


 * Tweaking the column titles
 * Removing "Editor type" (both for editor and reverter); this is redundant because the user name is also listed; the editor type can be derived (on a spreadhseet, outside of the sampling)
 * Removing "change in time"; as noted by someone else, this can be calculated
 * Removing the edit number of the first edit of the editor; seems not that useful (just there to "prove" something?)
 * Removing low-value information about the reverter: # of edits at time of edit, date of first edit, edit number. The critical issues are whether the revert is by an IP address or not, and how long it was before the revert occurred.  If more information about reverters is needed, a relatively small sample should suffice.
 * Changing months from full spelling to three-letter abbreviation
 * Changing edit type to shorter descriptions


 * Before putting back any columns, I strongly urge a review of (a) the cost of gathering the additional data; (b) the value of the data. (For example, asking those working on the project to calculate elapsed times between edits obviously will take time, but doing this in an Excel spreadsheet does the same thing, so the value of human beings doing the calculations are nil (or negative, given the likelihood of human mistakes while doing arithmetic).  -- John Broughton  (♫♫) 21:56, 14 April 2007 (UTC)


 * I like the revisions, John. At least one thing comes to mind.  We need the column for edit number because we have to know what edit the data relates to.  As we've viewed this so far, once we start collecting data, the only full column from the start will be the edit number column.  That let's volunteers know for which edits data needs gathering.  I'm not sure I agree aout the low value of the information of the reverter, but I'm not sure I disagree either.  I suppose that as long as we have the reverter's ID and time of reversion, a subsequent extension of this study could go and pick that data up.  I'll ponder that one.  And I agree we need to shrink the form of the data.  In my list (way) above, I put a list of several different types of edit.  I think we can get those down to two-letter entries (e.g., NV = Not vandalism, LS = Link Spam). Martschink 23:28, 14 April 2007 (UTC)


 * And I think we can 86 the first column, data number. Given that this is all going to be sortable, I'm not sure we need to keep track of that (especially since we have a column for date the data was collected).  We still have the open problem about how to get some of the data ATOE, such as links to article.  One last thought: edit number and article name need to be seperate b/c we may end up with more than one edit examined for one article. Martschink 23:34, 14 April 2007 (UTC)

Check out Village pump %28technical%29, it contains some answers to our questions, but it is pretty technical. Remember 13:36, 15 April 2007 (UTC)

Random Edits study, formulation and structure
Alright, as it looks like the Random Edits study one the vote above by 2:1 (see above), lets get the blueprints laid out. I was thinking this time we should follow a breakdown using the scientific method. Let's start filling in areas we can and set things up to start work. (omitted until work is done: Abstract, Data, Results and Discussion, Conclusion, References). JoeSmack Talk 20:06, 6 April 2007 (UTC)
 * I went ahead and added this to the main page. I think we can start working on stuff there now unless anyone objects. Remember 16:45, 7 April 2007 (UTC)
 * I was WP:BOLD myself and proposed an outline procedure. Make any changes and comments as need, including scrapping the whole thing.  This is just one idea. --Jayron32| talk | contribs  17:51, 7 April 2007 (UTC)
 * Some have talked about a recent changes study, but I think the consensus was to get a truely random sampling we need to look at randomly selected edits (which would be selected by a number gneerator and then going straight to that edit). So I have moved your suggestions to below for further comments. Any other feelings? Remember 18:19, 7 April 2007 (UTC)
 * My understanding is that we're going to go forward with the random edit study, not the recent changes version. That said, I'd like to see the scope of this study to be massive.  We need to look at far more edits than we will be able to look at ourselves.  I think we should proceed by (1) Figuring out what data we want to look at for each edit.  See the working list above.  (2) Spelling out a step-by-step procedure for examining an edit.  This way we can easily harness the work of new volunteers. (3) Generating the list of random edit numbers that we need data for.  (4) Start filling in the data.  I've talked this over with another contributor, so I'm going to make to further specific proposals.  First, that we examine 5000 edits.  That sounds like a huge number, but it will give the study credibility and attention.  That, coupled with a significant amount of information about each edit will also make this study useful.  Economist dig massive amounts of information, and our study can help provide that (along with whatever our initial conclusions are).  Which leads me to my second proposal, that we collect the data into a comma separated value (CSV) chart.  CSV is a common denominator for spreadsheets and databases and will make it easy to load for sophisticated analysis.  I believe that is the way to add additional value to the study. Martschink 00:29, 8 April 2007 (UTC)
 * I concur. Pick a date range, generate a large set of random edit numbers within that range (after determining what edit #s correspond to that date range). The set should include more than the x edits we want to study, because not all edits will be in article space. Once the measurements are decided on, then decide on a format for recording the results, ensure all contributors understand what their task is (do a test round perhaps), divide the set of random edits among those involved, and off we go. (Still not sure how much I will participate, but thought I'd contribute here.) – Outriggr § 23:39, 20 April 2007 (UTC)

Coder has volunteered help
I asked for help on the Bot request page and User:Autocracy has graciously volunteered to help so when we figure out what data we want to pull, he may be able to write a bot to make it easier. Remember 14:40, 8 April 2007 (UTC)
 * This looks interesting - if you need anything, I can help as well. ST47 Talk 16:07, 8 April 2007 (UTC)


 * OK, so here are my thoughts from what you've written up so far (using the table above as a reference):
 * I can create a table like the one above as output, and have an extra column at the end for signing off reviews.
 * To help with determining edit type, I can link to the point in the history list where the change is mid-point.
 * I can have a program populate all fields except "Edit type," "Notes on edit," "Revert Date," and "Reverter Name"
 * I can output results into the same format as that wiki table, and we can go from there. Further ideas or comments, anybody? -- Auto ( talk / contribs ) 14:39, 30 April 2007 (UTC)

Possible tools
I thought I would create a section for possible tools we could use to collect the relevant data. Remember 17:56, 9 April 2007 (UTC)
 * Stats on a particular article

Feeling slow today
I must be slow today.... can someone walk me through form the beginning  of what we would track and how we would actually track it? I am probably just missing something super obvious. Alex Jackl 03:45, 13 April 2007 (UTC)
 * The plan is to generate random numbers in a certain range and that will give us the edit number to investigate. We will then check out that edit and gather information about it.
 * We are currently debating what information to gather about that edit. Martschink believes that we should gather as much useful information about each edit as is possible so we can learn things like whether newer articles are vandalised more or whether new registered users vandalize more, etc.  The debate is taking place above (but no one has really offered any opinions yet).  Feel free to jump in and add your thoughts. Remember 13:07, 13 April 2007 (UTC)
 * I think I'm with Mart, lets get as much information per data point as possible. Measuring something is easy, but knowing what it is you are measuring is the hard part - interpretation seems to be the strong suite of this project and with good reason. The more paints we have to work with the more complete the picture will be in the end. :) JoeSmack Talk 16:01, 14 April 2007 (UTC)
 * Joe, how do you feel about the current table and data points? Remember 16:23, 14 April 2007 (UTC)
 * I think we need to make it look smaller or find a way to fit all the data on one screen and not have it drag off to the right so much (my browser has to sidescroll and i have a widescreen)! Any way to do that? It is a lot of info, and having it all infront of your eyes at once as opposed to having to scroll around would be important. JoeSmack Talk 16:48, 14 April 2007 (UTC)
 * That would be nice, but I don't know how to do it and have so much information in one row. Remember 17:16, 14 April 2007 (UTC)
 * Would it be weird if it was in columns? We could prolly fit like 10 data points wide doing that, and then just make a new table for the next 10. JoeSmack Talk 17:26, 14 April 2007 (UTC)
 * That's an idea but I like the ability to do sortable rows and that wouldn't be an ability with splitting up the table. I have revised the table to try to make it smaller.  Any other suggestions would be welcome. Remember 18:26, 14 April 2007 (UTC)

(undent) I've been bold and done more revisions, as described above; it now fits (at least on my screen). -- John Broughton (♫♫) 21:58, 14 April 2007 (UTC)

just a few ideas.
I saw you guys were doing this, and some things i'd personally thought would be interesting for a study would be: repeat rate of vandalism by a single person (how many warnings within a time, was the user blocked after some time etc etc.) and another thing which interested me from the previous study... if 25% of the reverts is done by "anonoymous" users, then how many of those "anoymous" users are "not logged in"-users. I think a decent estimation of such a fact can be made with CheckUser??, I'm not sure. The same can be said of the reverse of course. How many of the anonymous vandalisms actions are by not logged in users. There are considerable privacy concerns here of course, i'm not sure how to deal with that, but i'm sure it's possible. Good luck with the study ! --TheDJ (talk • contribs • WikiProject Television) 00:27, 16 April 2007 (UTC)


 * The repeat rate of vandalism, and the impact of warnings, is really another separate study in itself.


 * As far as the extent to which reverts by IP editors (or vandalism) might in fact be done by registered editors who simply aren't logged in, that's an interesting question, but it's unclear how to answer it. CheckUser is really the only way to find out, I believe, but (a) that's pretty much limited to checking on sock puppets, because of privacy concerns, and (b) I'm not sure that CheckUser is used the way you think it is - that is, "Here's an IP address, tell me if it matches ANY registered user".  Rather,  it's my impression that CheckUser goes something like this: "I think X and Y and Z are really the same person; do the IP addresses match?  (The difference is critical; we have no idea who an anonymous reverter or vandal might be, so we're really saying "does this IP address match any of the 3.7 million registered users?"  That's a pretty challenging question to answer.)  -- John Broughton  (♫♫) 02:55, 20 April 2007 (UTC)

Clarification
Is the decision to analyze "random edits" in a given time frame, not "recent changes"?

If the goal is a large data set, we need specific designed programmed tools to gather data faster. Without that help 5000 edits is to much for 8 or 9 volunteers. It would be easy feed in random numbers for the edits to programs which could plug fields like article name, hyperlink, edit date, user name, +/- 10 edit date, and creation date. Maybe size of edit to. If we work from such generated data set, we can click link to each edit, code its edit type and edit notes, and also other data I do not know could be gathered well by programs because of judgement involved such as catalog count, # links to article and revert information, in fraction of the time. Also we can put more time on coding atypical edits, for example edits filtered because there are not 10 edits prior.Venado 15:43, 23 April 2007 (UTC)
 * Yes the idea is to do random edits and not recent changes. I totally agree with the second idea that a program would do this much better than we could and we should probably just focus on doing the stuff that humans can do well (like saying what is vandalism and what isn't). Any help in anyone figuring out how to set this up would be most appreciated. Remember 16:39, 23 April 2007 (UTC)
 * We have two bot coder volunteers up there, perhaps tap them on the shoulder via their talk pages? JoeSmack Talk 17:30, 23 April 2007 (UTC)
 * I did that with one, feel free to tap the other. Remember 17:51, 23 April 2007 (UTC)
 * Finito! :) JoeSmack Talk 18:06, 23 April 2007 (UTC)
 * Yes, if you have a set of edits we can probably get you some information - user, edit summary, time, article name, net defference in size. Do you have a list of edits? I can give you a list of random numbers as oldids, and use that as the bot's input. LMK. ST47 Talk 14:22, 24 April 2007 (UTC)
 * Thank you for doing this. I can develop random number list if needed.  We have not chosen the time frame yet but I can get the list when we do. Venado 17:06, 25 April 2007 (UTC)
 * I would suggest we study a recent discrete time period (e.g., all the edits in 2006 or if that is too big, all the edits from January through March 2007). User ST47, could you just provide an example of what your information your bot could provide on a random edit.  For example, could you give a demonstration of all the info your bot can pull for edit number 87309971, which can be found here . Remember 17:22, 26 April 2007 (UTC)

Think big
First, you guys are doing a wonderful job. Results of your first study will be cited in many places, I am sure. Second - I'd like to invite all to General User Survey. Studying vandalism is important. But studying all of Wikipedians would be extremly useful... the project, unfortunatly, is stalled due to lack of interests among developers (the question part is mostly done, but nobody has a workable idea to carry out a survey (this being the best one...).--Piotr Konieczny aka Prokonsul Piotrus 07:30, 24 April 2007 (UTC)

New revision to planned study proposed
I was thinking now that the article's history displays how large an article is at each point in its history (by stating how many bytes it is in parentheses on the history page), we could limit our study to a time period that just has this feature incorporated into the wiki software and this might prove to add a lot of useful data easily gatherable data to our study. For instance, this would let us know the average size that a vandalized article was at the time of vandalism and how much information was added by a vandal when he vandalizes it. So I was thinking for this next study, we should just do a random sampling of all of the edits from May 1 to May 31, 2007 (the change to the wiki software took place on April 19). We could use the rest of this month to prepare and start the study on June 1 2007. What do people think of this? Remember 20:46, 6 May 2007 (UTC)


 * There's certainly a strong argument for doing it that way. We'd lose seasonal effects (like April Fools' Day) and we would be able to show changes over long periods of time, but otherwise I don't think the overall quality of the study would be compromised.  Good plan.  Martschink 15:44, 7 May 2007 (UTC)


 * Yes, I don't see any reason to use older dates that contain less data. So we'd be gathering data in May 1 - May 31 from random edits, and start analyzing data in June? If so, lets focus on gathering data right now, namely how? JoeSmack Talk 16:25, 7 May 2007 (UTC)


 * We would have to wait until May ends so that we could randomly select edits from all of May. Therefore, we can't start the study until after May 31.  But we could get everything ready for the study. Remember 17:02, 7 May 2007 (UTC)


 * Agreed, we'd have to wait until the end of May to gather the data. Do we need to talk about numbers?  Also, do we know how we're going to generate the non-repeating list of random edit numbers?  And I'm still of the opinion that we should do a trial run.  Maybe we should try a 100-edit micro-study of April so we can work any kinks out of the system?  Martschink 19:03, 7 May 2007 (UTC)
 * Test run sounds good to me. As for a random number generator, how about this:

Wait... this is all very silly... I already collect the size of the article at the time of the edit regardless... I shall go ahead and add in the difference in size between the selected revision and the prior one (if it exists) for the 2006 samples which I collected the first 1,000 of last night. -- Auto ( talk / contribs ) 13:15, 4 June 2007 (UTC)

Small script
I made a javascript script that did a whole bunch of queries to output these results: page: WIHT rev: 32782407 namespace: 0 user: NetBot editsum: Robot fixing template calls timestamp: 2005-12-26T18:12:00Z anon: false minor: false pageLength: 1962 pageHistoryIndex: 295 pageHistoryLeng: 311 tenBefore: 2005-09-14T04:14:50Z tenAfter: 2006-04-05T03:55:01Z This information was received with the function call  (referring to this revision), and what this data reveals is the page, the revision number, the namespace number, the user who made the edit, the timestamp of the edit (in UTC), whether the user is an anon, whether the edit is minor, the page length in characters, what number revision the selected on is out of the total ones, the timestamp of the edit ten revisions prior to the queried one, and the timestamp of the edit ten revisions after the queried one. The script was 4310 characters long, and it could do other stuff, depending entirely on what the wikiproject wants in terms of data. Does anyone have any suggestions? Grace notes T § 00:15, 13 May 2007 (UTC)


 * First let me say: wow, this is awesome! Thanks for keep with this and putting the time behind this code. Some questions: how long does this take to run for each data point? Can it be done for random data points like previous discussed in the date range may 1 - may 31? In terms of suggestions: can we have a PageLengthDif, ie how much was added or subtracted for the revision? Would it be too taxing in terms of time of process to capture the timestamp of edits 1-10 before and after the data point? Can it be checked if any of those 1-10 edits after the data point contain 'revert(ing)', 'rv', 'rvt', 'rvv' ,'vdl' or 'vandal(ism)'? Would it be possible to check to see if the contribution that added or subtracted is still present ('jhonny is STUPID') or not present ('"The Grell" is an episode of The Outer Limits television show.') in any of the 1-10 edits after the data point? These later ones could give us more data on reverting vandalism. Again, thanks for all your hard work Gracenotes! :D JoeSmack Talk 18:18, 13 May 2007 (UTC)
 * The amount of time it takes to get the above data depends on the length of the history&mdash;I create an array whose items are revisions in the page history, including the edit summary, revision number, time stamp, and other information for each revision. So the edit summary checking is doable; I'd assume that we might want to check the number of times rv, rvt, rvv, Undo, etc. appear, but also take note of whether such a string is included in the edit after the examined one.
 * The script is a bit slower than might be expected, because I can only grab the information for 50 revisions at a time. It would be possible to get only the (maximum) twenty-one revisions I needed, but I can't get the length of the page history without counting the number of items in the above-mentioned array.
 * As for the page length (in characters), I get that by getting the wikitext of an article and measuring the length. So to get the difference, I'd have to get the wikitext of two revisions; no problem.
 * By "content added or subtracted", do you mean compared to the current revision of the article, or compared to next revision not by the user? The latter seems to make a bit more sense, I think.
 * Or I could just compare the revisions before and after the user's. This only indicates whether the edit was immediately reverted. As for edits that are reverted in the long term, there would probably be too many false positives/negatives for sensing those. Grace notes T § 00:10, 14 May 2007 (UTC)
 * As for categories mentioned above, I can see how many categories appear in the article by scanning through the wikitext. This does not take care of categories included in templates, although I can get the wikitext of an article with all of the templates fully expanded, but those would be the current revision of the template. Backlinks do not look possible. Hope this helps in deciding which information to include! Grace notes T § 00:07, 14 May 2007 (UTC)
 * This looks awesome! Thanks for all your hard work!  Could you show me how to use this script?Remember 02:00, 14 May 2007 (UTC)


 * My thanks as well. That is a good go.  If program catches next edit reverts that is big help, also recording revert editor and time of revert.  That would reduce number of reverts we have to hunt for data personally.Venado 02:42, 17 May 2007 (UTC)

I see that autocracy has done some nifty work below. Meanwhile, though, anyone interested can probably include if (wgPageName == 'Wikipedia:WikiProject_Vandalism_studies/Query') importScript('User:Gracenotes/vandinfo.js'); in their monobook.js, and go to WikiProject Vandalism studies/Query to run the script. (I'll get it to work in IE as soon as possible.) Grace notes T § 19:48, 17 May 2007 (UTC)


 * Man these bot things look awesome. So I will ask for some more magic to be built in to them. You guys are saying you can detect "next edit reverts" right. The major task in this study is to tell if an edit is vandalism or not. If we find an subsequent edit that reverts the text exactly back to the text found in the edit in question then there are three possibilities (as I see it).
 * 1. The edit was vandalism
 * 2. The edit was reverting vandalism
 * 3. The edit is part of a content dispute
 * To distinguish between these posibilities will require a judgement call by the experimenter. And to make this judgment call they will need to look at the diffs between the edit in question and the edit that did the revert and probably any intermediate edits too. So ...
 * Would it be possible to have the bot search the history after the edit to find any edits that compeletely revert the edit (not just test if the next edit is a reversion)
 * and
 * To display links that will bring up the diffs between the edits?
 * Another very useful link to have would be one that displays the diff between the previous edit and the edit in question.
 * Ttguy 22:21, 23 May 2007 (UTC)


 * Example


 * On the [Genetically modified food] page we have the following series of edits
 * 131399491 - b4 version in question
 * 132294711 - version in question
 * 132294778 - intemediate version
 * 132294870 - reverting version


 * and lets say our random edit picker choose edit number  132294711.


 * Then it would be good for the bot to come up with a series of links like this...:

Dif Previous -> Edit in Question

version in Question

Intermedite Edit1

Intermedite Edit2

Reverting edit

Sanity Check (Dif version b4 Edit in Question -> Edit that Reverts - should show no differences)

Bot Progress
So, finals are over... and I'm finally taking a night to crank through most of this. I spent some time enhancing Perlwikipedia.pm, and now I'm working on writing up a full module. I see Gracenotes already has something in JS, but I'm going to code mine anyway :) Once it is working, I'll hopefully have a DB with 5,000 edits in it. One outstanding question: how do we handle a revid that has been deleted? -- Auto ( talk / contribs ) 01:09, 17 May 2007 (UTC)


 * Good question. The deleted article list is by article name only.  There are many speed deletes per day leaving little trail.  We can probably get how many WP article edits per deletion, and can see general category reason for each given on delete table and do analysis from that.  But we wont know data on editor responsible or any other data on article.  Admins with access rights can still see much of that data but that would be big hassle on admins to get if there are many in the random sample.  Question, can we assume deleted edits associate with a fixed edit number and if deleted edit came in sample it would be #null# hit?  Can we presume non-article edit deletions, #null# hits, would be very rare, so any disappeared edit is probably to article?  I have seen talk page edits permanent blanked to but I dont know if it is rare.Venado 02:32, 17 May 2007 (UTC)


 * Well, the thing I've run into is I can pick a revid that comes back "bad." Theres' no way that I'm aware of to track down the article or edit it relates to. Oh, and the bot is working awesomely right now. I have everything but creation, first edit, and number of edits done operating. -- Auto ( talk / contribs ) 03:44, 17 May 2007 (UTC)
 * At this unholy hour, I have now greatly expanded the capability of Perlwikipedia.pm, and made great progress.

Current Status

 * Edit Number: Done
 * Edit Date: Done
 * Edit Type: manual
 * Text Character Change: not implemented, but possible
 * Date of Ten Edits before: Done or last edit (e.g., if only 5 prior, than the 5th)
 * Date of Ten Edits later: Done same as 10 prior


 * Editor Type: not implemented, but possible
 * Editor Name: Done
 * Number of Edits: Done -- this is the Number of Edits to the Editors credit at the time of the edit right ?
 * That is correct.


 * Date of First Edit: Done -- This is the date the editor did his first edit ever anywhere on the Wiki right ?
 * That is correct.


 * All Reversion Info: Not implemented, intended to be manual
 * Article Size: Done
 * Number of Categories: Request not possible, implemented as current instead
 * Number of Links: Request not possible, implemented as current instead
 * Creation Date: Done follows full history of edits


 * Date of Collection: Done

--Proposed possible additional metrics I believe from the above two we can determine if the article was semi-protected or not at the time of the edit Ttguy 12:19, 23 May 2007 (UTC)
 * date of previous semi-protection of article
 * date of previous un-protection of article

Api.php broken
At the moment, I have to hold off until i can get somebody to fix this. Specifically, whatever I query for revisions always returns the latest one for that page. This is a Wikipedia issue. -- Auto ( talk / contribs ) 18:01, 17 May 2007 (UTC)
 * I coded up a workaround on my end. I have one last outstanding issue that is related to api.php, and that is not being able to query the contrib history for an anonymous IP user. Thoughts please?
 * That's not possible with api.php, I believe. query.php might be more suitable, e.g. here. Grace notes T § 20:30, 17 May 2007 (UTC)

Sample Output
Notes: Some sampled revision numbers are deleted revisions... I decided that sampling those would be valuable, so I keep the data (examples #5, #12). Redirects behave weirdly (#9). I don't have contrib history for IP address (#1, #3, #4). I can format in colums to edit later... it is probably best if input for determining whether an edit is vandalism is put into an actual SQL database as we go along to prevent conflicts and allow us to read the data easily. It's probably more likely, though, that I'll just end up breaking everything into 100 sample sections and recombining with a bot. -- Auto ( talk / contribs ) 21:16, 17 May 2007 (UTC)


 * This is great! Finally we're going to have some raw chunks of data to work with! Thanks for all the work on this! :D JoeSmack Talk 16:43, 18 May 2007 (UTC)