Wikipedia talk:Counter-Vandalism Unit/Vandalism studies/Study2

Volunteers section
Please place you name here if you are willing to help us gather or analyze data for our next study:
 * Remember 12:38, 28 March 2007 (UTC)
 * Jonathan Stokes 16:56, 28 March 2007 (UTC)
 * JoeSmack Talk 02:26, 29 March 2007 (UTC)
 * Xiner Xiner (talk, email) 02:56, 29 March 2007 (UTC)
 * --Martschink 04:25, 30 March 2007 (UTC)
 * shoy 16:59, 30 March 2007 (UTC)
 * Alex Jackl 03:42, 13 April 2007 (UTC)
 * Venado 15:47, 21 April 2007 (UTC)
 * Ttguy 14:28, 29 May 2007 (UTC)
 * Leujohn 09:23, 26 November 2008 (UTC)

Data to collect for random edit study
I've started this section to put together the possible data to collect for each given edit. This assumes we do a random edit study. What data can we collect to help make the study a robust source of data? With more data, we can possibly add more to the sum of knowledge about vandalism beyond just the anonymous/registered distinction. Some of these have been mentioned above, but I thought this was important enough to collect together under its own section. And this is far from complete. These aren't even specific proposals, just thoughts. Dig it? Please feel free to add. For each edit: Remember 23:51, 1 April 2007 (UTC)
 * article length
 * number of wikipedia links to the article
 * number of edits to the article within 7 days (or some time period) of the edit
 * time between chosen edit and immediately preceding and following edit
 * number of categories the article falls under
 * edit date and time
 * article age
 * was article protected within previous N days?
 * was article protected within subsequent N days? Martschink 22:57, 1 April 2007 (UTC)
 * how long it took to revert the article
 * who reverted the article

Data to Collect (Redux)
Drawing on the above list, here is a more official proposal of the data the study should collect for each edit, along with an explanation of why it might be worth collecting, as well as challenges that may need to be overcome to collect it. Please keep in mind that for every item of data, we're looking for whatever it was at the time of the edit, not as it is now. Generally speaking, the information will fall into four categories: (1) Edit, (2) Editor, (3) Reversion, (4) Article, and (5) Housekeeping. For purposes of aesthetics, it might be preferable to move where the data are positioned in the collection table relative to each other, but that can be addressed later.

EDIT
 * Edit Number. The randomly chosen edit number to which all of the following data relates.
 * Edit Date. The date and time the edit was made. We need to use a standard clock for this, such as GMT, and I'm not sure if this is going to be a problem.
 * Edit Type. This is the data to indicate whether the edit is vandalism and what type.:
 * NV Not Vandalism
 * OV Obvious Vandalism
 * POV Point of View
 * DV Deletion Vandalism
 * LS Linkspam
 * Text Character Change. This is intended to reflect how much the edit changed the article. I know the recent changes link gives a number for each recent change that indicates how much data was added or removed from the article.  I'm not sure how easily we can do the same when we're looking at randomly chosen edits.  Any ideas?  Can this be done by a bot?  Also, should it measure text characters or bytes?  What is the easiest and most meaningful?
 * Date of Ten Edits Before. This and the next point are meant to give a rough measure of the popularity of the article (so maybe it really belongs under the article category). Essentially, if you're gathering data for edit #46, this ask for the date and time of edit #36, thus providing a measure of how long it took for the article to be edited 10 times.
 * Date of Ten Edits Later. Just like the previous, except it would be the date and time of edit # 56. This will also be interesting for comparing reversion and edit rate.

EDITOR INFO
 * Editor Type. Whether the person who made the edit being studied was anonymous or registered. There has been a suggestion that we should further differentiate this point, and I tend to agree.  Any suggestions as to how?
 * Editor Login Name.
 * Number of Edits. This should be the number of edits the editor has made at the time of the edit in question. This is not that users current number of edits.  I'm not sure what the best way to collect this is.  Is this a bot job?
 * Date of First Edit. This indicates how long the editor has been involved on our fine encyclopedia. It shouldn't be too hard to collect.  This combined with the previous data point should provide somewhat interesting info about the editor's average rate of edits up to that point.

REVERSION INFO
 * Reversion Edit Numnber. The edit number for the reversion.
 * Reversion Date. Date and time of the reversion.
 * Reverter Type. Registered, Anonymous, Bot.  Should we further subdivide?
 * Reverter Login Name.

ARTICLE
 * Length/Size. What we need here is some sort of indication of the length or size of the article at the time of the edit in question. Does anyone know how that can be easily figured out?  This would be a very useful piece of information.
 * Number of Categories. This is the number of categories the article fell under at the time of the edit being examined (not current number of categories).
 * Number of Links to. The number of wikipedia links to the article at the time of the edit.  This would be a very important piece of information, but I'm not sure how to gather it.  The "What Links Here" button doesn't work because under our random edit method, that just shows what links to the index page.  And we can't do it by looking at a current version of the article, because what we want to know was how linked in was the article at the time of the edit/vandalism.  Does anyone know how we can do this?
 * Creation Date. The date and time the article was created.  Merges and moved articles may make this more difficult than it would seem at first.

HOUSEKEEPING Martschink 04:53, 9 April 2007 (UTC)
 * Login of Data Collector. The login of the wonderful person who collected the info.
 * Date of Collection. Date and time the data was collected.

Proposed range of edits to study
I propose that we use all of 2006 as the pool of edits to examine. So we would examine Edit numbers between the dates of January 1, 2006 (after edit number 33428086) and December 31, 2006 (before edit number 97631237). This would give us a total of 64,203,151 to draw from (97,631,237-33,428,086 = 64,203,151). What do others think? Is this too big? Remember 12:34, 30 April 2007 (UTC)
 * It depends on how many data points we plan on collecting, of course. :} shoy  12:41, 30 April 2007 (UTC)
 * I was thinking anywhere from 1000-5000. Statistically, how many data points would you need to analyze 64,203,151 edits? Remember 14:04, 30 April 2007 (UTC)
 * Is there any restriction on namespace proposed? Figuring out how many data points we need is something I can only give an idea of if we know what the final result we want confidence in is. Can we identify a stats geek somewhere? :) -- Auto ( talk / contribs ) 14:43, 30 April 2007 (UTC)
 * The stats geek (okay, engineer) cometh! 95% confidence is standard. If you want to test your earlier hypothesis that 4.6% of contributions are vandalism with 95% confidence, here's some numbers on how good your resolution will be vs. sample size:

Size Power  Proportion 1  Proportion 1 1000  0.95     0.0859615     0.0177271  2000   0.95     0.0729399     0.0249319  3000   0.95     0.0675262     0.0283945  4000   0.95     0.0644013     0.0305415  5000   0.95     0.0623124     0.0320430
 * (Thanks Minitab.) shoy  16:59, 30 April 2007 (UTC)


 * Thanks. Here is online sample size calculator .  Do we know though how many of 64,000,000 edits were article edits. recent changes snapshots look like about 80% of edits are to articles plus lists so with this size, population # specific to articles probably doesnt matter, it won't change sample size requirement.  With vandalism study 1 just 700 edits in sample which was enough for a percent within +-3 percent of total edits were vandalism.  But if we want dataset to be more powerful for analyzing certain sets of edits like differentiating them by length of time vandalism's editor has been contributing that sample size might not be big enough.Venado 01:50, 2 May 2007 (UTC)

Analysis of data
Can we list the analysis we want to perform in one place. It can help to make sure we gather critical data to do the analysis. Also doing prelimary tests of the analysis will help to gauge if sample size is sufficient for each issue measured and tell if more data needs to be gathered. Some I would add
 * Assess portion of editors using bots against vandalism (to see what kinds and number vandalism edits arent fixed with bots yet.also theoretical that an automated bot can fix some kinds without any user intervention)
 * Assess proportion of vandal editors who become contributors later (probably limited because cant tie later enrolled memberships to original IPs)
 * Categories of articles which are most vandalized-example political or celebrity. Venado 17:13, 2 May 2007 (UTC)

Missing a hypothesis and another proposed metric to collect
I note from the project page we are missing a hypothesis for this study. I have become interested in this topic because of discussions about the pros and cons of semi-protection and the inconsistent inforcment policy on semi-protection -see Protection Policy talk.

I have had it put to me by Kusma that we should not over use semi-protection because "semi-protection encourages throw-away accounts". So I think a hypothesis that could be worth considering is "does semi-protecion of a page encorage throw-away accounts".

Now it would be pretty hard to answer this question directly. However, what you could find out is - "when a page changes status from non-protected to semi-protected does the "editor age" (time editor has been a wikipedian) for the vandal editor go down?". ie does semi-protection encorage the vandal to register so she can start vandalising the page again?

To do this we could look at the editor ages of registered vandal edits durring periods of semi-protection and compare them to the same metric measured durring periods of non-protection.

We might be able to measure this with the data that we proposed to capture in this study - we would need to have "editor age" which I think we might already be proposing to capture. But we would also need time since last semi-protection of this article at the time of the edit in question and we would want to know if semi-protection was in place on the article when it was being edited.

I am wondering though whether, for this sort of question at least, should we be concentrating on specific randomly choosen pages rather than just randomly choosen edits. It really depends on what the hypothesis is - which is why we should have one. If a large part of the data is gathered automatcially and we were doing 5000 random edits could we instead do 500 random pages and randomly choose 10 edits on these pages within the 1 month window we were considering. I just feel this might be better in some ways because you are guaranteed to have some sort of history of the pages in question - other wise is is just random as to whether you hit the same page twice with the data sampling. One of our stats gurus might have some thoughts on this.

Just a few perhaps not so coherant thoughts.

Ttguy 12:13, 23 May 2007 (UTC)


 * The hypotheses are essential to data collection: what gets collected from each article, which articles are selected, etc. The big finding of Study1 of course was that most vandalism is by un- or newly registered users.  I believe that a very interesting hypothesis is this:
 * High vandalism reduces productive Edits (those that are not made simply to revert bad Edits).

Productive Edits are an easier thing to measure than article quality improvement, but they are intended to measure the rate of article improvement. If the hypothesis is supported by the evidence, it would suggest that the way to improve article quality is by stopping vandalism. --Thomasmeeks 00:14, 26 May 2007 (UTC)

Article quality improvement could also be measured directly by comparing vandalism rates of articles that made Featured articles with a control group of those that did not make FA as to rates of vandalism. --Thomasmeeks 01:07, 26 May 2007 (UTC)

Splitting up work
I have all the data collected (right now, 1000 edits... I can pull about 2,000 a night) in an sql database. I've posted the first 100 to WikiProject Vandalism studies/Study2/Dataset 1 as a sortable wikitable broken up into sections of 10.

The upside to this is it can be edited by section to prevent conflicts and make it easier to keep from being lost in the little edit box. The downside is that it makes having a sortable table almost useless.

I'm just fishing for thoughts on how we should break this up. The most technically optimal method would be to have a DB on a server that everyone can get into and edit one row at a time. Is it perhaps best to create a web interface for this project? -- Auto ( talk / contribs ) 00:04, 5 June 2007 (UTC)
 * I did the ultimate test and tried editing the table manually to update... hell no. I requested an account on the toolserver, and I'll write up a quick & dirty php form when I get access. Suddenly makes me wish WP had an OpenID kind of system... -- Auto ( talk / contribs ) 16:32, 5 June 2007 (UTC)
 * Sounds good to me. Set it up. Remember 16:40, 5 June 2007 (UTC)
 * Yeah, the last thing we want to do is make this process needlessly painful. Thanks for all the hard work Autocracy. JoeSmack Talk 16:46, 5 June 2007 (UTC)
 * This is great-I think the sections can be a lot bigger than 10 if that is easier. These first 100 are for edits in 2006.  Arent we looking at May 2007 in this study?  Should the dataset be filtered before it is split to leave the main article edits?  It might be best to have the whole set ready and do some numbers like number of talk edits, user page edits, user talk edits, image edits, things like that which can be done without coding on our part.  Then we can separate groups of just article edits for further evaluation.Venado 20:40, 7 June 2007 (UTC)
 * We were looking at May 2007 because some people thought a change in the Wikimedia software would provide us with more data, but that change is superficial, and we could retrieve it for the older data. When I get on the toolserver, I'll write up a PHP interface for us to process the data. Autocracy 20:50, 15 June 2007 (UTC)
 * Thanks. I would think this didn't impact the required sample size we estimated earlier.  With the unfiltered sections set up earlier I was easily able to work 50 edits or so in an hour even though I was going slow and doing it just to help brainstorm ideas.Venado 23:48, 15 June 2007 (UTC)

statistician
Hello, I helped with the previous study after you had collected the data and were attempting to analyze it. Just a FYI, it's often said (by statisticians) that we could do more good before you start than after you conclude data collection. I've just glanced at the page as is and noticed you've done some thought on power, but there is other things that could help you (i.e. methods of increasing the power, help with conclusions). If you are interested, I'll join the project and try to help. Pdbailey 21:13, 10 October 2007 (UTC)
 * Any help would be appreciated, but right now it looks like this project is dead. It needs a champion to get it going again and I don't have the time to do it this time. Remember 21:27, 10 October 2007 (UTC)
 * I too would really like to see this project get going again, but I'm stuck working on my MSW for the next two years (i.e. barely enough time to sleep). It isn't hopeless though! Do you know anyone else who'd like to take it up too? JoeSmack Talk 00:08, 11 October 2007 (UTC)
 * I'm sorry the project is in trouble. I wasn't paying attention, but it looks to me like one possibility is that the project was too fragmented -- not enough focus on particular questions. Perhaps focusing on questions that could be readily anwered with fewer variables per record, and fewer records collected. In particular, the following may be efficient:
 * identify all users who were active in 2006, and the number of edits they have
 * sample editors so that editors with more edits are less likely to be selected
 * sample edits from each editor
 * This could make the survey much faster than simply doing a completely random sample. however, step one may not be feasible. It also only has more power if editors with more edits are less likely to be vandals--but it would be possible to start down this path and see if it's true. There may also be some way to approximate the above.
 * But this is neither here nor there if there isn't a great question to answer. Having a great question would also motivate people to participate. Pdbailey 01:10, 11 October 2007 (UTC)
 * Maybe it was too fragmented: really i think it was an effort to meet people's motivation to help out. I like your questions though. Having recently stumbled upon User:Dragons_flight's work on edit analysis of the last 6 months (a sample of 118793 articles!), he might have a unique perspective on this approach. Tap him on the shoulder? JoeSmack Talk 02:35, 11 October 2007 (UTC)
 * I'd suggest holding off on that. And I'm sorry if I'm being obtuse, but what question, in particular, is of interest? What sort of answers would be interesting? Who cares about the answers? The questions are not designed to be offensive or put you on the defensive, but they are great questions to be able to answer before you embark on data collection. Pdbailey 19:50, 11 October 2007 (UTC)
 * The main question I have always wanted to answer is: what is the rate of vandalism on wikipedia, is it going up or down or stable and how long does it take for vandalism to be reverted. But that is just my opinion. Remember 13:03, 12 October 2007 (UTC)
 * Okay, that's a great question. In particular, your rate of change question is a very good (and interesting) question. I'd like to try to focus the rate question somewhat if you don't mind. If you do, please just ignore the rest of my comment. What does the rate mean? I ask this because there are several more specific questions that could be nested in your question that I think are interesting. Examples include, what is the probability that a page I'm looking at is vandalized, what is the probability that an edit is vandalism. Pdbailey 22:20, 12 October 2007 (UTC)