Wikipedia talk:Pending changes/Archive (Metrics)

=Archives from /Metrics=

Update coming soon
Hi everyone - just to let you know that we have an update to this page coming soon. We'll be publishing some statistics soon which outline per page metrics on revisions under Pending Changes. Nimish Gautam and Devin Finzer (Devin is an intern that is working for Wikimedia Foundation this summer) are working on some statistics.

Here are the per-article stats they are gathering:
 * Time under pc (secs)
 * Explicit approves
 * Explicit unapproves
 * Nonreverted anonymous under Pending
 * Reverted anonymous under Pending
 * Nonreverted loggedin under Pending
 * Reverted loggedin under Pending
 * Nonreverted anonymous not under Pending (e.g. old history)
 * Reverted anonymous not under Pending
 * Nonreverted loggedin not under Pending
 * Reverted loggedin non pc
 * Total anon edits while under Pending
 * Total explicit approves of anon

The output format they're producing is .csv file. I'm not sure what the preferred way of publishing that sort of thing here. Ideas? -- RobLa (talk) 17:22, 28 July 2010 (UTC)
 * A sortable table.
 * How are the stats being collected? Will the methodology be published? --MZMcBride (talk) 17:34, 28 July 2010 (UTC)
 * Hey there. The basic methodology for the recent stats collection is the following: We compiled a list of pending changes articles using the page log data dump (each time an article is put under pending changes, a "config" log action shows up in the page logs). We exported the revision history (from the last two months) for each of those articles and then parsed each one to obtain a .csv file with revision time, edit size, net size change, contributor id, etc. We also included whether certain edits were being reverted, using an algorithm based on MD5 hashes of the article text (let me know if you have questions about that). Because revision "acceptance" shows up in the page logs, we also had to parse the page log data snapshot at the same time and include that data in the revision history as well. We then compiled all the article .csv data into a master .csv document which includes the above statistics for each article. We'll publish a formal outline of the methodology, but let me know if you have any questions. Currently, all the scripts for our parsers (written in PHP) are available in the Mediawiki code repository under tools/analysis: http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/analysis/ ---Dfinzer (talk) 20:57, 28 July 2010 (UTC)
 * Thank you for the quick and detailed reply. I'm very familiar with MediaWiki's internal database structure and its software development process. I saw you commit your USERINFO file yesterday actually, but had no idea who you were. :-) (Perhaps add a note to your user page at mediawiki.org.) As I indicated above, a sortable table on a wiki page is the ideal presentation format for the final results, in my view. Users shouldn't be expected or required to read through .csv files in order to see the results. Though, a "top sheet" approach might be the best of both worlds; i.e., a final presentation on a wiki page that outlines everything and links to the raw data and a detailed explanation of the methodology used. --MZMcBride (talk) 23:42, 28 July 2010 (UTC)
 * Hi everyone, Devin created a csv file today, and I posted it to Bugzilla here: bug 24562 24596 I've gotta run now, but I should be able to put this out in more usable form tomorrow if someone doesn't beat me to it. -- RobLa (talk) 01:48, 30 July 2010 (UTC)
 * I think you mean 24596. --MZMcBride (talk) 02:03, 30 July 2010 (UTC)
 * Oops...yup, thanks for the catch. -- RobLa (talk) 19:48, 30 July 2010 (UTC)

What is the meta-information, such as (but not limited to) start time and end time for each of the statistics? BrainMarble (talk) 19:33, 28 July 2010 (UTC)

Metrics collected so far
Here's a list of the metrics we've collected so far:
 * Per-page anonymous edit quality - table which breaks down per-article quality metrics
 * Per-page full stats - table which has per-article quality metrics. Superset of the anonymous edit quality table, but includes metrics which aren't quite ready for prime time.  This table should be updated sometime hopefully before August 10, 2010
 * Special:ValidationStatistics - Standard statistics page built into the FlaggedRevs plugin

These links are now on Pending changes/Metrics as well. -- RobLa-WMF (talk) 01:41, 4 August 2010 (UTC)

Commentary on metrics
Metrics, like statistics, are most useful when they pertain directly to the goal of the study. The statistics listed so far (above by RobLa on 17:22, 28 July 2010; and by RobLa-WMF on 01:41, 4 August 2010) are descriptive statistics, useful in themselves for giving a sense of the environment or scope of the study.

However, during the end-of-trial evaluation of the pending changes policy, we would likely be interested first in how the metrics reveal answers to the following two questions:
 * How effective was the policy toward achieving its purpose?
 * How efficient were the trial policy's methods?

Both "how" questions imply a comparison between actions during the trial period and actions prior to the trial period. The trial period is about two months, overall. The pre-trial period used for comparison should be long enough to account for cyclic trends, perhaps up to a year before the start of the trial. (So far, none of the metrics appear to account for cyclic trends before the trial period.)

The duration of time intervals affects the quality of the evaluation. Shorter durations (hours, days) yield more detail but require greater effort. Longer durations (weeks, months) require less effort but yield less detail. Experimentation will help determine which duration to use in the evaluation. (So far, there appear to be only two time intervals in the metrics: "under pending" and "not under pending".)

The effectiveness question involves counts of groups of editing actions per time interval throughout both pre-trial and trial periods; reduction of the influence of cyclic trends; and comparison ratios for each of the groups of editing actions. (So far, only a set of raw counts appear in the metrics.)

The efficiency question implies calculating the time delay involved in taking corrective editing actions (such as revert, undo, unaccept) for the same time intervals as in the effectiveness question; and comparison ratios to reveal an increase or decrease in efficiency over time. (So far, none of the metrics are capable of revealing efficiency.) BrainMarble (talk) 18:55, 5 August 2010 (UTC)


 * Thanks for the feedback, BrainMarble. Suggestions on which exact metrics you'd like would be greatly appreciated.  Depending on the request, we might be able to accommodate it.  That said, a decision on whether to keep the feature around will probably need to be based on incomplete information.  Also, it's entirely possible to keep the feature for a while longer, but continue to use it in a limited fashion as everyone gets a chance to observe and collect data on it (e.g. keep the 2000 article limit that is still in place). -- RobLa-WMF (talk) 20:42, 5 August 2010 (UTC)


 * Hello, RobLa-WMF, and thanks for the reply. I had written (offline, for my own consumption) about some specific metrics in early July, and am including them, now. I haven't been following changes in terminology or expansions in categories, so please allow for any dated terms that appear.


 * I would propose using a time interval one day in duration, throughout both pre-trial and trial periods. A one day interval duration should be easier to process than hours or weeks, and finely grained enough to depict cyclic trends. I would propose setting the beginning of the pre-trial period to be one year prior to the beginning of the trial period, principally to account for cyclic trends.


 * While writing in early July, I assumed the set of data to evaluate would be contained in the set of revision histories of the Wikipedia pages involved in the trial policy. We would select only the acts in the revision histories that took place during the pre-trial period and the trial period. We would be interested in acts including: saved edit; revert; undo; accept; unaccept. We would also be interested in the time of each act. We would count each of these acts per time interval, through both the pre-trial period and the trial period.


 * We would also calculate the time delay involved in taking corrective actions (revert, undo, unaccept). For each revert and undo, we would measure the amount of time back to the specific edit that that act reverted or undid. For each unaccept, we would likewise measure the amount of time back to the edit that that act did not accept. We would then group all these measurements into the time intervals in both the pre-trial period and the trial period. We would use the time of the corrective act to determine into which time interval to place the measurement. We state these corrective action time delay measurements in number of seconds.


 * As a quality check, the number of time delay measurements given for each time interval should equal the total count of corrective actions for that time interval.


 * The revision history act counts and the corrective action time delay measurements become the basis for comparisons, both within each time interval and between the pre-trial period and the trial period. These comparisons would help answer the evaluation questions.


 * When I wrote in early July, I assumed that revert, undo, accept, and unaccept are distinct, and are not also recorded as saved edit. I also assumed the meanings of the "act" terms such as "revert", "undo", "accept", "unaccept". I'm not sure, now, whether the assumptions hold true.


 * Comparisons for each time interval:
 * Give the ratio of the count of (revert + undo) to the count of (saved edit + accept). This approximates the portion of corrective actions not due to the pending changes trial policy.
 * Give the ratio of the count of unaccept to the count of (saved edit + accept). This approximates the portion of corrective actions due to the pending changes trial policy.
 * Calculate the difference between the first ratio and the second ratio. This difference indicates the relative influence between the corrective actions during that time interval.
 * Give the distribution of time delay measurements. For each distribution, give both the mean and the dispersion (e.g., standard deviation) of measurements.


 * Comparisons between periods:
 * Construct a diagram of the ratios of corrective actions per time interval, for the total (pre-trial + trial) period.
 * Construct a timeline of the ratio differences. Positive values over time indicate more influence by corrective actions not due to the trial policy. Negative values over time indicate more influence by corrective actions due to the trial policy.
 * Construct a timeline of the time delay measurement distributions, both mean and dispersion. The behavior of the two lines would indicate the efficiency of the methods involved.


 * I apologize if this sounds a bit cryptic. BrainMarble (talk) 02:47, 6 August 2010 (UTC)

Some raw data
Hi everyone, Devin provided the following raw data prior before leaving WMF for the summer:


 * Individual csvs per article
 * One big csv

I'm traveling right now, so I'm not in a good spot to provide a lot of commentary on what is in these files, but the gist is that these are every edit to every article while it was under Pending Changes. I hope these are useful for anyone who wants to do some additional number crunching. -- RobLa-WMF (talk) 05:22, 15 August 2010 (UTC)
 * You had to go on vacation during the PC trial stop/continue discussion... Any way that we can get a summary of statistics (I mean, do you know anyone comfortable pulling all the numbers together)?  It'd be very useful over at Pending_changes/Closure Ocaasi (talk) 20:13, 19 August 2010 (UTC)
 * The summary of the data above is at Pending changes/Metrics. What other summary are you looking for? -- RobLa-WMF (talk) 21:54, 19 August 2010 (UTC)
 * Well, there seems to be a desire at the closure discussion for someone to explain what the statistics mean. You know, feed it to us with small spoon.  I checked out the data tables and can see some trends, but it might be good to have the analysis broken down by what kind of page it was (high traffic/low traffic), whether it was blp, or a sock-puppet page, or just a regular vandalism target, etc.  I'm not at all suggesting you should do this, but I think there is some room for a summary of the summary, if you will.  More like the "conclusions" section of the paper than the editor's introduction, which has been well-provided.  Is there something I'm missing? Ocaasi (talk) 03:17, 20 August 2010 (UTC)
 * Or..., if you'd prefer a different vacation-busting task, is there someone who can take a look at the list of problems/feature requests on the PC Pending_changes/Closure working summary and give some estimates of whether-they-can/how-long-it-would-take-to fix/implement them? I'm not sure who else to ask and you seem to have you hands in the developer pot. Ocaasi (talk) 12:50, 20 August 2010 (UTC)
 * I posted some preliminary analysis. It contains different cuts of data to help folks focus in on particular sets of articles.  It's not exactly spoon-feeding, but hopefully it provides a view for users that don't want to go through the whole data set.  I think there is some pretty interesting stuff and I'd love to get feedback on the content dimension of the analysis.  The analysis is still a draft, so please feel free to add/comment. Howief (talk) 00:09, 21 August 2010 (UTC)
 * Thanks. I think whatever happens with the trial (I'm assuming there will be moderate support to keep it going), these statistics will be really important to evaluating the feature going forward, and any help with it would be really useful, especially as a (semi) organized tracking effort combined with some agreed upon performance metrics.  Ocaasi (talk) 20:51, 22 August 2010 (UTC)

=Archives from Metrics/Preliminary Analysis=

Comments
Sorry if this analysis is rough, but I am heading on vacation for 2 weeks, so do not have time to provide additional analysis and/or commentary. Please feel free to contribute to the analysis, either on the talk page or directly on the article page. I'm hoping the skeleton I put together will engender discussion about the trial. Howief (talk) 00:13, 21 August 2010 (UTC)
 * No, this is a great start and just the kind of thing needed to get discussion going. Thanks! Ocaasi (talk) 21:23, 22 August 2010 (UTC)

Could the basic numbers be provided - the total number of anonymous edits (in total, and with unreverted, separately), and the average number of anonymous edits per article per day? Comparison to a control sample of the same number of articles over the same time periods that weren't protected by semi/full-protection (or the same articles that were protected by pending changes at an earlier time) would be good. I'm specifically interested in getting a feel for the number of edits that pending changes enabled/'saved' compared to semi-protection, and to the articles' unprotected states. Thanks. Mike Peel (talk) 20:12, 24 August 2010 (UTC)

Readership traffic
I found this analysis together with the Metrics page helpful, but not many have read them: and  show Pending changes/Metrics/Preliminary Analysis and Pending changes/Metrics were each viewed about 10 to 20 times daily, compared to several hundred times for Pending_changes/Closure. The closure article does have links to the others, but maybe not prominently enough. -84user (talk) 12:54, 25 August 2010 (UTC)

Comparing rate of change with/without "pending changes"
Thanks, Howief - very helpful!

You say One could potentially argue that an article with under a certain level of edits per day is simply not worth putting under pending changes. But since there is little cost to listing something as "Pending changes" I don't see why that would be true.

What I'd really like to see is a comparison of the rate of approved changes to articles before and after being put under "pending changes" If there is a significant increase, and if volunteers don't tire of reviewing changes, then it seems like it is helpful. --NealMcB (talk) 20:09, 22 August 2010 (UTC)


 * I agree, this is not what I wanted to see.


 * I want to see a comparison between before, during and after the Pending Changes trial.
 * I have a feeling that the edits are getting less with higher unemployed numbers through the economic crisis.
 * If I'm right, we had never less than 5 reverts per minute on Huggle before the Pending Changes trial.
 * The first and second question is: Does the Pending Changes trial frustrate anonymous editors? Are their destructive/ reverted edits getting less overall and on the Pending Changes trial articles?
 * The third and fourth question is: Does the Pending Changes trial frustrate anonymous editors? Are their construtive edits/ nonreverted edits getting less overall and on the Pending Changes trial articles?
 * --Chris.urs-o (talk) 09:46, 10 September 2010 (UTC)


 * I agree, this analysis isn't very useful. I don't care what articles received the most edits. I care about things like change in numbers of edits a day when pending changes is activated, median time to review an edit, etc.. --Tango (talk)

What success should look like
The final graph shows that a smaller percent of IP edits were accepted before than during the trial. Is this success, or failure? To me, it is total failure. Pending Changes is no good if it discourages good IPs more than it discourages bad IPs. The fact that the percent accepted went down on every article in the table should be very concerning. Either the criteria for acceptance went up, or good IPs were put off more than bad IPs. Or good IPs got alienated and turned into bad IPs. So, what happened over time during the trial period? Did the daily or weekly percent of IP edits that were accepted go up or down? 69.3.72.249 (talk) 04:44, 15 September 2010 (UTC)

Here is a snapshot of the graph I am referring to. 69.3.72.249 (talk) 04:46, 15 September 2010 (UTC)

Wrong diagnosis. IP vandals go up after the Pending Changes period begins... :( --Chris.urs-o (talk) 18:46, 17 September 2010 (UTC)
 * Different question. My question is what effect does Pending Changes have on good IP editors?  And does the effect on IP vandals more than offset the effect on good IP editors?  Show me that Pending Changes does more good than harm. 69.3.72.249 (talk) 19:15, 18 September 2010 (UTC)


 * 69.3, If criteria for acceptance goes up, why do you view this as failure? Doesn't that improve the quality of articles? Cliff (talk) 16:43, 17 March 2011 (UTC)

Why only anonymous edits?
PC affects more than only anonymous edits. Editors who are logged in but do not have reviewer privileges are also put in the "pending" queue. What do the results look like when non-anonymous users make edits? Cliff (talk) 16:41, 17 March 2011 (UTC)

=Archives from /Metrics/Anonymous edit quality=

Compared to what?
This page is great. But, it needs a last column which gives the unreverted/total anonymous edits % for each page which looks at the same amount of time the page was under PC, immediately prior to being under PC. We also need to identify whether that overlapped with another kind of page protection. Otherwise, we have data, internally comparable, but untethered to the past. Ocaasi (talk) 11:22, 7 August 2010 (UTC)
 * The problem with that is that the vast majority of these articles were under semi-protection prior to being under pending changes. Therefore, the number of anonymous edits before is zero. -- RobLa-WMF (talk) 15:40, 7 August 2010 (UTC)
 * Then we'll never no for sure!, since I doubt anyone is willing to throw up two months of no page protection at all.  What is the metric of "success", then?  Is it shifts in anonymous reverts using PC over time?  Is two months enough to chart a meaningful trajectory of vandalism on a page, or is there another metric I'm missing? Ocaasi (talk) 16:54, 7 August 2010 (UTC)
 * It cannot be both true that these pages qualified for page-protection and there is no data on IP edits before they were under PC. To qualify for PP a page must have been subject to attack so by definition there must be some data on IP edits before it went to PP.  If it was not under PP there is data before it went to PC.  Without the data on what was happening before PP we can have no idea whether simply unprotecting the page would have had the same result (in terms of attempted vandalism) as putting under PC.  Sp in ni  ng  Spark  20:07, 16 March 2011 (UTC)


 * I spot-checked the logs for more than a dozen, and the vast majority of them had been under continuous semi-protection for more than two years. Comparing vandalism in 2007 (the most frequent date in my check) to activity in 2010 is likely to be invalid, and the  number of articles that weren't under long-term semiprotection before being added to PC is so small that I doubt it would be statistically significant.  If you want to gather that data, though, please feel free to do so.  I'd start by adding column that identifies the prior state of the articles in this table.   WhatamIdoing (talk) 20:48, 16 March 2011 (UTC)
 * I neither designed this trial nor lobbied for it, the suggestion that I should now put a large amount of effort into gathering data is a non-starter. If the comparison to the start of PP is invalid because the data is stale, then those articles were unsuitable for inclusion in the trial in the first place and should be excluded from the data set.  If the remaining sample is too small (exactly how small is it? It is the absolute size of the sample, not the relative percentage of the population, which is important) then the whole trial is invalid and should be immediately stopped.  But in any case, why should 2007 data be considered stale?  There is no obvious reason that springs to mind.  Certainly the rate of editing on Wikipedia as a whole may be different, as may be the ratio of IP to registered, but both of these factors can be weighted for once we know the actual numbers.  Sp in ni  ng  Spark  21:31, 16 March 2011 (UTC)


 * I also had nothing to do with the trial design. The trial apparently was conceived as a way to compare PC to semi-protection, not as a way to compare PC to zero protection.  It's not the trial I personally would have chosen to run, but it's the one that was, in fact, actually run, and the data collected is valid for the designed purpose.
 * (If it had been up to me, I would have selected twice as many long-term, stably semi-protected articles plus an equal number of articles that hadn't been protected within the last year, 100% in advance, actually randomized 50% of each group into PC, and permitted neither additions nor removals during the trial. But as you can tell, I didn't design the trial.)
 * From my small and non-randomized spot check, I estimate that might be 50–150 articles that were both previously unprotected for a significant length of time and in the trial for more than a week or so.
 * The reason that you cannot validly compared 2007 against 2010 activity is because the entire community has seen serious changes in editing patterns, including fewer active editors, fewer new editors, and more automation. While this is known in very general terms, AFAICT nobody has enough information to even guess at how to weight the formula to account for these factors.  Even comparing winter 2010 against summer 2010 is a problem, since there are significant seasonal variations.
 * If you want the answers, you'll have to put the work in. If getting the answers isn't worth your time and effort, then it's not worth anyone else's, either.  WhatamIdoing (talk) 23:14, 16 March 2011 (UTC)
 * I would say the onus is on those arguing to continue the trial use it already to present evidence justifying that action.  Sp in ni ng  Spark  20:30, 17 March 2011 (UTC)
 * I recall no editors seriously arguing to 'continue the trial'. "Use it already"—the position of a majority of editors—is not the same thing as "study it some more".  WhatamIdoing (talk) 21:11, 17 March 2011 (UTC)
 * There are enough words written already without getting unnecessarily picky.  Sp in ni  ng  Spark  21:33, 17 March 2011 (UTC)

Here is an obvious experiment: try PC on Today's Featured Article for 1 day. There is already tons of experience with vandalism on TFA under the old practice of leaving it unprotected as much as possible, and reverting vandalism on it. 69.111.194.167 (talk) 16:31, 15 April 2011 (UTC)

an observation or two
first, NICE TABLE! I was a bit surprized at the range of accepted edits. For articles with more than a few the range is from 100% to 0%. I don't know what I might have expected, but it wasn't such a range. Going forward, I might suggest that articles with <20% or so accepted edits, and with an edit volume of more than a few a week might be better on semi-protection. Otherwise I'm very happy with the way the test plot has come out. --Rocksanddirt (talk) 16:11, 25 August 2010 (UTC)

Dangerous silencing of living biographied people
´We find this quite non-operative for certain cases. See Antonio Arnaiz-Villena (false )biography: 1-A group of apparent linguists (Trigaranus,Akerbeltz,User:Dumu Eduba,Kwamikagami) started fighting somewhere else in Wikipedia to remove “Iberian-Guanche Inscriptions”.The page was removed and they particularly disagree with the word “usko-mediterraneans” 2-After many months Dumu Eduba (only interested in linguistics) brought up false accusations against Arnaiz-Villena from doubtul newspapers written ten years ago that he found in Internet(June 9th 2010).This accusations were published within 2-3 weeks time and nothing was said anymore. 3-Arnaiz-Villena and his group have published quite a lot of papers in the last ten years and some books 

4-The accusations were shown to be induced and Judges invalidated them. 5-Now,at leastUser: Arnaiz1= Arnaiz-Villena himself,User:Virginal6,who were both involved in the accusations and have all documents tried to write details about how Judges made the false and induced accusations dissapearing: ,name of Judges ,dates,sentences number etc (see Arnaiz-Villena discussion) 6-They have silenced Arnaiz1 and Virginal6 .because of sock pupets (it is not true). 7-The false biography is in Wikipedia and the living interested people silenced. 8-I would ask you in the name of Arnaiz-Villena group to a)Push to Dumu Eduba to finishing the biography on litigations(he has the data in the Discussion). b)Remove this part of biography until it is finished.(or leave this part as it was before June 9th 2010). 9-We are not allowed to give away Court sentences to nick names.This has not been seen in Wikipedia yet.We will only give sentences to Wikipedia California Administrators. Please contact Antonio Arnaiz-Villena at or at aarnaiz@med.ucm.es or tel +34913941632.Symbio04 (talk) 10:11, 1 September 2010 (UTC)Symbio04 (talk) 10:58, 1 September 2010 (UTC)

Why only anonymous edits?
PC affects more than only anonymous edits. Editors who are logged in but do not have reviewer privileges are also put in the "pending" queue. What do the results look like when non-anonymous users make edits? Cliff (talk) 16:41, 17 March 2011 (UTC)

=Archives from /Metrics/Full table=

Definition of reverted changes
In the interests of accuracy and future research, could someone state what precisely "reverted" means in this context, and what algorithm [a link to the actual source code would be ideal!] was used to determine which revisions were reverted.--greenrd (talk) 12:09, 14 September 2010 (UTC)

Column totals
Are the totals for each of those columns available somewhere? --Anthonyhcole (talk) 18:11, 19 February 2011 (UTC)