Wikipedia talk:India Education Program/Analysis/Quantitative Analysis

Questions and remarks

 * 1) 21% content survival = net negative for the encyclopedia. I was expecting this to be low, but not this bad.
 * 2) How is content surviving cleanup defined? Is this an extrapolation from cleanup efforts so far or is it surviving content to analysis date/total content? If the latter, using it as a measure of content quality would be misleading; for instance the IEP CCI is only ~20% complete.
 * 3) What about userspace problems? What about images?
 * 4) Will the content survival ratio influence selection of classes for phase 2?
 * 5) Will the analysis be redone at a future date when the cleanup is essentially complete?
 * 6) Of the 1014 registered students, how many accounts did you analyze? Does this include duplicates, sockpuppets and non-existent users?

I await your responses. MER-C 10:01, 21 January 2012 (UTC)


 * Hi MER-C: I will let LiAnna / Ayush respond to the questions on how the analysis was done but in response to you specific point about Phase 2, we have accepted the analysis, report, findings and feedback and these will inform all aspects of our work going forward. Hisham (talk) 01:11, 22 January 2012 (UTC)
 * Hi MER-C, I'll start by thanking you for your cleanup efforts - I believe you have helped out with a significant chunk of the cleanup.


 * 1) Yes. For the future, it would be useful to do a similar analysis for new editors, so we have a baseline figure for content survival. I am currently working on running a similar analysis for the US/Canada program, will share those numbers for reference.
 * 2) Using the Wikitrust API, we calculated the number of words added by a student that were live on the current revision of an article. We also pulled total words added by that student, and then calculated the ratio. As I noted in the caveats, the analysis cannot be considered final, since there might be more cleanup needed.
 * 3) I didn't look into userspace content, because content survival/word count does not have the same relevance. I am aware of issues with userspace content containing copyvios, though. I don't think the Wikipedia API offers a way of analyzing what happened with images, but I will look into how I can do that, since it would be useful for measuring contributions in the future.
 * 4) As Hisham noted, this (and other metrics) will inform work in India. Further, we're working on building some infrastructure that helps us measure engagement and contributions for all Wikipedia Education Program initiatives - India, US/Canada, and Egypt. The idea is to have a dynamic view of contributions, so we can monitor and respond to issues with greater immediacy.
 * 5) Yes, absolutely. As I mentioned above, we want to use similar analysis to inform all work going forward. When cleanup efforts can be considered complete, I'll run the numbers again.
 * 6) My list is based on the 7 December 2011 revision of the student list at India_Education_Program/Students.

Hope this helps, and I'm more than happy to answer any further questions. Akhanna (WMF) (talk) 03:26, 22 January 2012 (UTC)


 * Thanks for your replies. I forgot to mention previously, but I'd like to see standard deviations and plots as well. Content retention as a function of mainspace edits would be especially interesting. MER-C 09:22, 22 January 2012 (UTC)
 * I have added some plots to the page. It's interesting that you bring up retention vs. edit count: I have noticed some interesting trends in the US/Canada data that suggest there's a relationship. For the IEP data, I saw a low correlation: 0.58. I'd like to factor in whether content was added to an existing article or a new one, I think that might make a difference.Akhanna (WMF) (talk) 21:10, 23 January 2012 (UTC)