Wikipedia talk:Xiong's stats

Very interesting. First initial comment : Clearly Something Happened in May 04. Presumably the the introduction of clever templates, categories or both. These clearly vastly increased the scope for trivial edits. Pcb21| Pete 07:23, 20 May 2005 (UTC)


 * Ah yes. MediaWiki 1.3 was introduced in the last two weeks of May 2004 (see e.g http://meta.wikimedia.org/w/wiki.phtml?title=MediaWiki_roadmap&oldid=53787 and http://mail.wikipedia.org/pipermail/wikipedia-l/2004-May/033089.html). In particular this released introduced the template namespace and categories. Your analysis shows that these features have had a fundamental change in editting - since that time the average edit contributes a lot less information, but fortunately a lot more edits are being made :). Pcb21| Pete 07:45, 20 May 2005 (UTC)


 * Or, to put it another way, people are spending their time fiddling with templates and categories rather than adding information. Hmm.


 * Many, many thanks for the fascinating analysis, by the way. -- ALoan (Talk) 10:34, 20 May 2005 (UTC)


 * Well as Xiong rightly says "Historical editing source data, including length and character of edit, broken down by namespace is [the Holy] grail." We can't tell from the stats data if the "fiddling" edits have replaced the content addition edits or merely come in addition to them. I hope and believe it is the latter. Recall also that it was simply easier to add content in 2002 than it is today because there were more substantial gaps! Pcb21| Pete 12:31, 20 May 2005 (UTC)


 * Oh, true: the total number of content-adding edits is probably greater than it was in the past, it is just that there is a mass of "fiddling" going on too... -- ALoan (Talk) 13:36, 20 May 2005 (UTC)

Grail
Well, if I'd meant "holy", I'd have said so; this is a more pedestrian grail. {grin} But yes, all questions about what kind of editing is going on need much more numbers. I've detailed what I think might be useful.

I'm new around here; so far all my requests for more hard data have run into dead ends. There are only two roads I see open, neither of which I am eager to travel:


 * Buy another machine and a big hard drive, download the entire database, download LAMP (or half of it anyway), study up, install the whole damn thing and query the database myself. Don't push me to that extremity, because from there it's only a short step to forking the project and once I've done that, I don't have to convince anybody of anything, for I will be the Jimbo.


 * Buttonhole every developer I can, with random begging letters. This is not the Chinese Way, and as much as I uphold the American Way, I've become irrevocably Easternized. In China, nobody ever cold calls -- it's unthinkable. We don't even go shopping in stores unless we know somebody. One reason for this flimsy preliminary analysis is to attract attention -- in the best American tradition of a cheap brass band on Main Street -- in hopes of getting the ear of Somebody who has the ear of Somebody who has the ear... it's the Chinese Way.

Any suggestion, talk to me. Obviously I don't know what I'm doing. &mdash; Xiong &#29066; talk * 17:41, 2005 May 20 (UTC)


 * Do you know precisely what information, beyond "what kind of editing", you want? Is it essentially an edit classification system? Here are some edit types I can think of:
 * revert
 * adding categories, templates, interwiki-links
 * spelling, grammar edits
 * "wikifying" - refactoring to conform to style/referencing guidelines
 * content addition
 * refactoring content
 * content removal


 * I imagine writing a program to do something like that classification is quite a task in itself. Remember that such a program would only need to take as input a diff of changes, so there is probably value in writing a function (in any language) that has the signature


 * edit_type detect_edit_type( string old_text, string new_text );


 * Getting this function implemented is independent of figuring out the best way to extract the text from the database.
 * Pcb21| Pete 08:50, 23 May 2005 (UTC)

I did not have anything quite so ambitious in mind! When I began to think on the matter, I thought at first there would be two kinds of edits -- those that lengthened and those that shortened; I could pull that info from article length, plus or minus. A split second later, I realized that some edits maintain article length constant -- and that of these, some change many bytes, some change few. That's as far as I got down that road -- I don't have any method for obtaining even these metrics.

Elaborating this statistical view, some edits do not really change any bytes; but reorder them. Byte-by-byte comparison would show a large difference, but a human eye would say, Oh, he just swapped those two parargraphs. If you knew to look for this sort of change, you might find it (by algorithm) fairly easily. Some edits -- quite a few, I'd say -- are of mixed character, with minor changes, major additions, and who-knows-what thrown in. It might be fairly easy to count the number of brackets or equals to see if wikilinks or section heads have been added (or removed); similarly, count braces to detect transclusion. (However, template substitution is an invisible act!)

Anyway, a fascinating avenue of exploration. I wish I did have a brand-spanking new G5 and a steady income, permitting me unlimited leisure in which to query the database and figure out neato ways to parse it. For now, all I'd really hoped for was to see editing activity broken down by namespace.


 * For this more limited activity, I think the sensible option is to expand Erik's current scripts. It is clear that at least some (or perhaps, all) of his stats are limited to the mainspace, and so he already has some namespace detection (the namespace is probably in the database tables). How much work you will need to do will depend on how much enthusiasm he has for the project, I suppose. Pcb21| Pete 10:48, 23 May 2005 (UTC)

A word of caution: I've said that you can't pull anything useful or interesting out of a dataset unless you are looking for something. But the other side of that is, if you are looking too hard and too specifically for something, you will probably find it -- whether it is really there or not. &mdash; Xiong &#29066; talk * 10:37, 2005 May 23 (UTC)


 * Seems like a wise thing to say. Pcb21| Pete 10:48, 23 May 2005 (UTC)

IBM history flow
I don't know if you've seen this already, but IBM's History Flow has some methods to visualize changes in Wikipedia. Perhaps you can glean more of how they do it from their pages - I just see pretty pictures there :) -- grm_wnr Esc  14:14, 25 May 2005 (UTC)


 * That is a fascinating reference and I recommend it to anyone with related interest. IBM has created a tool, demoed it on WP (of course, they probably have a more accurate archive of the project than we have in Florida), and now they'll sell it to us -- or anybody else with the price. Hey, it's the American Way, I'm not complaining.


 * I believe I understand what's been done; and it does it very well. But it is not really applicable to this kind of investigation of the project as a whole; it is a tool for examining the edit character of single pages. Now if you threw on top of that a pattiducker and ran it on all the millions of pages in the corpus, then you might have something.


 * That said, anybody who wants to buy me this tool is free to do so. :) &mdash; Xiong &#29066; talk * 02:26, 2005 Jun 2 (UTC)

Wikimetrics
I don't know if it's the chinese way ;-) but I just wanted to point you to my paper on "Measuring Wikipedia", Research and the Wiki Research Bibliography. I also have a Weblog where I will try to summarize your research as soon as I have studied it a little bit more. I'd like to have a big "Wikipedia Data Warehouse" where all researches can study the data collaboratively. I'm glad to see that there are more and more people interested in it! Nice to get to know, Xiong! -- Nichtich 18:51, 25 May 2005 (UTC)


 * Nichtech's paper is interesting, but does not talk much about trends over time. I would not have excluded EN from analysis merely due to Rambot; I'm not sure of the motivation. (In fact I concertrate solely on EN, since it has the longest history.)


 * I will have to follow the other links as I have time. &mdash; Xiong &#29066; talk * 16:18, 2005 Jun 2 (UTC)

Per Wikipedian
You remark "Per wikipedian this project is not growing at all -- it is shrinking!". I would imagine this reflects several things:
 * 1) As people participate for a while and become inactive, their accounts are still present, counting as Wikipedians.
 * 2) If someone decides to create a new identity on Wikipedia, their old account is still there, counting as a Wikipedian.
 * 3) Wikipedia is becoming far better known. One would expect that we will pick up more casual editors, people who are not particularly concerned with the project, but hear about it, make a few edits on one or two things they know about, and never become active participants. At least three people who have mentioned to me that someone asked them to come edit an article on which they were expert; they did so, but did not become otherwise involved, at least not so far. -- Jmabel | Talk 01:54, May 26, 2005 (UTC)

Error
I'm sorry, but I cannot verify your analysis of mean article size and edits per article. First the article size is and growth much smaller - maybe you confused a row? Second you cannot put a linear scale into a logarithmic scale! As I wrote in "Measuring Wikipedia" (page 4) for the German Wikipedia:
 * Until February 2005 all metrics (database size, number of words, links, active Wikipedians, very active Wikipedians) increased similarly with around 18 percent increase per month.  Only the number of articles increased slower with 13.8 percent. [...] That means  article's size, number of links and numbers of active Wikipedians per article continuously rise.

I have no doubt that also fits for English Wikipedia. Here is a small graph indicating that mean number of edits per article also growth exponentially but not that fast. Can you have a second look and update your statistics? -- Nichtich 11:56, 26 May 2005 (UTC)