Wikipedia:Xiong's stats

This is a preliminary analysis of selected English Wikipedia statistics over the period from 2002 January to 2005 March. Data is examined for evidence of a shift in Wikipedian Community values and cultural makeup. This analysis is based on incomplete data and no certain conclusion may be drawn.

The analyst is Xiong Changnian.

Tools

 * Apple Macintosh "Gossamer" G3 300 MHz


 * Mac OS 9.2.2


 * FileMaker Pro 5.5


 * Microsoft Excel 9.0

Source
Erik Zachte maintains scripts which automatically prepare extensive, comprehensive charts and tables available for public viewing at. One such table is Wikipedia Statistics English. On 2005 April 28 this page is downloaded and munged to a form suitable for further analysis (Source #1).

Additional data is stored at. This consists of a large number of CSV and other types of files. From this is abstracted a single file, "StatisticsUsers.csv", as found on 2005 May 14 (Source #2). This file contains one record for each user in all languages.

It is not clear whether these CSV files contain information up-to-date at the time of download, or whether they are cut off at any particular moment.

Method
The primary value of Zachte's presentation, as it appears to me, is comparison between different Wikipedias -- for example, between English Wikipedia and German Wikipedia. I choose in this analysis to concentrate on English Wikipedia alone, as it has the longest history and I am interested in trend analysis -- comparison of the state of the project at various points in time and extrapolation to hypothetical future states.

Source #1 yields two datasets.

Dataset #1-1, "Articles", is abstracted from columns E, I, and H of the source. These represent, respectively, official article count; mean bytes per article; and mean edits per article. It is understood that these refer exclusively to pages in the main article namespace (ns0).

Dataset #1-2, "Wikipedians", is abstracted from columns A, D, and C of the source. A is defined as the total number of users who have ever made at least 10 edits ("contributors"); D is the total number who have, in the last month, made over 100 edits ("very active"); C is the number who have made over 5, but not more than 100 edits ("active"). (Note a peculiarity of this dataset that it is possible for a user to make over 5 edits in one month, but fewer than 10 ever; such a user is "active", but not a "contributor".)

Source #2 is pulled into FileMaker and all records of users other than EN deleted. 62,836 records remain. For each user, aggregate measures of editing activity are shown. This file contains no historical information, unfortunately; only activity in the last 30 days and all activity since project start.

Individual users are of no concern and some chosen usernames actually corrupt Excel, so arbitrary local user id numbers are assigned before export. We note it is a disadvantage that bots are not always clearly identified. This is Dataset #2-1.

This large number of records is unwieldy -- not just too many for this author's antique system to process, not just too many for Excel hard limits -- too many for clarity when charted. So a random sample is taken; 10% of the population, which is Dataset #2-2, "Sampled Editors": 6259 users. The sampling method is to assign a pseudo-random number to each user record in Dataset #2-1 and find those &#8804;0.10. This appears to have excluded conveniently all bots.

Worksheet columns J, K, L and M are direct from source; respectively the given user's edits to "articles" for "all time", edits to articles for the last 30 days ("recent"), edits to "other" namespaces for all time, and for last 30. Columns P and Q are totals generated by summing J+L and K+M. Column R is the ratio K/P, while S = L/Q.

Limitations
Data prior to 2002 January, as noted on the source page, is suspect and has been discarded. Other language projects were not included in the analysis as they did not all start at the same time; English Wikipedia is the root and oldest project.

Many questions are raised by this entirely preliminary analysis that cannot be answered with the data available in these sources. Fuller analysis requires much more detailed information, such as:


 * complete EDIT history:
 * datestamp of edit
 * id of page edited (numeric id)
 * namespace of page edited (ns#)
 * id of editor (internal id)
 * length of edit in bytes (number of bytes added or subtracted)
 * length of edit in bytes (number of bytes changed)


 * complete USER history:
 * id of editor, including anon-IPs
 * date of entry into system
 * other significant activity besides editing, if any
 * other user information can be processed out of EDIT history


 * complete PAGE history
 * id of page
 * namespace of page
 * and for each reporting period:
 * date of reporting period
 * current size in bytes
 * this information is used to check deductions from EDIT history

A serious hard limit on all statistical analysis of this project may exist: It is not clear that any independent archive exists. Thus, all historical information must be processed out of the current database state. In theory, wiki architecture preserves all prior state information in current state, but this may not hold in practice -- and in the absence of archived prior states, it is impossible to prove. This situation, if true, will be exacerbated by database restructure incident to MediaWiki 1.5.

It is critical to trend analysis that historical information be preserved. This author hopes that archived prior states will be discovered and that statistical summaries of current state be logged exterior to the project database.

There are several questions regarding the preservation of certain actions, such as page protection and deletion. This information may or may not be lost. Resolution of these questions is a priority.

At this time, this author is not in possession of complete documentation of the database structure. An explicit declaration is available at SourceForge but of course this requires even more background to understand. m:Help:Database layout is 2 years out of date, but it does link to detailed explanations of the various database "tables". A more rigorous analysis must include accurate and current descriptions of the source and nature of each group of data.

Relative frequency of editing to article namespace and other namespaces is of great interest. Prima facie, article editing is of greater value than, say, talkspace editing. After all, the former produces value directly to the end user, the reader.

Caution must be exercised when moving from the general to the specific; a user who does nothing but edit talkspace may be performing a function useful to the project. This author disapproves of "rating" individuals by their namespace editing ratios, and does not intend to produce such ratings. However, the overall percentage of editing -- by number of edits and by bytes thus edited -- may be a valuable measure of "overhead" -- the work needed to glue the project together and make article editing possible.

The mean user has made about 10 times as many edits in all time than recently; for all time, about 11 "article" edits for every 4 "others"; recently, that has dropped to about 7 to 3 (17% decline).

Analysis of these variables is severely limited by the paucity of data. It is barely possible to extract any useful trend at all from a dataset consisting merely of aggregate and recent data.


 * Historical editing source data, including length and character of edit, broken down by namespace is this author's grail.

Usage
These charts include detail best viewed at full size. If you do not have a second, high-resolution monitor available, it is recommended that you download and print the charts before following the analysis.

The Excel workbook is available on request.

Articles


The first chart will be familiar to anyone who has a passing interest in Wikipedia growth. Although like many human activities, it must eventuate to a logistic curve, it appears as an exponential growth curve in this early stage. Excel generates an exponential curve with an excellent fit when the late-2002 Rambot anomaly is excluded.



Exponential growth is best plotted on a chart with a logarithmic axis; this makes the curve appear as a straight line. Shown on the same chart, but plotted against the right-hand linear axes, are two other data series of note: mean bytes per article and mean edits per article.


 * Why is the number of mean edits per article rising so quickly?

Three years ago, the average article had been edited 3 or 4 times; two years ago, about 7 times; last year, 11 times; and this year, over 18 times. Obviously, old articles are being constantly re-edited; but should this effect not be swamped by the exponential influx of new articles? It seems all articles are being more heavily edited -- and that this trend is increasing.

Perhaps large amounts of new content are being added to existing articles? The average size of an article is increasing -- but not steadily.

Prior to Rambot, mean article size hovered around 1700 bytes; this represents a period of relatively rapid growth. Afterwards, article size steadily increased over the next five quarters -- a period of relative consolidation, although total number of articles continued to grow exponentially.

Three more quarters were flat at about 2200 bytes; we are currently in another period of consolidation. These fluctuations in article size growth are not reflected in edits per article, which figure has increased rapidly by a factor of five even as total article count has increased by roughly the same factor. Another way of putting this is that total number of edits has increased 22-fold since the end of the Rambot anomaly.

There is a very sharp point of inflection in edits per article right in mid-May 2004, so the trend is accelerating. One is led to suspect that the same text is being edited more frequently, an effect that remains when others are accounted for.



Articles per Wikipedian


A smoking gun is found when we chart the number of articles in the project per Wikipedian in the community. Since both figures are growing exponentially, we might expect their ratio to be flat -- but surprisingly, the ratio is decaying along some function that appears either logarithmic or polynomial. Rambot, of course, added many articles without adding any users; but long after this ratio fell below pre-Rambot levels, it continues to fall, slowly but surely. It's uncertain whether this will level off somewhere above 20 articles per contributor.

This pretty much explains at least part of the rapid rise in edits per article; there are fewer articles to go around, per editor. But it does not explain why edits per contributor is on the rise.

We finally catch a correlation to the 2Q 2004 edits-per-article inflection; even while total number of contributors was rising, edits per contributor turned around from a local minimum of about 380 and has gone up so quickly we may anticipate equalling Rambot before 2006.



Wikipedians


Community membership has, like article count, been growing exponentially. There is no obvious Rambot anomaly, since after all Rambot is a single user, albeit very active. The curve, plotted like the other charts on a log axis, is very flat indeed. It is not clear why two other curves are relatively noisy: "active" and "very active" editors. Note that the sum of these two curves does not come near that of all Wikipedians; the majority of users make 5 or fewer edits per month.


 * Either all users are, on the average, more active today than yesterday; or a smaller number of users are more and more active. Which is it?

We turn to a ratio of more-active members against less-active ones. The ratio of these two noisy curves is itself very noisy, but there is are clear trends. If users are becoming more active on the average, then perhaps more of them will more active as individuals.

But we see exactly the opposite effect. The proportion of merely "active" editors increases; that of "very active" editors decreases. What's even more surprising is that this a steady decline is only the most recent trend. Prior to Rambot, this ratio was on the rise; the proportion of "very active" editors roughly doubled in about 6 months. After Rambot, the ratio retired even more quickly before assuming the present irregular decline.



Sampled Editors


Unlike the others, this is an X-Y scatter chart. Both axes represent some numbers of edits; the only difference is scale -- the X axis, running along the bottom of the chart, covers a much wider range than the Y axis (5:1).

Thus the m=1 red line connects points of equal value on both axes; its slope is 1. Any of the data series from Dataset 2-1 or 2-2 can be plotted against either axis. In each case, a pair of data series are compared; the first series on the pair is plotted against the X axis and the second against Y.


 * Can we identify subpopulations?

The data (even sampled) covers an extremely wide range, with some editors exceeding 25,000 edits in all time -- and I don't think they are even bots. However, the median user makes only about 5 edits in all time. Thus the same data is examined on different scales. Unfortunately, at the smallest scale the integer nature of the data forces datapoints to stack up, obscuring density.

At first, the results are discouraging. At all scales, there seems to be a continuum of editors; the only obvious subpopulation is the very numerous "Tourists", who make a small number of edits and depart. Even so, there is no sharp boundary to this group.

The first pair, P-Q, compares all-time edits against recents. Naturally no user can make more edits recently than in all time, so none fall above the m=1 red line. K-M compares article edits for all time to others; L-N makes the same comparison for recents. Editors found above the red line have made more edits outside of article mainspace than within it.

At the larger scales, heavy clustering toward the "small end" of the chart is immediately apparent. Outliers are few; exceptionally heavy editors are rare indeed. At smaller scales, one feature does emerge; numbers of "dead" users who have made no edits at all recently, though they may have made thousands in all time.

At all scales there is a healthy preference for article editing, although a number of users are found above the line. This appears to hold for recent as well as all time editors. Note however that while many heavy all-time editors are found well below the line, recent heavy editors appear to tend closer to it. It's not clear how much this is an illusion caused by the preponderance of smaller and smaller editors.



This lopsided distribution naturally leads to thoughts of normalization. R and S, being ratios (of natural numbers) which may not exceed 1, are restricted to the range (0 .. 1). To recap, R is the ratio (for each editor) of article to all edits, for all time; S the same ratio for recent edits only. Axes are equal and the new m=1 line is shown in green. New red lines divide the chart into quadrants which correspond to the division of the prior set of charts by their m=1 red lines.

To restate the last point: In the first set of scatter charts, activity was plotted along both axes. The comparisons K-M and L-N fell below the red line when editors made mostly article edits; above when most edits were to other namespaces. In this scatter chart, these ratios are expressed directly. Editors who favor article editing tend to the right and top; those who favor other edits tend to the left and bottom of the chart. All-time editing preferences range from left to right, while recents range from bottom to top.

Since "dead" users have made no recent edits, their recent preferences do not exist and cannot be normalized against non-existent activity. On this chart, these users are forced below the X-axis; the indicated value of -0.10 is purely bogus.

Total activity is no longer shown, and only a few distinct patterns emerge. A number of editors are strung along the X-axis itself; their recent edits have all been to other namespaces. More editors are strung along the Y=1 line at the top; they have made only article edits recently, whatever their past performance. Between these, there is a rather undifferentiated mass about which it can only be said (a) that most users do prefer to edit articles; and (b) that there is a weak correlation between axes, suggesting that preferences are somewhat persistent.



The same dataset is plotted on this bubble chart twice -- each point appears in both series -- but the first series is weighted by K, all time edits; the second by L, recents. This weighting gives more "importance" to more active editors; otherwise, it is identical to the previous chart. The area of each bubble represents the given editor's overall activity. Please keep in mind that each editor is represented by two concentric bubbles, visible or not.

As before, "dead" users are forced to a bogus value; now we can see how much they contributed to the project overall. Some editors found at the top -- those that recently contribute to articles exclusively -- edited other pages more frequently over all time.

It appears that very few significantly active users have preferred strongly to edit outside article space. Those that have, have generally done so for all time; but there are a number of editors well below and to the right of the m=1 green line. These editors recently prefer to edit outside article space more than they did over all time. Most of these editors have not been especially active recently. They are not balanced by equal numbers of prolific editors who have inverted their preference in the opposite direction.

There is a sizable subpopulation just to the left of center; these editors have always slightly preferred other-editing. What makes this group notable is that they continue to be recently active.

Then, the great herd is found in the upper-right quadrant, who have edited articles and still do. The most active users, both recently and over all time, are found here. It's quite clear that the majority settle on an editing pattern and stick to it; both more- and less-active editors corrolate well between recent and all-time editing ratios.



Preliminary Conclusions


It would be fatuous to draw any final conclusions from such sparse data. A great deal of further analysis is required and far more data to be fed to the process. This author will venture only a few, entirely tentative, conclusions -- indications that something is happening.


 * Article count continues to increase exponentially. Between Jan 3 and Jan 4, article count doubled from about 100,000 to nearly 200,000; the next doubling was reached in less than 10 months. Extrapolation suggests the 1 million mark will be hit during the first half of 2006. Ten yearly doublings would push total article count over the 1 billion mark around 2015.


 * Obviously this cannot continue indefinitely; only a few more decades of such growth would result in a database bigger than the total computing resources of Men. Moore's Law has computing power doubling every 24 months, so even if Moore hits no limit, we will eventually crash into him.


 * It is much more likely that this growth will eventuate to a logistic curve as some more limited resource is exhausted, to wit, editor time. However there is no indication of this yet.


 * Mean article size grows less quickly and less steadily. Assuming a continuation of the growth-and-consolidation cycle, article size at the 1 million mark will still be less than 3Kb. This suggests that the bulk of new articles are small in size and that growth in size of older articles is of lesser significance.


 * Articles are being edited much more frequently without adding much more content. Those million articles will have been edited an average of perhaps 30 times each -- more than 1 edit per 100 bytes. For comparison, note that this bullet point alone is nearly 300 bytes in length.


 * We should dearly love to speculate on the nature of this increasingly intense editing, but absent data, must refrain.


 * Per Wikipedian this project is not growing at all -- it is shrinking! This is all the more surprising considering the continuous exponential growth of article count; but it seems that our userbase is growing faster than our database.


 * Why is the absolute number of articles increasing at all? Is it that so very many new users are coming on board? Or is a small core of users doing all the work? Our data is insufficient. It does not help us to examine member activity without specific knowledge of the nature of each edit. A small number of editors may be adding large quantities of content in only a few edits while others repeatedly edit the same articles again and again; or perhaps all editing activity is more or less equally effective in increasing count and length.


 * The number of edits per article is increasing; the number of articles per member is falling; the number of edits per member is rising; also the proportion of "very active" members is falling. As tempting as it is to cry doom based on these indications, the data does not quite support it. Much more detailed analysis must be performed before any sort of firm conclusion can be drawn from these trends. All we can say at this time is that these trends do appear statistically significant: they are not a by-product of noise, sampling error, or failure to normalize.


 * Frankly these trends baffle this author. I expected to find some evidence of the project's exponential growth rolling over into the inevitable logistic curve; I hoped for a leading indicator, perhaps -- a sort of canary. I was startled to find so many peculiar, presently inexplicable trends, especially with inflection apparently unrelated to Rambot.


 * Rambot's footprint is larger than immediately apparent. Our first chart showed a little bump, which we simply ignored when extrapolating article count. The deeper we dig, the more often Rambot appears to surface, and in unexpected ways. This author can find no explanation, for instance, why Rambot's introduction seems to have kicked off a decrease in the proportion of "very active" users.


 * The scatter chart and bubble chart analyses are seriously weakened (as noted in "Limitations") but distinct subpopulations of editors do emerge. These may or may not represent "Tourists" or "Vandals"; "dead", "burnt-out", and "reformed" "Old Heads"; "loud" and "quiet" "Dissidents" (or "Maintainers"), as well as the large mass of common "Soldiers". Absent more detailed data, it cannot even be determined whether any of these groups are increasing or decreasing in strength, let alone what actual functions they serve or roles they play in the project.

Something is happening; what it is, we cannot say without further study based on more and better-qualified data.

— Xiong &#29066;  talk  *  05:49, 2005 May 20 (UTC)

Discussion
Please see Talk.