User:Jrincayc/Wikipedia Growth Paper

I have printed out this version and will hand it in. Mutilate at will, tell me what I did wrong, what to do next ect.

Abstract
I use a model of Wikipedia to attempt to explain the growth of it. Unfortunately, while the model I use does have explanatory power, I am unable to explain many of the coefficients.

Introduction to Wikipedia
Wikipedia, in a nutshell, is an online, multilingual, encyclopedia which can be edited by anyone with an internet connection. It was begun on January 15, 2001 as an experiment to determine whether a less formal encyclopedia (compared to the more formal Nupedia) could be developed in an `open source' manner (see Britannica or Nupedia? The Future of Free Encyclopedias, http://www.kuro5hin.org/story/2001/7/25/103136/121 ). The most unique aspect of it is that almost every page on the site has a edit this page link (exceptions being pages like the front page that are especially prone to vandalism). If you click on this link, you are taken to a page where you can edit the article and make any changes that you want. New articles are created by following a link to a article that has no text.

Of course, this also means that vandalism is very easy. Hence, detecting and undoing vandalism must correspondingly be easy. There are two major features that help this. The first is that a complete record of every edit and every version of any article is kept and made available. As part of this, it is very easy to go to a article and choose one of the older versions and make it the current version, thereby removing the subsequent vandalism (called a revert). The other feature is each person who is logged in gets a watchlist that shows when articles they are interested in change. They can then see the exact words that have been changed in the article. This allows edits to Wikipedia to be carefully examined and reverted if they are vandalisms without incurring a large time cost.

Wikipedia currently has over 350,000 articles and there are 10 languages with more than 10,000 articles. It is gaining hundreds of new articles a day and there are around 10,000 edits every day. These are impressive figures for a encyclopedia that depends entirely on volunteer effort. The fact that the entire database of edits is downloadable makes examining Wikipedia further very interesting.

Coase's Penguin
The only mention of Wikipedia in a journal that I have found is the paper Coase's Penguin, or, Linux and the Nature of the Firm (Yochai Benkler, Yale Law Journal, Volume 112, Number 3, December 2002). This paper examines several instances of creation of freely available informational and cultural works that anyone can contribute to, called peer production. The paper concludes that a major factor in helping these works get created is that transaction costs have been cut substantially compared to firm production or market production. In the context of Wikipedia, the relevant cost that has been cut is determining who is best to work on a given encyclopedia article. Each person who uses Wikipedia has a very good idea of their individual cost and the benefit of improving a particular article. If their individual cost is less then their individual benefit, then the individual can make the improvement. Wikipedia has access to far more individuals than a firm, so it is much more likely that a low cost, high benefit individual can be found. Also the firm will not be able to costlessly determine the best individual in the firm. Trying to replicate Wikipedia with a market, would either involve contracting less optimal individuals, or trying to contract thousands of people for small amounts of work. The search costs and the contracting costs involved with the latter would be huge. So, the peer production that occurs in Wikipedia may very well be the most efficient way to produce an encyclopedia since the transaction costs of producing it with a firm or by a market are substantially greater.

Effect of edits and authors, Costs and Benefits
A edit that improves an article has two effects
 * 1) Increases the overall quality of Encyclopedia
 * 2) Increases quality of Article

The first effect is expected to increase visitors and hence edits. The second effect is expected to decrease visitors that are capable of improving the article. An edit that is done by a different author is expected to have even more of an effect, since it will bring new ideas and perspectives.

So, as an article gets closer to the perfect article, the benefit of an additional edit will decrease. The cost will still be similar, or may even go up as the number of people capable of improving the article decreases and the amount of rewriting work increases.

The effect on the encyclopedia should be that more quality articles will bring in more people to read and potentially edit articles. This effect is in the opposite direction of the effect on the article.

Data gathered
The first thing that was done with the entire Wikipedia download was that it was run through a fast preprocessing program to remove the information that I was not interested in. The only information that was left for each edit was article title, author name, article checksum, number of links in article, edit date/time, and flags for the type of the edit and article (such as name-space, redirect, minor edit). The main thing that this removed was the article text which greatly

reduced the amount of data that needed to be dealt with.

The next processes continued to remove extra information that was not needed. First, all non-articles where removed. The definition of an article is the standard Wikipedia definition, a article has at least one internal link, is not a redirect and is in the main name-space (as in is not an image, a talk page or otherwise). Next reverts were removed. Any article that had the pattern A,B,C where A and C had the same checksum and length, and B and C had different authors, was considered a revert, and changes B and C where not counted for any subsequent statistics. The numbers of reverts were kept track of by month and encyclopedia.

The last processing was to get the data into a form that could have OLS done to it. For each month, the total number of articles in various categories was calculated (an example category would be articles that had 2 to 5 authors and 5 to 10 edits and would be written as AE:2to5_5to10). Also the number of bot edits was calculated (any edit by a user listed on Wikipedia:Bots) so that this could be figured into the calculation and disregarded.

This produced data with the following summary statistics:

Here are the averages in a table ordered by the number of authors going down and the number of edits going right. Note that the categories were chosen based on trying to make sure that each one had a reasonable number of articles and some were combined to ensure this (for example AE:0to1_6toplus is a combined article).

Equation and Coefficients
&Delta;total = Beta0 + Beta1bot_create + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus

On the left is &Delta;total. This is the change in the total number of articles in a month for a given encyclopedia. The intercept is expected to be positive, since the data is only on encyclopedias that have actually been started, and to get started they need to go from zero pages to some pages, even though all the variables are zero. The next variable, bot_create, is the number of articles that were created that month by computer programs, referred to as bots. This should be close to one since one bot created article in the month will result in one more article being created. (It might possibly spur on human authors, but that is unlikely in a month's time.)

The rest of the variables are variables that are dependent on the structure of the articles. These variables are expected to have coefficients that increase as the number of authors and edits increases, since based on the model of Wikipedia growth, articles with more authors and/or more edits are expected to be of higher quality, and so should draw in more readers, some of whom then proceed to create new articles. On the other hand, as the article ages, more of the links that it has to other articles will be to already existing articles, so I would expect that there will be some decrease in the value of the coefficients as the number of authors and edits increases.

&Delta;edits = Beta0 + Beta1bot_total + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus

The second equation is trying to predict &Delta;edits, or the number of new edits done in a month. The only different variable is that instead of bot_create, bot_total, or the number of edits done by bots, is used. This should have a coefficient of one since one edit by a bot should create approximately one edit in that month (plus or minus any discouragement or encouragement of humans factor). The coefficients on the article categories should be somewhat similar to the ones on the &Delta;totals equation since some of the same effects are occuring. Of course, the coefficients should be greater in magnitude than the ones on &Delta;totals equation since you only have to create an article once, but you have to edit it multiple times to get it to become a high quality article.

In general, I would expect that coefficients should be positive except when one of two things is happening. They both depend on the fact that new authors are joining and old authors are leaving. If the current mix of articles decreases the amount of new authors entering, then that is actually having a negative effect on the number of new edits done. So, if the current mix of articles is of poor quality, more potential authors might get discouraged with the poor quality of Wikipedia, and never join. On the other hand, this might just cause them to start editing. The way to tell would be that the low quality articles would possibly cause more edits to be done and less new articles to be created. The other possible cause of negative coefficients is high quality articles. These would tend to discourage new authors since no improvements that can be made will be found.

The Regression
Both equations were regressed on the data. Below is the &Delta;totals result:

R2 = 0.8664

Well, it has a reasonably high R2, the intercept is positive and the value for bot_create is close to one. Other than that, I have to say the values on the coefficients surprise me and I have no good story to explain them. The only two that are significant at a 95% confidence level and are positive are one author, 2 to 3 edits and 11 or more authors, 21 or more edits. It is possible that the former demonstrates some kind of new article with lots of empty links, and the latter demonstrates the high quality encyclopedia attraction effect, but it's also possible that the data is just biased on something else. Some other ones that are significant and negative such as AE:5to10_21toplus and AE:11toplus_11to20 do not follow a pattern that I can see.

Below are the structural coefficients arranged is a table:

The &Delta;edits regression yielded similarly puzzling results presented below:

R2 = 0.9597

Well, it has an even higher R2, a positive intercept, and the right value on the coefficient for bot_total. On the other hand, I can't think of a good explanation for the coefficients on AE:5to10_4to5 (+), AE:5to10_6to10 (-), AE:5to10_11to20 (+), AE:5to10_21toplus (-), and AE:11toplus_11to20 (-). Also, the value on AE:5to10_21toplus seems much lower than I would expect. I am quite suspicious that some of the coefficients are picking up an excluded variable bias since they seem inexplicable. My best guess for a candidate is some kind of large encyclopedia effect is affecting the higher edit and author counts.

Conclusions
Something odd is happening with the data. It seems to explain quite a bit of the variation, but on the other hand, I would not have expected the signs on the coefficients that I have seen. I suspect that I will have to examine the article level data very carefully to try and explain the values that I am getting. The aggregate data that I am using does not give sufficient insight into the data to try and give a good explanation of it. I suspect that I will have to work closer with individual articles to explain some of the effects seen.


 * the data
 * the programs used