Wikipedia:Researching Wikipedia

Researching Wikipedia (formerly known as State of Wikipedia) discusses some ways to quantitatively measure various aspects of Wikipedia project as well as covers research done in that area. The subject is difficult, as there are different goals that Wikipedia may have, and different ways of measuring achievement of those goals.

Raw numbers
A hard way of measuring success is to count the number of articles in Wikipedia. This information can be found on the Statistics page. A problem with just counting the number of articles is, what is an "article"? A large percentage of our "articles" may be extremely short stubs, or even just consist of uncaught vandalism. mergeing stubby articles leads to fewer, better articles, without losing any content. A more accurate measure of the size of Wikipedia is the number of characters or words in articles. Wikipedia as of October 2006 had 1.4 million articles with an average length of 3,300 characters.

Such a measurement gives no indication of the quality of content. It is much more difficult to estimate the number of good, useful, accurate, or balanced articles in Wikipedia. For this, we may only take into account articles that have been in some way assessed, either as "featured", "good", "A-" or "B-Class" articles. As of February 2007, one in ca. 550 articles on Wikipedia is either "featured" or "good".

One way to think about the Statistics page is to consider it a measure of Wikipedia's success as a project rather than as a reference work. Since it is a project for producing a reference work (with community building being a side effect, not a secondary goal), assessment of success of the project will be directly tied to assessment of the reference work.

Relevance to the Web
Another way to consider Wikipedia success is to ask how relevant Wikipedia's information is to the World Wide Web. How many hits per day does the Wikipedia site receive? How many readers come from Google? Which pages have high Google PageRank?

A measure of Wikipedia's popularity is provided by its entry on Alexa which shows its web traffic rankings.

One measure that's valuable, but difficult to automate, is to consider Top 10 Google hits. Of the subjects already in Wikipedia, how many are good enough references that they get high page ranking at Google?

Yet another measurement might involve the number of, or degree to which, other sites use Wikipedia's content. The fact that a number of other sites trust the accuracy of Wikipedia's content is a strong indicator of its success.

Coverage
Another axis to consider is Wikipedia coverage. Coverage is a measure of how much of the information we need in Wikipedia is already there. How well does Wikipedia "cover" the range of knowledge that it should?

One way to think of coverage is to imagine some kind of "endpoint" in the future – Edit Zero – where all the information that's Wikipedia-worthy is in the system. At that point, the work of Wikipedians will change from writing about existing subjects, to adding articles about new subjects as new people, events, countries, awards ceremonies, species, albums, books, and planets come into being. A measure of Wikipedia's current coverage would be to measure how many of the articles in that imagined encyclopedia already exist in some useful form.

This is, in most ways, an immeasurable metric. We don't know how many articles will be in Wikipedia at Edit Zero, so we can't know what percentage of those we already have. The best we can hope to do is approximate the "real" coverage metric with some ad hoc measurements.

Some proposed approximations:


 * Of the entries in the 1911 Encyclopædia Britannica, how many have corresponding Wikipedia articles? (Totally crude, but if we were back in 1911, wouldn't we want to have at least as much knowledge as the EB? Close to it?)
 * What percentage of Wikipedia searches come up empty? (This would measure what percentage of things Wikipedia readers think should be in the system are already there.)
 * Of the internal links inside Wikipedia, what percentage point nowhere? How many have non-stub articles at the endpoint? (This would measure what percentage of things Wikipedia authors think should be in the system are already there.)

Note that the Edit Zero model is simplistic in expecting the number of Wikipedia-worthy articles to converge at some point in the future.

List of conducted studies and other resources
Wikipedia (primarily) and other Wikimedia projects are increasingly generating research concerned with studying phenomena responsible for their functioning. Some of that research has been published in professional academic journals, or presented at conferences: see Academic studies of Wikipedia.

However a significant number of other inquiries are not published in such journals, and as a result, Wikipedia namespace on Wikipedia, as well as some pages on our Meta wiki and likely on other projects has been increasingly filled with such short research papers, essays and other resources. meta:Research is the place where such research is supposed to be coordinated, but in fact majority of tools and papers can be found on English Wikipedia. Below is a guide to those resources.

Category:Wikipedia statistics
Note 1: Most interesting and more-or-less up-to-date projects are bolded.

Note 2: Graphs, charts and such should be added to Category:Wikipedia charts

Keywords:
 * Editors: about editors
 * Users: about users
 * Articles: about articles
 * Technical: technical aspects of the projects (software, code...)

Category:Wikipedia resources for researchers
Category description:

This category aims to include resources for researchers in two capacities: We are interested in the 2nd subcategory which surprisingly has very few pages.
 * 1) . Using Wikipedia as a research tool (see Researching with Wikipedia)
 * 2) . About Wikipedia as a research subject (see Research)

Category:Wikipedia tools
The following tools are useful for research/stats analysis of Wikipedia and related projects.