User:Smallbones/Article quality prelim

Article quality on Wikipedia - a preliminary investigation

TLDR; Wikipedia needs a method to answer questions such as “Is Wikipedia getting better?” or “Is any increase in average article quality due to better new articles, or to old articles getting better?” This exploratory data analysis expands upon a method proposed by Jimmy Wales - selecting random articles and comparing their quality now to earlier versions. 100 pairs of a current article and its 2 year old version are examined, along with 13 articles less than 2 years old, The stub-FA class ratings do not appear to be useful for this analysis. A proposed rating system was flawed, but may be adjusted for future use. The small increase in average article quality appears to be driven by both a higher quality for new articles and an increase in the quality of old articles. Article quality, as well as page views, vary across subject topics, but changes in the composition of Wikipedia by subject topics appears to be minimal. Improvements to the method used are discussed along with a potential use of the method.

On September 6, 2015 asked a deceptively simple question at on Jimmy Wales’s talk page "Is Wikipedia getting better?" The following day proposed a simple method to answer the question. "My favorite way of checking this is to "click random article" on 10 articles, and go back and look at them a year ago, 5 years ago, 10 years ago. Every time I have tried, it's unambiguous: Wikipedia is getting better by this test.--Jimbo Wales (talk) 08:28, 7 September 2015 (UTC)"

Several editors had objections to this method, but the discussion did not result in the citing of many statistics on article quality. Wikipedia doesn’t seem to keep track of article quality very well.

The simplicity of Wales’s method should be its strength. A basic "test of proportions" should give a credible answer if we can operationally define quality, and randomly sample enough articles. If the proportion of articles that have improved in quality is greater than 2 times the standard error, then we can conclude that on average articles are improving in quality. With a sample size of n=100, the standard error is 5%, at n=400 the s.e. = 2.5%. While n=400 would be a more powerful test, the usefulness of this test also depends on the time editors have to spend on it.

Data collection
For this exploratory analysis n=100 pairs of a current article and its two year old version were examined. The sample was selected using the “random article” function that appears in the left-hand column on every Wikipedia page. 13 articles that were randomly selected during this process were less than 2 years old, and this sample is also examined.

Each article was placed into one of 13 subject categories or sub-categories, and data was gathered for each article on the assigned quality class (Stub-GA or unassessed) for the current article on September 13, 2015 and the article as it appeared on September 13, 2013. Page view data was gathered for the month of August 2015, and both the current article quality and the 2 year old article quality were assessed using the method briefly described below.


 * Quality level 0 - The article was less than 3 sentences long or had less than 2 sources in any format, including inline citations, bibliographic entries, and external links
 * Level 1 - The article had at least 3 sentences and 2 sources in any format but did not qualify as level 2.
 * Level 2 - The article had at least 9 sentences and 5 inline citations
 * Level 3 - The article had at least 13 sentences and 10 inline citations.

More detailed criteria were proposed before the rating began, but in practice almost all articles were rated by the simplified criteria above.

Article subjects were categorized as follows:


 * Biography, with subcategories based on living or dead, and male or female
 * BLP, M; BLP, W; BDP, M; BDP, W
 * Biology, health, and medicine
 * Business, products and services
 * Culture and arts
 * CA, 1990- (before 1991); and CA, 1991+ (after 1990)
 * Geography
 * GEO, W (western hemisphere); GEO, E (eastern hemisphere)
 * Hard sciences, technology, and math
 * History, politics and government
 * Other society, sports, religion, philosophy and social science

First all biographies were placed in the biography category, then all business articles were placed in the business category, then all geography articles in the geography category. Following this order, there were almost no article subjects that were difficult to categorize.

My work product is available at User:Smallbones/Wales method and the data as gathered is at User:Smallbones/Wales method. Other editors are invited to copy the data to their own user pages, apply their own quality assessment methods, or otherwise review or add to the data as they see fit. I ask that they report the results, if only on the talk page here.

Is the data representative?
The distribution of the stub-FA quality classes is compared to data taken from from

Though the percentage of both Start and unassessed articles are both about 5% higher than expected and the percentage of Stubs is 8% lower than expected, these differences are not unexpected given the small sample size. Similarly the proportions in the subject categories shown below are roughly similar to 2008 and 2006 studies. Note that neither of these studies used exactly the same categories.

Quality results
The currently used method of classifying article quality is the Stub-GA class system. This method is not useful for this analysis since almost no articles were reclassified during the 2 year period. One article was reclassified from Stub to Start. Five articles which were unassessed at the start of the period were later assessed. The large majority of the articles were only assessed once in their lifetime. Class assignment also seems inconsistent. Consider the Start-class article Roszkowice, Choszczno County and the stub-class article Semera, both articles on geography.

The proposed quality levels system performed only slightly better. 41% of the 2 year old articles were rated at level 0, and 45% at level 1, so that, in effect, the system only had 2 levels rather than 4. Seven articles increased by one level, one article increased by 2 levels, and one article decreased by one level. If the articles were originally spread out over the 4 levels, there would likely be more variation seen over time. Rather than adjust the quality level criteria after viewing these results, I decided that the measure should be adjusted for a future study when analyzing a new sample of at least 400 articles.

Overall the 100 articles increased on average from a quality level of 0.77 to 0.85 over the 2 years. Articles which were new had an average quality level of 1.23., raising the ending quality level for all 113 articles to 0.89. Using this flawed measure of quality level we might conclude that the average increase in the quality of the 2 year old articles (.85 -.77 = .08) was twice as important as the increase due to new articles (.89 - .85 = .04), though a new test with an adjusted quality measure and a larger sample size would be needed to state this with any confidence.

The following table breaks down the average quality level by article subject.

Ten of the 13 categories and subcategories had an increase in average quality level, and one had no change, so it appears that quality generally increased across categories. The largest increases can be seen for biographies of men. (Unsurprisingly the biographies of men and of women differ in several ways.) Other than male biographies, no patterns are apparent in the rating variable, though a much larger sample size would be needed to accurately compare across subject categories.

Large differences are apparent, however, in page views across subject categories. It is no surprise that articles on popular culture in the CA, 1991+ subcategory had the highest number of page views. The high page views for GEO, E are due to a single article.

Page views on Wikipedia are generally believed to follow a power law distribution, where a few articles account for a large percentage of page views. The page view data in the sample are consistent with this belief. The following table shows page views broken down by quintiles (22 or 23 articles). The top quintile consists of the 22 articles with the highest number of page views and the lowest quintile consists of the 22 articles with the fewest page views. The average quality measure is shown for each quintile.

The top quintile accounts for about 86% of all page views and has by far the highest quality level. Average quality declines as the number of page views declines but is fairly steady in the 3 middle quintiles. The lowest quintile, however, accounts for less than 1% of page views and has an average quality level far below the 2nd quintile.

Discussion
While a small sample size and a poorly designed quality level measurement prevent us from making any firm conclusions, it is clear that the general method proposed by Jimmy Wales and extended here can address questions like
 * Is any increase in average article quality due to better new articles, or to old articles getting better?
 * Does article quality or changes in quality vary according to article subject?
 * Are page views related to article quality?

The analysis suggests that article quality is increasing, more because of changes to old articles, but also because of increased quality in new articles. Biographies of men might be increasing in quality faster than in other categories, and quality seems to be very dependent on page views.

The answers to these and similar questions may inform policy discussions or the implementation of new projects by the Wikipedia community and the Wikimedia Foundation as they try to address the question of how to increase article quality or understand the process of how quality develops in an article. As new questions arise, new variables may be added to the analysis and the straightforward methods used here may be modified.

Perhaps the most surprising aspect of this analysis is the number of very low quality “sub-stub” articles on Wikipedia. 41% of the sample articles did not contain at least three sentences and two sources. Given that 53% of all articles are classified as stubs, maybe this shouldn’t have been a surprise. But I suspect that most regular editors, like myself, spend most of their editing time on articles well above the sub-stub level. This analysis should serve to remind editors of the many low quality forgotten articles.

After almost 15 years of articles added by mostly anonymous editors, I suspect that Wikipedia has accumulated a lot of dust and debris, which is hiding under the bed where few people see it! One possible solution to this problem would be to have a program to improve, merge or delete many of these articles. This analysis suggests that 20% of all articles could be selected using very rudimentary methods and deleted while decreasing page views by less than 1% and greatly increasing average quality. Of course a more rigorous study would be needed before we should take such action.

I propose that another study be conducted with at least 400 random article pairs. If comparisons of article quality across subject topics are included, 800 random pairs may be needed. The quality level measure should be adjusted to be more sensitive to small changes in quality. I suggest a 5 level system with the requirement for advancing from the lowest level to be very basic, say two sentences and one source in any format. This would allow us to see if the lowest quality articles on Wikipedia are making any improvement. The second lowest quality category should also be narrowed, perhaps by only requiring 3 inline citations to advance to the 3rd level.

A time period of 4 years would allow us to see more quality changes as well, and would increase the sample size of the new articles (less than 4 years old).

The factors considered in assessing the quality levels should also be increased, though still kept easy to identify, or perhaps even be measured by bots. The number of sentences and the number of sources or inline citations are clearly part of perceived quality on Wikipedia, but there is more to quality than just size and citation. A third easily observable variable should be included, perhaps the number of editors in the last year (or average number of editors per year of life), or average number of edits per year.

I propose that a WikiProject be formed to carry out such investigations. Editors could propose research questions that they consider to be important, and participate in gathering the data and reporting the results. Cooperation with WMF efforts may greatly reduce the amount of time needed to do the required work. Currently, if a sample of 400 article pairs is needed, it might take 5 editors about 12 minutes per day to gather the data and rate 4 articles each. In the course of a month a sample of adequate size would be gathered and a record of how quality changes each month could be built.