User:Smallbones/1000 random results

What topics does Wikipedia cover? Has the coverage changed over time? These and similar questions are examined using 1001 random articles sampled in December, 2015. The 18 categories and subcategories are dominated by biographies (27.8% of all articles), with biographies of men (23.8%) being 5.8 times as common as women (4.1%). Sports biographies (9.5%) form over a third of all biographies. All articles on sports total 16.0% of the sample. Geography (17.8%) and Culture & Arts (15.8%) are other major categories, with Science & Math (3.5%) having the lowest proportion of the categories in the sample. These proportions are fairly stable over time. Almost 90% of page views go to the 20% of articles with the highest page views.

The sample is intended to be used to investigate additional questions as they arise, with additional data gathered as necessary. As an example, 27 matched pairs of biographies of men and women were formed and the growth, in bytes and words of text, of these articles was recorded over 5 years from their starts. The results suggest that, on average, all bios more than double in size over 5 years, but that bios of men start about 30% larger than bios of women, and that they grow at a slightly faster rate. Modified data from ORES was also examined and appears to be a useful resource for addressing these, and similar questions.

The sample
Using the random article function on Wikipedia, 1003 articles were selected during December 2015. Disambiguation pages and lists were not included in the sample. Data was gathered on the date of the first edit and November 2015 page views. Before any data could be gathered one article was deleted at AfD and another was merged. Both these articles were discarded from the sample.

See User:Smallbones/1000 random for the full sample.

Defining the categories
The categories are mutually exclusive, except for "sport" which is an overlay on the biography and society categories.
 * A biography includes any article on an individual, and is divided by living/deceased and male/female (BLP, M; BLP, F; BDP, M; BDP, F), and sports/non-sports. There were no articles in the sample on deceased sportswomen.
 * Biology, health, and medicine BHM
 * Business, products and services BUS
 * Culture and arts
 * on topics beginning in 1990 and earlier (CA, 1990-); and beginning 1991 and later (CA, 1991+) ("classical" vs. current)
 * geography
 * for the Americas aka the Western hemisphere (GEO, W) ; for the Eastern Hemisphere (GEO, E)
 * History, politics and government HIST
 * Other society, sports, religion, philosophy and social science SOC
 * the subcategory on sports includes articles on teams, seasons, and leagues - anything on sports that is not a biography
 * Hard sciences, technology, and math SCI

The margin of error for sample proportions is generally given as ±2 standard errors. The SE for a category where the true population proportion is 10% in a sample of 1001 is Sqrt[1001(.10)(.90)]/1001 ≈ 1.0%, thus the margin of error is about 2%.

The sample at User:Smallbones/1000 random was taken in December 2015

Change over time?
Articles were classified into 5 uneven time periods by the date they were started: before 2006, 2006-2007, 2008-2009, 2010-2012, 2013-2015. These periods roughly coincide with the times that the 1st million, 2nd million, 3rd million, 4th million, and 5th million articles were begun. Thus the expected number of articles in each period are approximately equal.

While there are some changes in the number of articles in each category over time, there is little to suggest major changes taking place. The two most obvious features are the low number of articles in the sample started before 2006, and a large jump in geographic articles begun in 2008-2009, accompanied by a similar sized fall in the number of Culture and Arts articles begun at this time.

The number of articles begun before 2006 is lower than in the following period for all six of the categories shown. One explanation might be that the definition of articles used by Wikipedia has changed since Wikipedia announced that it had over 1 million articles, and that many of these articles are no longer included in the encyclopedia. Alternatively many of those articles could have been deleted since that time in a much greater proportion than later articles.

The jump in the number of articles begun on geography, with a similar fall in the number of articles begun on culture and arts, during the middle period is more difficult to explain. One editor suggests that during this time arts and culture article were being discouraged due to a perception of them being low quality and commercially oriented. Editors who would have begun these articles perhaps turned to create articles on geography instead.

Other than these two anomalies, the number in each category appear stable over time. The number of biographies begun since 2006 seems to be especially stable. Articles begun on geography, and perhaps on society, might be seen as falling since 2009, but this might also just be random fluctuation in the data.

Page views by quintile
The number of times an article is viewed is often described as following a power law - a few of the most viewed articles account for a very large proportion of total page views. The following table demonstrates this relation and suggests that 20% of Wikipedia's articles could be deleted with less than a 1% reduction in page views.

Of the 200 articles in the lowest quintile 29.5% were on the geography of the Eastern Hemisphere, 26.5% were on biographies, and 18.5% were on biology, health, or medicine.

Conclusions
As an editor I sometimes have questions that might be addressed with some basic statistics, such as "What's in Wikipedia?"

A random sample of 1001 articles was described and categorized, showing what type of articles appear in Wikipedia. Basic analysis suggests that the categories are fairly stable over time, and that the page view distribution across articles follows a power law.

This data set may be useful for addressing similar questions from other editors, though additional data may need to be gathered depending on the question. Please feel free to use this data in any way you wish, but please drop me a note about your conclusions. If you want some help in gathering more data, just ask.

As an example of the type of investigation that can be conducted using this data, the question of how biographical articles improve over time is addressed, together with a comparison of women's and men's biographies.

Biographies, women and men
The full sample has 5.8 times more bios of men (237) than of women (41). For non-sports, bios of men to bios of women has a ratio of 5.1 to 1; for sports bios 7.6 to 1; for bios of deceased persons 8.6 to 1.

These differences may reflect a relative lack of source material on women, especially in sports.

Average page views (apvs) in the sample follow a similar pattern: the ratio is 2.5 to 1 for all bios of men (1146 apvs) to all bios of women (456 apvs); 2.2 to 1 for non-sports bios; 2.8 to 1 for sports bios; 5.9 to 1 for deceased persons.

The differences in apvs may reflect a lack of interest, or alternatively a lack of knowledge on the availability of information, on women's biographies, with more interest or knowledge of availability, for living women. Non-sports bios typically have more apvs than sports bios, with a ratio of 5.6 to 1; for men the ratio for non-sports to sports apvs is 5.8 to 1; for women 5.5 to 1.

Matched pair data subset
To address the question on how articles on women and men are formed and grow over time, matched pairs of bios of men and women were formed for articles begun in the period 2005 - 2010. Each of the 27 biographies of women started during this period was matched with a male bio started during the same year, with the same living/deceased status and same sports/non-sports status. Seventeen of the 27 start dates for the pairs were matched within 30 days, the longest difference being 103 days. The size of the article in bytes and other indications of quality were recorded for 1 month after the first edit, and 13, 25, 37, 49, and 61 months after the 1st edit.

Using Wikipedia's article assessment ratings, the women's bios in the matched pairs are currently rated somewhat lower than men's bios. However, the number of articles with photos or other illustrations, and the total number of illustrations might indicate a very slightly higher quality for women's bios.

Article size in bytes was averaged for all bios of men, and for all bios of women for the 6 dates relative to the article's 1st edit, showing, on average how the articles grew over time. Article size more than doubled for both types of bios. Men's bios were 31% larger than women's bios one month after the first edit and 53% larger 5 years later.

To check this result, the number of words in the text of the article, excluding info boxes, categories, bibliographies, filmographies, and lists of achievements, was recorded for 1 month after the first edit and 61 months after the first edit. The averages for the 27 matched pairs are recorded in the table below.

The amount of text in men's bios is only 8% greater than in women's bios 1 month after the first edit, but is 71% greater 5 years later. The amount of text in men's bios grows 140%, similar to the growth in article size in bytes 137%. For women the growth in text is only 51%, much lower than the growth in bytes of 103%.

Using ORES as a quality measure
Finally, a fairly new estimate of article quality was used to examine the matched pairs data. ORES is a machine-learning system that uses data on article length, internal and external links, and other measures to predict how an editor will rate the quality of an article according to Wikipedia's article assessment ratings (Stub, Start, C, B, GA, and FA).

Slightly over 50% of articles are rated as stubs by editors and the ORES predictions share the same characteristic. This system makes the examination of quality changes for low rated articles difficult. Therefore, I made a simple modification to the ORES output to include an additional class "sub-stub". ORES output includes a probability that an article will be rated as a stub. Based on an informal survey of about 2 dozen articles, I chose a probability > .70 to classify articles as sub-stubs, which was intended to reclassify about one-half of stubs as sub-stubs. The final quality classes I use range from 1 to 7:

1 (sub-stub); 2 (stub); 3 (start); 4 (C); 5 (B); 6 (GA); and 7 (FA).

The tables below show how the 54 articles in the matched pair data were rated by the modified ORES output. One month after the articles' first edits, 28 articles were rated class 1 (sub-stubs) but five years later only 10 remained sub-stubs. 18 of these 28 (64.3%) sub-stubs increased by at least one class in quality. Of the 7 articles rated class 2 (stubs) one month after their first edits, 4 increased by at least one quality class. The remaining 3 stubs were joined by 7 former sub-stubs after 5 years, giving a total of 10 stubs. Four of 18 articles originally rated as class 3 (start) increased by 1 class to class 4 (C). Thus almost half (48.1%) of the articles rated by the the modified ORES data show improvement over 5 years.

Class scores of articles over time

Class scores were averaged for bios of women and men for one month and 61 months after the first edit, showing how these articles improved over five years. Women's bios averaged lower than men's bios for both periods, with the men's bios having a slightly greater quality increase. Less than half of the women's bios increased by at least one class, whereas sightly more than half of men's bios increased in quality.

Average class score

Discussion
Using several measures of article quality, the matched pairs data on biographies strongly suggest that article quality increases over five year periods. Article length measured in bytes more than doubles over this time. Other measures show somewhat less of a quality increase and the modified ORES data suggests that only about half of articles have a measureable quality increase.

Women's bios, according to several measures, are slightly lower in quality than men's bios one month after being started, and the difference increses over five years.