User:Nettrom/datasets/March 2014 popular stubs

See the section on our data gathering method below for specific details regarding how this data was gathered.

Links to pages listing articles per WikiProject
Our dataset contains 610 WikiProjects and they have been broken into parts depending on how many articles they cover. The first part is more than 1,000 articles, the second between 100 and 1,000 articles, the third between 10 and 100 articles, and the last between 1 and 10 articles.


 * 1) WikiProjects with more than 1,000 articles (currently only WikiProject Biography)
 * 2) WikiProjects with more than 100 articles
 * 3) WikiProjects with more than 10 articles
 * 4) WikiProjects with more than 1 article

Data gathering method
The basis for identifying these stubs is the English Wikipedia dump of March 4, 2014. We first utilised an improved version of SuggestBot's article classifier to predict the assessment class of the entire dump, finding over 2.6 million stubs.



Access to data on viewership was kindly provided by Wiki ViewStats. A plot of the popularity of the 4,188 Featured Articles on English Wikipedia found in the dump is shown in the graph on the right. The dotted line shows the geometric mean popularity of approximately 128 views per day. We use the geometric mean because popularity is greatly skewed, and similarly the X-axis in the graph is logarithmic (1=10, 2=100, 3=1,000).

Comparing the popularity of the 2.6 million predicted stubs against the average Featured Article, we found 12,322 articles that were above the average. We then gathered the assessment classes of these articles and restricted it to those that were only assessed as stubs as of the dump, resulting in the 6,140 articles listed in the table.

The table might be later extended with additional articles as the remaining articles had multiple assessment categories, thus requiring a bit of manual labour to filter them before adding them.