User:Smallbones/Wales method

On September 6, 2015 asked a deceptively simple question at User talk:Jimbo Wales/Archive 194 "Is Wikipedia getting better?" The following day Jimmy Wales proposed a simple method to answer the question. "My favorite way of checking this is to "click random article" on 10 articles, and go back and look at them a year ago, 5 years ago, 10 years ago. Every time I have tried, it's unambiguous: Wikipedia is getting better by this test.--Jimbo Wales (talk) 08:28, 7 September 2015 (UTC)"

More than one editor had objections to this method, which I'll call the "Wales method" unless objects. The simplicity of the method should be its strength. A very basic "test of proportions" should give a credible answer if we can operationally define quality, and randomly sample enough articles. If the difference in the proportion of articles that have improved in quality is greater than those which have not increased in quality is greater than 2 x the standard error, then we can conclude that the typical article is improving in quality. With a sample size of n=100, the standard error is 5%, at n=400 the s.e. = 2.5%. While n=400 would obviously be a more powerful test, the usefulness of this test also depends on the time editors have to spend on it. For this preliminary study, I'll use n=100.

Also to save time, just one time interval is used to compare quality, 2 years.

Defining quality
I'm skeptical that the currently defined quality classes are consistently applied to articles, and that they are updated as needed, so I'll define a different measure below. The current class system ranges from Stub, Start, C, B, Good article (GA), A, to Featured article (FA). The table show how the classes were assigned in 2010 and 2015 for both total articles and percentage of articles. It also includes "featured lists" and "lists" as classes.

Since over half of all articles are classified as stubs, I'd like to divide this class roughly in half. To conserve time, I'll define 2 additional classes that can be quickly observed. Note for n=100, there will be 200 quality classifications needed. If each takes 1.5 minutes, 5 hours will be required to make the classifications.

Quality definitions

 * Level 0 - Doesn't meet the requirements for Level 1, a "sub-stub"
 * Level 1 - Roughly a "good stub"
 * In 3 or more sentences, describes and defines the topic and places it in context in the body of knowledge (e.g. mentions a parent topic, or a general subject such as physics and 2 more closely related topics)
 * has at least two references, bibliographic entries or external links to sources of additional information
 * Level 2 - meets Level 1 requirements and
 * includes at least 3 passages or sections of at least 2 sentences each that cover a subtopic (e.g. history, current use)
 * has at least 5 inline citations
 * may not have a large warning tag at the top of the article (e.g. for POV, citations needed, notability)
 * text is clearly written and contains no obvious contradictions, or has at least one of the following
 * a photo or video
 * a bibliography or further reading section of at least 5 items
 * at least 10 inline citations
 * Level 3 - meets Level 2 requirements and
 * includes at least 5 passages of at least 2 sentences each that cover a subtopic
 * has at least 10 inline citations
 * text is clearly written and contains no obvious contradictions
 * has at least one of the following
 * a photo or video
 * an extensive bibliography
 * at least 20 inline citations

Work flow

 * Click the Random article link in the left hand column
 * If the "article" is a disambiguation page or list, click the Random article link again without recording anything
 * Otherwise, record the article name and permanent link (click in left hand column) in the article column of the table
 * Record subject from list below
 * Check to see if two years ago a version of the article was available, if not click the Random article link again
 * If a 2 year old version is available, record its permanent link in the "Old version" column of the table. Rate that version's quality.and record it in the "Old quality" column.
 * Return to the current version and rate that version quality and record it in the Current quality column
 * Start again (or take a break)

Other variables may be recorded later, using the permanent versions.

Data
Sept 13, 2015, 113 obs, 100 with current vs. 2 years ago

Topic categories
I would like to record subject areas with each random article, to see in general whether quality has been changing differently across subjects. I'd also like to be able to reuse these mutually exclusivesubject areas in further studies, having them include about 5% of all articles up to a maximum of 15%. Based on the above table and chart, I'll use the following subject categories and sub-categories:
 * Biography
 * BLP, M; BLP, W; BDP, M; BDP, W
 * geography
 * GEO, W (for Western hemisphere); GEO, E
 * Culture and arts
 * CA, 1990-; and CA, 1991+ ("classical" vs. current)
 * Business, products and services
 * History, politics and government
 * Other society, sports, religion, philosophy and social science
 * Hard sciences, technology, and math
 * Biology, health, and medicine
 * Other/unclassifiable