User:Protonk/Article Feedback

Past research on the Article Feedback Tool has focused on assessment of the tool itself rather than characterizing the picture of Wikipedia the tool itself provides. In this article we explore the Article Feedback Tool dataset and attempt to clearly illustrate the relationship between the feedback ratings and the Wikipedia community's own assessment systems. While a strong model cannot be constructed without additional data a stronger understanding of the quality dimensions illustrated by both processes is offered.

Prior to the introduction of the Article Feedback Tool in 2011, differentiation and scoring of Wikipedia articles along lines of quality were mainly confined to the Article Assessment Quality Rating Scale. This scale divides articles into categories of varying distinctiveness based on expectations from individual interested Wiki Projects as well as editors across the encyclopedia. Articles rated from Stub to A class are generally (though not exclusively) reviewed by an individual editor and not subjected to external quality control as a rule. Project wide classifications such as Featured Articles (FA), Good Articles (GA) and Featured Lists (FL) represent both more structured sets of expectations as well as stronger quality control.

However despite these stronger quality control processes these more stringent quality criteria share a qualitative nature with the A, B and C class assessments. It is difficult to reify editor decisions to promote an article and therefore nearly impossible to make cross sectional comparisons. As will be described below in detail, the Article Feedback Tool (AFT) offers anonymous numerical reviews from both readers and regular editors. Past research efforts have shown higher average feedback rating averages associated with articles holding higher project quality assessments (Unassessed vs. GA, GA vs FA, etc.) and we confirm those findings. We also explore more rigorously the relationship between project assessment and feedback rating.

At the outset we should make clear that we hypothesize no causal relationship nor do our data support any such conclusion. Both project quality assessment and article feedback rating are proxies for article quality. The best way to imagine both processes is as being jointly determined by &quot;true&quot; article quality. Beyond the limitation on causal claims any model purporting to relate project quality assessment and rating average will suffer from a lack of identification. Further, assessed articles represent a vanishingly small percentage of overall article. Consequently many naive models may be ill posed or suffer from other numerical stability issues.

Such issues are not fatal. We present the work here as exploratory only. Follow-on projects, discussed in Further Research, will be informed by the analysis we provide today.

We first discuss the data used and both the rating and assessment systems. Next we introduce our two models; first a logit model and then a linear regression backing out the influence of article length on both rating and assessment. Results and future research are then discussed.

Article Feedback Tool
[[File:Heatmap of Correlation between Averages in Different Article Feedback Categories.svg|thumb|350px|

The Article Feedback Tool (AFT) presents a unique dataset with which to test the assessment decisions slotting articles into good or featured status. Anonymous and registered editors were given a chance to provide a quick numerical rating (rated 1 to 5 on four categories) on the article page itself. Approximately 9,550,000 ratings were recorded across almost 800,000 articles. Among these 800,000 articles 9,674 GAs, FAs and FLs were rated. Articles assessed in these top categories represent ~1.2% of the dataset or a little more than twice the proportion of assessed articles in the English Wikipedia (0.51%). On the English Wikipedia 19,708 articles are assessed as GA/FL/FA but the total number of articles is between three and four times the number of articles rated in our dataset.

The AFT itself underwent several iterations between its introduction as part of the Public Policy Initiative (the AFT component of this named the public policy pilot) and that currently displayed on the English Wikipedia. In all cases, however, the rating system presented users with four categories: trustworthiness, objectivity, completeness and quality of writing. The names for these categories have changed between revisions but the intended meaning has remained roughly fixed. Previous research has focused on the different rating categories but we tend to ignore them, as individual category ratings are highly correlated with rating averages. This is true for all articles in the sample but it especially true for articles which are likely to have been rated on all four categories by a single individual.



Readers and editors are free to rate a single article on one or more categories, but a large number of articles are rated in multiples of 4 times--suggesting that the framing of the feedback tool led readers to rate all four categories. We can see the result in aggregate terms as number of ratings per articles as well as in terms of higher correlation. Among a subsample of articles rated fewer than 41 times each, the mean Pearson's correlation coefficient between individual categories and overall averages for articles rated a multiple of four times is 0.894 while other articles is 0.870 (A Mann-Whitney test rejects the null hypothesis that these two samples are equal with a high confidence). In truth, we are making two suppositions. First, that articles rated a multiple of four times are likely to be rated by editors who rate all four categories than some combination of editors who rate fewer than four categories. Second, that those editors rating all four categories tend to give approximately the same rating for each. As the data are anonymized we cannot make any strong claims about either hypothesis. However we feel we can make a sufficient case to deal only with overall rating average rather than include each individual category in our models.

Project Quality Assessment
Our project assessment criteria of interest are Good Articles, Featured Articles and Featured Lists. Both GAs and FA/FLs represent a very small fraction of Wikipedia articles in total and are only selected based on a common rubric covering article quality, comprehensiveness and sourcing. However the Featured Content process represents a considerably higher bar both in terms of the stated criteria and in terms of the mechanism used to promote articles. Featured Content is expected to represent the very best content and is judged by multiple editors, only one of whom can raise an objection and potentially scuttle the nomination. Good Articles are supposed to represent solidly above average content and are generally judged by a single editor.

Both Good and Featured Content have mechanisms to delist articles which have degraded in quality over time or where the original assessment is deemed to have been erroneous. In addition, the Good Article project maintains a tracking category of articles which have been nominated but failed to be named as good articles (for a variety of reasons).

As a result we have 6 categories of project assessment: delisted Good Articles, delisted Featured Articles, failed Good Article Nominees, Good Articles, Featured Lists and Featured Articles. Some of these categories are mutually exclusive, others are not. Where some overlap exists the category with the higher unconditional mean feedback rating is applied (e.g. A former Featured Article may also be a Good Article and would be listed only as a Good Article in the data).

Data was collected from the AFT from July 16th, 2011 to September 19th, 2011. The category listings for the 6 project assessment factors were gathered on February 16th, 2012. In the intervening time the status of articles in the sample could have changed, causing an article which was assessed as a Featured Article but since subject to a drop in quality and delisted to be erroneously listed as a delisted Featured Article while the AFT ratings would have reflected its contemporaneous quality. In general this is not a thorny problem. While we are not building an explicit model for article quality and therefore cannot make a clear estimate of the bias this may introduce, it may be reasonable to assume the bias is negligible. The project added roughly 1,000 Good Articles, lost roughly 75 Featured Articles and gained (in the net) roughly 45 Featured Lists. While the number of Good Articles is disconcerting, we have no data on the ex ante true quality of those articles. They may have been improved in the intervening time, recognized for their latent quality after a period, or some combination of the two.

Characteristics
The AFT database dump records a summary of all ratings recorded for each article between the July 16th, 2011 and September 19th, 2011 and reports the number of ratings received in each category, the length of the article in bytes and the article title as well as page id.

Ratings per Article


Article ratings follow a rough power law, with the modal number of ratings per article equal to one and a majority (438,408) of articles rated four or fewer times. The top end of the distribution is likewise characterized by an extreme disparity between successive articles. The most commonly rated article (Justin Bieber) was rated 24,431 times, with the next most commonly rated article (Born This Way) rated 10,792 times. Jimmy Wales's article was also rated greater than 5,000 times, though users felt it deserved only a 1.106 out of 5--well below the average. Despite the large number of rankings these articles received, 80% of all ratings in the data are recorded for articles rated 158 times or fewer.



Clustering of observations in articles with few ratings per article results in an interesting constraint on the distribution of rating averages. Splitting the dataset up by ratings per article show that while the mean of several important variables converges quickly on the grand mean, the distribution of rating averages shows some odd support as the number of ratings per article increases. Part of this is likely due to the same framing which impacted category correlation--a single reader rating an article on all for categories may be likely to rate each category the same, raising the likelihood of whole number rating averages. It is also caused by a funny (but largely inconsequential) literal constraint on averages. An article rated once can only have a whole number average (1,2,3,4,5). An article rated two times can only result in averages differing by 0.5 and so on. This is largely a curiosity because while the potential distribution of rating averages is constrained at low numbers, the number of articles which exhibit those rating averages is not uniform. However as a majority of articles in the sample are rated four times or fewer this result is mildly interesting.



One potential method to illustrate the potential impact of low ratings per article on sample statistics of interest is to progressively censor our sample for increasing number of ratings per article and measure correlations associated with relationships of interest. Correlations tend to converge quickly on their expected values, with any significant successive difference evaporating as we exclude articles with 25 or fewer ratings.



The quick converge of correlation belies a noticeable but not necessarily problematic link between number of ratings per article and project quality assessment. To put it simply, many FAs and GAs have been rated four times or fewer--2237 GAs and 585 FAs. The simplest explanation lies with the fact that both the GA and the FA process are essentially the purview of long time editors who are likely to focus on very different types of articles than intermittent or anonymous editors. Thus they represent a very non-random sample of readers. Articles which were listed as FAs or GAs but have since been delisted are more often popular articles whose maintenance has decreased as the support from dedicated editors has dropped relative to the overall attention the article has received.

Length
Article length ranges from 51 bytes (likely a redirect at the end of the summary period) to 460112 bytes (2011 ITF Men's Circuit). 50 bytes was chosen as a cutoff for very likely redirects which were removed from the sample (roughly 2,000 articles). Most uses of length either in descriptions or in models will log transform article length in order to operate with a manageable dispersion. log(Article Length) is log normally distributed, though no results depend on this distribution as an assumption. There exists a roughly positive relationship between average rating (along with project quality assessment) and article length. As article length is the primary bit of textual information we can gather on all articles in the sample, we explore both relationships further below.

Average Rating
As we indicated above, average rating is best visualized as generating from a process which jointly determines AFT rating and project quality assessment. As such, we should not be surprised to see that average rating is increasing with project quality assessment. It is heartening to note FA/GAs rated on average higher than their delisted counterparts. All in all, the increase in median rating is almost monotone across the various project quality assessments with the exception of FA and GA which are largely indistinguishable.

Another means to illustrate differences in group central tendencies is to employ Tukey's Honest Significant Difference test. A word of warning, this test is offered as a supplement to eyeballing the difference. The HSD assumes observations are i.i.d. and are drawn from normal distributions--we can meet neither assumption. However the HSD is more conservative than a t-test comparison and best suited for multiple pairwise comparisons. The HSD compares the difference in group means across each group to a studentized range. Most groups of project quality assessments are indicated as distinct by the HSD, with a few exceptions which were relatively clear from the box plot.

Model
Two models were chosen in order to tease out the potential dependence that article rating and project quality assessment may have on article length and rating counts (as well as each other). Both length and number of ratings per article may vary with rating or assessment. Some groups (such as FA/FL) effectively represent truncated distributions with respect to length--while there is not a listed minimum length practical limits exist. The shortest Featured Article is HMS Temeraire (1798) at 5,600 bytes. 428,769 articles in the sample are shorter than this. Shorter articles, especially very short articles, may be stubs or bot generated; not the kind of articles which would be rated highly. But very long articles may be rated poorly by editors if their length makes them hard to read or navigate. We have no strong priors about this relationship.

Ratings per article represents a more difficult relationship. As we showed above, many FA/GA/FLs are reviewed a small number of times (of course, most of the article in the sample are reviewed the same number of times!) and we posited this may occur because dedicated Wikipedians get to work without interruption. However the basic model of Wikipedia revolves around distributed editing. All things being equal, a more popular article should be &quot;better&quot; and this should be reflected in both AFT rating and project quality assessment. Like article length, we do not enter in with a strong prior one way or another.

Choice model
Two models are fitted here, both logit regressions. First we fit a binomial logit regressing the probability of choosing to assess an article at all with article length, ratings per article and average article rating. In keeping with the broad exploratory theme, we are less interested in strong identification and more interested in illustrating the relationships. The second model fits a an ordered logit against the probability of choosing an increasing project quality assessment (with assessments ordered by empirical median rating). We could technically fit an ordered logit to the entire dataset but such a model isn't really meaningful when 6/7 rankings are crammed into ~1% of the data.

Binomial logit
In order to fit a binomial logit we must maximize the following log likelihood (MLE), where $$ \beta $$ will represent our various estimated coefficients, and $$ \mathit{X} $$ our data.


 * $$ log [ L( \beta_0, \beta_1  );  \mathit{X} ] = \sum_{i=1}^n \mathit{Y}_i(\beta_0 + \beta_1\mathit{X}_i) -  \sum_{i=1}^n log[1 + e^{(\beta_0 + \beta_1\mathit{X}_i)}]$$

Maximizing this likelihood leaves us with a fitted model mapping the log-odds of $$ \mathit{Y}_i = 1 $$, that is to say, an article being assessed in the project quality assessment, to a linear equation:


 * $$ p_i = \beta_0 + \beta_1(Rating Avg.) + \beta_2 log(length) + \beta_3 log(Ratings / Article) + \epsilon_i $$

Where


 * $$ e^{p_i} = \frac{p_i}{1 - p_i} $$



We estimate the intercept and coefficients jointly. The intercept represents the log-odds of $$ \mathit{Y}_i = 1 $$ with all variables (rating average, length, ratings per article) at 0. In this case, as is often the case, the intercept is not meaningful as none of these variables are 0 in the data. Each individual coefficient is the marginal contribution of an increase of one in a given variable to the log-odds of assessment. For this analysis we chose to bootstrap the coefficient estimation in order to provide a reasonably non-parametric (the model itself contains parametric assumptions, of course) estimate of each parameter. All of our estimates are greater than zero and the length estimate is 1.85. We should be careful placing too much emphasis on interpretation of an individual coefficient where there are multiple explanatory variables, but the large parameter estimate offers some support for our continued investigation into the relationship between length and assessment/rating.

Ordered logit
An ordered logit is very similar to a binomial regression. Rather than a single choice variable there are a series of ordered responses. Estimation is still through MLE but is a bit more complicated than the binomial logit. We first imagine the response term as having a latent continuous variable--so even though project quality assessment is inherently categorical we treat it as a binning of a continuous variable and construct the disturbance term that way. For k breakpoints, a latent variable $$ \mathit{Y}^\star_j $$, parameter estimates $$ \beta $$ and a disturbance term $$ \eta $$ we can estimate:




 * $$ \eta_j = \sum_{i=1}^k \beta_k  \mathit{X}_{kj} = E (\mathit{Y}^\star_j ) $$

Once we have computed the breakpoints we can estimate the probability for each (ordinal) event as:


 * $$ P(Y=1) = \frac{1}{1+exp(\eta_j - k_1)}, P(Y=2) = \frac{1}{1+exp(\eta_j - k_2)} - \frac{1}{1+exp(\eta_j - k_1)} , ... $$



Unlike the binomial, there is no single intercept term. However the interpretation of our parameter estimates are the same. Each represents the marginal increase in the log odds of increasing the rank category for a 1 unit increase in the explanatory variable. So a 1 unit increase in log(length) would result in a 0.4 increase in the log-odds of transitioning from a GA to an FA and so forth. Now that we are examining only assessed articles (so former good articles represent the lowest ranked response) the impact of length diminishes substantially. The parameter estimate for ratings per article is negative, but this may be a consequence of many GA/FA/FLs having only 1 or 4 ratings rather than a broader trend across the whole dataset.

Linear regression
Linear regression using ordinary least squares is the simplest and most common approach to inference and may be appropriate to this dataset. Assuming we treat average rating as the explained variable a linear regression is a reasonable specification. However the endogeneity problem limits the usefulness of such an approach. It doesn't make much sense to construct a simple linear regression relating average rating to the remainder of the explanatory variables without project quality assessment nor is including project quality assessment an appropriate choice. What we can do is attempt to disambiguate the effects of one explanatory variable in particular--length. Using a strategy most often employed in labor econometrics we regress our dependent variable of interest against a flexible specification for length. Then we can extract the residuals of this regression and regress those against our other explanatory variables. Our first regression where $$ Y_i$$ is average rating:


 * $$ Y_i = \gamma_ + \gamma_ log(Length) + \gamma_ log(Length)^2 + \gamma_ log(Length)^3 + \gamma_ log(Length)^4 + \epsilon_i $$

The choice of a high order polynomial was somewhat arbitrary. A nonlinear regression could have been run with a more general smoothing function but the basic results would have been the same. Once we have our residuals from the first regression we can construct the second:


 * $$ resid(Y_i) = \beta_ + \beta_ log(No. Ratings) + \sum_{\delta = 1}^n (Assessment_) + \epsilon_i $$



Here $$ \delta \,$$ are the various parameter estimates for project quality assessment. The results of the first regression are shown below:

Bear in mind that the explanatory variables are log(length), so the parameter estimates are relating a relative change in log(length) (which roughly varies from 3 to 13) and rating average. Another thing to keep in mind is that while these individual parameter estimates are statistically significant, that was nearly unavoidable for any reasonable relationship given the sample size. The AIC of this regression is enormous, as we can see most of the variation in the data is not explained by this model. The intent of this initial regression is simply to isolate the relationship of length and average rating to a reasonable degree and we have done so.

In fitting our second regression we should again remember that the predictive power of any model containing both average rating and project quality is very limited. In this case, we feel that something interesting has been exposed. Rather than the unconditioned differences in means between assessment categories, we now have conditioned means adjusted for variation in article length. The picture which we paint is somewhat different than the box plot from the beginning.

The parameter estimates for Featured Lists, Featured Articles and Good Articles indicate the expected signs and remain statistically significant but the estimates for each of the delisted categories are no longer significantly different from zero. With the effects of article length removed, articles which have been assessed as Featured or Good retain their distinction from the bulk of unassessed articles in terms of rating averages but those which have been delisted or failed their nomination are no longer distinguishable from the remainder of the sample.

While such a result is not dispositve of any claim we have introduced here it offers a valuable hint that project quality assessment may provide a meaningful source of variation in a more sophisticated and well identified model. Had the results of this second regression been radically different we might have had cause to worry about the informational value of article assessment.

Discussion
A number of assumptions have been left unsaid in this small study. First, we have no idea if the article feedback data represents a random sample of Wikipedia articles. A look at the top rated articles indicates that it may not, rather it may be driven disproportionately by articles which were popular in Summer 2011. This is to be expected, as this was the timeframe in which the data were gathered! In contrast, Featured Articles have existed in some form since 2003 and Good Articles since 2005. Many of the articles which were listed under these criteria then remain listed today but may not have attracted attention during the survey period. We have no way of knowing which way (if any) this source of error cuts. We are also making the implicit assumption that type and genre of an article matters neither for the project quality assessment nor for the AFT rating average. It is easy to imagine article genres which could systematically attract ratings divergent from their project assessment, both positive and negative. A more complete study would estimate these effects and (potentially) control for them.

Despite these and other limitations, we feel this study has offered some clarity on the subject of article feedback ratings. Of particular note are the following results:


 * Dimensionality can be reduced by relying on the high degree of multi-collinearity in category ratings
 * Apparently significant differences in group mean ratings between delisted assessed articles and unassessed articles are not robust to controlling for other factors
 * Distribution of article views (and consequently, ratings) requires we pay attention to articles with low ratings per article and characteristics of the data in these areas can be problematic for inference.

In a more comprehensive study we may be inclined to include some natural language processing elements, additional meta-data and category/project information into our dataset. A numerical estimate of a reasonably complex model would abut against the curse of dimensionality. This study offers a reasonably strong prior argument that dropping constituent category ratings is not profligate. These ratings can be used if a particular model calls for it (e.g. looking at readability ratings across genres) but we have also given some cause to be suspicious of a naive interpretation of a single category.



Summary statistics, box plots and Tukey's HSD all indicated that nearly every assessment category, including those where were distinguished mainly by their former or prospective status, was rated significantly higher. However this difference evaporated when a simple correction for length is performed. An even more simple linear model fitting assessment and length on the right hand side with rating average on the left hand side would show roughly similar but unconvincing results. Former Good and Featured articles along with former Good Article nominees cannot be said to be significantly different from unassessed articles and FA/FA/FLs are rated significantly higher than unassessed articles. Even though we don't have an effective identification strategy, this is an encouraging result. It may mean, as a first cut, that the AFT is not systematically inviting perverse ratings and/or the project quality assessment system seems to match fairly well with quality expectations. More on that final interpretation in a moment.

The last result is the most difficult to discuss as it is the least well explored. In contrast to the clicktracking research or research on articles effectively subject to flash mob editing, the majority of ratings are applied to articles which will never be rated again. These articles, specifically all those with ~20-25 or fewer ratings may have to be dealt with as a separate regime (or perhaps merely a dummy variable!) from the remainder of the articles. Alternately, they may be ignored. This may be a perfectly reasonable course of action but the decision must be informed by the underlying data.

Further Research
Three core problems need to be addressed:


 * Endogeniety: Some mechanism must be developed to avoid the core issue of joint determination of quality.
 * Rarefied sample of concern: Restricting our analysis to GA/FA/FL/etc. is simply insufficient
 * More robust testing of distributional assumptions or reliance on non-parametric analysis

Endogeniety: Easily the largest hurdle in answering the question entirely. There are a few ways to attempt to solve the problem as well as a few potential pitfalls. First, we could analyze the articles themselves and produce a scoring system for maintenance templates, estimated reading level, etc. Such an approach is not a dead end. The primary problem with this approach is generating an independent proxy for quality without resorting to expert review or some strong, arbitrary assumptions. One potential solution would be to sidestep the creation of a new proxy for quality entirely and rely on some source of exogenous variation in either the rating averages or the assessment system in order to gain identification. This could be done with a regression discontinuity design--such a design would be challenging as none of Wikipedia's major content assessment criteria generate a bright line (or reasonably bright line) breakpoint in an otherwise continuous variable. Another potential solution would be to exploit timing variations to generate some exogenous variation. This study made the broad (and potentially unsupportable) assumption that ratings, quality and assessment were fixed over time. Is it more likely that all three varied across the survey period and after it. For different dynamic shocks, different proxies for quality are respectively flexible and rigid. A popular news item on a semi-protected article is most likely to change ratings first, content second and assessment last. An assessment/reassessment drive may work in the opposite direction.

Expanded classifications: GA/FA/FL are easy classifications to choose. Between the three of them and their delisted counterparts there are approximately 18,000 articles--enumerating those categories from the API is simple and takes relatively little time. A/B/C class (to say nothing of start or stub) are another problem entirely. There are many more of them, they can construct multiple overlapping categories and their determination is arbitrary at best. Some projects treat A class review as a serious, structured category. Others merely require the appearance of effort on behalf of the reviewer. Despite these problems, a complete discussion of project assessed quality cannot be made without including most assessment categories.

Distributional assumptions: As this first work was primarily exploratory we did not invest too much energy in constructing and validating a believable errors structure for the feedback data. As later studies will work with more complex data this is not likely to be improved upon. One potential solution may be to generate certain parameter estimates by bootstrapping, sidestepping the need to establish some distributional assumptions. However even that will often imply some structure to the error term.

Reasonable future paths for research may look like this:


 * Simple model, complex data
 * Build multiple samples of assessed and unassessed articles with roughly similar length and ratings per article (conditional resampling)
 * Perform a textual analysis of the then current revision for these samples, including template information, incoming/outgoing links, limited NLP, and category information.
 * Build an ordered probit (or classification tree) for project quality assessment and backtest the classification. May also use this information to test the value of article feedback rating as a prior in a similar classification scheme.


 * Dynamic model, simple data
 * Build a series of switch-on switch off data for assessment and ratings over the course of the survey period.
 * Test response to dynamic shocks and estimate a model for incorporating new information, backtest the model.
 * Potentially the easiest to do once all the data and shocks are programmed. Probably the hardest overall.

In any case, more work is yet to be done on the subject.