Wikipedia talk:WikiProject United States Public Policy/Assessment/Archive 1

Example
Let's assume that we have to rate the article Wilderness Act. What would the score look like? --Fschulenburg (Public Policy) (talk) 22:51, 29 June 2010 (UTC)

Open questions

 * What happens if I can rate some areas, but not all? E.g. I am able to fill out the areas "Format" and "Neutrality", but I have no clue about "Content"? --Fschulenburg (Public Policy) (talk) 22:51, 29 June 2010 (UTC)
 * What does a "1" and a "2" in the area "Neutrality" mean? --Fschulenburg (Public Policy) (talk) 22:58, 29 June 2010 (UTC)

ARoth (Public Policy Initiative) (talk) 23:31, 29 June 2010 (UTC)
 * By popular opinion, Neutrality is now scalable. For example, if an article has a neutral voice overall, but one section seems biased then you may want to score it a 2 in Neutrality.
 * I don't know exactly how to score if I am not familiar with a topic. My guess is that if I read an article and it seems like some historical context is missing, then then completeness of the content score should go down. Likewise, if there is only 1 source cited then the accuracy score should go down.
 * What if the topic does not really need an image, is the image score a 0 or 2? - I would go with 2 because the images are sufficient for the topic.

Ideas for improvement

 * I propose to specify what "0", "1", "2", etc. means. --Fschulenburg (Public Policy) (talk) 22:56, 29 June 2010 (UTC)

Results
ARoth (Public Policy Initiative) (talk) 23:39, 29 June 2010 (UTC)

Whenintransit (talk) 00:13, 30 June 2010 (UTC)

--Ldavis (Public Policy) (talk) 00:16, 30 June 2010 (UTC)

This was a good exercise. I like the streamlined version much better. Trouble spots in wording/explanation for me included: My overall impression of the balance of difference score categories is that this doesn't put enough weight on the completeness of the content; an article could get a very pretty high score for being a great stub, while if the same article was expanded to be essentially complete but with modest shortcomings in grammar, neutrality, formatting, and, useful article could lose points all over the place and end up scoring lower.--Sross (Public Policy) (talk) 15:02, 30 June 2010 (UTC)
 * "Any images..." this implies that only the images that are actually there are to be evaluated; instead the question is how appropriately and adequately illustrated is the article? In this case, the image itself is a low-quality (very small) image, but it looks fine in the infobox.  But that's just a sort of "flavor" image; the important images are all missing: a portrait of the author of the act, maps of the initial wilderness areas and the areas today, typical examples of wilderness areas under the protection of the different wilderness services, and probably more (it's not clear what else is missing without having a fuller picture of what else would be in a complete version of the article).
 * "The accuracy and sufficiency of cited sources." In this case, there are no citations at all except for the statistics, while there is one source listed in the bibliography that seems reasonable but it's not clear what information in the article comes from it.  I'm also unclear about whether this score ought to represent sufficiency of cited sources to support what is actually in an article (meager as that may be) or sufficiency of sources compared to what a complete article would look like.  Probably the former, but it could use clarification.
 * Appropriate length is a tough one, because it will vary depending on who you ask. Wikipedia has some well-developed standards in this regard (Article size being the main one) but opinions will vary, especially if articles are being reviewed by non-Wikipedians.  So what appropriate Wikipedia article length is should be defined, in broad strokes.
 * I didn't notice any grammar problems, but if there was a place to dock points for style, I would have done so.

Hybrid metric proposal
I tried to make a similar metric in spirit to what Amy has proposed, but rooted in Wikipedia's policies and expectations for high-quality articles. It has detailed breakdowns of scores for different aspects of article quality, but it also can translate into the standard Stub/Start/C/B scale and thus feed into the 1.0 assessment system without too much duplicated effort.

Explanation
Note: intermediate scores are possible between those listed below.

Comprehensiveness
The article covers all significant aspects of the topic, neglecting no major facts or details and placing the subject in context. Any score from 1 to 10 is possible.
 * The article is comprehensive, going into appropriate detail about all significant aspects of the topic, and using summary style where appropriate. - 10 points
 * The article is mostly comprehensive but falls short in one or more significant aspects of the topic. - 7 points
 * The article is well-developed in some aspects but requires major expansion in others. - 4 points
 * The article goes beyond a preliminary introduction, with at least some detail beyond a brief overview, but is far from comprehensive. - 3 points
 * The article is a stub, consisting of only a paragraph or two of brief introduction to the topic. - 1 point

Sourcing
The article is well-researched. It is verifiable and cites its sources, with inline citations to reliable sources for any material that is likely to be challenged and for all quotations. Any score from 0 to 6 is possible.
 * The article is well-sourced, such that readers can determine which information comes from which source. The most appropriate source are used, including journal articles and scholarly monographs where possible. - 6 points
 * The article is mostly well-sourced, but has some material that is not sourced or does not use the most appropriate sources. - 4 points
 * A significant portion of the article is well-sourced, but the majority of it is not adequately sourced. - 2 points
 * The article contains only a bibliography, or only a small portion of the article is well-sourced. - 1 point
 * The article does not reference any reliable sources. - 0 points

Neutrality
The article has a neutral point of view, accurately representing significant points of view on the topic without advocating or placing inappropriate weight on particular viewpoints.
 * The article follows the NPOV policy. - 3 points
 * The article follows the NPOV policy, with only minor exceptions. - 2 points
 * Minor exceptions include subtle imbalances in the ways different comparably significant viewpoints are described, the exclusion of minor but still significant viewpoints when all major viewpoints are covered, etc. Such an article is neutral on the whole, but may have a few small problem areas.
 * The article mostly follows the NPOV policy for the viewpoints represented, but other major viewpoints are absent - 1 point
 * The article falls significantly short of following the NPOV policy. - 0 points

Readability
The prose is engaging and of a professional standard, and there are no significant grammar problems.
 * The article has excellent style and grammar and is highly readable. - 3 points
 * The article is comprehensible and reasonably clear, but a need for copy editing is apparent. - 2 points
 * The organization, style and/or grammar of the article detract significantly from the reading experience. - 1 point
 * The article is difficult to understand and requires a thorough re-write. - 0 points

Illustrations
The article is illustrated as well as possible using images (and other media where appropriate) that follow the image use policy and have acceptable copyright status. The images are appropriately captioned.
 * The article is well-illustrated, with all or nearly all the appropriate images and captions. - 2 points
 * The article is partially illustrated, but more or better images should be added. - 1 point
 * The article has few or no illustrations, or inappropriate illustrations. - 0 points

Formatting
The article is organized and formatted according to Wikipedia standards and generally adheres to the manual of style.
 * The article is well-formatted and is mostly consistent with itself and with the manual of style. - 2 points
 * The article has modest deficiencies in format and/or deviates significantly from the manual of style. - 1 point
 * The article is poorly formatted such that the formatting detracts significantly from the reading experience. - 0 points

Translating scores
This is my first stab at interpreting what is supposed to be expected from articles of different classes on the 1.0 assessment scale, in terms of numerical ratings for the different aspects of article quality. For the lower classes, comprehensiveness and sourcing are the main things that differentiate articles of different classes; things like neutrality, style, layout, and illustrations quickly become important as well for the higher tiers of the assessment scale.


 * Stub - Any article with a 1 or 2 in comprehensiveness is Stub-class.
 * Start - Any article with a 3 or higher in comprehensiveness that doesn't qualify for a higher rating is Start-class.
 * C - An article must have at least a score of 4 in comprehensiveness and 2 in sourcing to qualify as C-class.
 * B - An article must have at least a score of 7 in comprehensiveness, 4 in sourcing, 2 in readability, and 2 in NPOV to qualify as B-class
 * GA - Articles with at least 8 in comprehensiveness, 5 in sourcing, 2 in readability, 3 in NPOV, 2 in layout and 1 in illustrations may be good candidates to be nominated for Good Article status. (B is the highest rating automatically assigned by a numerical assessment.)
 * A - Articles with a 9 in comprehensiveness, 5 in sourcing, 3 in readability, 3 in NPOV, 2 in layout, and 2 in illustrations may be good candidates for an A-class review.
 * FA - Articles with full points in every category may be good Featured Article Candidates; even then, additional work may be necessary to comply fully with the manual of style.

--Sross (Public Policy) (talk) 01:37, 1 July 2010 (UTC)

Discussion

 * I like it. johnpseudo 14:51, 1 July 2010 (UTC)
 * I'd suggest much less  emphasis on the pedestrian factors of style, and more upon comprehensiveness and coverage, which is where Wikipedia articles tend to be lacking. (The relationship to the WP GA criteria is The point of this externally funded project, as I understand it, is to raise the level of coverage, not of copyediting. It's the intellectual quality of the articles for which   we need the people in higher education. In my opinion, any article from this project should be at the highest   level of comprehensiveness and referencing, and the number that meet this are the appropriate metric. Failing to meet these is the typical Wikipedia amateur work in the subject area.      DGG ( talk ) 16:59, 1 July 2010 (UTC)
 * I disagree because we're explicitly inviting in people who don't already have a good grasp of the basics of writing a Wikipedia article. Comprehensiveness and coverage don't mean much if they're in the form of essays instead of encyclopedia articles. Nifboy (talk) 00:41, 2 July 2010 (UTC)
 * Yeah, we were talking about this yesterday; scores aren't really additive, because falling far short in one area can basically ruin any article, even if other areas would have high scores. That's why the sum of the scores doesn't matter (it's just a construct that makes analysis easier), what's more important is reaching certain levels of quality in all the different areas.  (This is what I tried to reflect with the "translating scores" part.)--Sross (Public Policy) (talk) 11:38, 2 July 2010 (UTC)
 * this now raises the question of how we are to evaluate, ;let alone measure quantitatively, such things as "comprehensiveness" and (what seems to be omitted), balance. Is this entirely impressionistic, and how will we align the impressions of various people. Inalkl academic work I am aware of, the criterion becomes in the end the personal evaluation by experts. who are our experts? This sort of evaluation is not really a collective task in the same way writing an article is. In the end , there has to be some explicit or at least understood standard -- and there has no be some way of validating that  standard. I am not aware of any way this can be done is their subject except outside review by multiple experts, in the conventional academic peer review fashion, --and I have the strongest doubts about the actual validity of this, or at least the provable validity of this.  DGG ( talk ) 02:14, 14 July 2010 (UTC)

Template for testing this out
Thanks to MSGJ, we have a template now that can automagically take this rating system and convert the numerical scores into standard ratings. Right now it's at WikiProject United States Public Policy/sandbox, but since this system seems to be enjoying a pretty good reception, I'm going to put that code into the actual banner soon if there are no objections. Please test it out; let's work out any bugs and refinements that need to be fixed before we start using it live. Here's the basic syntax:

And here's an example of the output:

--Sross (Public Policy) (talk) 02:09, 6 July 2010 (UTC)
 * I moved it over to the live banner now, so feel free to try it out, without an eye out for any bugs or possible improvements.--Sross (Public Policy) (talk) 13:31, 6 July 2010 (UTC)

Accessibility
I don't know if it's within your remit to consider this, but one of the goals of Wikipedia is to make the encyclopedia as accessible as possible. You may already be aware of WP:Accessibility, but I'd particularly like to draw your attention to Alternative text for images, which is now pretty much expected for articles in the higher quality classifications. You may wish to consider how accessibility might factor into your assessment scheme. Regards --RexxS (talk) 01:47, 7 July 2010 (UTC)
 * hi RexxS, thanks for your comment, sorry for the delayed response. It seems that accessibility has a strong overlap with formatting and following the Wikipedia Manual of Style, so I think it is accounted for in the formatting part of the rubric. ARoth (Public Policy Initiative) (talk) 16:59, 17 September 2010 (UTC)

Team Members

 * My name is Amy Roth, and I am the Research Analyst for the project. I have a lot of research experience, but not much Wikipedia experience, other than using it a lot and thinking it is a great thing. Also, just to give you a heads up: I am expecting my first child very shortly, a girl, so I have been feeling the pressure to really get the assessment side of this project going before she comes. I realize that evaluating the Public Policy Initiative isn't quite as high on anyone else's priority list as it is on mine, but if someone is really interested and would like be more involved I could really use the support. Thanks again for signing up and I think we will be able to learn a lot about assessment and article quality through this project.ARoth (Public Policy Initiative) (talk) 21:49, 17 September 2010 (UTC)

Few questions and a couple of suggestions
I think that it does not exist yet. Recently, I have participated to the WikiProject Films/Assessment/Tag & Assess 2009-2010. That project has over 60,000 articles, but check out that page when you get a moment. Maybe we could create some worklists similar to those of WP:FILM but with ranges of 10 or 20 articles per row instead of 200. This will help the work of the reviewers in selecting the articles to review skipping from a random method to a more rational one.
 * Where are the lists of articles to review?


 * Actually, the assessment lists are on the pages created for each team member. This assessment is probably very different from other wikiprojects because it is designed as experiment. While I encourage assessment team to assess as many articles within the project as possible, the primary assessment goal is to complete assessments of the articles in the experiments so that we can draw some conclusions about assessment methods in Wikipedia and the impact of the Public Policy Initiative on article quality. The assessment requests probably seem random, because they are! In statistical analysis sampling is everything, and the researcher must be able to prove random selection to make any claims about the results. So these articles were selected by an online random number generator and then designated to each team member by random draw. (It actually takes quite a while to design the experiment; the previous sentence captures about a week's work.) Each article in the first assessment experiment is reviewed by 5 different assessors. This experiment will tell us if the metric is any good, or more precisely how much the metric varies between different assessors.ARoth (Public Policy Initiative) (talk) 22:24, 24 September 2010 (UTC)

I must confess that I do not like the adopted set of criteria for grading assignments. Why we should have different sets of values for the scores of each assessment area? Isn't easier for the reviewers to use the same range <0-10> for each assessment area? The template itself and the related tool can multiply each of those values by a scale factor automatically. For example we could multiply the Comphrensivenes value by 1, Sourcing by 0.6, Neutrality and Readability by 0.3, and Illustrations and Formatting by 0.2... and we will bring about the same result. Or by 10, 6, 3, and 2 respectively, and we will get an integer between 0 and 260. It seems complicated, but the tool used for the analysis will make the calculations easily.
 * Rubric


 * There are several reasons the quantitative metric was constructed this way. That is not to say it is the “right” way or the “only” way, and other experienced researchers may point out other possible methods. But, it is part of my job to explain the methods and defend the results, so the metric must be one that generates results that I can analyze with appropriate statistical tools. I tried to explain the reasons as clearly as could below:

ARoth (Public Policy Initiative) (talk) 22:24, 24 September 2010 (UTC)
 * 1) It is a rubric very similar to one that a professor would develop for a class paper assignment. Since we are looking to compare subject matter expert assessment to Wikipedian assessment, we need a metric that non-Wikipedians could easily learn. Also, by using a rubric format which is widely used by university professors, it increases the likelihood that the research will be accepted by academic community.
 * 2) The weights and thresholds of the metric reflect essential Wikipedia policies and principles.
 * 3) The metric needs to be quantitative to allow for more powerful statistical analysis. Non-parametric statistical tools which are used for ranked qualitative data usually require large sample sizes and don't typically generate very convincing results. Since resources for research of this project are pretty limited, we require a metric that can create meaningful results efficiently, that is why I pushed for a quantitative metric.
 * 4) It reduces variability if weighting is inherent in the scores. It is always a big challenge in statistical analysis to separate noise from real variation. Applying weights after values have been assigned typically increases the noise and makes it harder to capture real variation. That is why professors typically write the scores to reflect the final value of the question. For example, on an exam, there may be short essay questions worth 5 points each and then there may be a major concept question worth 20 points. If the professor were to allow 20 points for the short answer questions and then scale them later with a calculation, she would have many students complaining about why one short essay answer earned a 7/20 and one earned a 6/20 when they are essentially the same. The weight must be inherently understood when the value is assigned or it just increases noise instead of reflecting true variability.


 * Pay attention that&#58;
 * Many WikiProjects and their assessment banners have an additional checklist for checking against the criteria for B-Class status. It applies to Start and C-Class articles:


 * b1 = 
 * b2 = 
 * b3 = 
 * b4 = 
 * b5 = 
 * b6 = 


 * Of course, it's a true/false, yes/no choice questions, but it's very similar to the rubric proposed here.


 * To all reviewers: Do NOT forget that when an article belongs to more than one WikiProject, the quality scale must be the same across all WikiProjects. So, please change the class parameter value to each banner in that page.


 * I consulted Sage Ross, the online facilitator, about this point, because most of the articles I have seen that are tagged by multiple wikiprojects have different quality ratings for each one. I understand the convention to be that assessments are done on a per-WikiProject basis, but it is a good idea to update old assessments done by other projects, because most cases of conflicting assessments are cases where some are based on old versions.ARoth (Public Policy Initiative) (talk) 22:24, 24 September 2010 (UTC)
 * Yeah, it's per-project especially if you get outside of the basic stub-start-C-B range: Some large projects give A-class special treatment (see e.g. WP:MHRA at the Military History project), and not all templates support more esoteric classes like Redirect-Class. Nifboy (talk) 02:14, 25 September 2010 (UTC)
 * Can I please make a plea for editors not to alter quality assessments made by other WikiProjects, unless they are thoroughly conversant with the assessments used by that project. Several large projects differ in the options they use which are available in the assessment scale. For example WP:WikiProject Military history (one of the largest and most well organised) does not use C-Class (see here), so rating as C-Class in a MILHIST template can cause problems for their categorisation structure. It would be much more collaborative and polite to make a note on the article talk page and request a re-assessment from the WikiProject's talk page. You'll usually get a good response from the project, and this helps to build links between WikiProjects. Of course, if the project is moribund and you get no response after a while, then there's probably no problem with updating the assessment for them. Hope that helps. --RexxS (talk) 02:16, 25 September 2010 (UTC)


 * Remember that three levels, GA, FL and FA, are NOT assessments that can be assigned simply by a project member. These refer to external judgments of article quality made at WP:GA, WP:FL and WP:FA.  If these tags are desired, and the article meets the criteria (for GA or FA), it must be nominated (for GA or FA) and await comments.  See also the guideline on the English Wikipedia.


 * At the moment, this project has 325 unassessed articles; 306 unknown-importance articles; 11 B-Class articles that will eventually need to be nominated for Good Article status, external judgments of article quality and peer reviews to move to the next level (GA); and 1 GA-Class article only. This means that using the current rubric, it is very difficult that an article will get more than 15-20 points.

That's all for now... ...and happy editing! –p joe f (talk • contribs) 11:58, 24 September 2010 (UTC)


 * I am glad for the discussion about this, pjoef raises some important issues. This wikiproject is somewhat of aberration from other wikiprojects. As far as I know, this is the first time the foundation has attempted to impact the content of Wikipedia. Editors are still volunteers, of course – no one is paid to edit, through this project the foundation is simply targeting a subset of new contributors: public policy university students improving content through class assignments. I hope that most Wikipedians see the value of trying to get buy-in from the academic community. Broader university endorsement of Wikipedia will increase expert contribution, recruit more college student contributors, and increase diversity within the Wikipedia community.


 * One important goal of the PPI is to build credibility with the academic community about the quality of Wikipedia articles. I believe a lot of Wikipedia's content has been accurate for a number of years, but the academic community operates in slower communications so their impression of Wikipedia is behind the times. (The participating professors are the obvious exception.) The research and assessment aspect of this project is an important bridge to building that credibility. The challenge is to bring the essential Wikipedia policies into an article quality assessment that the academic community will recognize and respect. The best way I know to share information about the quality of Wikipedia with academia is to publish in peer-reviewed journals and present at conferences. The advantage of this is that these communications are also recognized by Wikipedia.ARoth (Public Policy Initiative) (talk) 22:24, 24 September 2010 (UTC)

Completed and a comment
I've completed my assessments; in a couple of cases other assessors had also completed theirs and it was interesting to compare the scores. I wonder if perhaps the Wikipedia experts will be harsher than the public policy experts when scoring readability, neutrality, formatting and illustrations, and less harsh on the other two areas -- sourcing and comprehensiveness. When the assessments are complete I think it would be good to compare scores in that way. Mike Christie (talk) 14:14, 26 September 2010 (UTC)
 * I'm also done, and I know (edit: I thought) I'm going to be on the extreme low end for grading neutrality, given that my philosophy is that "NPOV" does not stand for "No Point Of View", so even a factually-written stub gets hit hard on neutrality for missing POVs. I didn't give full marks for neutrality on any of my articles. (edit) Wow, I wasn't quite expecting others' scores to be like that. Nifboy (talk) 18:16, 26 September 2010 (UTC)
 * Interesting to hear that. My philosophy is that NPOV means reflecting mainly the POV expressed by the majority of reliable sources. If the article has no (or few) reliable sources, then it cannot verifiably be seen to be doing that. I'll be giving unsourced or poorly sourced articles low scores for "Neutrality" as a consequence. --RexxS (talk) 19:28, 26 September 2010 (UTC)
 * hi again Mike, yes! That's exactly what we are doing comparing Wikipedian assessment to expert assessment. I suspect Wikipedians will typically be harsher critics, they care more about Wikipedia and have higher expectations, also they are more familiar with what constitutes a good article so it is easier for them to identify articles that lack good qualities. It is interesting to hear two different philosophies on the NPOV policy, I wonder if it will result in very different scores in that area.... ARoth (Public Policy Initiative) (talk) 17:28, 27 September 2010 (UTC)
 * I completed mine, and am happy to find activity on this page. I am not surprised by the variance in assessments, particularly at this stage of project-level-coordination, because each of us have our own established benchmarks; some have specific experience, some a particular expertise/interest, some a seat-of-the-pants understanding of what Wikipedia is, and others with experience in administering that.  It is all grist for the mill, but I guess the biggest question to ask is: What is best for ARoth to crunch for the good of the project?  Should we continue to use our own benchmarks, if we can define them consistently, or should we coordinate a more global understanding for easier numerical consistency?  I can see advantages and disadvantages of both, with the former being more comfortable personally, being this early in a year-long effort.  After a while, we may be able to translate assessments between us, without changing our particular methods of assessment, if we remain consistent. I consider that there is no problem with differences, particularly if those differences are consistent. For example, I strictly followed the rubric; i.e., used only the defined scores in the ranges, and did not vary between those specifically defined.  For example, in comprehensiveness, only scores of 10, 7, 4, 3 and 1 could be used; same for assessing sourcing, with only 6,4,2,1 and 0 possible.  This was how I approached the first article-set assessment, but am unaware if this was intended for the exercise; maybe a discussion is warranted. What do y'all say? CasualObserver&#39;48 (talk) 07:55, 28 September 2010 (UTC)
 * I'd be interested to hear others' comments on examples of significant differences. For example, looking at your assessment page and mine, one article that we scored very differently is Equal Access to COBRA Act, which I gave a score of 4/26 to and you scored as 17/26.  Perhaps some discussion of how we arrived at these scores would lead to more consistency.  Amy, is there any reason for us not to engage in this sort of discussion?  Should we wait till the other assessors have completed their scoring, or should we just ask them not to get involved in this discussion till they've scored their articles? Mike Christie (talk) 09:30, 28 September 2010 (UTC)
 * I think my own scores, on the high end of the spectrum, come from the fact that I basically don't bother with anything above B-class anymore, and spend most of my time adding to shorter articles, so I differentiate more at the Stub and Start-class levels as opposed to the higher end of the spectrum. I reserved zeros for truly biased or unreadable prose. For Comprehensiveness, my view was informed by the fact that even FAs can be short, so I usually went looking for things that I thought were missing from the article, usually not knowing if the topic was better covered in some other article. Nifboy (talk) 10:52, 28 September 2010 (UTC)
 * That's an interesting thought. I spend most of my time working with FAs, either reviewing or writing them, and I suspect I am going to be at the low end of the scoring spectrum -- perhaps my expectations have been set too high. Mike Christie (talk) 12:08, 28 September 2010 (UTC)
 * It's Sturgeon's Law at work; 90% of our assessed articles are Stub- or Start-class. I consider typical C-class articles to have a comprehensiveness around 7, which gives me a lot of room to work below that. Nifboy (talk) 13:48, 28 September 2010 (UTC)
 * Mike, et.al. discussing the Equal Access to COBRA Act differences may be illustrative, but first I'll point to considerable differences among the assessors' stated starting points noted here; these are considerable and succinctly different vantage points for our initial review, which leads to much of the variance. I have recently un-hidden my notes for the assessment, for others to see; I agree it is the way to do this.  Another difference may be how each judges the notability and relevance of the article, generally.  In its one sentence, this article weasels (missing domestic partner link) its description with just two internal links, one external link to bill specifics, and one ref specifically mentioning it; the LGBT tag also on the page however, makes it much more informative in an expansive and very neutral way. Without that, I was thinking wheelchair ramps for access.  For comprehensiveness, I gave it 4/10, because it ' is well-developed in some aspects but requires major expansion in others'; you gave it 1/10, because it 'is a stub, consisting of only a paragraph or two of brief introduction to the topic'.  I certainly can't argue with your assessment, it is true; my difference with you is +3.  For sourcing, I assessed 4/6; 'the article is mostly well-sourced, but has some material that is not sourced or does not use the most appropriate sources'.  You assessed 0/6; 'the article does not reference any reliable sources'.  I can't quite agree with that, but my difference here is +4.  In these important parameters, I am at +7; that is considerable, but I think, defensible to a degree.


 * In neutrality, I gave it 2/3, considering from my note that it is 'more than a 1', which would indicate unequal coverage of one pov over another; I readily admit that NO points of view are specifically included, but do consider that method to be one of many ways of providing neutrality for a reader easily, with the LGBT tag very much indicating those differences in point of view, whether they are equal treatment, moral, nature vs nurture, or political.  You rated it a 0/3, indicating that 'the article falls significantly short of following the NPOV policy', and you also stated above that NPOV 'does not stand for "No Point Of View"', and would therefore suffer.  I can  understand taking that view on highly debated topics, but not for me on this one then, and there will likely be articles where we all may take that harder view, based on our backgrounds.  I certainly have my areas in this regard, and may re-consider my assessing method because of that, after some discussion, simply because, if it is not done globally, it might not be done neutrally.  I suggest discussion of a 'penalty premium' in comprehensiveness or neutrality, for side-stepping such elephant in the room issues. My difference here is a +2, higher again.


 * In the parameters of readability, illustrations and formatting, the aggregate difference for me totals +4, 1 in readability, 2 in illustrations, and 1 in formatting. I assessed them all at full score, seeing, 'no problems', 'nothing really needed' and 'no real problems', respectively.  On illustrations, I do not feel that all articles need images; if none are really available or appropriate, shouldn't it be assessed at a full 2, rather than a 0. Maybe I should have lowered formatting by 1 more, because of that missing and informative link noted as weaseled above. I will also note that this assessment was my first ever, it was drudgery, but by the end it was getting more comfortable.  I will wait for Amy to make some comments, but want to ask if there is any way we can change our assessments after some discussion. We are all different, but are also a team; we all should be in the same ballpark.  Regards, CasualObserver&#39;48 (talk) 05:34, 29 September 2010 (UTC)


 * Yeah, there's a huge difference between assessing the criteria against what's there ("are there enough sources for these two paragraphs?") versus assessing the criteria against a theoretical FA ("are there enough sources for an entire article?"). I prefer the former method because, for instance, adding unsourced paragraphs to a short-but-sourced article adds to the comprehensiveness but subtracts from the sourcing, which IMO is better than having it simply add to comprehensiveness. Nifboy (talk) 14:35, 29 September 2010 (UTC)

A couple things (my opinions, your mileage may vary, and Amy may have a bit of a different perspective, although I think this is roughly what our shared understanding is):
 * 1) The way sourcing is supposed to work (this is implied in the example rating in the video, but we should revise the rubric to make it clear) is that the sourcing score is based on the existing content, not sourcing of the theoretical FA.
 * 2) Neutrality is a bit different, because NPOV is not about lack of bias, it's about having all significant views fairly represented.  So missing points of view should count against it, although that should still be relative to the size of the article; in a really well-developed article, there's more room to devote to more minor views, whereas a properly neutral mid-sized article would only give detail about the more significant views and maybe summarize a range of minor views in a single sentence.
 * 3) The total score for an article doesn't really mean anything.  What matters more is whether ratings from different people translate to roughly consistent scores on the 1.0 scale.  The way I think of it is that the quality of an article isn't really additive; the relationship between different aspects is more complicated.  An outstandingly illustrated, well-sourced, perfectly readable and well-formatted one-sentence stub... is still just a stub, only a little bit more valuable than a dreadful one-sentence stub.  That said, for Amy's purposes, having a common understanding of what the individual scores mean is definitely important.  But for stubs, nothing besides completeness is going to matter all that much; the system was designed to capture quality differences between relatively well-developed articles.

--Sage Ross - Online Facilitator, Wikimedia Foundation (talk) 15:50, 29 September 2010 (UTC)


 * This type of discussion is exactly what needs to be figured out in quality assessment. I am not an expert in what makes article quality, I have opinions but I don't think they should have any more weight than anyone else's. In fact, probably less, because I am a new Wikipedian and have less understanding of the policies and community. The comments here illustrate some important flaws in the quantitative metric and it would help the research immensely to have a metric that is fairly consistent. In something as subjective as article quality, it will never be like taking a temperature where there is fairly precise reading, but it can definitely be better. The goal of the metric/rubric was to create a tool that allowed different reviewers to produce approximately the same results. It is essential to a measurement tool that the results be reproducible, then we can say with some degree of confidence that these data indicate whatever it is we are testing. If through use of the metric/rubric we realize that some aspects result in different interpretations and therefore significantly different scores, maybe we should decide on what the best interpretation is and clarify it in the rubric. For example, should a stub article be unable to achieve full NPOV score or formatting score? I don't know, there are good arguments for either way, but I would like there to be consensus that is clearly defined in the metric so that the scores from different reviewers in these areas are more similar. ARoth (Public Policy Initiative) (talk) 17:05, 29 September 2010 (UTC)

Clarity lacking
I arrived here via the watchlist header that reads "A team is forming to test Wikipedia's article assessments, as part of the Public Policy Initiative. Interested article reviewers can sign up now". It links to a subsection of this project page. Only after attempting to make something of where I landed did I scroll to the top of the page and read "WikiProject United States Public Policy". So.... I'm assuming it's of general interest to Wikipedians (usually implied by the watchlist mention)? What is the "Public Policy Initiative"? Is it an initiative unrelated to "WikiProject United States Public Policy", but its test bed just happens to be "WikiProject United States Public Policy"? Clarity is lacking.

I'm not personally looking for answers to any of these questions, but after spending five minutes with the page I arrived at from the watchlist, I have next to no idea what this is about. My sense is that this wikiproject is the test bed for some new, more general initiative, but that's far from clear. If it is, the new initiative needs to spend some more time on communications if it is going to advertise broadly. Just some feedback. Riggr Mortis (talk) 04:59, 29 September 2010 (UTC)
 * Thanks for pointing that out. I've added a pointer to the broader context for this assessment experiment, so that hopefully it will be easier for others coming in from their watchlists to figure out what this is all about.  This piece from The Signpost discusses the assessments and how they fit in to the Public Policy Initiative and Wikipedia more broadly, and this earlier piece gives an overview of the Public Policy Initiative.--Sage Ross - Online Facilitator, Wikimedia Foundation (talk) 15:08, 29 September 2010 (UTC)
 * There is also the page on the outreach wiki describing the project in more detail. ARoth (Public Policy Initiative) (talk) 17:08, 29 September 2010 (UTC)

Is the metric is any good?
As I understand it, the purpose of this exercise to examine the quality of the metric used for the assessment. I've now finished, so I'll throw in my 2d on the metric as initial observations. I'm referring to the section WP:WikiProject United States Public Policy/Assessment: I'll have a further think and perhaps add to these observations later. --RexxS (talk) 15:07, 29 September 2010 (UTC)
 * 1) Most of the "Assessment areas" such as 'Sourcing' employ level descriptors which are predominantly objective; while one, 'Readability' has level descriptors which are almost wholly subjective. Subjective descriptors will inherently introduce greater variability among assessors.
 * 2) I believe that not using the full range of integer scores available between level descriptors will produce an unnecessary quantisation in the values and reduce the discrimination of the assessment. As an example, 'Sourcing' has:
 * - "The article is mostly well-sourced, but has some material that is not sourced or does not use the most appropriate sources. - 4 points"
 * - "A significant portion of the article is well-sourced, but the majority of it is not adequately sourced. - 2 points"
 * An article that is about 50% well-sourced falls below the former but above the latter; giving 3 points in such cases increases differentiation.
 * 1) For what it's worth, each of the articles I reviewed produced an assessment that agreed with my initial impression of the article on the Stub–FA scale (which didn't always correspond with the assessment already made by other projects). However that scale has some pretty broad bands (Start & C), so perhaps my experience is not surprising.
 * Excellent points. I'll give my thoughts.
 * While to some extent, the areas differ in how subjective they are, I don't think it's as stark a difference as you suggest. Sourcing, completeness and neutrality all vary based on the assessors background knowledge, expectations, etc., while readability, in broad strokes, is usually pretty obvious to most people.  It may be very subjective to determine which of two similarly written articles is more readable, but it's pretty objective that, say, the typical New York Times article is more readable than the typical article in a scientific journal (or the typical essay by a beginning writing student).
 * As the rubric notes for sourcing, "Any score from 0 to 6 is possible." The intention is that the full range be used.  In drafting the descriptions of specific scores, it didn't seem useful to define every single score number with prose.  But yeah, an article that is 50% well-sourced is supposed to get a 3.  We can expand the rubric if people think that's necessary, but the more detail it includes the more people will gloss over it.
 * The intention was to design a numerical system that would match up well with (and, hopefully, give a little more consistency to the broad bands in) the standard system. So yeah, if anyone is getting results from this system that differ significantly from what you would rate them in the standard system, that's a shortcoming, and we should discuss how it might be improved.
 * --Sage Ross - Online Facilitator, Wikimedia Foundation (talk) 15:28, 29 September 2010 (UTC)
 * Thanks Sage, I accept your criticism that my presentation of point 1 is too stark a contrast. However, consider an example of a expert medical practitioner assessing a medical article: they will probably know what the mainstream view is, and what the best sources are. Their judgement on sourcing is likely to correspond more closely to the actual range of sources than the judgement of the average reviewer. The difference between their assessment scores is the difference between an objective reality (what sources actually exist) and the limited knowledge of the non-expert. Now think about how they might judge the readability of a medical article which is peppered with jargon. The expert might find that perfectly readable, while the non-expert might find it heavy reading without a medical dictionary! But in this case, is there an objective reality that corresponds to either view? I believe not; they are both entitled to their opinions. That's rather more what I had in mind when I called readability "subjective". Does that make sense? --RexxS (talk) 16:18, 29 September 2010 (UTC)
 * Yep, definitely. And that's something where we don't have super well-defined norms for what audience to write for (in part because we don't know that much about who the readers for specialized topic articles are and what level of complexity they can handle, which is one potential benefit of the Article Feedback Tool), so the best we can hope for in this case is to use the metric to catch the bad stuff that everyone can agree is bad, and the legitimate disagreements (which amount to style choices) add a bit of noise to that.  But in practice, I don't see issues of readability to be that far out of line with other aspects of article quality; editors seems to generally be capable of coming to agreement about what's appropriate for any given article.--Sage Ross - Online Facilitator, Wikimedia Foundation (talk) 16:34, 29 September 2010 (UTC)