User:Sphilbrick/Sandbox re noindexing of new articles by novices

What problem is being addressed?
Brand new users, many unfamiliar with the guidelines of Wikipedia, create pages that violate many rules. Many of these pages are indexed by search engines such as Google. Links to Wikipedia pages often score very high in the Google page rank system, thus these pages are very often one of the first presented to a person using a search engine. In the case of some rule violations (poor grammar) it is only mildly troubling that such a page would appear in a search engine. However, other violations, such as copyright violations, and libelous statements against living persons, are far more serious. While established editors can and do run afoul of the rules, pages created by the newest Wikipedians are more likely to be troublesome.

I did a very unscientific review of a few recently created pages. Results here:User:Sphilbrick/Sandbox for support of proposal The sample size is too small to draw broad conclusions, but does illustrate some of the issues. In short, I reviewed a number of new pages created by users with fewer than ten edits, and found eight with concerns. Six of the eight are already found in Google, and can still be found, even though, in some cases, the underlying page has been deleted.

What are the benefits?
Under this proposal, none of the eight pages identified would be found using a search engine (outside of Wikipedia). One of these pages is flagged for a copyvio, and all but one of the others reflects poorly on Wikipedia.

While this benefit is admittedly modest, I estimate that dozens of pages with problems are created each day by nonautoconfirmed users, and these pages are finding their way into search engines. If a better estimate of the potential count is needed, I can do a more rigorous analysis, but I'll note that my review covered fewer pages than are generated in a single day, so it is highly unlikely the number is less than thousands in a year. (However, I don't know how long it takes for a deleted page to drop out of Google, so I don't have an estimate of the number of problems pages in existence at any time.)

What are the costs?
I see four types of costs: My list is somewhat long, but each is minor. IANAP so I don't know precisely what is necessary to implement this, but I assume that when a user creates a new page, it fires off an algorithm that would need modification. The algorithm needs to determine whether the user is autoconfirmed, but my understanding is that this happens every time a user does an action, so is no cost. The algorithm would have to create the new page in a slightly different way for the two types of users, but I think adding a template should not be a major cost. (Does the system already add a non-patrolled template, or is that handled a different way?)
 * 1) Implementation time
 * 2) Education time
 * 3) Template removal time
 * 4) False positive cost

If the proposal is implemented, user manuals would need a rewrite. Not zero, and I may be under-estimating the places affected, but not major.

If the proposal is implemented, someone will have to remove the template at some time. However, most of these pages will have templates relating to other issues, so removing this template when the other problem messages are removed should be a minor addition of time.

Finally, there is a possibility that a new user will create an excellent article relating to a breaking news item, and users of search engines will run across the article later than they would under the present approach. While it needs to be mentioned, I find it difficult to believe it could be a meaningful problem. New pages relating to breaking news are highly likely to be patrolled, and it seems very unlikely that such a page would escape review by an autoconfirmed user with the ability to remove the template for more than literally minutes.

Background
This proposal was inspired by a discussion in Technical  here. This proposal stands on its own, but interested readers might want to read the link to see, for example, why removal based upon a time limit was criticized, or why a proposal that it would take an administrator to change the status was a problem.

Alternative
I feel that the noindex template rule should be stronger than simply users who are not yet autoconfirmed. I would prefer something like a few hundred edits and a few months. However, creation of a new class of users is likely to be a big deal, so I chose to make the distinction using autoconfirmed. I do note that Tor users have rules applied at a different point - 90 days and 100 edits. If that status is easily available, I would find it a better choice.

Third pass review
Selection criteria:
 * Review Special:NewPages for first eight hours 23 May 2009
 * Identify new page created by user with <10 edits, and without user page
 * Review page to subjectively determine whether the world would be a better place if the entry could be found in a search engine.
 * Add to list, identify whether the answer is "yes", "no" or "maybe".
 * Exclude redirects
 * Sort list to place more problematic pages first

Observations:
 * These articles are newer, having been created today
 * Over one third (nine) of the identified articles should not (in my opinion) be in a search engine
 * More relevantly, 7 of the 25 are already identified by other editors as possibly deserving deletion.
 * Seven of the nine articles identified by me are already in Google
 * Six of the seven articles identified by other Wikipedians as problematic are already in Google

First pass review
Selection criteria:


 * Review Special:NewPages
 * Identify new page created by user with <10 edits
 * Review page to subjectively determine whether the world would be a better place if the entry could be found in a search engine.
 * Add to list if the answer is "no" or "maybe".
 * Stop after eight entries (selected from fewer than 1000 new articles)

Second pass review
Selection criteria:
 * Review Special:NewPages for 25 April 2009
 * Identify new page created by user with <10 edits, and without user page
 * Review page to subjectively determine whether the world would be a better place if the entry could be found in a search engine.
 * Add to list, identify whether the answer is "yes", "no" or "maybe".
 * Exclude redirects
 * Sort list to place more problematic pages first

Observations:
 * Higher ratio of acceptable pages than first review - but this is partly because pages already found to be seriously problematic have been removed from WP and the list of new pages, even though some will still be found in search engines
 * While many deserve to be found in search engines, they will be found if any editor removes the noindex tag, so false positives are not a significant problem