Wikipedia:Search engine indexing (proposal)

Search engines such as Google and Bing deliver search results by using computer programs called web crawlers to 'surf' the internet looking for new pages to add to search indices, and for updates to previously 'crawled' pages. These potentially-intrusive programs are governed by a set of standards that allow website owners to control which pages the crawlers are allowed to visit, and which links they are allowed to follow to reach new pages. In the context of Wikipedia, this means that we have the ability to control which pages are accessible to web crawlers, and hence which pages are returned by search engines such as Google.

Background
From Wikipedia's foundation, all of its content was made accessible to web crawlers and search engines. Robots.txt, the file that controls web crawler access, was used primarily to block individual web crawlers that were making excessively long or rapid crawls and hence were draining system resources. This meant that in addition to all our encyclopedic content, enormous amounts of discussion, dispute, and drama, were made available to external searches. This material is the focus of considerable numbers of complaints to the OTRS service, and can often contain unwanted personal information about users, undesirably heated debates about article subjects, and other content that does nothing to enhance Wikipedia's reputation as a professional encyclopedia. In 2006 the German Wikipedia held a 'Meinungsbilder' (roughly analogous to an RfC), and asked the developers to exclude all talk namespaces from web crawlers (see ), in an attempt to control some of this content.

Wikipedia's powerful presence as the internet's eighth most-popular website gives all our pages very heavy weighting in search engine rankings; a Wikipedia page that matches the search term entered is almost guaranteed a place in the top ten results, regardless of the actual page content. While this is an extremely positive status for our articles and content, it is not always beneficial:
 * "copyright violations" lists Copyright violations as second result.
 * "basic dignity" links second to Basic dignity, a user essay.

In June 2006, MediaWiki was enhanced to provide the ability for developers to exclude individual namespaces from being indexed by web crawlers. This functionality was extended in February 2008 to allow developers to set indexing policy on individual pages. Finally, in July 2008, users were given the ability to manually set indexing policies for individual pages using two magic words  and  ; the developers can customise in which pages these magic words function.

Until late 2008, the poor quality of Wikipedia's own internal search engine meant that editors relied upon Google to find material for internal purposes, such as past discussions, useful help pages, and other information. In October 2008, the internal search function was significantly improved, enabling all the functionality already available through search engines such as Google, and also incorporating a number of features unique to Wikipedia, such as automatic identification of redirects and page sections, and more appropriate search rankings. This made the internal search a superior method for finding internal content than external searches like Google. In December 2008, new updates to the MediaWiki software enabled the insertion of inline search buttons to search through sets of subpages, such as the archives of talk pages or the Administrators' noticeboard.

The entirety of editorial pages have been spidered (pushed onto search engines such as Google) as a result. As a smaller website this was not a big deal. As a "top 5-10 website" it is. Dialog on users from Wikipedia, including their internal actions as editors, is routinely a "top hit" for individuals long after they edit, and pages other than mainspace and well patrolled parts of other spaces may contain large amounts of unchecked, unverified, user writings which any user may place within a variety of namespaces. Unless significantly problematic and actively noticed, they may go unchecked and spidered as Wikipedia content for years.

Our visitors and readers look for encyclopedic content, not inward-facing discussions, disputes by users. Our readers come first. There is considerable content we want the public to find and see. That is the end product of the project.

The rest - including popular project pages such as AFD, and all "talk" namespaces, dispute resolution pages, user pages, etc, are not of great benefit to the project if indexed on search engines. Many of them also raise considerable concerns about privacy and ease of finding harmful stuff (user disputes/allegations) on Google, far more than they help the project. We don't need those publicized. They are internal (editorial use) pages.

It is proposed that it's finally time to close the gap. Instead of NOINDEXing individual pages mostly ad-hoc, I can't see any strong current continuing rationale for any "internal" page to be spidered at all, and I can see problems reduced by killing it. Use internal search to find such material, and kill off spidering of anything that's not really of genuine public note as our "output/product".

A prior discussion has taken place at Wikipedia:Village pump (policy)#NOINDEX of all non-content namespaces (Dec 2008 - Jan 2009). This proposal is being set up to formally see if consensus exists to request these changes, and to identify the technical means to do so.

Proposal
The proposed changes fall into two areas: technical, and procedural, as described below.

Technical
The Wikipedia:, MediaWiki: and Template: subject namespaces, and all talk namespaces, are set to be not indexed by default; that is, no pages in these namespaces will be found by web crawlers and hence will not appear in search engine rankings, although all pages will continue to be visible in Wikipedia's own internal search results.

In addition, the magic words  and   are disabled in the MediaWiki: and Help: subject namespaces, and in all talk namespaces. This has the effect of 'locking in' the default setting so it cannot be changed on a per-page basis.

The new indexing settings are shown graphically in the table to the right.

Procedural
With these changes, it becomes necessary to develop new guidelines to govern the use of the magic words  and   in those namespaces where they function.


 * INDEX in User: namespace


 * INDEX in Wikipedia: namespace


 * Pages such as policies, guidelines, and 'any well-recognized stable reference pages' (consensus basis) will remain indexed.
 * Other pages may be individually indexed on a case-by-case basis (consensus basis).


 * NOINDEX in File: namespace

Some content (non-encyclopedic material such as bug reports, internal project logos, etc) may be noindexed on a consensus basis. A discussion of NOINDEXing non-free media is likely to take place, separately to this proposal.


 * INDEX in Template: namespace


 * NOINDEX in Category: namespace

'Maintenance' categories will be manually NOINDEXed, all other categories (i.e. content categories) should not be overridden and shall remain Indexed.


 * NOINDEX in Portal: namespace

Implementation

 * Once this page is complete, the community will be asked to consider the proposals to change the index status of the various namespaces as described above. The different parts of this proposal will be asked separately so that editors may pick and choose their preferences on a per-namespace basis.
 * For those namespaces where consensus is reached, WMF and technical users will be asked to determine the most appropriate way to implement the decision.

FAQ

 * Will this be a problem if users rely on Google to find non-content in Wikipedia?
 * No. In November 2008 the site's internal search was enhanced. The new search handles complex queries of the same kind as Google, and other features which leave it better for searching these spaces, than Google is.
 * For example, internal search can handle the same boolean expressions and "page title" search, as Google advanced search can, but it now also understands namespaces, page "sections", can look for words with wildcards in them, and so on, which Google cannot. In addition the many pages that are already NOINDEXED can be searched by internal search, but Google cannot see them.


 * What will users need to know?
 * Users will need to use internal search rather than external search to find material within past discussions. They will find that once they get used to clicking "search" rather than "Google", the same formats as Google Advanced Search are accepted, and also, that more directly useful information relevant to Wikipedians searching past discussions is available, such as limiting the search to specific namespaces, or "section" and "section title" information, that they did not have before using Google.


 * Such a change requires clear advance notice. Users would be notified by a clear banner, and noticeboard posts, of the change, a month in advance, and directed to a useful link and help information. Other means of making the switchover easy would also be used as fully as possible. New users would pick up "this is how one searches discussions" in the same way that they pick up how to review history revisions, or markup, or any other Wikipedia editorial know-how.


 * What else might happen during the month's advance notice?
 * By the time the technical side is discussed and a month's notice has passed, it's likely that most of the obvious project space pages needing to be INDEXed, or those where consensus would happen, will have been tagged as INDEXed. Users will be unlikely to wait :)


 * Will this affect Wikipedia's rankings?
 * Wikipedia is ranked near the top on many topics because its content is very heavily referenced. The impact of this proposal is very difficult to predict.


 * Why is Project space being proposed to be indexed the way it is?
 * Short answer - pages we'd want to spider in Projectspace are likely to change relatively slowly in number or location. The ones we don't want to spider will be written at the drop of a hat or obscure, and likely far outnumber them. So we default to not indexing unless decided.


 * {| class="collapsible collapsed" style="width:100%;text-align: left; border: 0px; margin-top: 0.2em;font-size:100%"

! style="background-color: #f2dfce;text-align:left;font-size:100%" | Slightly longer answer
 * style="font-size:100%" | Project space contains a wide range of material. It can include, like userspace,  almost any user writing , provided it appears superficially to be about the project or of project interest; discussions; disputes; negative material on users; essays on viewpoints of any editor; and considerable other unchecked material.
 * style="font-size:100%" | Project space contains a wide range of material. It can include, like userspace,  almost any user writing , provided it appears superficially to be about the project or of project interest; discussions; disputes; negative material on users; essays on viewpoints of any editor; and considerable other unchecked material.

It also contains a significant amount of genuinely valuable material that is as much our "output/product" as any article - our policies, guidelines, explanations of processes, well recognized stable pages on Wikipedia/Wikimedia, reference data, and so on.

Project space is a mix of all these. Some should be spidered (broadly, the latter valuable material and any other "consensus says"). A lot is unchecked, and new material may be added at any time. Since policies and guidelines can be collectively indexed simply via their respective templates, and the number of stable, valuable reference pages is fairly stable itself, and the number of other pages grows far faster and is unchecked, it's easier and more effective to default to NOINDEX, and then index as an exception, anything (or any group or category of pages) that consensus says is valuable.
 * }


 * Can a namespace actually be set as "no index, not overridable"?
 * Short answer: Yes, both MediaWiki developers and en.wiki admins can make these settings, although the most effective solution involves a combination of both.
 * {| class="collapsible collapsed" style="text-align:left; border:none; width:100%; margin:0;"

! style="width:15em;" | Full answer: ! style="width:;" | A page can be set to Not Index in a number of ways. Web crawlers used by search engines check for a file called "robots.txt" on the root of a webserver, and use that to set global parameters for which paths on the site can be accessed by the crawler. Wikipedia's robots.txt file is viewable at http://en.wikipedia.org/robots.txt. Entries can be added to the file either by Wikimedia developers, or by en.wiki admins by editing MediaWiki:Robots.txt. Entries added by the developers override those added by en.wiki admins. Secondly, meta HTML tags may be added to the header of individual pages to force web crawlers who visit the page to 'ignore' it. Several MediaWiki configuration settings allow these tags to be set on a wiki-wide, per-namespace, and per-page basis. Finally, wiki users can add a behavior switch to a page's wikimarkup to manually add an HTML meta element – the switch  adds a meta element that excludes the page from crawlers, while the switch   removes excluding meta elements that may have been added by other methods. These switches are usually included by adding the templates and  to enable us to easily keep track of their use.
 * colspan="2" |
 * colspan="2" |

Meta HTML tags cannot override restrictions that are set in the robots.txt file, as a page that is excluded by robots.txt will never be fetched, so if it has a local override in the markup this will never be noticed. Finally, the namespaces in which the  and   switches are recognised can be set in another MediaWiki configuration setting. Currently the tags are disabled in the mainspace only.

Using these options, we can ask the developers to implement any permutation of default state and overrides for any namespace (using the MediaWiki configuration settings), and also block both individual pages (using ) and page hierarchies (using MediaWiki:Robots.txt) on an ongoing basis.
 * }


 * Isn't this page pointless? Since the community has decided that it wants to let pages in non-main space be indexed?


 * The community has never had the opportunity to form a consensus on this issue; as explained above, the ability to restrict web crawler access to pages was implemented long after the formation of Wikipedia, and until recently the poor internal search function made noindexing an impossibility. Now that the situation has changed, we can form a legitimate consensus.  Don't forget that, even if the community had decided previously that non-mainspace pages should be indexed (which it hasn't), such a consensus can change over time as the situation changes, such as the updated internal search.