Wikipedia:Link intersection


 * "Inside every link is a tag waiting to get out" -- David Weinberger

"Link intersection" is an informal tagging system meant to complement the more formal taxonomies that we have been creating using categories. Instead of adding tags to articles, we would use the wikilinks that already exist in every article. This is essentially a tagging system that is going unused. Every time we link an article to another, we in effect create a tag. All of these tags are managed in a pure wiki way. So all that is missing is the way to harvest information from these tags. A new "Find similar articles" tool would do this by finding all articles that contain the same set of user selected wikilinks. If the links to articles intersect, the articles themselves are very likely to be similar.

This proposal requires a change to the MediaWiki software. "Link intersection" offers the possibility of adding several new features that will benefit users by adding valuable research tools as well as improving the categorization system by taking over some of the functions it has been handling.

There are three components to this proposal:
 * 1) Improving the "What links here" tool.
 * 2) Creating a "Find similar articles" tool -- the user interface for finding link intersections.
 * 3) Related changes to the categorization system.

Using links as "Keywords"
Wikilinks can be thought of as keywords. They have been added to articles to show that there is more information about topics mentioned in articles. This is better than tagging systems for several reasons:
 * Wikilinks are limited to existing names of articles which keeps keywords from becoming trivial or silly.
 * Wikilinks are disambiguated. This means that each keyword has as specific definition -- the article it is linked to.
 * Wikilinks have already been added to virtually every article so the only thing needed to implement this system is the software.

Keyword searches -- improving "What links here"
One can filter "What links here" by namespace. If you are reading the article about Suspension bridge, clicking on "What links here" would show all pages that have links to Suspension bridge. But the average user of Wikipedia reading about bridges is probably not looking for user pages or project pages that link to the article. So the form at the top contains a select box at the side, allowing to narrow to the article namespace: Articles that link to "Suspension bridge".

But for the sake of usability, it has been suggested that the limitation to article space should be the default, and the results should be alphabetized or ordered through some relevance heuristic. So in effect, this modification would make "What links here" into a more effective keyword search.

This should be much more efficient and useful than doing a regular text search. The most important improvement is much less noise. If there is a link such as Bridge, it points to a specific article (unlike Bridges which is a disambiguation page, as is Bridge (disambiguation)), so the context of the word is clear. It won't be linked when used in other ways such as "burned a bridge" or "bridge to the future". If an article has an outgoing link, it means that the specific subject matter is relevant to the article, and thus acts like a smart tag.

Link intersections
For keyword searches to be effective, you want to be able to find articles based on multiple keywords, not just the single search provided by "What links here". The core of this proposal is a new tool called "Find similar pages", which does multiple keyword searches. It is essentially a way to create and display the intersection of multiple "What links here" requests.

Being able to enter many different keywords, and then do an intersection of the links to each keyword would be a very effective way to search for information. The intersection of New York, Suspension bridge, and Subway would likely bring up the suspension bridges in New York that carry the Subway. Of course there will be some other articles that appear unexpectedly.

Here's how it would work: Let's say you're looking at an article like Golden Gate Bridge and you want to find some similar articles. You'd click on a new item in the toolbox that says "find similar pages" and you'd see a list of all the outgoing links from the article. Next to every link there would be check boxes, and you could check off the ones you are interested in, and then click on a button that says "find similar pages". You'd get a list of all the articles that have the same combination of outgoing links. This is the intersection of all the articles that link to each tag.

The default for the tool would show similar pages in the same namespace. Other namespaces, or all namespaces could also be selected. This is the same as the proposed overhaul for "What links here".

There might be different ways to view the list of links to select when using the tool:
 * Plain list. All the links appear in the order they first appear on the page.
 * Alphabetical list. All the links on the page would be sorted alphabetically.
 * Section lists. The links would appear in the order they appear in the article, and the section headings would also be displayed.
 * Article view. This would display the page with the text in grey except for the links and the checkboxes that have been added.
 * (Note: All four options are linked to mock-ups)

Technical considerations
The major obstacle to this proposal might be server load. If all the selected articles have many thousands of incoming links, finding the intersections could possibly add too much to the server load. However, most general terms are not linked. For example, biographies rarely link the word 'person'. If huge intersections are a problem, there might be some ways to control them. If an article has a huge number of incoming links, it might be grayed out in the list. There is likely to be a similar, more specific link that wouldn't be grayed out. For example, if 'Bridge' had too many links, you'd be able to choose 'Suspension bridge'. Since the list of links will fairly long, this might not be too much of a limitation.


 * As I see the present database schema, fetching the intersection of pages that link to a set of pages should be quite an efficient operation, requiring less database access than the existing "what links here" (though it would require a bit more memory for the query optimizer). An indexed subset of an indexed subset (...) is something SQL should be able to handle quite well.  --LDC (talk) 16:34, 25 February 2008 (UTC)

Changes to the categorization system
The goal of adding a link intersection system is to lift some of the burden now being carried by the categorization system. If link intersections are seen as the way to tag articles and do keyword searches, then this requirement would be removed from the categorization system. Many categories could probably be deleted, and many less categories that are now seen as over-categorizations would be created.

There are currently several overlapping views about the purpose of Wikipedia's categories: Using categories to do a database search is often in conflict with the other uses of categories. Categories get diffused into small specific grouping which depopulate the parent categories that serve as a general index. By having the tools to do link intersection, there would be less need to break up categories into very specialized small groups.
 * Categories are a tool for browsing
 * Categories are a means of classifying articles
 * Categories are an index of a subject
 * Categories are used to do a database search (e.g. American Film Actors)

Proof of concept implementation
A working proof of concept implementation of Link intersection and Category intersection combined is available at
 * http://tools.wmflabs.org/dschwenbot/intersection