User:Dispenser/Checklinks

Checklinks is a tool that checks external links for Wikimedia Foundation wikis. It parses a page, queries all external links and classifies them using heuristics. The tool runs in two modes: an on-the-fly for instant results on individual pages and project scan for producing reports for interested WikiProjects.

The tool is typically used in one of two ways: in the article review process as a link auditor to make sure the links are working and the other as a link manager where links can be reviewed, replaced with a working or archive link, add citation information, tagged, or removed.

Background
Link rot is a major problem for the English Wikipedia, more so than for other websites, since most external links are used to reference sources. Some of the dead links are caused by content being moved around without proper redirection, while others require micropayments after a certain time period, and others simply vanish. With perhaps a hundred links in an article, it becomes an ordeal to ensure that all the links and references are working correctly. Even featured articles that appear on our main page have had broken links. Some Wikipedians have built scripts to scan for dead links. There are giant aging lists, like those at Link rot, which was last updated in late 2006. However, these scripts require manual checking to see if the link is still there, searching for a replacement, and finally editing the broken link. Much of this work is repetitive and represents an inefficient use of an editor's time. The Checklinks tool attempts to increase efficiency as much as possible by combining the most used features into a single interface.

Running
Type or paste into the input box the URL or page's title or a wikilink. All major languages and projects are supported. MediaWiki search functionality is not supported in this interface at this time.

Interface

 * Page heading
 * 1) Name of the article for the set of links below
 * 2) Other tools which allow checking of contributions and basic page information.
 * 3) "Save changes" is used after setting actions using the drop down.
 * Link
 * 1) Reference number
 * 2) The external link.  May contain information extracted from cite web or citation.
 * 3) HTTP status code; tooltip contains the reason as stated by the web server.
 * 4) Analysis information.  In this example it determined that one of the 302 redirects was likely a dead link.

Repair
Once the page has fully loaded, select an article to work on. Click on the link to make sure the tool has correctly identified the problem (errors can be reported on the talk page). If the link is incorrect you can try a Google search to locate it again, right-click and copy the URL, and paste into prompt create by the "Input correct URL" option or "Input archive URL". The color in the box on the left changes to the type of replacement that will be performed on the URL. When you're finished click "Save changes" and the tool will merge your changes and present a preview or the difference before letting you save.

Redirects
There are principally two types of redirects used: HTTP 301 (permanent redirect) and HTTP 302 (regular redirect). In the former it is recommended that the site update the URL to use the new address. While in contrast, the latter is optional and should be reviewed by a human operator.

Some links might be access redirect as to avoid the need to log into a system. These may be said to be permalink. Finally, there are redirects that point to fake or soft 404 pages. Do not blindly change these links!

Do not "fix" redirects

 * Removes access to archive history by WebCite and the Wayback Machine at the old URL
 * WP:NOTBROKEN calculates the cost an edit far excesses the value of fixing a MediaWiki redirect. A similar thing can be said about redirect on external links.

Archives
The Wayback Machine is a valuable tool for dead link repair. The simplest way to get the list of links from the Wayback Machine is to click on the row. You can also load the results manually and paste them in using the "Use archive URL" option. The software will attempt to insert the URL using the archiveurl parameter of cite web.

Tips

 * Most non-news links can be found again by doing a search with the title of the link. This is the default setup for searching.
 * Link can be taken from the Google results via right-clicking and selecting "Copy Link Location" and inputting it through the drop down.
 * Always check the link by clicking on it (not the row) as some websites do not like how tools send requests (false positive) or the tool was not smart enough to handle the incorrect error handling (false negative).
 * Non-HTML document can sometimes be found by searching for their file name.
 * If Google turns up the same link, leave it be as it has recently or temporally become dead and you will not find a replacement until the Google's index is updated.
 * You may wish to email the webmaster asking them to use redirection to keep the old links working.

Internal workings
The tool downloads the wiki text using the edit page. It checks that the page exists and is not a redirect. Then it processes the markup: escaping certain comments so they are visible, remove nowiki'ed parts, expand link templates, numbering bracketed links, adding reference numbers, and marking links tagged with dead link. Since templates are not actually expanded this prevents some from working as intended, most notably external link templates. A possible remedy is to use a better parser such as mwlib from Collection. The parsed paged can be seen by appending  to the end of the URL.

Limitations

 * BBC.com has blocked the tool in the past, and this domain is now disabled to prevent waiting for connection timeouts
 * Excludes external links transcluded from templates, this is on purpose as the tool wouldn't be able to modify these when saving.

Linking
It is preferable to link using the interwiki prefix. Change the link as such: checklinks checklinks Linking to a specific page (swap  for  ): Edip Yuksel links Edip Yuksel links

Documentation TODO

 * ADD information for website to opt out of scanning
 * Break things up so can be read non-linearly (i.e. use pictures, bullets)
 * Explain why detection isn't 100%. Give examples of website that return 404 for content. Others which are dead until the disks on the server finish spin up.  Those which return 200 on Error pages, etc.
 * Users don't seem to understand that they can make edits WITH the tool or search the Internet Archive Wayback Machine and WebCite (archive too).