User:H3llBot/ADL

This task combats link rot by using the Internet Archive Wayback Machine to provide archive copies of now dead links in references and citations or marking them with dead link if a suitable archived copy is unavailable.

The bot currently only processes citation templates that have url and accessdate set. The recognized citations are: Citation, Cite news, Cite web, Cite journal, Cite book, Cite mailing list, Cite video, Vcite web, Vcite book, Vcite news, and Vcite journal. The bot will attempt to retrieve the archived copy from Wayback and add archiveurl and archivedate to the citation (the bot will respect whitespace formatting). The bot will also add  comment, so it is possible to track bot added archvies. Failing that, it will mark dead links with or set yes if it was a preemptively archived citation with no.

The retrieved Wayback archive's date is either (1) the closest archived copy before the citation's accessdate up to 3 month range or (2) the first archived copy after the access date up to 1 month range (used to be ±6 months). The date format is derived either from Use dmy dates or Use mdy dates templates or the citation's accessdate or date field.

Dead links are URLs whose HTTP status responses are 404 or 301. Other error codes or failed connections are ignored. The 404 check is carried out twice within 3 days (used to be 1 day) to make sure the link is really dead and not just down for maintenance. GET (as opposed to HEAD) requests are used and redirects followed as some servers redirect to both 404 and 200 pages.

FAQ
Q: You marked a link as Dead link, but there is a copy on Wayback!

A: Usually, the available copies are out of the date range the bot is comfortable using. Secondly, Wayback is not always reliable. The bot uses secondary attempts if Internet Archive returns connection errors, but even that sometimes fails. I use multiple retries and delays.

Q: How many times will your bot keep coming back to the same page and making changes, can't you do them at once?

A: Wayback is not always reliable (in fact, it's quite unreliable most of the time with common timeouts). Often the retrieval fails at one time and succeeds for the same link at other time. Even the implemented retries and delays do not always work. Hopefully, return visits will mean fixing more links.

Q: The link isn't dead! You marked it as dead.

A: Some web-sites don't like bots and use various ways of determining automatic processes, simplest being a check for user-agent and referrer. The bot fakes these, but even then some sites may wrongly return a 404 not found page instead of 403 forbidden as they should.

Q: The link isn't dead! You added archive parameters to it anyway.

A: Sometimes web-sites are temporarily down and wrongly return 404s instead of 503. Even though bot retries every dead link, it may visit within this maintenance frame. Also, preserving archive copies for live links is actually not wrong, if misleading without no.

Q: The linked Wayback page says there is no archive available! Why did bot add bad urls?

A: Make sure it is not a temporary problem, often individual Wayback's servers are down. Otherwise, the page was available when it was added. Internet Archive respects robots.txt and request for content removal. So any copyright holder can contact them and ask the pages to be removed. This doesn't happen often and is very time consuming to verify reliably.

Q: The bot only marked 1 or 2 links, but there are more dead, even from the same domain.

A: This is probably because the bot had seen that link with that access date before in another article, but has not yet checked all the links in this one. It should get back to this article eventually and mark the rest. This happens rarely.

Codes
This task covers several "sub-tasks", marked with the code (in edit summary or page link redirect) as follows:
 * ADL – archive dead links: adds archiveurl to citation(s) with successfully retrieved archived copy
 * MCD – mark citation dead: adds dead link to citation(s) unable to get archived copy for
 * MDY – mark citation expired: set yes for citation(s) now dead, but with preemptively archived copy
 * RDT – remove deadlink tag: removes dead link from citation(s) with successfully retrieved/added archived copy

TODO

 * Check bare external links
 * Check manually written references
 * Parse revision history to find url insertion dates when accessdate is missing
 * Use WebCite as alternative to Wayback

BRFA
The bot request for approval available at WP:BRFA/H3llBot 2. Addendums: H3llBot 2b, H3llBot 2c

Relevant links

 * WP:DEADLINK, WP:DEADREF
 * WikiProject_External_links/Webcitebot2 - task force of WP:EL dedicated to link repair
 * Dead link related: 1 2 3 4 5
 * WebCiteBOT related: 1 2 3 4 5 6
 * Access dates by bots: VP 1, VP 2, no consensus


 * Other similar bot BRFAs:
 * Tim1357's DASHBot 11 for the same purpose bot, description.
 * Ocolon's Ocobot for finding dead links
 * Anomie's AnomieBOT 60 for replacing and archiving certain domains
 * ThaddeusB's DeadLinkBOT for replacing certain domains
 * Merlissimo's MerlLinkBot for replacing certain domains
 * ThaddeusB's WebCiteBOT for preemptively archiving new links
 * Bots/Requests for approval/BlevintronBot
 * Bots/Requests for approval/BOTijo 10