User:GreenC/WaybackMedic 2



Wayback Medic is a bot that fixes problems with links to Internet Archive Wayback Machine (and some WebCite and others). Mostly it will be Fix #2 below.

The difference between WaybackMedic 1 & 2:


 * WM2 processes all articles containing Wayback links (about 380k). WM1 processed a subset of articles (about 140k).
 * WM2 will do fix #2 on all links. WM1 only did fix #2 if another fix was done on the same URL at the same time.

The bot operator is User:Green Cardamom please leave problem notices on my talk page.


 * WM fixes:


 * WM examines:


 * templates inside and outside ref pairs.
 * Citation templates inside and outside ref pairs.
 * Bare URLs outside templates. If these return 404 etc replace with the regular URL.


 * WM design:


 * Multiple HTTP header status code checks at application layer. Verifies Wayback URLs.
 * Time outs & retries built-in to the web transfer agent settings (wget)
 * Multiple checks of the Wayback API using multiple dates to ensure a page really is unavailable.
 * Re-checks the API results by looking at the header to ensure it really is a good page.
 * If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
 * If no Wayback available, checks Memento for alternative archives such as Library of Congress, WebCite and a few dozen others.

Statistics

 * August 20, 2016

WaybackMedic 2 checked ~400k articles containing Wayback links on en.Wikipedia - this is the full corpus of every article containing Wayback links in the system as of August 20, 2016. It found about 1.1 million Wayback links. Of those about ~15,000 links were not working. WM2 was able to fix 3,786 by finding a new snapshot date, and 680 by finding an alternative archive service - the rest 11,125 were deleted from Wikipedia. Of those, about 7,000 were due to robots.txt - the rest were 301/302 (infinite loop), 303, 400, 401, 403 (non-robots.txt), 406, 409, 415, 500, 502, 504 and 521.

WaybackMedic 2 edited 203,902 articles in total. Most of it was fix #2 (439,170 changes) which mostly amounted to changing http->https and adding "/web/" to the URL. It uncovered a large number of "Log dead URL" and "Log emptyarch" fix #3, as well as a very large number of "Log date mismatch" fix #8.

Links

 * Bot Approval
 * Trial runs