User:GreenC/WaybackMedic



Wayback Medic is a bot that fixes problems with links to Internet Archive. Mostly it removes/replaces Wayback links that don't work (#4 below).


 * WM fixes:


 * WM examines:


 * templates inside ref pairs.
 * Citation templates inside and outside ref pairs.
 * Bare wayback URLs outside templates. If these return 404 etc replace with the regular URL. WM is currently unable to add in this case.


 * WM design:


 * Multiple HTTP checks at application layer if Wayback reports an error to account for brief outages or intermittent responses.
 * In addition time outs & retries built-in to the web transfer agent settings (wget)
 * Multiple checks of the Wayback API using multiple dates to ensure a page really is unavailable.
 * Re-checks the API results by looking at the header to ensure it really is a good page.
 * If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
 * If no Wayback available, checks Memento for alternative archives such as Library of Congress, WebCite and a few dozen others.

Statistics

 * August 2015 to June 6, 2016

WaybackMedic checked ~140k articles edited by Cyberbot II from August 2015 to June 6, 2016. It found ~374596 wayback links (includes duplicates) of which 29171 were dead in 17978 articles. It was able to fix 8785 by finding a new snapshot date, and 661 by finding an alternative archive service through Mementoweb.org - the rest 19602 were deleted from Wikipedia (robots.txt or missing page or link was never good to begin with). Other fixes and problems were logged and corrected.

Links

 * Bot Request for Approval
 * WaybackMedic on GitHub


 * DISCLAIMER

Please report problems.