User:GreenC/WaybackMedic 2.1

Wayback Medic 2.1 is a bot that adds and maintains links from the list of known webarchive services in use on the English Wikipedia.

Edits made after 2017-01-07 are by version 2.1

The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".


 * WM fixes:


 * Technical details:


 * Real-time operations, no link database.
 * Many APIs including Internet Archive, Memento, WebCite and "Timemap" APIs at individual service sites
 * Multiple HTTP header status code checks at the application (WaybackMedic) layer
 * Additional time-out & retries built-in to the web transfer libraries.
 * Additional operating-procedure level checks against network and other errors - semi-supervised.
 * Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
 * Accepts API results but then verifies by looking at page headers and/or contents
 * If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
 * If link is policy blocked by robots.txt log it but leave alone - in the future, robots may be lifted by the site owner or IA

Statistics
The bot runs through a batch of articles about every 2-3 months taking a break in-between. Below are some stats from the first two runs.

Run 1: 2016-12-15 -> 2017-03-15
From December 15, 2016 to March 15, 2017, WaybackMedic processed 336,271 articles. This set represents articles edited by InternetArchiveBot from July 2016 -> February 2017 plus articles requiring merge of ->. WaybackMedic made 115,066 changes in 47,810 articles. All changes are logged and available on request eg. which articles had the Archive.is URL fix. Diffs of each article pre and post edit are also saved and searchable.

Bummer         : 507     (Wayback links that return "Bummer page not found") Robots.txt     : 6477    (Wayback links blocked by robots.txt) Bogusapi       : 13273   (Wayback API-returned links that don't match real status code) API mismatch   : 17117   (Wayback API returned fewer records than sent.) JSON mismatch  : 28972   (Wayback API returned different size JSON) Discovered     : 47810   (Number of articles edited by WaybackMedic) Log 404        : 9894    (Dead wayback links) Log emptyarch  : 2001    (Empty archiveurl arguments) Log emptyway   : 0       (Ref has an empty ) Log encode     : 0       (URL misencoded) Log spurious 1 : 191     (Spurious "|1=" parameter) Log trail      : 3       (URL has a trailing bad character) Log dead URL   : 185     (|url= is dead even though dead-url=no, archiveurl is dead and no ) Log skindeep   : 8466    (changes to URL are skindeep) Log doubleurl  : 416     (Double archive.org URL error) Log datemismatch: 27433  (Date in archive URL doesn't match archivedate argument in cite template) Log wrong https : 895    (https and :80 conflict) Log WAM        : 32198   (webarchive merge) Log stray dead : 2709    (stray  - straydt.awk) Log WC|IS->IA  : 1022    (Convert WebCite|Archive.is to Wayback et al.) Log short url  : 10552   (WebCite URL elongated - webcitlong.awk) Log short url  : 522     (Archive.is URL elongated - archiveis.awk) Log citeaddl   : 256     (webarchive merge - citeaddl.awk) Log nowikiway  : 41      (Wayback mangled a certain way) Log br bug     : 0       (br bug) Log miss timest : 3043   (Timestamp missing from IA URL) Log embeded way : 559    (embedded wayback template in cite template) Log embeded wa : 18      (embedded cite template in webarchive template) Log switch URL : 6051    (archive in url= field) Log dead /items/: 281    (/items/ URL dead replacement) Log x2 webarch : 2311    (double webarchive template) Log pct encode : 15      (pct encode magic characters in URLs) New alt archive : 1009   (Replaced with archive URL found at Mementoweb.org) New IA link    : 509     (Added new IA link) New IA date    : 1642    (Changed snapshot date) Redirects      : 52      (Page was a redirect) Zombie links   : 650     (Links needing removal by hand) Wayback RM     : 2836    (Wayback link deleted)


 * Links found

Wayback All    : 1099355 (Wayback links total found) WebCite All    : 39120   (WebCite links total found) Archive.is All : 1288    (Archive.is links total found) Loc.gov All    : 410     (Loc.gov links total found) Portugal All   : 180     (Portugal links total found) Stanford All   : 30      (Stanford links total found) Archive-it All : 76      (Archive-it.org links total found) Bibalex All    : 17      (Bibalex.org links total found) NatArchiveUK All: 4668   (National Archives (UK) links total found) Europa Archives : 2      (Europa Archives (Ireland) links total found) Perma.cc All   : 0       (Perma.CC links total found) PRONI All      : 0       (PRONI links total found) UK Parliament  : 1       (UK Parliament links total found) UK Web Archive : 125     (UK Web Archive (British Library) links total found) Canada All     : 68      (Canada links total found) Catalonian All : 1       (Catalonian links total found) Singapore Archiv: 10     (Singapore Archives links total found) Slovenian Archiv: 1      (Slovenian Archives links total found) Freezepage.com : 1524    (Freezepage.com links total found) Webharvest.gov : 4       (US Nat. Archives links total found) NLA AU ALL     : 2610    (AU Nat. Archives links total found) archiveorg items: 419    (Archive.org /items/ total found)

Run 2: 2017-03-19 -> 2017-04-07
From March 19, 2017 to April 7, 2017, WaybackMedic processed 149,195 articles. These were all articles on English Wikipedia containing a template. WaybackMedic checked each tagged link and replaced with a working archive if available. It made other standard fixes. The number of links saved was 31,317


 * Archive.org: 12,804
 * Archive.is: 16,541
 * Webcite: 20
 * Library of Congress: 413
 * National Archives UK: 405
 * NLA Australia: 6
 * arquivo.pt (Portugal): 284
 * Stanford University: 17
 * Archive-It.org: 584
 * BibAlex: 86
 * National Archives Iceland: 61
 * Europa Archives Ireland: 29
 * Proni Web Archives: 6
 * Parliament UK: 34
 * UK Web Archive (British Library): 55
 * Canada: 8

The reason Archive.is is so high is because most of the articles had already been checked for archive.org saves on previous runs of IABot. Archive.is has many pages unavailable anywhere else and WaybackMedic is the only bot adding Archive.is links. Generally WaybackMedic uses Archive.is as last resort.

Links

 * WaybackMedic 2.5
 * WaybackMedic 2.0
 * WaybackMedic 1.0
 * Bot Approval
 * Trial runs