Wikipedia:Bots/Requests for approval/GreenC bot 2


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

GreenC bot 2
Operator:

Time filed: 15:49, Friday, June 17, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Nim and AWK

Source code available: WaybackMedic on GitHub

Function overview: User:Green Cardamom/WaybackMedic 2

Links to relevant discussions (where appropriate): Bots/Requests for approval/GreenC bot - first revision approved and successful completed.

Edit period(s): one time run

Estimated number of pages affected: ~380,000 pages have Wayback links as of July 20 2016

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: The bot is nearly the same as the first bot (User:Green Cardamom/WaybackMedic), with these differences:


 * 1) In fix #2, instead of only making changes when other changes are made, it makes changes always. For example, it will convert all web.archive.org http links to secure https even if it's the only change. This modification amounts to commenting out the skindeep function so doesn't require new code.
 * 2) The first bot was limited in scope to articles previously edited by Cyberbot II. This will look at all articles on the English Wikipedia containing Wayback Machine links, somewhere around 380k. The bot determines target articles by regex'ing a Wikipedia database dump prior to running.

Most of the edits will be URL formatting fix #2. Fix #4 will impact somewhere around 5% of the links (based on stats from the first run of WaybackMedic). The rest of the fixes should be minimal 1% or less.

Discussion

 * I assume the difference in #2 is just how you're pulling a list of articles, not any coding change to the bot. Is this using the exact same code as the last bot except for commenting out the skindeep bit? Did the issues at the previous trial (bugs relating to alternative archives) pose no problem in the full run? If yes to both, this seems like something that could be speedily approved to run in 25,000 article batches with a 48-72 hour hold between them. If I'm understanding correctly, the only change is the removal of a simple function, and there seems to be no room for new bugs to have been introduced. ~ RobTalk 16:37, 17 June 2016 (UTC)
 * Essentially yes. Before if it found a URL needing fix #4 and fix #2 in the same link, it did both fixes on that link (eg. changed the snapshot date (#4) and added https (#2)). If however it found only a fix #2 in a link, it ignored it as being too "skin deep" ie. just a URL format change. So now the bot will fix those skin deep cases. There is no change to the code, essentially, other than it no longer ignores the "skin deep" cases (only fix #2), and it will run against all articles with Wayback links not just a sub-set of them edited by Cyberbot II. The edits themselves will be the same as before, so the code is not changed. There were a couple minor issues that came up during the run that were fixed in the code and Wikipedia articles. I won't run the bot until after July 1 when the next database dump becomes available, since that is where the article list will be pulled from. -- Green  C  17:21, 17 June 2016 (UTC)
 * Sorry, I phrased that ambiguously. By #2, I meant the second bullet point above, not fix #2. Nothing in the actual code of this task changed to widen the scope from articles edited by a previous bot to all articles, right? It's just in the manner in which you're pulling articles from the database dump? ~ RobTalk 19:34, 17 June 2016 (UTC)
 * Nothing in the code changed to widen the scope of task other than explained in bullet #1 above. -- Green  C  01:19, 18 June 2016 (UTC)

— xaosflux  Talk 03:25, 25 June 2016 (UTC)
 * Note Community feedback solicited on WP:VPR due to large run size. — xaosflux  Talk 01:34, 18 June 2016 (UTC)
 * Just to be clear, I'm not intended community input here to be a requirement to move forward - just ensuring that there is advertisement. — xaosflux  Talk 14:40, 18 June 2016 (UTC)

Trial 1
WM will process in batches of 100 articles each, but some articles may not need changes so the number of edits will vary within each batch.


 * Trial batch 5 (July 02) (51)
 * Trial batch 4 (July 01) (62)
 * Trial batch 3 (July 01) (42)
 * Trial batch 2 (July 01) (33)
 * Trial batch 1 (June 30) (48)


 * -- Green  C  15:08, 4 July 2016 (UTC)
 * In this edit why was the content removed? — xaosflux  Talk 02:52, 8 July 2016 (UTC)
 * It appears the original URL is working; it's possible that's why. ~ Rob 13 Talk 04:26, 8 July 2016 (UTC)
 * Am I missing something - that condition doesn't appear to be on this list. — xaosflux  Talk 04:53, 8 July 2016 (UTC)


 * This is fix #4 on that list. If an archive URL is not working it tries to find a working snapshot date, if it can't find it the archive is removed, as was here. In this case since the original URL is still working it didn't leave a dead. However there is a problem -- the archive URL is working. The bot keeps logs so I checked the JSON returned by the Wayback API which shows the URL was not available at Wayback. But the bot also does a header check to verify since the API is sometimes wrong. The header check also returned unavailable (at the time it ran). I just re-ran a dry run and it came back as link available - the problem doesn't appear to be with the bot. If I had to guess it's robots.txt as that is the most common reason links come and go from Wayback. robots.txt are controlled by the owners of the website. -- Green  C  13:14, 8 July 2016 (UTC)


 * -- Green  C  20:23, 31 July 2016 (UTC)
 * Sorry this has sat open so long, regarding the last problem discussed above - do you plan on removing content from articles (in that example case a useful reference link) anytime you are not able to instantaneously validate it? This is information that could be useful to readers, and I'm quite wary about having reference information removed.  Do you have more of an explanation? —  xaosflux  Talk 03:01, 18 August 2016 (UTC)
 * - It's removing a non-working archive link. Any human editor would do the same. One might make the case that if it's non-working due to robots.txt on Internet Archive, it's possible it could start working in the future -- however until then (if ever) we have a broken archive link which is the point of the bot to fix. One could "preserve" the broken link in a comment or talk page, but what's the point, anyone can check IA using the original URL, there's no information needing preservation. It's better to remove broken links where they exist and let bots like IABot (and other tools) re-add them if they become available, as normal. BTW I've already done this for 10's of thousands of articles and didn't have any complaints or concern, or during the last Bot Request. -- Green  C  14:54, 18 August 2016 (UTC)

Trial 2

 * (5000 article targets) (Note, this is for targeting 5000 articles only (with between 0-5000 edits as appropriate for those 5000 targets). This should be the final trial round and is 1% of the estimated targets. —  xaosflux  Talk 15:06, 18 August 2016 (UTC)

Sorry if you don't mind me asking, what is the rationale for a second trial? The bot has been tested extensively on 100,000 articles already. The point of the request was to extend the number of articles to the whole site, and some minor changes which are working. -- Green  C  15:18, 18 August 2016 (UTC)
 * As your final run is so large, just want one last check point. You do not need to personally evaluate it - if you can run 5000 targets, just give a count of how many updates were needed - if there are no complaints in a week I think you are good to go. —  xaosflux  Talk 15:41, 18 August 2016 (UTC)
 * Ok no problem. -- Green  C  17:40, 18 August 2016 (UTC)
 * Out of curiosity, what happens if the Wayback Machine goes down for some reason while the bot is running? Would the bot begin to remove every archive link as non-working, or is there some check to prevent this from happening? ~ Rob 13 Talk 18:04, 18 August 2016 (UTC)
 * It handles that a number of ways. I can describe the details if you're interested. --  Green  C  18:58, 18 August 2016 (UTC)
 * As long as it fails gracefully in this scenario, I don't need to hear details. Just don't want a server outage to result in the bot going wild. ~ Rob 13 Talk 03:01, 20 August 2016 (UTC)
 * It's a good question. Design philosophy is sanity check data, and on failure skip and log. Errors end up in logs not in Wikipedia. Critical failures at the network level (such as timeouts or Wayback API not responding which happens) get logged and the articles reprocessed in a future batch. When processing the first 140k articles for WaybackMedic #1 it never went wild even during API outages. -- Green  C  12:40, 20 August 2016 (UTC)


 * .. target 5000 articles resulted in around 2200 article updates. Received a report about a incomplete date due to a recent change in cite template requirements. Noticed a problem with garbage-data URLs ("https://web.archive.org/web/20150315074902/https://web.archive.org/web/20150315074902/http:...") the bot will now skip processing those they need manual fixing (this one could be done by a bot but others are too mangled). -- Green  C  02:25, 20 August 2016 (UTC)

If I may suggest an additional feature, for future runs: there may be articles in which  has a functioning WBM link but   is empty or missing. It would be nice if this bot could fix this issue but extracting the archive date information from the WBM url. --bender235 (talk) 19:19, 22 August 2016 (UTC)


 * Ok this is now part of Fix #3. I'm hesitant to do major feature additions this late in the RfA but this is simple to check and fix. It will also log. I'll keep an eye on it on the first batch, manual testing shows no problem. I don't think there will be too many since the CS generates a red error and they will likely get fixed manually after a while. -- Green  C  21:40, 22 August 2016 (UTC)


 * Thanks. I just realized there would be another issue (at least theoretically; I didn't check if we have those cases): there might be articles in which the information in  contradicts the actual date from the WBM url (imagine, for instance, the editor thought he should put the date when he added the archive link to Wikipedia rather than when WBM archived the page; or, even simpler, typos). Cases like these could be fixed/corrected based on the WBM url archive-date information. --bender235 (talk) 00:43, 23 August 2016 (UTC)
 * Alright, it will now verify archivedate matches the date in the wayback URL and if not change archivedate. There is one exception: if the archivedate is in dmy format, and the page doesn't have a or a  it will leave alone. The reason is editors often forget to use the dmy template and I don't want the bot to undo proper formatting (bot defaults to mdy). Note: This was not a big change to the code, I've tested every combo I can think of on a test case, every change is being logged, and when the first batch runs I'll keep a tight eye on it. I don't think it will be a very common occurrence. --  Green  C  14:37, 23 August 2016 (UTC)
 * I just ran the bot on the 5500 articles of the trial using only the two new features added above (other features disabled). It found about 400 article changes. I checked them manually and see no problems. That was a good suggestion, bender235, there are a fair number of problems. They were all in Fix #8 none in Fix #3. -- Green  C  19:57, 24 August 2016 (UTC)
 * You're welcome. I hope this bot gets final approval soon. --bender235 (talk) 19:07, 25 August 2016 (UTC)


 * Task approved. — xaosflux  Talk 13:13, 4 September 2016 (UTC)


 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.