Wikipedia:Bots/Requests for approval/WildBot 8


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol neutral vote.svg Request Expired.

WildBot 8
Operator:

Time filed: 23:47, Friday December 6, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python, wikitools

Source code available: Sure, prompt me

Function overview: Replace broken urls to *.thecanadianencyclopedia.com with working ones to thecanadianencyclopedia.ca

Links to relevant discussions (where appropriate): Bot requests/Archive 57

Edit period(s): one run

Estimated number of pages affected: ~3468

Exclusion compliant (Yes/No): No, one-time run

Already has a bot flag (Yes/No): Yes

Function details: swap broken urls for tested good ones. I have assembled a mapping of certain URL updates for The Canadian Encyclopedia based on lookups into the Wayback Machine of all external URLs that match *.thecanadianencyclopedia.com and used that to generate and test combinations of URLs against thecanadianencyclopedia.ca until I got a 200-sucess. Links to the home page of the site will be stripped. URLs where I couldn't get a successful hit will be left unchanged. Variations on the dead link templates are added or removed to the article to reflect the status of external links; they're only removed for thecanadianencyclopedia. The work parameter of the various cite templates is altered to change hyperlinks into domains.

Discussion
— HELL KNOWZ  ▎TALK 22:30, 11 December 2013 (UTC)
 * 10 Trial edits in I made a stupid error with my HTML comment not being an actual comment. I fixed all the edits.
 * More interesting errors:
 * "corrected" to a 404. The substitution did what it was told. The translation list was mis-populated because of a parsing error on http://web.archive.org/web/20110929060526/http://www.thecanadianencyclopedia.com/index.cfm?PgNm=TCE&Params=U1ARTU0000203 where the "writing" section was determined to be the article title.  An inspection of the translation list shows this has not occurred on any other occasion, nor has "author" nor "bibliography" been the target of any translations.  It was just good luck catching this one.  Fixed on a couple of levels - tighter regex matching, plus those headings have been added to the blacklist.
 * shows a replacement within a ref tag where the link is followed by a . Ought I be removing these dead links? Josh Parris 02:44, 12 December 2013 (UTC)
 * Do I get this right -- you are comparing the actual page content on wayback archived version to find matches? Or title?
 * Yes, you should remove dead links after the citation or reference tag if you fix them. — HELL KNOWZ  ▎TALK 13:16, 13 December 2013 (UTC)
 * The technique I'm using for translating from the old URL to the new one is:
 * Check for a 302 redirect sometime in 2012. The redirect will be to a URL similar to what's used now, with quite a few variations - a trailing slash may or may not be required, the order of words may have changed, parts of the path might have been moved around.
 * Failing that, the Wayback Machine's copy will have an article title, which might be transformable in various ways into the corresponding URL in the new website
 * All I'm doing is checking for a 200 status code to confirm a match - do you think I ought to be doing something less naive?
 * I'll get onto removing deadlink tags; it might be easy, or perhaps not. Josh Parris 11:48, 14 December 2013 (UTC)
 * I think that's good enough -- I don't think there would be obvious false positives, especially if you use their own 302s. — HELL KNOWZ  ▎TALK 12:03, 14 December 2013 (UTC)
 * I think that's good enough -- I don't think there would be obvious false positives, especially if you use their own 302s. — HELL KNOWZ  ▎TALK 12:03, 14 December 2013 (UTC)

— HELL KNOWZ  ▎TALK 12:03, 14 December 2013 (UTC)
 * Wow, that expanded the source code dramatically. I selected ten articles that had dead link and canadianencyclopedia urls. Performed 10 trial edits, highlights include:
 * Shows the dead link template being removed for the repaired link
 * Shows how subst doesn't work for bots (fixed)
 * demonstrates removing a URL from a cite template's work parameter
 * So, it seems all went well. Josh Parris 06:55, 17 December 2013 (UTC)


 * Okay, these look good, but that's quite a range of functionality. — HELL KNOWZ  ▎TALK 13:47, 18 December 2013 (UTC)

More trial since the addition of code and just a larger sample. — HELL KNOWZ  ▎TALK 13:47, 18 December 2013 (UTC)
 * with results here. Points of note:
 * has the bot swapping out a dead url for text, which would be fine except this is a url= field. I've removed this functionality from the bot and will leave it to humans to clean up these urls. But shows I removed it wrong; I should have detected those URLs and done nothing, rather than treating them as any other URL. Fixed.
 * has the bot making supplemental fixes but not the main fix of swapping dead urls. This was due to a logic bug in the code to detect null edits - fixed.
 * I stand ready for another trial. Josh Parris 00:59, 19 December 2013 (UTC)

without removing external links from work. — HELL KNOWZ  ▎TALK 10:28, 19 December 2013 (UTC)
 * Functionality altered to reflect this. Josh Parris 10:40, 19 December 2013 (UTC)
 * after 50 edits. Every edit seems fine.
 * I did get a scare from, but looking at http://web.archive.org/web/20120315000000*/http://www.thecanadianencyclopedia.com/index.cfm?PgNm=TCE&Params=U1ARTU0002865 I'm reassured that the bot isn't at fault. Josh Parris 11:27, 19 December 2013 (UTC)


 * What's up with first dead link here? Or here], although here it is dead. This happens in many pages, are you now checking unrelated links for 404s (I must have misread this from the ever-changing function details)? — HELL KNOWZ  ▎TALK 17:17, 21 December 2013 (UTC)


 * Sorry for the delay in responding; my Internet is temperamental right now.
 * The problem with that first link is that it was to an HTML anchor, and anchors weren't being stripped (now fixed). No other links were incorrectly marked as dead.
 * As an aside, do you have any insight as to why cURL and my browser agree that a 404 is returned for http://www.lethbridge.ca/NR/rdonlyres/D4CEB98B-9F18-4786-870D-84A06E1533FC/310/LethbridgeProfile2003.pdf yet Python's httplib thinks it's a 200?
 * Yes, I figured checking all URLs was a value-add during swapping the Canadian URLs, given I had to check the deadurls weren't dead - you know, "free functionality". Josh Parris 04:42, 26 December 2013 (UTC)
 * It says that because "checking 404 is not an easy task". None of our 404-checking and archiving bots (iuncluding mine) are running simply due to all the continuous issues that they have and all the ingenious ways web developers break them. That's why I didn't think you were also checking all dead links. That will extend this BRFA and trials a lot and I really recommend this be a separate task ( or I'll go mad ). Not to mention, you cannot check a link once, you need to come back in a week or so and check it again or there will be tons of false positives on temporarily 404ed sites. In your case, there was possibly a redirect (one of several ways to do it) or a different page version served as there might be an agent or referrer check or cookie requirement or many of the subtle HTTP protocol options. It could just be some broken headers or inconsistencies between curl and your typical browsers. Not setting up a cookie container has led to many sites failing on me. — HELL KNOWZ  ▎TALK 20:02, 26 December 2013 (UTC)
 * Gotcha. Pulled 404 checking for anything other than the target URLs. Josh Parris 03:45, 27 December 2013 (UTC)

— HELL KNOWZ  ▎TALK 10:48, 27 December 2013 (UTC)
 * Sorry for the delay; flaky Internet. After 20 edits, every edit seems fine. Josh Parris 09:17, 28 December 2013 (UTC)
 * I've got solid Internet under my feet now, so BAGAssistanceNeeded Josh Parris 20:39, 5 January 2014 (UTC)

Note that I haven't (yet) gone through previous trials link by link. — HELL KNOWZ  ▎TALK 21:59, 5 January 2014 (UTC)
 * -- doesn't look like the right one
 * -- same
 * -- same (macleans again)
 * -- broken link


 * If only there was an exasperated sigh template I could invoke here.
 * The Ben Johnson (and Kurdish protest) edits show the "check for a 200 status" rule isn't adequate. I'll work up something more robust in the face of this.
 * The Eva Rose York edit is actually fine.
 * The The Queensway – Humber Bay edit is particularly galling, as running the list generator against the page today pulls up the 404 and can't resolve it, but going to the URL in the article redirects to a valid article. The site operator has not only 404'd their old URL, they've made the older one work by redirecting it to their new one. I'm going to have to throw away my old translation list and regenerate it.
 * I'll ping back once I've made the necessary code changes. Expect a two week delay. Josh Parris 21:40, 6 January 2014 (UTC)
 * That fix was easier than I thought.
 * It seems something similar to the mcleans thing happened with French articles, so I already had code to simply strip it out.
 * I've coded up a fix to the 404.
 * I'm going to review all the edits the bot made since the start of time and confirm they correlate to what the bot would now do, and repair anything that's wrong. Josh Parris 14:49, 8 January 2014 (UTC)

Anything new about this taske? 46.107.88.236 (talk) 16:45, 24 January 2014 (UTC)
 * Any progress? (t) Josve05a  (c) 13:12, 2 April 2014 (UTC)

-- slakr \ talk / 07:03, 12 April 2014 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.