Wikipedia:Bots/Requests for approval/Gaelan Bot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Gaelan Bot
Operator:

Time filed: 11:25, Wednesday, February 20, 2019 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Python

Source code available: here (currently implements heuristics to fix URL; code to actually edit the fixed URL into the article pending BRFA)

Function overview: Fix Category:Pages with URL errors in obvious cases

Links to relevant discussions (where appropriate):

Edit period(s): One big run, then regularly to fix new cases

Estimated number of pages affected: ~500

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): No

Function details: This bot attempts to fix common errors that cause articles to be placed into Category:Pages with URL errors. Specifically, tries to:


 * Prepend "https://" or "http://" (preferring the former)
 * Remove a space (if there is exactly one space, and the portions before and after the space both contain slashes; the goal here is to catch spaces accidentally inserted into the URL without making a mess where there's both a URL and some English text in the url field, which the bot can't currently fix)

In either case, it checks that the resulting URL is accessible and returns a 200 status code before making changes. Testing against the first 200 articles in the category, the bot was able to fix 9% of the articles.

Discussion

 * I've found a few additional heuristics to add. I'll try and update the BRFA tonight (but the current heuristics are mostly unchanged, so feel free to comment on those). Gaelan 💬✏️ 21:33, 20 February 2019 (UTC)
 * Headbomb {t · c · p · b} 01:24, 21 February 2019 (UTC)
 * I've temporarily added the confirmed group. —&thinsp;JJMC89&thinsp; (T·C) 04:55, 21 February 2019 (UTC)
 * 20-ish captchas and two 20-minute waits to avoid edit filters later, One problematic edit: . The resulting URL redirects to their homepage, which the bot should have detected (and not made an edit). However, the HTTP library I used was configured to automatically follow redirects, resulting in the redirect not being detected. This has been fixed. By the way, the new heuristics I mentioned in a comment above:


 * Translate unicode domain names into IDN (e.g. http://клубзнамя.рф/members/andrey-arkhipov.html -> http://xn--80abskfjf9b9g.xn--p1ai/members/andrey-arkhipov.html). They're technically equivalent, but the templates don't support the former.
 * Add |plainurl=yes to Google books
 * Replace ":/" and "//" with "://"
 * Fix double schemes ("https://https://")
 * Remove
 * None of these are as common as the first three, but all but one (":/" and "//") did show up in the trial run. Gaelan 💬✏️ 05:48, 21 February 2019 (UTC)
 * 1) This, this, and this is all pretty WTF to me. Headbomb {t · c · p · b} 01:34, 23 February 2019 (UTC)


 * 2) I can't see this type of edits being approved on a large scale per WP:CONTEXTBOT. There's no way to tell if the space should be removed, or replaced by


 * Headbomb {t · c · p · b} 01:34, 23 February 2019 (UTC)
 * 1) That's the bot converting domain names into IDN format. энциклопедия-урала.рф and xn8sbanercnjfnpns8bzb7hyb.xn--p1ai are the same domain—note that if you type the latter into a browser address bar, it turns into the former. The citation templates expect domain names to be ASCII, so the bot makes the conversion. Maybe it'd be better to fix the templates instead.
 * 2) In all cases (except the Google Books fix, because the template is assumed to generate a valid URL and expanding the template would be technically challenging), the bot checks any resulting URLs to make sure they work (i.e. they load and return an HTTP 200). Therefore, it should be pretty unlikely for the bot to make an incorrect fix. (In this case, the website in question actually completely ignores at part of the URL, presumably only caring about the ID that appears later. But, if a dash was required, the website would most likely return a 404 and the bot wouldn't have made the edit.) Gaelan 💬✏️ 02:25, 23 February 2019 (UTC)


 * Alright, well if the final URL is checked, I don't really have objections to that part., without the Russian ones. Not saying that won't be approved in the future, but not without a CS1 discussion first. Headbomb {t · c · p · b} 02:39, 23 February 2019 (UTC)


 * The bot crashed once because it hit the spam filter (an article cited a blacklisted site, but because the link was broken, it previously wasn't counted). The bot now handles that situation (by doing nothing). I'm reviewing the edits now. Gaelan 💬✏️ 03:54, 23 February 2019 (UTC)
 * OK. These edits    all deleted spaces in questionable-to-straight-up-wrong circumstances. They're all in the query string (presumably because sites tend to silently ignore invalid query parameters instead of erroring), so I'll disable deleting spaces that are after a  . There's also a few edits where the bot correctly fixes the URL but adds a space at the end for no reason—I'll look into that. Everything else seemed fine to me. Gaelan 💬✏️ 04:23, 23 February 2019 (UTC)
 * was also problematic. Headbomb {t · c · p · b} 05:46, 23 February 2019 (UTC)
 * I think that one was fine—the markup was messed up both before and after the bot edit, but the bot's output seems to be less of a mess that what it was before. That being said, given the relatively small number of articles this will hit (~500 judging by current success rate) maybe I should just do this semi-automated. Gaelan 💬✏️ 06:29, 23 February 2019 (UTC)
 * here's a ping. Gaelan 💬✏️ 06:29, 23 February 2019 (UTC)


 * If the number of articles is low, I would suggest you seriously consider a semi-automatic run instead. Experience shows that one should not underestimate the sheer perversity of website behaviours. You mention silently redirecting above. It's also common to return a 404 page with a 200 HTTP status. Or returning different content without going by way of an actual 302 redirect. Or returning random content for any URL (especially common on usurped sites). URLs starting with "//" are technically valid (it's a protocol relative URL; Mediawiki uses these internally, and they're common for transitions from http to https). Trying to code for every single perverse edge case will drive you batty: much less effort overall if you can inject a human brain for sanity checking. I also agree that the IDN issue at the very least needs further consideration. I am somewhat surprised at the assertion that the CS1 templates require ASCII here (they're usually very good about i18n), so some effort should definitely be put towards double checking what's going on there and determine what the most appropriate fix is.I'll also add that for certain problems, helpful bots can create really pathological problems. I'm fuzzy on the details, but I had one case where AWB changed something like an endash inside an URL into a hyphen This broke the link and prevented finding an archived version because web archives are invariably indexed by URL. By the time I came by trying to fix the dead link I had to manually trawl through years worth of page history to figure out what had happened, and even detecting this was mostly luck. Anyone less obsessive would most likely have just removed the cite as a permanent dead link. The changes you are proposing are disposed to creating such problems. It's not an easy problem you're trying to tackle, is what I'm saying. --Xover (talk) 10:26, 23 February 2019 (UTC)
 * in favor of a semi-automated tool. Gaelan 💬✏️ 00:25, 25 February 2019 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.