Wikipedia:Bots/Requests for approval/KolbertBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

KolbertBot
Operator:

Time filed: 21:01, Thursday, August 10, 2017 (UTC)

Automatic, Supervised, or Manual: Supervised

Programming language(s): Python

Source code available: Pywikibot

Function overview: replace http links with https, if available.

Links to relevant discussions (where appropriate): Why we should convert external links to HTTPS wherever possible

Edit period(s): One-time

Estimated number of pages affected: ~35,000

Namespace(s): mainspace and template

Exclusion compliant (Yes/No): Yes

Function details: Modify the following links
 * University of British Columbia -  to
 * WebCite -  to
 * The Register -  to
 * University of Ottawa -  to
 * University of Saskatchewan -  to
 * University of Regina -  to
 * Simon Fraser University -  to
 * University of Victoria -  to
 * Vancouver Island University -  to
 * Capilano University -  to
 * Emily Carr University of Art and Design -  to
 * University of the Fraser Valley -  to
 * Trinity Western University -  to
 * British Columbia Institute of Technology -  to
 * University of British Columbia Okanagan -  to
 * Thompson Rivers University -  to
 * University of Winnipeg -  to
 * Université de Saint-Boniface -  to
 * Brandon University -  to
 * University of New Brunswick -  to
 * Université de Moncton -  to
 * Memorial University of Newfoundland -  to
 * Dalhousie University -  to
 * Saint Mary's University (Halifax) -  to
 * Concordia University -  to
 * McGill University -   to
 * McGill University -   to
 * Bishop's University -   to
 * Université du Québec à Montréal -  to
 * University of Alberta -  to
 * University of Calgary -  to
 * University of Lethbridge -  to
 * First Nations University of Canada -  to
 * Université de Sherbrooke -  to
 * Université Laval -  to
 * Carleton University -  to
 * Carleton University -  to
 * Carleton University -  to
 * Université du Québec à Trois-Rivières -  to
 * McMaster University -  to
 * University of Waterloo -  to
 * University of Waterloo -  to
 * University of Fredericton -  to
 * Government of Alberta -  to
 * Government of Manitoba -  to
 * Government of Ontario -  to
 * Government of Nova Scotia -  to
 * Government of Nova Scotia -  to
 * Government of Newfoundland and Labrador -  to
 * Government of Canada -  to
 * Wired UK -  to

Discussion
Should this be approved, what would be the preferred edit summary format? Jon Kolbert (talk) 21:01, 10 August 2017 (UTC)

The plan to change WebCite links to https://www.webcitation.com/ appears to be incorrect. WebCite is at www.webcitation.org, not www.webcitation.com. —RP88 (talk) 23:55, 10 August 2017 (UTC)
 * Thank you for spotting that transcription error, it has been amended. Jon Kolbert (talk) 00:27, 11 August 2017 (UTC)

Sometimes URLs are embedded in archive URLs. For example (non-working):
 * https://www.webarchive.org.uk/wayback/archive/20100803155857/http://www.theregister.co.uk/

In this case one wouldn't change to https://www.theregister.co.uk/ because it may break the webarchive.org.uk URL which may interpret it as a different URL and unable to find the archive. This isn't a real world example, but there are ones like it. Plus changing to https wouldn't do anything anyway since it's part of the path. -- Green  C  00:52, 11 August 2017 (UTC)
 * I do believe that supervising the edits while they take place is the right step to prevent such erroneous replacements - but in any case, should a mistake happen,


 * https://www.webarchive.org.uk/wayback/archive/20130202200017/http://www.babraham.ac.uk/ and


 * https://www.webarchive.org.uk/wayback/archive/20130202200017/https://www.babraham.ac.uk/ both do work. I have also tested this with webcitation.org as well.


 * https://www.webcitation.org/5J1lvgxQH?url=http://fire.prohosting.com/hud607/uncommon/reference/usa/army.html


 * https://www.webcitation.org/5J1lvgxQH?url=https://fire.prohosting.com/hud607/uncommon/reference/usa/army.html


 * https://web.archive.org/web/20170727004838/http://www.ubc.ca/


 * https://web.archive.org/web/20170727004838/https://www.ubc.ca/


 * In each three cases, both links are fully functional. Jon Kolbert (talk) 01:18, 11 August 2017 (UTC)
 * I wonder if this, therefore, confirms all cases? I would be happier to see confirmation from the archive.org website that this is the case. Apart from that, this is a very good bot task - and one that needs doing. I believe that with the link provided we have consensus - and we have done similar tasks in the past (The Guardian springs to mind as one that was recently done, although I can't remember the name of the bot). TheMagikCow (T) (C) 10:40, 11 August 2017 (UTC)
 * Fiddling around with URLs in WebCitation links, is kind of sort of pushing the boundaries of WP:COSMETICBOT, likewise with other archive URLs. As the operator of IABot, I would really advise against having your bot handle these and leaving it up to IABot.  As a BAG member I see no issues with this task and redundancy never hurts for a supported task.— CYBERPOWER  ( Chat ) 11:01, 11 August 2017 (UTC)

Jon Kolbert, it should be possible to avoid the archive URLs (of which there are many) by regex'ing URLs from the source including http + the first character preceding, and if that character is "/" skip. -- Green  C  13:57, 11 August 2017 (UTC)
 * ✅ Perfect, would it also be permissible to use AWB as well? Cheers. Jon Kolbert (talk) 18:47, 11 August 2017 (UTC)
 * What would you be use AWB for?— CYBERPOWER  ( Chat ) 14:36, 12 August 2017 (UTC)
 * I was originally using Pywikibot with a list generated from Special:LinkSearch, but I find that it'll be easier to manage and sort replacement lists in AWB instead, especially if using Regex. Jon Kolbert (talk) 15:06, 12 August 2017 (UTC)


 * — CYBERPOWER  ( Chat ) 15:13, 12 August 2017 (UTC)
 * I've requested AWB access for the bot here. Jon Kolbert (talk) 15:46, 12 August 2017 (UTC)
 * ✅, though obviously the bot doesn't have a flag yet. Primefac (talk) 15:50, 12 August 2017 (UTC)
 * I've also requested the confirmed permission here as modifying links triggers a CAPTCHA Jon Kolbert (talk) 16:08, 12 August 2017 (UTC)
 * - I've noticed that while the bot only pulls pages based on active links, it can still trip up where archive links are used. I've done some additional testing using The Globe and Mail - a link that has been archived many times, and it replaces the in the archive URL. While not broken, it's not ideal. Jon Kolbert (talk) 17:47, 12 August 2017 (UTC)
 * Please explain this edit.— CYBERPOWER  (Around ) 20:10, 12 August 2017 (UTC)
 * I was unable to find a way in AWB to both change non-archive URLs and skip archive URLs. I have since decided that using Pywikibot should be the best option here as skipping changes with  has worked without a hitch, as shown here. Would it be possible to run a trial using Pywikibot? Jon Kolbert (talk) 09:59, 15 August 2017 (UTC)
 * Hi Jon, could the regex be run on this testcases page and see what happens? User:GreenC/testcases/kolbert -- Green  C  14:06, 15 August 2017 (UTC)


 * . This trial is for testing Pywikibot. Please run on Green's test cases first.— CYBERPOWER  ( Chat ) 13:02, 16 August 2017 (UTC)
 * - Trial complete, I ran it on Green's test cases without issue, as well as the ~46 remaining pages with uottawa.ca links. Cheers! Jon Kolbert (talk) 05:26, 17 August 2017 (UTC)
 * — CYBERPOWER  ( Chat ) 18:22, 17 August 2017 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.