Wikipedia:Bots/Requests for approval/ScannerBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard. The result of the discussion was

ScannerBot
Operator:

Time filed: 01:48, Thursday, May 5, 2022 (UTC)

Function overview: Removes tracker tags in Twitter links.

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python

Source code available: gist

Links to relevant discussions (where appropriate):

Edit period(s): One time run

Estimated number of pages affected: <3000 per this query

Namespace(s): Mainspace

Exclusion compliant (Yes/No): Yes

Function details: Finds twitter.com URLs and remove parameters named as s, t, or cxt.

Discussion
if a bot account is needed, I will probably use. 0xDEADBEEF (T C) 01:51, 5 May 2022 (UTC)
 * This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. AnomieBOT ⚡ 10:53, 5 May 2022 (UTC) — AnomieBOT (talk&#32;• contribs) has made few or no other edits outside this topic.
 * This bot has edited its own BRFA page. Bot policy states that the bot account is only for edits on approved tasks or trials approved by BAG; the operator must log into their normal account to make any non-bot edits. AnomieBOT ⚡ 11:40, 5 May 2022 (UTC)
 * 0xDEADBEEF (T C) 11:43, 5 May 2022 (UTC)
 * I'm not entirely sure how much I want to be commenting with my BAG hat on, but based on previous tasks that were approved I am not convinced that as a bot task this is fully formed yet. Based on the supposed list of URLs where this tracking is located, the scanner isn't working right either, because there are a few false positives that I know exist out there that are not on the list. If 0xDeadbeef wants to use JWB on their main account they are welcome to and do not require BAG approval. On that note, though, I have moved this BRFA to the bot's page to make it officially a BRFA. Primefac (talk) 14:41, 7 May 2022 (UTC)
 * And, on a minor note, this has prompted me to run Task 17 again... Primefac (talk) 14:49, 7 May 2022 (UTC)
 * I didn't have a method for determining that they are actually parameters of an URL. I tested with a python script that just matched on keywords within the source. I didn't know that there were previous tasks. I will take a look at those and perhaps amend the regex to match more parameters. 0xDEADBEEF (T C) 02:30, 8 May 2022 (UTC)
 * 0xDEADBEEF (T C) 02:40, 8 May 2022 (UTC)
 * Based on the supposed list of URLs where this tracking is located, the scanner isn't working right either: For the record: I didn't know that CirrusSearch allowed regex searching so I used pywikibot. Now I will probably use  to generate list of articles to fix, with JWB. 0xDEADBEEF  (T C) 04:06, 8 May 2022 (UTC)


 * Note: The functionality and the scope of the bot was made more specific. See page history for more details. 0x Deadbeef  (T C) 06:28, 14 May 2022 (UTC)
 * Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)
 * You can look at the gist I linked.  is used to match the URL, and then urllib is used to parse, and then remove the parameters. 0x Deadbeef   (T C) 15:19, 14 May 2022 (UTC)
 * You'll likely want  for regex, to escape the   characters. (Same for below). &#32; Headbomb {t · c · p · b} 01:13, 17 May 2022 (UTC)
 * I embedded the regex as a Python raw string which does not need to escape forward slashes. 0x Deadbeef  (T C) 01:17, 17 May 2022 (UTC)
 * But dots still need escaping? &#32; Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)
 * Yes because . and \. have different meanings in regex. 0x Deadbeef  (T C) 02:30, 17 May 2022 (UTC)
 * I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. &#32; Headbomb {t · c · p · b} 10:24, 17 May 2022 (UTC)
 * @Headbomb, for what it's worth, I believe it's because some non-python RegEx is enclosed in / . . . /, so  needs to be escaped, but in python RegEx is just given as a string ' . . . ' &#8213;  Qwerfjkl  talk  14:22, 29 May 2022 (UTC)
 * You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to  detect if the twitter URL starts with "/" (example in Brandon Clarke). --  Green  C  16:15, 14 May 2022 (UTC)
 * Yeah, I should probably match [^/] or  for it to be primary. 0x Deadbeef   (T C) 02:07, 15 May 2022 (UTC)
 * Great, thanks. Also WebCite like  .. couple others use   vs. "/" as the break point.  --  Green  C  03:12, 15 May 2022 (UTC)
 * Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...
 * 0x Deadbeef  (T C) 04:03, 15 May 2022 (UTC)
 * Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" --  Green  C  17:33, 15 May 2022 (UTC)
 * Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl 0x Deadbeef  (T C) 23:18, 15 May 2022 (UTC)
 * 0x Deadbeef  (T C) 04:25, 16 May 2022 (UTC)
 * Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg. . They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. --  Green  C  05:21, 16 May 2022 (UTC)
 * a quick search seems to show that it is fine. I've fixed all three that appeared from that search. 0x Deadbeef  (T C) 06:52, 16 May 2022 (UTC)
 * number of pages affected has been lowered following a quick search with insource:. <span style="font-family:Fira Mono,Courier New,monospace">0x Deadbeef  (T C) 04:23, 21 May 2022 (UTC)
 * BAG assistance needed Requesting BAG assistance due to stale BRFA. <span style="font-family:Fira Mono,Courier New,monospace">0x Deadbeef  (T C) 05:08, 27 May 2022 (UTC)
 * To be clear: This BRFA has been inactive for some time. Primefac told me that they wanted input from other BAG members first. I would like to know if this is declined or approved for trial. Thanks. <span style="font-family:Iosevka,monospace">0x Deadbeef 07:43, 28 May 2022 (UTC)
 * Looks fine to me for trial. All issues raised above appear addressed anyway. -- Green  C  19:05, 29 May 2022 (UTC)
 * Let's give it a try. —&#8239; The Earwig (talk) 21:18, 30 May 2022 (UTC)
 * <span style="font-family:Iosevka, monospace">0x Deadbeef 04:57, 31 May 2022 (UTC)
 * Deadbeef, checked one edit and noticed the Wayback link actually works with the tracker removed. Who knew. After all that above :) Wayback magic. But can't say this holds true for every link, it's the kind of thing would have to verify with a header check on the Wayback link with tracking removed. It would be like an added feature to the bot, only if you wanted to try. - Green  C  06:18, 31 May 2022 (UTC)
 * So I tried querying the wayback machine api to fix archive.org URLs: Looking at the preview of the bot's edits, it looks fine. Perhaps it needs an extended trial? <span style="font-family:Iosevka,monospace">0x Deadbeef  08:01, 31 May 2022 (UTC)
 * (@The Earwig) <span style="font-family:Iosevka,monospace">0x Deadbeef 11:52, 4 June 2022 (UTC)
 * That's great, as it checks there is a copy in the API, it should be good to go. - Green  C  15:35, 4 June 2022 (UTC)
 * BAG assistance needed <span style="font-family:Iosevka,monospace">0x Deadbeef 05:28, 12 June 2022 (UTC)
 * Thanks for your patience. Edits look good. I am fine with the expanded functionality for Wayback links and don't see a need for an extra trial provided you monitor these changes. —&#8239; The Earwig (talk) 02:35, 13 June 2022 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard.
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard.