Wikipedia:Bots/Requests for approval/Yapperbot 3


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard. The result of the discussion was

Yapperbot 3
Operator:

Time filed: 14:13, Saturday, June 20, 2020 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Golang

Source code available: https://github.com/mashedkeyboard/yapperbot-scantag

Function overview: Scans every article on Wikipedia, and checks for configured patterns. When it finds a pattern, it tags the article with an appropriate maintenance tag.

Links to relevant discussions (where appropriate): Bot requests (|2_cite_templates_missing_%22}}%22 now archived here)

Edit period(s): Continuous

Estimated number of pages affected: Hard to say; as it's configurable, potentially the entire corpus of Wikipedia articles

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: was looking for a bot that would populate a tracking category for templates that are unclosed and contained within a   tag. I started out making this, but then realised that, rather than having people make a mishmash of bots that did regex scanning of all articles with different regexes, it would be a better idea to have one bot parsing the pages and doing the regex scanning for all the regexes people wanted to match over pages. So, that's what I've made.

Scantag, the name I've given to this task, is a dynamic system for creating rules for pattern matching articles that need maintenance using regular expressions, and then maintenance tagging them according to the associated matches. At present, it is only configured for the specific request that GreenC has made; however, its configuration (as is becoming a recurring theme with my bots!) is entirely on-wiki, so it can be reconfigured on-the-fly. You can see its configuration file here. This is in a user JSON file, so it is only editable by administrators and myself through the bot account; I think this should be a sufficient threshold to prevent abuse, but was considering getting the content model changed to JS to make it interface protected, instead, due to the potential for danger inherent in the task. Thoughts on this would be appreciated.

Whilst the edit filter regex matches changes, it is designed only to be used for preventing serious issues that actively harm the wiki, and there's a limit to the number of rules that it can have - after all, a user is waiting. Scantag, on the other hand, is a deliberately slow process - it runs with a short maxlag, a high number of retries for maxlag, and after every edit it waits a full ten seconds before continuing. This brings with it the advantage that, while it may be a slow process, it can be used for a lot more than the edit filter would ever be. Because it's looking over every single article, it can also be useful for finding and tagging articles that would be impossible to run through a standard regex Elasticsearch, because it would simply time out. Case in point, the maintenance tagging that we're talking about here - but potentially, the same system could be useful for a number of other applications that involve matching patterns in articles.

The task works as follows:


 * 1) The bot examines the User:Yapperbot/Scantag.json file, reads the rules, and compiles the regexes.
 * 2) The bot then iterates through the latest database dump's list of page titles in NS0.
 * 3) For every title in NS0, the bot retrieves the wikitext.
 * 4) The bot matches each regex (currently only the one) specified in Scantag.json against the wikitext of the article.

If there is no match, the bot skips to the next article. If the bot matches the regex, however, it performs the following steps:


 * 1) Check the "noTagIf" regex specified corresponding to the main regex. This is a rule designed to check for where the article has already been tagged with the correct maintenance tag.
 * 2) Prefix the article with the corresponding "prefix" property in the JSON file, if there is one.
 * 3) Suffix the article with the corresponding "suffix" property in the JSON file, if there is one.
 * 4) Edit the page, with an edit summary linking to the task page, and listing the "detected" parameter as the reason.
 * 5) Wait ten seconds before moving on. This is a safety mechanism to prevent a situation where a badly-written regex causes the bot to go completely haywire, editing every single article it comes across.

In common with other Yapperbot tasks, the bot respects the kill page at User:Yapperbot/kill/Scantag, so in the event of an emergency, it could be turned off that way.

Because writing the regexes involved requires not only a good knowledge of regex, but for those regexes to be JSON escaped as well to stay in the JSON string correctly, and because of the potential for issues to come up as a result, there is also a sandbox for the rules. Myself or any other administrator configuring a Scantag rule would be able to set one up to test in here. Rules in the sandbox generate a report at User:Yapperbot/Scantag.sandbox, explaining exactly what the bot has understood from the JSON it's been given, and rendering an error if there are any obvious problems (e.g. failure to compile one of the regexes, noTagIf being set to anything other than a regex or false, etc). Each rule also can have a "testpage" parameter, specifying a test page with the prefix "User:Yapperbot/Scantag.sandbox/tests/", which is designed as a place to set up tests to make sure that the corresponding regex is matching when it's supposed to, and not matching when it's not. An example of one of these is.

I appreciate that this is a fair bit more complicated than the previous bot tasks, so I'm absolutely about to answer any questions! There are specific instructions for admins on how to deal with Scantag rule requests on the task page. I think there is also an open question here as to whether each rule would require a separate BRFA. Fundamentally, what's going on here isn't all too different from a "retroactive edit filter", of sorts, so I should think either the default restriction for JSON files to only admins editing, or changing the content model to JS so only interface admins can edit, should be sufficient to protect from misuse; however, I'd definitely like to hear BAG members' thoughts on this.

Discussion

 * This bot is currently proposing to check for the "CS1|2 non-closed }} issue". How do you propose that "new changes" be proposed/vetted/implemented? Primefac (talk) 18:31, 30 June 2020 (UTC)
 * I'd envisage the process to be similar to that which is used for edit filters, and indeed have modelled the system around many of the same assumptions, but I'm absolutely open to any better suggestions! To give an overview of what I'm thinking of, though:Proposing new changes would happen through User talk:Yapperbot/Scantag, where I've set up a requests system very similar to that of WP:EFR. In much the same way, I'd expect that anything that is posted there has a clearly-demonstrated need, and in cases where it is envisaged to affect a large number of pages, a clear consensus so to do. Any editor would be welcome to propose and discuss rules there, just like EFR, and as discussed below, myself or any sysop would then be able to implement them.Vetting changes would take place in two stages: community review of the rule requests from any interested users (much like a BRFA or an EFR) if applicable, as well as (hopefully!) other experienced technical editors and myself, and then implementation review - i.e. checking that the regexes that are written are sane and will run correctly. I'll talk a bit more about this below, as it leads into:Implementing changes, which would be done by myself through the Yapperbot account or by any other administrator who edits the JSON file containing the rules. Because this process is non-urgent by its very nature, I would expect that even a sysop making a request would go through the same processes as any other request - there's no reason for them to directly skip to editing the JSON file. As I've mentioned in the instructions up at User:Yapperbot/Scantag, it would be expected to be the case that all changes would be tested in the sandbox first before actually being implemented; I'm also considering adding a separate "live" parameter to the actual JSON file, which would notate whether or not a rule should be live, or on a dry run. This would allow for more complex regexes to be tested on the entire Wikipedia text, and having the bot save to a page a list of pages the regex would match, prior to it actually modifying those changes.Hopefully that clears things up a bit, let me know if there's anything that's not clear though! All of this is just "how it's struck me as being best", not "how it is definitively best", so any thoughts are definitely appreciated. As I mentioned in the original BRFA text, I'm particularly interested in thoughts on whether this is actually better to be restricted to interface administrators only rather than all administrators (or perhaps the sandbox should be admins, and the real rules intadmins? or perhaps even the sandbox and "dry run" rules being admins only, and the real rules intadmins?) PS. I appreciate that this is a chunky and annoying wall of text; sorry this BRFA is a bit more complex than the others! Naypta ☺ &#124; ✉ talk page &#124; 18:52, 30 June 2020 (UTC)
 * This bot appears to be fetching page texts from the API individually for every page. If its going to do that for 6 million pages, that's horribly inefficient. Please batch the queries - it's possible for bots to query the texts of upto 500 pages in one single request, which is more efficient for the server. See mw:API:Etiquette. I see you're already handling edit conflicts, which is great (as they would occur more often because of the larger duration between fetching and editing).
 * Regarding the editing restrictions, I don't there's a need to restrict it to intadmins. Just put a banner as an editnotice asking admins not to edit unless they know what they're doing. (non-BAG comment) SD0001 (talk) 14:05, 2 July 2020 (UTC)
 * I had a chat with some of the team in either #wikimedia-cloud or #wikimedia-operations on IRC (one or the other, I don't recall which, I'm afraid) who had indicated that there wouldn't be an issue with doing it this way, so long as maxlag was set appropriately (which is deliberately low here, at 3s). I didn't initially want to do too many page requests in a batch, for fear of ending up with a ton of edit conflicts towards the end of the batch; even with the ability to handle edit conflicts, it's expensive both in terms of client performance and also in terms of server requests to do so. That being said, batching some of the requests could be an idea - if either you or anyone else has a feel for roughly what that batch limit ought to be, I'd appreciate any suggestions, as this is the first time I'm building a bot that parses the entire corpus. Naypta ☺ &#124; ✉ talk page &#124; 14:38, 2 July 2020 (UTC)
 * Now I actually read the task description. Since the bot is only editing the very top or bottom of the page, it is unlikely to run into many conflicts. Edit conflicts are only raised if the edits touched content in nearby areas; the rest are auto-merged using diff3. I'd be very surprised if you get more than 5-10 editconflicts in a batch of 500. So if you want to reduce the number of server requests (from about 1000 to about 510 per 500 pages), batching is what I'd use. If you choose to do this, you'd want to give the jsub command enough memory to avoid an OOM. SD0001 (talk) 16:09, 2 July 2020 (UTC)
 * Sure, thanks for the recommendation - I'll plonk them all into batches then. You're right that it's only editing the very top and bottom, but it does need to do a full edit (because of maintenance template ordering) rather than just a prependpage and appendpage, which is unfortunate, but so the edit conflict issues might still come about from that. No harm in giving it a go batched and seeing how it goes though! I'll make sure it gets plenty of memory assigned on grid engine to handle all those pages - a few gigabytes should do it in all cases. Naypta ☺ &#124; ✉ talk page &#124; 16:13, 2 July 2020 (UTC)
 * ✅ - batching implemented. I've also changed the underlying library I use to interact with MediaWiki to make it work with multipart encoding, so it can handle large pages and large queries, like these batches, an awful lot better. Naypta ☺ &#124; ✉ talk page &#124; 22:20, 3 July 2020 (UTC)

Primefac (talk) 22:22, 2 August 2020 (UTC) Operator inactive. No prejudice against re-opening upon their return. Primefac (talk) 21:02, 14 November 2020 (UTC)
 * D Any progress? Primefac (talk) 15:38, 10 November 2020 (UTC)
 * The proposing bot op hasn't edited any WMF wiki since August 2. An email or talk page notice might be in order. --Izno (talk) 21:00, 14 November 2020 (UTC)
 * Thanks. Primefac (talk) 21:02, 14 November 2020 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard.