Wikipedia:Bots/Requests for approval/DYK-Tools-Bot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard. The result of the discussion was

DYK-Tools-Bot
Operator:

Time filed: 00:35, Thursday, December 15, 2022 (UTC)

Function overview: A bot to assist in various tasks related to WP:DYK maintenance.

Automatic, Supervised, or Manual:Automatic

Programming language(s): Python

Source code available: https://github.com/roysmith/dyk-tools

Links to relevant discussions (where appropriate):
 * WT:DYK
 * WT:DYK

Edit period(s): Hourly

Estimated number of pages affected: Category:Pending DYK nominations currently has 268 entries

Namespace(s): Template

Exclusion compliant (Yes/No): No

Function details: This is a proposal for a new bot to help out at WP:DYK. A big part of the back-end work of DYK is building prep sets. Each set consists of 8 "hooks", which are chosen from those proposed in nominations. The selection of hooks needs to comply with an absurdly large number of rules. These rules include:


 * The hook must be previously approved, indicated by a checkmark icon on the nomination template.
 * Once approved, a hook can be unapproved by somebody raising an objection, requiring that it be re-approved
 * If you are the author of a hook or have approved it, you can't promote it to a set yourself
 * The first hook in set must include an image (which in turn must be approved)
 * Within a set, it is strongly discouraged to run two hooks that are biographies next to each other
 * It is similarly strongly discouraged to run two hooks about American topics next to each other
 * The total number of biography and/or American topic hooks in a set is capped
 * Between sets, it is discouraged to have the lead hooks be of similar types
 * Certain hooks are tagged to be run on particular dates
 * And so on

In the current process, people building prep sets scan the list pending hooks looking for ones that meet all the requirements. It would be good to have a tool which automates as much of this as possible and presents to the human a list of potential hooks that might fit a given slot. It would then be up to the human to confirm the suitability and pick from the suggestions presented (or ignore them completely).

A POC implementation of the evaluation system is currently running on toolforge. Source is available in github.

The next step is to repackage the nomination evaluation code as a bot which runs under cron on toolforge. This would:


 * Run at some reasonable interval. Hourly seems like a good starting point.  Based on some initial measurements, I estimate a run will take a couple of minutes to complete.
 * Iterate over the articles in Category:Pending DYK nominations to find nominations to examine.
 * For each unassessed nomination, evaluate it to determine if it's a biography and/or an American topic.
 * Add Category:Pending DYK biographies and/or Category:Pending DYK American hooks to the nomination template as appropriate. The edit summary will include a link back to the bot's user page.  A human can override the automatic assignments by adding or deleting classification templates manually.
 * Keep track of which nominations it has processed so it doesn't keep reprocessing the same ones. Any nomination which already has any of the classification templates will be automatically skipped.  Thus, if a human does a manual evaluation, the bot will never override the human.
 * Iterate over Category:Pending DYK biographies and [[:Category:Pending DYK American hooks to find any templates which are (no longer) in Category:Pending DYK nominations and remove the classification categories.
 * Alternative to that would be to have the bot edit the DYKsubpage which is on every nomination, adding new parameters to indicate the categories. That will clean up the cats automatically when the DYKsubpage during the nomination close process.


 * I'll implement some kind of emergency button so anybody can stop it if it goes haywire.
 * Assert will be used to prevent logged-out editing (I need to figure out how that works in pywikibot).

Future work will be to build a tool that a user can run (probably as part of the existing toolforge web service) to filter based on these categories and/or other criteria. I could also see additional classification categories being added in the future if needed.

The code that touches the wiki is pywikibot. The web app is Flask.

I don't anticipate the need to persist much data. What little bits of state I need, I'll probably use redis to keep things simple.

I've created User:DYK-Tools-Bot.

Discussion
So to clarify, this BRFA is about the addition and removal of Category:Pending DYK biographies and/or Category:Pending DYK American hooks to pages in Category:Pending DYK nominations? How does it make this assessment? I presume by the associated article/article talk containing certain categories (like biographies or america-related wikiprojects)? Or some other heuristic? ProcrastinatingReader (talk) 23:24, 15 December 2022 (UTC)


 * The code is Article.is_biography and Article.is_american. The gist is:
 * Biography: there's a birth year category or an infobox which descends from Category:People and person infobox templates.
 * American: the word "american" appears in the intro, there's a category that ends in "in the united states", or there's a link anywhere in the article to a US State (or state-like area) page.
 * These are probably not perfect, but they seem to be working. The heuristics can always be tweaked.  Errors (in either direction) are not critical, since this is just an aid to a human who makes the final decision. -- RoySmith (talk) 23:48, 15 December 2022 (UTC)
 * Will the bot also be differentiating between approved and non-approved noms, by the way? theleekycauldron (talk • contribs) (she/her) 03:06, 16 December 2022 (UTC)
 * The existing code certainly has the ability to figure out if a nomination is approved. Ultimately I envision a front-end where you can say, for example, "Show me all the non-American biographies that are approved".  But that's not something the bot part of this needs to know about when it's assigning categories. -- RoySmith (talk) 03:24, 16 December 2022 (UTC)

ProcrastinatingReader (talk) 13:36, 16 December 2022 (UTC)


 * OK, thanks. I haven't actually written the bot code yet; I assume the 7 days runs from whenever I turn it on? -- RoySmith (talk) 14:27, 16 December 2022 (UTC)
 * Yeah ProcrastinatingReader (talk) 16:52, 16 December 2022 (UTC)


 * Template:Did you know nominations/Rosa Diaz – fictional character articles are usually treated as biographical for the purposes of prep set building. theleekycauldron (talk • contribs) (she/her) 21:45, 18 December 2022 (UTC)
 * Thanks. https://github.com/roysmith/dyk-tools/issues/3 -- RoySmith (talk) 21:49, 18 December 2022 (UTC)
 * I've done one full run, kicked off manually. Took longer than I expected:
 * but additional runs should take a lot less time since they will just be working on the new nominations. -- RoySmith (talk) 21:59, 18 December 2022 (UTC)
 * but additional runs should take a lot less time since they will just be working on the new nominations. -- RoySmith (talk) 21:59, 18 December 2022 (UTC)

I'm concerned because when you have a list that looks like this: : something :: something ::: something :::: something DYK-Tools-Bot was here ::::: something we end up with a LISTGAP problem for screenreaders. Moving it outside DYKsubpage would theoretically prevent users from talking around it. theleekycauldron (talk • contribs) (she/her) 06:18, 26 December 2022 (UTC)
 * I saw this on my watchlist. Roy, I think it would be best for soliciting feedback to have a link to this BRFA in both the bot's edit summaries and on its user page – I don't currently have anything to say on the task itself, but a user who wanted to would have to go to WP:BRFA (which is not linked on the user page) and find the specific subpage, which is not very accessible. Thanks, Sdrqaz (talk) 22:40, 18 December 2022 (UTC)
 * Thanks for the suggestion. I've added a link to here from the bot's user page.  I'll provide something better as things progress. -- RoySmith (talk) 00:40, 19 December 2022 (UTC)
 * Users are breaking MOS:LISTGAP by continuing lists after DYK-Tools-Bot was here is placed without being integrated into the list. I would suggest either placing the template outside of DYKsubpage or integrating it into the template with a  parameter. Same goes for any relevant categories, although those could be made parameters of DYK-Tools-Bot was here as well. theleekycauldron (talk • contribs) (she/her) 08:42, 21 December 2022 (UTC)
 * Yeah, I'll work on that, thanks. My template-fu is kind of weak, but I'll see what I can figure out.  I'm not a fan of the whole "HTML comments as delimiters" thing; it really breaks the model of being able to parse wikitext in some structured way. -- RoySmith (talk) 14:03, 21 December 2022 (UTC)
 * All right, works for me. Do you know of a way to sort these into the "Approved" and "Pending" categories as well? This could be as simple as checking whether it's transcluded to WP:DYKN or WP:DYKNA. Would be a huge help. theleekycauldron (talk • contribs) (she/her) 12:10, 22 December 2022 (UTC)
 * There's already code which knows how to follow the chain of approvals and dis-approvals.  I'm working on a fix for the issue you pointed out yesterday, I want to get that out the door before I look at other stuff. -- RoySmith (talk) 12:42, 22 December 2022 (UTC)
 * @Theleekycauldron How does Special:Diff/1129331559 look. Will that work? -- RoySmith (talk) 19:42, 24 December 2022 (UTC)
 * @RoySmith: Sigh, my mistake – I was under the impression that DYKsubpage came with tags pre-installed. Now the categories are being transcluded onto WP:DYKN. I'd say probably your best bet is gonna be adding  and   templates to DYKsubpage. I'm happy to assist you with that, if you'd like :) theleekycauldron (talk • contribs) (she/her) 22:29, 24 December 2022 (UTC)
 * I had an earlier version that wrapped the cats in noinclude tags, but I got rid of that in the latest go-round because it added a lot of complication to the code. I'm really hesitant to bury this in the DYKsubpage template because that will add its own layer of complication and cross-dependencies.  What I'm thinking is Pending DYK biographies which would look something like:
 * but I haven't been able to find the right combination of tags that would let the category apply to the Template:Did you know nominations/... page, but not to the pages that include that. I'd certainly appreciate help figuring that out. -- RoySmith (talk) 22:43, 24 December 2022 (UTC)
 * @Theleekycauldron OK, I think I've got this figured out. Take a look at:
 * https://test.wikipedia.org/wiki/Template:Pending_DYK_biographies
 * https://test.wikipedia.org/wiki/Template:Did_you_know_nominations/East_Germany–Zanzibar_relations
 * https://test.wikipedia.org/wiki/Template_talk:Did_you_know
 * https://test.wikipedia.org/wiki/Talk:East_Germany–Zanzibar_relations
 * Did_you_know_nominations/East_Germany–Zanzibar_relations is in Category:Pending DYK biographies, the other two, which transclude the first, are not in the category. From my coding point of view, all the bot needs to do is add or remove Pending DYK biographies, so the code is relatively clean.  Will that work for you?
 * With only a small amount of encouragement, I could go off on a frothing-at-the-mouth rant about Mediawiki markup language, but I'll behave myself. -- RoySmith (talk) 02:11, 25 December 2022 (UTC)
 * @RoySmith: Okay, I was definitely very wrong! Seems that it absolutely needs to go inside the DYKsubpage template, because otherwise the note persists after the nomination is closed. Other than that (and please please pretty please a category for noms transcluded to WP:DYKNA), looks good to me! theleekycauldron (talk • contribs) (she/her) 11:03, 25 December 2022 (UTC)
 * @Theleekycauldron Let's take a step back. What is it that you actually are concerned about which moving the categories inside or outside DYKsubpage will solve? -- RoySmith (talk) 14:59, 25 December 2022 (UTC)
 * Related question: is there some documentation which describes how the morass of DYK templates are supposed to work? Looking at the source for DYKsubpage I can't make heads or tails of what's supposed to be happening.  Specifically, I've been trying to dig my way down to where the "(Review or comment . Article history)" text is produced and can't find it. -- RoySmith (talk) 15:40, 25 December 2022 (UTC)
 * Ugh, the problem was that it's not in a template at all. It's in Module:DYK nompage links. -- RoySmith (talk) 15:46, 25 December 2022 (UTC)
 * Ugh, the problem was that it's not in a template at all. It's in Module:DYK nompage links. -- RoySmith (talk) 15:46, 25 December 2022 (UTC)


 * That makes sense. But it's at odds with your last statement, Seems that it absolutely needs to go inside the DYKsubpage template -- RoySmith (talk) 15:06, 26 December 2022 (UTC)
 * @RoySmith: that would be because I erred in offering that solution, despite the LISTGAP concerns – when a nomination is closed, the DYK-Tools-Bot was here persists in otherwise-blank transclusions, which is not great. theleekycauldron (talk • contribs) (she/her) 23:12, 27 December 2022 (UTC)
 * I think part of the problem is that DYK-Tools-Bot was here is producing user-visible text to begin with. My original intent was that it would have no visible text, and I'm planning to go back to that.  It was intended as just a boolean marker that the bot would use to keep track of whether it had processed a nomination yet.  I'll add something to the template documentation explaining what it is.
 * So, I think we're in agreement now that I'll put DYK-Tools-Bot was here after DYKsubpage and before the "Please do not write below this line" comment?
 * Template:Did you know nominations/E. Daniel Cherry is an interesting case. In this edit,  ignored the "do not write below this line" and put the DYK checklit below the line.  I'm not sure if anything really cares about that.  Which is another way of saying I'm not sure why we even have that comment line. -- RoySmith (talk) 23:57, 27 December 2022 (UTC)
 * @RoySmith: It's because of the aforementioned problem – if it's not in the DYKsubpage at time of close, it won't be against the pale blue and will be transcluded in places it shouldn't be. People do routinely ignore that line, it'd be nice to do something about it. theleekycauldron (talk • contribs) (she/her) 00:00, 28 December 2022 (UTC)
 * I think that works, yes :) theleekycauldron (talk • contribs) (she/her) 00:01, 28 December 2022 (UTC)
 * @RoySmith Oops. ~  ONUnicorn (Talk&#124;Contribs) problem solving 03:50, 28 December 2022 (UTC)

I'm wary of bloating up the DYK page with that long message, honestly. Is it really required? Can this not just be stated on the bot/bot talk page (which someone will probably look at once the bot re-adds the template, if only to complain), or in a comment in the wikitext which doesn't show but will be visible to someone removing the template? ProcrastinatingReader (talk) 00:33, 28 December 2022 (UTC)


 * @ProcrastinatingReader @Theleekycauldron Leeky and I just had a quick zoom conversation where we cleared up a lot of confusion. The bottom line is I'll get rid of the message completely.  And all the DYK-Tools-Bot stuff will go at the very end of the page, after the HTML comment.
 * Most of my confusion had to do with what the HTML comment means. While it says, "Place comments above this line", what it really means is "Place comments inside the DYKsubpage template".  If you're thinking of the page as a sequence of lines of text, those two interpretations lead to the same meaning.  But I was thinking of the page as a structured tree of nodes, which led me to think what I should be doing is putting my stuff after the DYKsubpage but before the start of the comment. -- RoySmith (talk) 00:58, 28 December 2022 (UTC)

Cron running
I've now got this running as an hourly cron job:

(venv) tools.dyk-tools@tools-sgebastion-11 [~] toolforge-jobs list Job name:   Job type:             Status: ---   --- dykbot-cron  schedule: 43 * * * *  Last schedule time: unknown

I believe I've incorporated all of the comments above. Please let me know if there's anything of concern, and of course, feel free to block DYK-Tools-Bot if it starts doing something stupid. Now that this is running in automated mode, I'm assuming the 7-day trial clock is now running. -- RoySmith (talk) 04:09, 2 January 2023 (UTC)


 * I pushed a few changes out today. There's some improvements to the "American" detection heuristics, and I've moved to the Pending DYK biographies scheme instead of the raw Category:Pending DYK biographies tags. -- RoySmith (talk) 03:50, 6 January 2023 (UTC)
 * @ProcrastinatingReader The 7-day trial period is up in a couple of hours. What's next? -- RoySmith (talk) 02:15, 9 January 2023 (UTC)
 * Oh, I see. I'm supposed to say  -- RoySmith (talk) 13:42, 9 January 2023 (UTC)
 * Is DYK-Tools-Bot was here really required? Can you track state within the bot (using some kind of database, even some buffered flatfile if you want to avoid spinning up new infrastructure. On Toolforge might be easier to spin up a db, see wikitech:Help:Toolforge/Database, saves you having to deal with on-disk logic)? The bot seems to be editing far more pages than necessary ('necessary' being cases where it has to add a category). ProcrastinatingReader (talk) 15:58, 18 January 2023 (UTC)
 * Strictly speaking, yes, it would be possible to track the state managed by DYK-Tools-Bot was here, but keeping everything on the DYK pages seemed more logical. The intent is that the categories can be overridden by humans (either adding or removing).  DYK-Tools-Bot was here provides a human-visible indication that the bot has done its thing.  Lacking that, a human would have no way to know if the bot mis-categorized the nomination as not belonging to any of its managed categories, or simply hasn't been there yet. -- RoySmith (talk) 16:25, 18 January 2023 (UTC)
 * Why not run the bot more frequently? Any reason it can't run every 5 minutes or so - I suspect it's not really intensive on the API? (Or webhooks, if the API offers that, though I think it's probably overkill.) My thinking is: If it runs frequently enough then by the time a human DYK editor starts to wonder if they need to add categories manually, the answer will be yes, as the bot has very likely visited it.
 * If we really do want to make it obvious and possible for other users to verify state, you could maintain the flatfile onwiki under the bot's userspace, though IMO that's probably overkill and adds some overhead. ProcrastinatingReader (talk) 16:33, 18 January 2023 (UTC)
 * I agree with Proc, it would be nice if we didn't need the template. The main advantage of keeping state on-wiki tends to be that humans can "reset" the bot without needing maintainer assistance. But since you've already suggested having a web tool, maybe that tool could also surface whether an article has been processed or not?
 * Alternatively, I also like leeky's suggestion above of having it be a parameter to the existing DYKsubpage template, though I'd suggest something like, rather than being named exclusively to the bot. Legoktm (talk) 18:53, 18 January 2023 (UTC)
 * One other suggestion, you could alternatively just check the page history to see if the bot has edited the nomination before. Since you're on Toolforge and have access to the database replicas, it shouldn't be too difficult to have a query that gives you a list of pages in the pending nominations category that the bot hasn't edited yet. Legoktm (talk) 18:55, 18 January 2023 (UTC)
 * That can't differentiate between "looked at and decided not to add any categories" and "hasn't been looked at yet". -- RoySmith (talk) 20:06, 18 January 2023 (UTC)
 * @ProcrastinatingReader @Legoktm I've given this some thought overnight, and I still think what I'm doing now makes the most sense. Yes, I could store the state somewhere else, but that would just be more complicated, spreading things out into different places, with additional opportunities for things to get out of sync.  Yes, I could set an attribute on DYKsubpage, but I don't see how that's fundamentally any different from what I'm doing now, and it will add additional coupling between two basically unrelated things.  DYKsubpage is already crazy complicated as it is now, and adding another parameter will just make it more complicated.
 * So, other than the other ways to do this existing, what about those other ways makes them better than what I'm doing now? -- RoySmith (talk) 17:58, 19 January 2023 (UTC)
 * The main concerns I had with tracking state on the pages is:
 * Excess edits - the bot will have to edit every DYK page created, even though it only needs to modify categories for a fraction of them
 * In the past, people (esp less technical editors) have been quite sensitive to changes in template/page structure in processes they interact with (the edit cited above is an example). I hope this won't confuse anyone, but I saw the slight change in structure here unnecessary/avoidable
 * Generally think it's bad practice for bots to track state on the pages they potentially wish to edit. The approach is not scalable, in general.
 * ProcrastinatingReader (talk) 18:04, 19 January 2023 (UTC)
 * @ProcrastinatingReader I've spent some time looking at what it would take to convert this to using a SQL database to keep track of which nominations have been processed. It will add considerable complexity.  So, let me make sure I understand your position.  Are you saying, "I suggest you consider other ways to do this", or "The way you're doing it now is an absolute blocker to getting this approved"? -- RoySmith (talk) 17:03, 21 January 2023 (UTC)
 * @ProcrastinatingReader I haven't heard back in a couple of days. Just making sure you saw my question above.  Could you respond, please? -- RoySmith (talk) 14:18, 24 January 2023 (UTC)
 * I can't comment on the architecture of your app, of course, but in general I don't think implementing state tracking (in a database system, or using flatfile) should be too difficult. If on toolforge, since they deploy database infrastructure and provide you with a database, it should be a bit easier. Worst case, tracking state in a subpage of your bot's userspace seems doable.
 * w.r.t. your specific question, I imagine some BAG member would probably approve this with state tracking on the DYK nomination page, as you're doing currently. It's slightly easier to justify since the scope is limited (limited to DYK nominations and not, say, large swathes of articles or talk pages). I'd ask for a second opinion on that, myself. ProcrastinatingReader (talk) 16:39, 24 January 2023 (UTC)
 * BAG assistance needed Second opinion requested on whether this is approved in its current state or if the changes suggested above are required for approval. -- RoySmith (talk) 18:11, 24 January 2023 (UTC)

SQL-based version
This afternoon I deployed a version of the bot which does not store state on wiki pages. The code is available for inspection at https://github.com/roysmith/dyk-tools/tree/botlog. I have performed two supervised runs manually, with no observed problems. -- RoySmith (talk) 23:35, 27 January 2023 (UTC)


 * @ProcrastinatingReader I see you were off-wiki for a few days, but looks like you're back. Can this be approved now? -- RoySmith (talk) 15:36, 29 January 2023 (UTC)
 * ProcrastinatingReader (talk) 19:49, 29 January 2023 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Bots/Noticeboard.