Wikipedia:Bots/Requests for approval/GreenC bot 7


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

GreenC bot 7
Operator:

Time filed: 01:14, Friday, December 28, 2018 (UTC)

Function overview: Add and  to the tops of pages that have no references or are missing in-line footnotes.

Automatic, Supervised, or Manual: Automatic

Programming language(s): GNU Awk and BotWikiAwk framework

Source code available: Yes (TBU)

Links to relevant discussions (where appropriate):
 * Wikipedia_talk:WikiProject_Unreferenced_articles
 * Village_pump_(proposals)

Edit period(s): one time run

Estimated number of pages affected: ~ 130,000

Namespace(s): Mainspace articles

Exclusion compliant (Yes/No): Yes

Function details:

As background, members of New Page Patrollers (WP:NPP) have caught up tagging the backlog of new pages. However there is still an older backlist of articles created since day 1 up to about 2012 which still contain many untagged articles. Estimates could be half a million or more untagged. A request was made on BOTREQ by a NPP member. I took a try at creating an algorithm to detect when a page could reasonably be tagged. Dry-run tests on 10,000 articles show it be successful. Discussion at Wikipedia_talk:WikiProject_Unreferenced_articles shows support for an automated bot to help find articles needing attention and tag the pages.

Test results are available at User:GreenC/data/noref

The bot will start slowly and be fully supervised initially, running in batches, checking results.

Discussion
What kinds of main namespace pages are you exempting? (e.g. Redirects, Disambiguation pages?) — xaosflux  Talk 01:56, 28 December 2018 (UTC)
 * User:GreenC/data/noref lists stubs, redirects, any containing one of 1500+ templates (eg. and dabs), the HTML tag , any that begin with "List of ..", "Index of .." or " in .." --  Green  C  02:22, 28 December 2018 (UTC)


 * Personally, I would also skip stubs and articles with a dedicated 'further reading/external links' from the run. Headbomb {t · c · p · b} 17:51, 2 January 2019 (UTC)
 * Right "stubs" are already skipped. If there is an external links section they get . -- Green  C  13:41, 9 January 2019 (UTC)

Let's see it in action then. Headbomb {t · c · p · b} 16:45, 9 January 2019 (UTC)
 * The sum total of three editors from a niche project who've supported this so far isn't really representative of the community and we'd really need a broader discussion, ideally at the village pump. There are segemnts of the community that take issue with the perceived indiscriminate tag-bombing performed by human editors, so I'm not sure having a bot take up this activity could be completely uncontroversial. There are differences of opinion on whether unreferenced should be placed on any unreferenced article, or only on those where the fact of being unreferenced is not immediately obvious to readers (ex. the article is long) and there is good reason to doubt the veracity of the content. And even if assume that all unreferenced articles should be tagged, I'm not sure how a bot could do that within acceptable bounds of the rate of false positives. The bot currently employes a good deal of nuance (I like that it excludes lists and articles with external links), but I don't see how it could reasonably detect all types of references. Sources can be present without templates or ref tags (an aricle might only have a bibliography list at the end, and in-text attribution like "According to Strand's lengthy article in the 1953 issue of the JBF" is acceptable even in the absence of such a bibliography), or they may be implicit in the external links to standard identifiers in some types of infoboxes, or in the authority control templates at the end of articles. Heck, I've seen even human editors do a poor job of figuring out if an article is unreferenced, so I'm not confident a bot could do that either. – Uanfala (talk) 14:33, 12 January 2019 (UTC)


 * Maybe so, but that does not preclude a small scope trial to see exactly what is being proposed in action. And keep in mind a small amount of false positive is acceptable. Headbomb {t · c · p · b} 15:23, 12 January 2019 (UTC)
 * For the next round I'm going to make more entries to the test data results. An admirable User:Boleyn tagged some of the previous test results so more would be good. Boleyn is a great example who has already benefited from this bot to make improvements as part of the NPP process. -- Green  C  14:05, 13 January 2019 (UTC)


 * Rather than asking if the bot can deal with these things or how, Uanfala declares the bot "could not do that". It is unclear Uanfala has looked at the test data results. Every issue raised was already encountered when it showed up during testing, I coded for it, the bot edits like a discriminate person would, it's been trained and can be further trained. The question is, which article in the test data do you take exception to? If the position is no tags at all, why does unreferenced have 200,000 and why does NPP add tags systemically every day (this tool might add another 20k so a 10% increase). There is consensus for tagging and WP:NPP has been widely lauded for their work which involves significant tagging. If the position is zero mistakes then that is unreasonable for bot or person. If the position is it makes too many mistakes, that is pure conjecture, see test results, and misses the fact this bot is being carefully run by a programmer who is checking results, taking feedback and continually improving it. -- Green  C  14:05, 13 January 2019 (UTC)
 * I acknowledge you've done great work with the bot, and apologies if I haven't been clear enough, and for having to repeat what I've alredy written above. The major issue is that there is so far no meaningful consensus for allowing a bot to tag thousands of articles: to get consensus for something that affects so many articles, there needs to be a well-attended discission at a place like the village pump. And no, the fact that two editors have so far stated this would be a good idea (especially given that one of them is known for her extreme views as to what constitutes appropriate tagging) is very far from that. Yes, I did look at the test results (and incidentally, that's where I came across an article with Authority control that was earmarked for tagging – I wouldn't have thought of that otherwise). And again, I'm sorry if this sounds like I'm simply postulating without seeing the data, but I think we would all agree that even in this age of AI optimism, that it's unlikely for a bot to be able to make good judgements as to which articles need to be tagged: again, the tag is not meant to be placed on every article without blue superscript numbers (though some NPP reviewers seem to do just that), but only where there's good reason to alert the reader to it. – Uanfala (talk) 18:16, 13 January 2019 (UTC)
 * The bot is not "AI". I am not an "optimist" who thinks computers are the solution for everything. The bot does not put a tag on every article "without a blue subscript", it is more nuanced than that. The question again is why anyone (bot or person) would not tag the articles identified. If you think Authority control should be skipped, that is a trivial feature. -- Green  C  19:23, 13 January 2019 (UTC)

feel free to proceed with the trial whenever you want. Headbomb {t · c · p · b} 18:49, 13 January 2019 (UTC)

- please withdraw the BRFA. It is going to prove too controversial. Not that I agree (otherwise i wouldn't have made the BRFA) but there are evidently some old wounds in the community about tagging and this bot will reopen old battle scars. And there is more than 1 way to make use of the tool, it's purpose is to discover and identify potential candidate articles, information others can do with as they please. If there is support to re-open the BRFA it should go through VP or an RFC first. -- Green  C  15:28, 14 January 2019 (UTC)


 * Well, it's your BRFA, so you can withdraw it if you want (simply put BotWithdrawn somewhere here). I still think you should proceed with the trial personally. Headbomb {t · c · p · b} 15:31, 14 January 2019 (UTC)
 * Well thank you for the support. There are trial "edits" in the test data so we can see what it does (would do) after processing 16,000 articles. If you or anyone else would like to start an RfC that is fine by me, but I don't fancy leading this fight it will not be pretty. This was an interesting technical challenge and the data it produces can still be posted either way. I will withhold withdraw for a bit in case anyone wants to initiate a consensus discussion. -- Green  C  15:43, 14 January 2019 (UTC)


 * Will you proceed with a trial or not? Because if not, there is little point in keeping this open. Headbomb {t · c · p · b} 15:45, 14 January 2019 (UTC)
 * You mean just adding the tag? Normally page edits can be complex and/or error prone so they need be trialed, but dropping a tag at the top of a page is trivial and not error prone. In this case the problem is more about consensus. --  Green  C  15:51, 14 January 2019 (UTC)


 * I mean doing edits yes. That's what a trial is. Headbomb {t · c · p · b} 15:52, 14 January 2019 (UTC)

I have opened a discussion at Village_pump_(proposals) to ease or confirm 's fears. Any watching this page may be interested in commenting there. Thanks all for comments. Sorry to get in the middle of the BRFA here. Cheers! Ajpolino (talk) 17:20, 14 January 2019 (UTC)'


 * Sigh... Headbomb {t · c · p · b} 18:11, 14 January 2019 (UTC)
 * It's been over a month since the discussion was opened. How does it look? Qzekrom (talk) 16:21, 24 February 2019 (UTC)

Hi. Would a page such as List of United States Supreme Court cases, volume 586 be tagged by this bot?

I'm not convinced that tagging pages such as Grand Ducal Highness with is particularly helpful, but shrug. If it's a one-time run, people can presumably just remove the ugly tags if they don't like them. --MZMcBride (talk) 01:41, 15 January 2019 (UTC)


 * To my understand, it would be no to both, the first is a list (excluded) and the second has external links (also excluded). Headbomb {t · c · p · b} 01:43, 15 January 2019 (UTC)
 * Right the first one is no since it is a "List of" .. the second would get assuming there is consensus. In this page, User:GreenC/data/noref/14001-15000 it shows three independent bot algorithms. Which algorithm(s) the bot deploys is up to the community. It can do the first, or all three, or some combo. By default it will do all three, but if there is concern about the two  algos one or both could be dropped. --  Green  C  02:52, 15 January 2019 (UTC)

The bot doesn't have support for a parenthetical referencing either bare parentheticals or ones generated by harv. It lists Nummer 5 as an article to be tagged as type 2 no footnotes though the article has inline citations generated by harv and harvnb. It also lists Fossilized affixes in Austronesian languages to be tagged as type 2 no footnotes though it uses parenthetical citations with page numbers. Wugapodes [thɑk] [ˈkan.ˌʧɹɪbz] 08:25, 17 January 2019 (UTC)
 * Thanks. The harv etc is fixed. The parenthetical citation method is not common and hard to check for, it may be a good idea to tag these anyway for community attention so they can be converted. More developed articles using the parenthetical method will likely get skipped as they will probably contain other bits of info that will flag it for bypass. -- Green  C  15:50, 17 January 2019 (UTC)

- 20 edits at Special:Contributions/GreenC_bot on January 17 ("via noref bot") -- Green  C  20:11, 19 January 2019 (UTC)
 * , Easy link to this trial's edits:  SQL Query me!  05:09, 24 January 2019 (UTC)
 * Note that the tags must be added after any hatnote templates (if any exist). The following code may help:  (source: Twinkle) SD0001 (talk) 15:46, 25 January 2019 (UTC)
 * That probably will not occur as currently the bot skips anything with a preexisting banner in an abundance of caution to avoid over-tagging, presumably some-one/thing has looked it and decided it needed that banner and not this one. We've got lower hanging fruit than piling on banners. Eventually they can be revisited when the tracking categories are reduced. -- Green  C  19:54, 31 January 2019 (UTC)
 * The list is extensive, thousands, I made a separate program to auto-generate which templates to avoid (including their alias/redirect names), then a function to autogenerate a lengthy regex. -- Green  C  19:57, 31 January 2019 (UTC)

. The RfC seems to have run its course (over 30 days). There are at least 4 ways this bot can run, from ultra-conservative to regular-conservative, so it need not be a black and white decision. There is currently some sort of majority that wants the bot to run in some form and their voices will hopefully not be ignored. I also note, many of the opposer arguments are based in FUD, as the bot writer I can assert most of those claims are either technically incorrect are very much on the margins based on real-world testing results. -- Green  C  16:52, 24 February 2019 (UTC)
 * You should request a formal close of that RFC (so it's not you yourself interpreting the consensus, but the closer: it will avoid fights down the line). I think the contentious issue is pretty binary: either the community supports tagging unreferenced articles or it does not. Everything else is a detail that can be figured out or ironed out. I happen to be against such indiscriminate tagging, but if the wider community consensus is in favor of it then having a bot do it is ipso facto a good idea; and if consensus is against then neither bot nor NPP should be doing so systematically. --Xover (talk) 19:02, 25 February 2019 (UTC)
 * Yes it should be closed.. You realize NPP was not even notified of the RfC? If there is a re-do after a Review they might properly be given a chance to participate and see how the results trend. Any ironing out discussions they should be there too. Last I checked there are at least 200k uses of which strongly suggests there is consensus for tagging, so that is not the "contentious" issue it is something else. It's not accuracy, because the bot is even more accurate than human taggers evidently. It's more discriminate than manual tagging. It's not volume because the bot will be limited in total edits it won't go hog wild or over-tag pages. The only thing left is an opinion against tagging generally, which is a minority position on Wikipedia.  --  Green  C  23:33, 25 February 2019 (UTC)
 * I didn't check whether the RFC was done right. If there's a question about that that may skew the results you should flag that for the closer, and possibly suggest a new RFC done correctly. I agree about your other points: the only real issue is whether the tags are desireable or not (the rest are essentially non-issues and technical matters). On that point, though, I'm not so sure it is a minority position: it's entirely possible that those opposed have just been overwhelmed by their endless addition and are not currently making a stink about it. And the NPP are neither neutral nor have any special authority on the issue: they are one important constituency within the community, but they most definitely have their own blinders and biases (it's the inevitable result of spending all your time staring fighting dragons). For existing (pre-2012) articles they are not even the most relevant constituency: NPP's remit is unpatrolled new articles. In any case, a well-publicised and well-constructed RFC that's properly closed would be a good foundation for this task. My recommendation is simply that you make sure you have that so you can point at it if there are complaints. --Xover (talk) 05:41, 26 February 2019 (UTC)
 * Closure requested. Thanks for the input! Ajpolino (talk) 01:27, 26 February 2019 (UTC)
 * Well the RFC has been closed. Assessed consensus is in favor of tagging unreferenced articles, but not article lacking footnotes. Note #2 from the closer is already done. Note #3 is already done(?) since I believe you referenced the trial results at least twice in your comments on the RFC. As for note #1, I'm not sure if it's feasible (or desirable) to add a parameter to  to categorize the tagged articles separately. If it's possible to do this while also having them added to Category:All articles lacking sources (which the unreferenced tag currently does), I think that would be most desirable since it would facilitate searching through all unreferenced articles (which would be possible, but a slight pain in the butt if a chunk of unreferenced articles are in some different category). Thanks all for your help in moving this along! I'll get out of your hair now and promise not to start any further trouble. Thoughts on moving forward? Ajpolino (talk) 04:13, 1 March 2019 (UTC)
 * In the discussion, I don't believe a link was given to the trial (personally, I was unaware of the trial until I read through the brfa itself), but rather only to User:GreenC/data/noref. Since the trial was only 20 edits, #3 should be construed as after an extended trial - meaning that, if the bot continues forward, it should have an extended trial, after which a link to the contributions should be posted. I'm not saying you need a whole extra RfC, but that would be a good time to ping NPP if you want to. However, the close should not be construed to suggest that the 20 edits already made satisfy the notification to the community. Thanks, --DannyS712 (talk) 04:24, 1 March 2019 (UTC)
 * 1. The category idea doesn't make much sense to me, either. To implement that would require special permanent code just for this bot in or a fork of that template just for this bot. It's not worth introducing all those forking or one-off special case complications that no one will remember soon enough and creates complications for those clearing out categories. Either the bot is working correctly or it's not, and if not, then it will stop. It won't be full speed and walk away, it will be carefully monitored just like every other bot I run. If folks just want a list of articles the bot tagged, that can be done easily by searching on the bot's unique edit-comment string (I can leave instructions on the bot page).
 * 2. In terms of trial edits, that's fine the BAG operations can determine, but this BRFA, the RfC, other previous discussions and whatever trial edits themselves (pinging watchlists) are all notifications, it's how BRFA is designed to work and why it takes a long time for BAG operators to close things because they give time for people to look and respond. -- Green  C  05:50, 1 March 2019 (UTC)
 * A source parameter to that adds the article to a per-source category is not a big deal. The majority of instances of the template won't make use of it, but if there is a need for it elsewhere (another bot, another task for this bot, some kind of categorization for the NPP, etc.) it's there and can be used. The category created would typically be a sub-cat of the relevant main maintenance category. It's a bit excessive for the (lack of) risk this task entails, but it's neither excessively onerous nor entirely wasted. *shrug* Would you like me to open a thread on it over at Template talk:Unreferenced or would you prefer to do so yourself?  If you would prefer not to deal with the template changes I can open a thread about it at Template talk:Unreferenced? And on the bright side, the RFC demonstrates that there is explicit support for tagging unreferenced articles (much to my chagrin :|) with, contingent only on it meeting a reasonable rate of false positives in an extended trial. So get source added to ; ask BAG for a couple hundred trial edits; and post a link to the diffs at VP/P for review. --Xover (talk) 07:17, 1 March 2019 (UTC) Edited: --Xover (talk) 14:28, 1 March 2019 (UTC)
 * FWIW Xover this bot might make 25,000 edits, total. By comparison has about 220k instances. So an increase of maybe 10% is not a big deal. If you want to implement it in the template I'll add it to the code but it's a needless complication IMO that will soon be forgotten and unused. Now, further VP/P discussions? That could throw this BRFA off the rails and leave the BAG operator uncertain with how to proceed. This is how it should work: The RfC closed in support of the bot. We return to BRFA and the BAG operator takes over. They approve (more) trial edits. Anyone can monitor the BRFA and trials and make comments for an extended period. The BAG operator then ultimately decides to accept or reject the bot based on community input and technical performance. It's always been done this way. The RFC says "notify the community once they have finished the bot's trial", BRFA is "the community" and this BRFA is well advertised and if someone wants to post more notifications elsewhere they are welcome to do so, but the RFC doesn't change the normal BRFA process. Doing so undermines that process and makes it hard for BAG operators to close it out when there are open, unresolved or disputed discussions happening elsewhere. --  Green  C  16:14, 1 March 2019 (UTC)
 * Just a notice to say that I'm not going to review this BRFA. I'll let someone else interpret what that mess of an RFC means. This is why I asked for trial first, so that the RFC could happen without FUD. Headbomb {t · c · p · b} 05:56, 1 March 2019 (UTC)
 * Headbomb, that's not really fair though. Prior to the RfC, there were an extensive number of (dry run) edits based on 18,000 article checks. If the edits are live or dry run, it's the same thing when checking for false pos. Processing 18,000 articles is no small thing for a trial, and it was done dry run because it couldn't make a large number of live edits - though it also made some live edits - overall it went beyond what was requested. The results were available for the RFC and probably why it closed in favor of the bot, it demonstrated it is effective. The FUD was because people didn't look at the (dry) trial edits or were fearful of them because they were accurate (ie. fearful of a flood of tagging since it can be automated). --  Green  C  16:14, 1 March 2019 (UTC)


 * A dry run is near meaningless to people who need diffs so see the bot live. Myself included. And because they couldn't review things, FUD happen. You preferred to do the RFC before trial, so now you're bound by it, rather than have proof that most of the FUD was baseless. Maybe the outcome of the RFC would have been the same, but I wager the footnotes thing would have passed once all the false positives and corner cases addressed. Headbomb {t · c · p · b} 18:28, 1 March 2019 (UTC)

- there will be a replacement BRFA soon. -- Green  C  17:18, 2 March 2019 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.