Wikipedia:Bots/Requests for approval/GreenC bot 11


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

GreenC bot 11
Operator:

Time filed: 17:02, Sunday, March 3, 2019 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): GNU Awk

Source code available: TBU

Function overview: Add template to target articles. currently has about 220,000 instances the bot will add about 25,000 more or about a 10% increase.

Links to relevant discussions (where appropriate): Village_pump_(proposals) (RFC)

Edit period(s): one time run

Estimated number of pages affected: 25,000 (est)

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details:


 * Implement the RFC
 * How the bot works is described at User:GreenC bot/Job 11/How
 * The bot previously processed 18,000 articles listing which articles it would tag. Available at User:GreenC/data/noref
 * The bot previously made 20 trial edits however the addition of  should be ignored as that was done before the results of the RFC changed the bot scope to.
 * Per the RFC, discussion is open at Template_talk:Unreferenced for adding GreenC bot argument to

Discussion
Informing previous BRFA participants of new BRFA: --  Green  C  17:24, 3 March 2019 (UTC)


 * BAG notes, I encouraged GreenC to restart this BRFA. Avoiding FP's is important, and this is not an "easy" filter.  The prior RfC is supportive in general of the tagging - but it has to be accurate.  There could be a small margin of error, but we need to focus on reducing it.  Feedback on FP avoidance and examples is extremely welcome below, thank you! —  xaosflux  Talk 17:47, 3 March 2019 (UTC)

Hi GreenC, for Skip if section named "External links", "References", "Sources", "Further reading", "Bibliography", "Notes", "Footnotes" if you have that as equals can you change it to contains? I've run across pages with sections such as "Literature and References". — xaosflux  Talk 17:35, 3 March 2019 (UTC)
 * Will do. -- Green  C  18:13, 3 March 2019 (UTC)

I've had a look at a dozen or so of the articles identified at User:GreenC/data/noref. Here are a few articles that point to ways to potentially adjust the selection criteria:
 * 1) Vemuri – this is a surname article and it doesn't need sources as it's mostly serves as a disambiguation page
 * 2) Dom Aleixo Timorese – not unsourced. I guess it needs to be taken into account that the bibliography section might have a different title ("Literature" in this case). Also, if articles with external links are to be excluded, then articles with Authority control will need to be excluded as well.
 * 3) Callichirus – similarly, it has a Taxonbar.
 * 4) Fukushima's Theorem – it has hand-formatted citations in a section called "Journal articles"
 * 5) Cordichelys – weren't stubs meant to be excluded?
 * 6) There are quite a few articles on films, TV episodes, books or music albums (like Parade (Bottom) or The Platinum Collection (Blue album)) that indeed list no sources, but a fair amount of whose content – plot synopses, track listings and the like – are obviously sourced to the publication that is the subject of the article. I don't think tagging with unsourced is a good idea, but there certainly is an underlying issue and that's the fact that they don't use any secondary sources. A more appropriate tag would probably be Primary sources, though it use normally entails some form of editorial judgement. – Uanfala (talk) 17:36, 3 March 2019 (UTC)


 * It will now filter anything with "surname" in a category name. Normally it would have been filtered out by one of the index templates in Category:Set index article templates but the page has none which is in error.
 * Authority control can be filtered. "Literature" can be added to the section title list.
 * Taxonbar, Authority control and others are now removed via Category:External link templates to linked data sites with reciprocal links
 * Section titles with "Articles" are now filtered (the section title words are case and plural sensitive)
 * It does not tag articles marked as stubs in an abundance of caution but that doesn't preclude stubby articles without sources can't or shouldn't be tagged. The article is unsourced and should be tagged. It was actually tagged previously, but some sort of deletion-by-redirect reversal caused them to be lost. The bot uncovered this problem.
 * There is no source, primary or otherwise. The presumption of a source is not the same as a literal source ie. what is the name of the source, where is it located, who is the author, what date was it accessed etc.. all that is missing. There is no verifiable source. That is why we have this tag, so the community can be made aware of articles like this that need a source. -- Green  C  18:13, 3 March 2019 (UTC)

BAGAssistanceNeeded - the bot is ready to begin trials. -- Green  C  14:08, 6 March 2019 (UTC)
 * Go ahead a run a trial with your adjusted parameters. — xaosflux  Talk 00:46, 7 March 2019 (UTC)
 * diffs. --  Green  C  18:03, 13 March 2019 (UTC)
 * I skimmed these pretty quickly so I may have missed some. Thoughts:
 * Sawsan, looks like the same problem as Vemuri discussed above. Can you also just skip articles with "Given name" categories? (same for Gaurav)
 * Municipalities of Central Finland is basically a list article, but I can't think of a clever way to skip it. Maybe it's best articles like that have a reference anyway...
 * Communes_of_the_Aisne_department is a bona fide list. Maybe you can skip articles with "List" in the category names? This one was in Category:Lists of communes of France. (Members of the 5th Dáil and Duchess of Brabant (by marriage) would also be skipped with this).
 * Otherwise looks great! Ajpolino (talk) 20:43, 13 March 2019 (UTC)
 * Given-name articles have sources (see the Category tree for example Abdul Hamid or William or Alexander). List-of articles also have sources eg. List of counties in New York.  --  Green  C  21:27, 13 March 2019 (UTC)
 * Courtesy ping ^ -- The SandDoctor Talk 16:58, 16 March 2019 (UTC)
 * I'm not arguing that any kind of article ought not have references, but we pitched this in the RfC as a conservative bot skipping stubs, lists, et al. So if it's not too much trouble (and maybe it is), I think it'd be best if we skipped lists even if they aren't titled "List of..."... Also someone added a source to one of the articles you tagged in your most recent test run. So that's somewhat validating. That was kind of the point of all this. Thanks for all your work! Ajpolino (talk) 17:41, 16 March 2019 (UTC)
 * There was no 'pitch' to skip lists nor can I think of any reason to they have sources just list any other article. --  Green  C  18:40, 16 March 2019 (UTC)
 * Also, the most recent run suggests there will be around 10,000 edits not the 25,000 as originally thought, due to the additions of filters suggested by Uanfala. Each filter causes a significant reduction. To put 10,000 in perspective that is 0.00175 of all articles (about one-fifth of one percent) or an increase of by 5%. These to me are conservative numbers. --  Green  C  18:52, 16 March 2019 (UTC)
 * Ah, sorry to be stuck on this point, but just to clarify does the bot in its current configuration skip articles that have titles "List of..."? I think that was in your original exclusion list (per the old BRFA) but perhaps you've decided against it. Ajpolino (talk) 20:00, 16 March 2019 (UTC)
 * Ah indeed it is filtering 'list of' articles, sorry! Not sure what I was thinking, loosing track. OK, more filtering can be be done on the category layer as you suggested. My code notes say the reason for filtering 'list of' articles it was picking up too many false positives. Also rethinking Given-name articles, those also are already filtered by way of the Set Index templates and those showing up here are edge cases that are not properly templated, so they should also be filtered on the category level. Thanks for your better memory keeping this straight :) -- Green  C  22:19, 16 March 2019 (UTC)

BAGAssistanceNeeded - above new filters added, ready for next trial, recommend another 50. -- Green  C  14:09, 17 March 2019 (UTC)


 * I'd like to see a bigger trial here, odd cases can be hard to find until this has more volume. — xaosflux  Talk 17:47, 22 March 2019 (UTC)

Hi Ajpolino and anyone else: Trial complete. The bot's contrib history is mixed in with other tasks so I made User:GreenC/data/noref/trial March 28 for 300. Feel free to edit this page with notes and comments. I have not checked yet but appreciate help finding problems and possible solutions. It was about 100k articles. -- Green  C  15:34, 29 March 2019 (UTC)
 * Made it through the first 50. In general looks great! There was one disambiguation page that has a category that's a sub-cat of Category:Disambiguation pages (I didn't know that category had sub-categories; learn something new every day). It doesn't need a ref, so maybe you could either skip all sub-cats of Category:Disambiguation pages, or if it's easier just skip categories with "Disambiguation" in the name? Annotated it on your list. Might get a chance to look through the rest of them in a bit. Ajpolino (talk) 20:59, 29 March 2019 (UTC)

In the future, please remember to use the template (or others relevant). Otherwise, this sort of thing can go unnoticed (even when directly viewing this BRFA) and lead to unnecessary waits. -- The SandDoctor Talk 21:13, 29 March 2019 (UTC)
 * I'm not sure if Line of succession to the Moroccan throne should have been tagged - it clearly says "According to..." before the info given. --DannyS712 (talk) 22:25, 30 March 2019 (UTC)
 * More analysis:


 * San Roque Catholic School - it shouldn't have both "no references" and "citation needed" tags on the same page
 * Neighborhoods in Key West, Florida - it starts with "The following is a list", which you can use to sort out such pages
 * Same with Kart Racing Championships
 * Georgian monarchs family tree of Bagrationi dynasty of Kakheti - not an article (no prose) so probably shouldn't be tagged
 * Battle of Beijing - dab page (though not tagged as dab), is there any way to skip that?
 * Zigor (opera) - Last sentence says "Information extracted from the libretto that accompanies the recording of Gernika, realised in November and December 2007. Symphonic Orchestra of Euskadi, Choir Society of Bilbao, Jose Ramon Encinar (musical director). DECCA, 2 CD (0028947667957)"
 * Senegalese Popular Bloc - revision tagged was Special:Permalink/889949685, which said "Source: Zuccarelli, François. La vie politique sénégalaise (1940-1988). Paris: CHEAM, 1988."
 * Englische Schulredensarten für den Sprachenunterricht - not under a proper section heading, but has a clear source ("Rückoldt, Armin: Englische Schulredensarten für den Sprachunterricht. 2. Edition. Leipzig, Roßberg'sche Verlagsbuchhandlung, 1909. 80 Pgs.")
 * Outline of statistics - outlines generally don't have sources themselves, since they are just guides to other pages
 * 1985 Quebec school board elections - sourced; ends with "Source: Sandy Senyk, "School board elections drew low voter turnout," Montreal Gazette, 11 December 1985, A5."
 * Lewkowicz - surname pages are another type of disambiguation pages
 * Burials at the Novodevichy Cemetery - just a list of links to other articles, probably shouldn't be tagged
 * Sherlock Yack - last section (Sherlock Yack) ends with "According to the "Sherlock Yack – Zoo Detective" collection Michel Amelin and Colonel Mustard, published by Editions Milan. © Milan – 2010"
 * Marie Lesueur - multiple inline sources referenced ("Almanach des spectacles for 1820 stated", "1822 Revue des spectacles described")
 * Bhagwat - family names are a type of dab pages
 * Transports Montreux–Vevey–Riviera - sourced, ends with "Details of the report were published in "24 Heures", Riviera-Chablais edition dated 21–22 June 2008."
 * Mount Langya - untagged dab page
 * Deaths in October 1966 - such lists are usually not sourced
 * I have looked through all 301 pages edited, and there are a number of false positives. I didn't list a few of them because they were the same issue (list, dab, etc) but a number had inlice sources or a note at the bottom explaining the source and were still tagged. --DannyS712 (talk) 22:58, 30 March 2019 (UTC)
 * DannyS712, I have copied your comments as in-line annotations to the list. When I am done responding/fixing there will ping.  --  Green  C  00:55, 31 March 2019 (UTC)
 * - inline response to above. Feel free to continue the inline discussions it helps me to keep it organized in one place/line. --  Green  C  02:10, 31 March 2019 (UTC)
 * Seen - I left a note on your talk page with a question. Thanks for pointing out the outlines and deaths issues. --DannyS712 (talk) 02:13, 31 March 2019 (UTC)
 * Answered there and anything else let me know,thanks for checking through these, great improvements. -- Green  C  16:47, 31 March 2019 (UTC)

Because of the number of false positives discovered by DannyS712, and new filters added, I think it would be a good idea to next test with a dry run ie. post a list of 300 like before, but it won't make the final step of adding the tag, only listing which articles it would tag. I'll start this process now and post results when ready. -- Green  C  14:33, 2 April 2019 (UTC)

Next round dry-run trial results to test the latest filters: User:GreenC/data/noref/trial April 3 for 300 -- Green  C  15:25, 3 April 2019 (UTC)
 * I checked all 300, added inline comments, and added new filters. Two unable to resolve: 1920 Toronto municipal election and Gaius (biblical figure) (might be OK to tag). -- Green  C  16:40, 3 April 2019 (UTC)

Started another 300. -- Green  C  16:47, 3 April 2019 (UTC)
 * I didn't get to it between the "another 300" and the next dry-run trial, but I analyzed the trial April 3 for 300 link to added - see my notes there --DannyS712 (talk) 05:42, 4 April 2019 (UTC)

New trial results: User:GreenC/data/noref/trial April 4 for 300 -- Green  C  16:24, 4 April 2019 (UTC)
 * reviewed --DannyS712 (talk) 01:52, 5 April 2019 (UTC)
 * I saw your responses - I guess I would be more conservative with tags, but thats just my preference, and shouldn't be interpreted in my role as the closer of the discussion. --DannyS712 (talk) 01:59, 8 April 2019 (UTC)
 * DannyS712, appreciate your help thus far this is a difficult bot and a lot of work. I was assuming your involvement here was a personal interest. You have made a lot of good filter recommendations that have been implemented that have improved the bot. There are some I disagree with and I would be happy to demonstrate any of those in two ways: by adding sources to them, and showing other articles like them that have sources. There is also the problem some of these can't be effectively filtered so those might be a little less conservative with tagging, yet they can be justified as needing sources. -- Green  C  14:24, 8 April 2019 (UTC)

Next 300: User:GreenC/data/noref/trial April 8 for 300 -- Green  C  00:31, 9 April 2019 (UTC)
 * Danny checked already. Some new filters added. Number of problems are much less. Will start processing the next 300. I think we should consider tagging the previous 900 since they have been manually verified, minus the ones identified as a problem. It is about 9% complete out of 5.5 million. -- Green  C  00:31, 9 April 2019 (UTC)
 * if you want I can do that with AWB - I personally checked almost every page, and definitely the last 600 --DannyS712 (talk) 00:53, 10 April 2019 (UTC)
 * Sounds great. Thanks for the offer! There is also another 300 ready to post. Would you do all 1200? --  Green  C  14:57, 10 April 2019 (UTC)
 * Maybe at the end. User:GreenC/data/noref/trial April 4 for 300 --DannyS712 (talk) 20:51, 10 April 2019 (UTC)
 * ✅ --DannyS712 (talk) 21:12, 10 April 2019 (UTC)


 * Next 300: User:GreenC/data/noref/trial April 10 for 300 -- Green  C  15:02, 10 April 2019 (UTC)
 * Verified about 60%, skipped the obvious sports and music articles, found one problem with a dab page. --  Green  C  15:32, 10 April 2019 (UTC)
 * I assumed you did the first 60%, so I went through 200-300. In general, looks great! One wishy-washy kinda disambiguation page. One list article that should probably be skipped (or better yet, replaced by a category), and two lists that should probably have references. Thanks again for all the leg work you're doing on this !! Ajpolino (talk) 19:35, 10 April 2019 (UTC)
 * . Taking stock, thinking the bot is running pretty clean now and maybe it's time to ping for BAG help. What do you think, or do you want to do more trial-trials? -- Green  C  22:10, 10 April 2019 (UTC)
 * I think its good to go - the last run didn't have any true false positives --DannyS712 (talk) 22:21, 10 April 2019 (UTC)
 * Also I'm not going to tag manually the April 8 and 10 results - maybe have those as an extended trial that the bot actually edits, after implementing the filters, so that anyone who notices a reference we missed has a chance to speak up? --DannyS712 (talk) 22:22, 10 April 2019 (UTC)
 * Ok great. Ajpolin is probably offline but stated "generally looks great" so taking that as encouragement to keep going. Good idea re: having the bot do it. -- Green  C  23:28, 10 April 2019 (UTC)


 * BAGAssistanceNeeded .. the bot is ready to resume live trials. It might be the April 3, 8 & 10 trial-trials comprising ~900 articles (which have been manually verified but not edited yet) and/or new. --  Green  C  23:28, 10 April 2019 (UTC)
 * So sorry for the delay, . Just to clarify: you would like a 900 edit extended trial? -- The SandDoctor Talk 17:15, 27 April 2019 (UTC)
 * that would be good. Even better could we make it 1,800? The 900 previously verified as being acceptable, plus another 900 in full-auto mode no pre-verification. -- Green  C  03:26, 28 April 2019 (UTC)
 * per your above comment. Please take all the time you need in completing this trial and post back here when done. -- The SandDoctor Talk 08:02, 28 April 2019 (UTC)
 * The following have been tagged with (except where noted with in-line comments). The "*" are the 900 approved for this trial:
 * User:GreenC/data/noref/trial March 28 for 300 (tagged March 28)
 * User:GreenC/data/noref/trial April 3 for 300 (tagged May 5) (*)
 * User:GreenC/data/noref/trial April 4 for 300 (tagged April 10 by DannyS712 with AWB)
 * User:GreenC/data/noref/trial April 8 for 300 (tagged May 6) (*)
 * User:GreenC/data/noref/trial April 10 for 300 (tagged May 6) (*)
 * The trial is approved for 1,800 total, it will begin working on another 900 which are not pre-screened like the above. -- Green  C  16:26, 7 May 2019 (UTC)

- the second part of the trial, the 900, is done. Which is 1,800 total. The problem is there was a bug in the program and the run ended up making 1,889 instead of 900 edits so the total for trial is 2,789 (1889+900). Since it takes days to find this many, I didn't notice it had gone past the 900 limit! Sorry about that, looking into why it didn't stop, but regardless there have been no complaints about the tagging so far. This bug won't matter in the final run, it will stop once it processes all articles. -- Green  C  18:05, 10 May 2019 (UTC)
 * Bug fix. Forgot to increment the counter when it found a match. -- Green  C  18:36, 10 May 2019 (UTC)
 * I'm not sure what the official protocol is here, but I spot checked about 150 of these and saw no problems (for others interested, it's easiest to find if you sort GreenC bot's contribution history from 8 May to 10 May). Ajpolino (talk) 05:33, 29 May 2019 (UTC)

BAGAssistanceNeeded - How about we close the BRFA. The bot was first proposed in December (2018). It went through a major VP RFC and had community approval. It has gone through extensive testing and trials with multiple people looking at the results. So much testing the bot is 30% done (1.67m out of 5.7m articles scanned). The last trial results were posted almost 4-weeks ago with no response. Let's finish out the remaining 70%. -- Green  C  16:36, 2 June 2019 (UTC)
 * Didn't have time to do any checks before now, but from what I've seen above and through spot-checks everything looks good. Primefac (talk) 12:11, 15 June 2019 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.