Wikipedia:Bots/Requests for approval/Femto Bot 7


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol keep vote.svg Approved

Femto Bot 7
Operator:

Time filed: 21:05, Sunday, January 17, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): perl

Source code available: no

Function overview: Creates/updates lists of articles by location that require images.

Links to relevant discussions (where appropriate): Several year old request from User:WereSpielChequers.

Edit period(s): Will probably do a full scan monthly, might remove "fixed" items from the list daily.

Estimated number of pages affected: Initially one 185 pages for UK articles. May well add more, as there seems to be a demand for this type of report.

Exclusion compliant (Yes/No): n/a (will only be editing it's "own" pages).

Already has a bot flag (Yes/No): Bot flag needs re-enabling.

Function details: Produces a report of pages in a given geographical area which require images. Details may vary by region. For example the UK page will have OSGB grid squares, which make it easier to find Geograph images.

UK example layout

 * A105 road in Haringey, England.
 * Asanté Academy of Chinese Medicine in Haringey, England.
 * Basketball at the 1948 Summer Olympics in Haringey, England.
 * Brownswood (ward) in Haringey, England.
 * Channing School in Haringey, England.
 * Chestnuts Park in Haringey, England.
 * Crouch Hill in Haringey, England.
 * East Warwick Reservoir in Haringey, England.
 * Getter's Talmud Torah in Haringey, England.
 * Greek Secondary School of London in Haringey, England.
 * Greig City Academy in Haringey, England.
 * Grove House School in Haringey, England.
 * Guy Chester Centre in Haringey, England.
 * Heartlands High School in Haringey, England.
 * High Maynard Reservoir in Haringey, England.
 * Highgate Wood Secondary School in Haringey, England.
 * Hornsey (parish) in Haringey, England.
 * Hornsey School for Girls in Haringey, England.

Discussion
Where is the report produced at? How specific are the regions, e.g. how many report pages do you expect to create?

Sounds like a straight-forward list-generating task that doesn't involve direct article edits. We can probably look at your sample report to see if there's any issues, though I can't foresee anything. — HELL KNOWZ  ▎TALK 22:40, 17 January 2016 (UTC)
 * I've yet to see how big the (UK) report will be, but I expect a few thousand entries, which would go on one page - probably in my user space while I am trialling. Later it may go in Wikipedia space, probably under the "Database reports" hierarchy.
 * All the best: Rich Farmbrough, 21:59, 18 January 2016 (UTC).


 * We could also do with a list for Ireland, and ideally split out lists for Wales and Scotland. The finer the geography the better the chance of getting wikiprojects involved. At the other end of the scale, if there are other parts of the world with as many images on Commons it would be good to get lists (the UK is 0.1% of the Earth's landmass and about 10% of Wikimedia Commons, so there can't be many places currently as well covered on commons. But we might have photographers interested in lists of articles without images in an area they are or will be in.  Ϣere Spiel  Chequers  13:41, 23 January 2016 (UTC)


 * Any updates, Rich? — Earwig   talk  06:40, 31 January 2016 (UTC)
 * Yes I have, for example, this which is fairly raw still. All the best: Rich Farmbrough, 16:24, 1 February 2016 (UTC).


 * And here is the type of page generated for Aberdeen. All the best: Rich Farmbrough, 23:57, 18 February 2016 (UTC).


 * All the best: Rich Farmbrough, 23:59, 18 February 2016 (UTC).


 * How often is this going to edit, and how many pages per day? Ideally I'd like some assurance of hard maximums for now, given prior concerns from the community. -- slakr  \ talk / 04:34, 19 February 2016 (UTC)
 * I envision once per month, and possibly updates on demand, it will only be editing pages created by me to hold the results. As to the number of pages it may be one or two (British Isles or UK and Ireland), one per country/territory (Ireland, Scotland, Wales, NI, IOM, Jersey Guernsey) or one per primary subdivision for the UK and Ireland.
 * All the best: Rich Farmbrough, 11:52, 19 February 2016 (UTC).


 * Really useful reports, thanks Rich. As for updating the more frequently the better please, daily would be better than monthly. It is much easier to teach newbies to add images using this list if they don't have to be told to remove entries they have added images for.  Ϣere Spiel  Chequers  13:27, 19 February 2016 (UTC)
 * It's quite a nice tool for spotting errors too. I have recovered an academy from the sea, and moved an arboretum to its correct country already.  All the best: Rich Farmbrough, 13:52, 19 February 2016 (UTC).

User:Rich Farmbrough/temp138. Only surprise is that there are some empty subdivisions. All the best: Rich Farmbrough, 23:15, 21 February 2016 (UTC).


 * , ... any comments? All the best: Rich Farmbrough, 04:08, 25 February 2016 (UTC).

14:24, 25 February 2016 (UTC)
 * That was wayyy more than "one page." In fact, you've made thousands of edits&mdash;and the bot is still running and has made, in some cases three edits to a single page in 24 hours. -- slakr  \ talk / 05:01, 25 February 2016 (UTC)
 * ...and whatever's going on here and here is clearly broken. -- slakr \ talk / 05:04, 25 February 2016 (UTC)
 * ...blocked for now. Normally I wouldn't be as itchy with the trigger finger on this&mdash;and userspace edits are normally fine&mdash;but:
 * There are bugs.
 * The bot continued editing after the it was marked "trial complete."
 * Mismatch between number of edits bot was claimed to be making per day and number of pages ("once per month" and "it may be one or two").
 * Prior arbcom issues and sanctions related to automated edits.
 * Anyone is free to unblock at their discretion, but I don't personally feel comfortable with allowing this bot to constantly make edits over hundreds or thousands of pages, multiple times throughout the day, without some sort of clear boundary. The crossing of boundaries even within this BRFA is, for me, cause for concern&mdash;especially considering this is but one of hundreds of countries that this could happen on with this rather open-ended task description.
 * -- slakr \ talk / 05:36, 25 February 2016 (UTC)
 * It's a shame that you used the block button rather than using your words.
 * Your "thousands of edits to hundreds of pages" are 666 edits to 185 pages.
 * Since I am running in user space there is no need for me to stop developing the bot merely because the trial is complete. BAG's imprimatur is only needed if and when the reports are moved to Wikipedia: space.
 * Your reference to "clearly broken" behaviour is quite accurate, however that diagnostic, identified a difficulty in following redirects, a feature I have now successfully added.
 * Pretty clearly until the bot was run there was no way to know the size of the data set. It is too large for a single page.  Moreover the user requirements expressed by User:WereSpielChequers are for a more fine-grained solution.  Thirdly the finer granularity enables smaller deltas.
 * As for bugs there are currently three known expressions, two of which are probably caused by Femto Bot not having the bot flag, and hence being limited to the number of items returnable from an API call, and one which is of unknown cause, but probably due to an API call failing to serve a page. I was hoping to resolve the third today.
 * This task has already been held up for nearly four years by bureaucracy, it is not good to hold it up further. The task as it stands meets with the criteria of being harmless and useful.
 * All the best: Rich Farmbrough, 14:24, 25 February 2016 (UTC).
 * Oh yes and the spec is clear "might remove "fixed" items from the list daily". All the best: Rich Farmbrough, 14:42, 25 February 2016 (UTC).


 * I think the main issue is that you specified "one page", the trial was for "UK report" (singular) without daily fixes and then you confirmed "one page", but we have 183 pages and many more edits. You did not communicate this, you simply went ahead with it. I gave you a quick trial to let you produce a report so we can focus on the merits of the task and not the operator. I would have approved the task as clearly useful for singular per-country pages with reasonable edit intervals. With ArbCom sanctions, I would have expected you to be triple-careful and I hoped you'd show us such diligence. But you have significantly miscommunicated and over-edited the trial. If such a trial is indication of future past approval, I cannot approve or endorse it. — HELL KNOWZ  ▎TALK 15:25, 25 February 2016 (UTC)
 * In response to Slakr's query I said


 * I don't think this could have been clearer. The time to have made objections would have been in response to that statement.
 * All the best: Rich Farmbrough, 18:03, 25 February 2016 (UTC).


 * Well I think my request "We could also do with a list for Ireland, and ideally split out lists for Wales and Scotland. The finer the geography the better the chance of getting wikiprojects involved." was even clearer. Note nobody including or  have explained why they want this bot to be less useful.  Ϣere  Spiel  Chequers  23:02, 25 February 2016 (UTC)

I'm fine with the split pages, but up to 185 edits per day every day seems excessive for a simple reporting task. What happens if (when) we extend it to multiple countries? Thousands of edits per day? The numbers don't "feel" right; the practical number of edits must really be much smaller, given that most imageless pages aren't going to have an image added every day. I am not clear from the above explanation why so many edits were made after the initial dumps—whether due to testing or bugs or intended behavior—and whether this frequency will decrease over time. I also think this sort of report ideally belongs off-wiki (e.g., Labs), updated much more frequently (hourly?) without extra resources spent storing historical data, but... — Earwig   talk  02:54, 26 February 2016 (UTC)


 * While I concur with what both and  have said, I don't have a truly serious issue with a bot task that generates these reports periodically, provided there are sane, appropriate, hard limits&mdash;limits that I still have yet to be assured of.  On a related note, if you need a consistent, constantly-updated, live view of a data set&mdash;especially if it's only for only one or two people who've thus far claimed to need it&mdash;hundreds to thousands of edits per day is likely the Wrong Way(tm) from a technical standpoint.  However, even ignoring that&mdash;after all, we work with what we have within reason, and we've had (and still do have) bots that do stuff that populate templates several times throughout the day&mdash;the issue this all comes back to is one of communication.  While I can see that you clearly intended "daily" to mean "many throughout the day, however often I want," you'll have to understand that in context, I, personally, did not interpret "daily" to mean that, and clearly others shared in that misinterpretation.
 * Furthermore, while I appreciate that you have now updated the function details of the bot to reflect a two-orders-of-magnitude difference from the original page count, the bot's trial was approved with the phrase "initially one" in mind. That's not to say it's in any way inexorable&mdash;scope changes frequently happen during these things&mdash;but usually it's the result of a conversation, not "SURPRISE! I have altered the deal; pray I do not alter it further."  Just because you unilaterally decide on a scope increase doesn't mean that we are at fault for somehow failing to "get the memo."  Don't get me wrong; it's sorta like how I'd love to notify my bank that I actually meant that when I said I'd pay the loan monthly that I really meant a few times each year. I'm guessing it wouldn't go over so well.
 * Anyway, I guess we could continue talking about the "why" or we could move on to addressing the concerns that were raised (e.g., bugs, edit rate concerns, long-term technical suitability). To start, I'd suggest clearly stating the upper limits of this task in a manner that doesn't cut you a blank check and picking an update frequency that's both in-line with the other database reports and also in line with the realistic level of urgency for the task.
 * -- slakr \ talk / 06:49, 26 February 2016 (UTC)
 * Hi Slakr, there was a conversation, and this wasn't Rich springing the frequency on us as a surprise. I requested that he create multiple pages and subdivide them so we can offer them to wikiprojects. Rich responded to that request. So may I suggest you strike your "SURPRISE! I have altered the deal; pray I do not alter it further. Just because you unilaterally decide on a scope increase" comment. As for how many people will use them, that depends on their success either as a newbie training exercise or as an additional report for wikiprojects, if either takes off I'm confident this will have rather more than one or two users.  Ϣere Spiel  Chequers  08:49, 26 February 2016 (UTC)
 * I understand that you feel that I should have expressed myself in a manner more suitable to your liking, however I feel it's a hyperbolic reference to ostensibly the best Star Wars movie, said by a sort of tragic hero who, deep down, is good, a genius, but has been vilified or even feared by people who didn't understand him (and with whom he becomes frustrated). That's where any sort of semblance ends, however, because far from labelling him a villain, I actually support Rich in his efforts to improve stuff and laud his patience with the community over time.  In fact, I could have sworn that I posted favorable words somewhere back when people were angry with him about using find-replace and whether that even counted as automation.  Automation's a great thing when it's properly thought through, but it can still be frustrating to deal with people dealing with automation&mdash;especially when it malfunctions.  Even worse, the best of intentions can come off to others as something negative. That's part of the reason we're here&mdash;to try to keep unnecessary drama from happening again.  I do understand that you supported his changes, but they were still made without actual approval. Others disagreed&mdash;more specifically, the people who fairly regularly volunteer to sanity-check bots for technical issues and their impact.  Still, for what it's worth, normally I wouldn't even be saying this (WP:BURO and all), but in this particular case, there are prior issues combined with community expectations that result in what I feel are valid justifications for significantly heightened scrutiny. -- slakr  \ talk / 09:43, 26 February 2016 (UTC)
 * I haven't seen that movie. All the best: Rich Farmbrough, 11:36, 26 February 2016 (UTC).


 * Hi Slakr, leaving aside which film(s) people may or may not be alluding to, you have accused Rich of making changes without prior discussion, and I have pointed to where that discussion took place with me publicly requesting things, Rich responding and no-one objecting. This is not just me supporting Rich, this is me requesting that he do somethhing and him agreeing without public objection.  Ϣere  Spiel  Chequers  18:40, 26 February 2016 (UTC)

You are correct that the number of changes daily is expected to be small or indeed often zero (out of a data set this size). However there was a constant flux of items between the two sets, it was therefore important to identify why this was happening. The reason is, as I surmised above, that the API is returning a blank response (or possibly no response, or a very delayed response) in about 10 cases per 100,000 (this is distinct from the "missing" response). Obviously these items were matching the criteria for "no image", and, if they had images, resulting in two additional edits, one to add to the report and one, next time they were correctly interrogated, to be removed again. It should be stressed that without additional testing I would not have been/will not be able to: All the best: Rich Farmbrough, 11:36, 26 February 2016 (UTC).
 * 1) Come up with any estimate of the number of daily changes
 * 2) Identify the issues with the server responses
 * 3) Test bugfixes related to API calls
 * 4) Update the pages to reflect the current state of the  bot.

I understand you want "hard limits", however in the process of bedding in even indicative limits are hard to provide. Or to be more accurate, hard limits are not useful, though easy to provide (if someone adds an image to each group each day then there will be 187 edits per day). More realistically I would expect a trickle of edits, maybe one or two per day if that, and a relatively small number of updates at month end. To get a feel for these figures I need to put in a datum (which is partly why I have been updating the pages when I discover an issue) and run the same code over a period. I don't even know yet if a daily run is worth the effort, or if a weekly run might be better.

In terms of other territories, I intend to cover Ireland, the Isle of Man, Jersey and Guernsey as part of this regular update. The only other part of the British Isles excluded is square MX, which includes Rockall (and which I could easily, and maybe will add).

The motivation for keeping these pages updated is newbie-friendliness - they are specifically targeted as an introductory editathon task. For other territories this may not be the case, (certainly the ready-made store of CC-BY images at Geograph does not exist) and indeed there is scope for moving to an "on request" production.

All the best: Rich Farmbrough, 15:26, 26 February 2016 (UTC).


 * I'm willing to forgive the earlier incident—while unusual and inadvisable, it wasn't actively harmful—and this task is minor enough that it shouldn't be jeopardized by that. As long as we operate within reasonable bounds, I think we can move forward. Rich, can't you work through the bugfixes and let it run locally for a bit to estimate the expected edit frequency? —  Earwig   talk 00:00, 27 February 2016 (UTC)
 * I have been running locally, to pick up stuff like blank API results. And of course there was a deal of work to transform co-ordinates to OSGB and establish the administrative division places lie in.  But ultimately it needs to write to a MediaWiki wiki running the same version as en:WP. Only that can really test the updating functionality.  I have been conservative with my runs, even though they are in my user-space.  Most of them are selective updates, including three single page updates. See log.
 * As far as bug expressions are concerned the biggest is caused by FB not having a bot flag and not being able to get full results from the API.
 * All the best: Rich Farmbrough, 02:14, 27 February 2016 (UTC).


 * A few things:
 * The api is accessible to unflagged bots; you just might have to make a few extra requests (as opposed to getting 500-5000 results all in one request). That's why generators have been added that essentially retrieve 'n+1' results and hint at where you continue from.  For example, this query is self-limited to 4 results at a time, but by passing the generator continuation parameters it gives you back, || this followup query gives you the next 4.  You can max out that limit to 500 (or I think "max" nowadays, if you want) and just keep getting pages and pages of data way beyond 500.  If that's not sufficient, which queries are you using?  Do you happen to have examples? Maybe we can help out.
 * Have you considered using the production database replicas available to tool labs users? It also helps to avoid hammering the api servers.  It's the preferred way of doing big or complex queries.  Check out wikitech:Help:MySQL queries if you want a few examples of raw SQL queries if you're unaccustomed to dealing with the mw schema.
 * Perl allows you to echo data to a terminal or log file or simulate dry runs without editing the site live. To say it's impossible to commit to hard limits or predict the behavior of your bot without making live edits for something as straightforward as this is dubious at best.
 * -- slakr \ talk / 03:16, 27 February 2016 (UTC)
 * Yes I know about the max and generators. Currently I am hitting the limit on probably less than one percent of queries.  I monitor this number and it seems likely that with the bot limit this will not occur.  I do not wish to add unnecessary code.
 * Yes I have considered using tool labs, and maybe I will in the future. I don't like the insistence that one's code must be given away, it seems both bureaucratic and totalitarian.
 * These are two completely different issues.
 * Hard limits If WereSpeilChecquers has a wildly successful editathon then - or if we have a "Wiki Loves British Places" month then many pages can have  images added. Conversely if someone discovers that several thousand photographs have been stolen from a copyright collection that could also trigger a wave of changes.  Of course if you want to create an artificial hard limit, say a maximum of five changes per day, that could be done, though it seems bone headed.
 * Emulation working.
 * I do use local copies. But a utf8 file is not the same format as a url.  Moreover the wiki is the ideal vessel not only for ensuring that we are looking at the actual results, as uploaded by the program and as processed by the API, but it is perfect for showing which subdivisions have changed,  by looking at contribs, and which  items have been added or removed using diffs.  To give another example a red-link in one of these reports is a red flag.  On a local copy it says nothing.


 * Having said all that I do indeed have some statistics:
 * I compared the output of two runs one completing on 26/02/2016 at 13:12 and one completing on 27/02/2016 at 16:45. The number of items different was one, Sulby Reservoir, which has had an image added (by me, as it happens).
 * On the basis of this I think that your fears of the servers being overwhelmed by a tidal wave of edits can be assuaged somewhat.
 * All the best: Rich Farmbrough, 02:06, 28 February 2016 (UTC).

(and see above). I had a look at the Recent Deaths report, this has been updated 3 times in the same day. FemtoBot 7 will not run more than once per day, so it won't do that.

I think the issue here is conflating testing frequency with running frequently.

All the best: Rich Farmbrough, 01:40, 29 February 2016 (UTC).

(Should have been) removed from lists

 * 1) ISPS Handa Ladies European Masters in Buckinghamshire, England.
 * 2) Leusdon in Devon, England.
 * 3) Little Totham in Essex, England.
 * 4) Frongoch internment camp in Gwynedd, Wales.
 * 5) Queen Elizabeth II Hospital in Hertfordshire, England.
 * 6) Old Dalby Test Track in Nottinghamshire, England.
 * 7) Sulby Reservoir in unknown subdivision.


 * Reasons
 * 1) Logo added.
 * 2) Geograph image added.
 * 3) Geograph image added.
 * 4) Image added.
 * 5) Image added.
 * 6) 2 images added.
 * 7) Geograph image added.

(Should have been) added to lists

 * Ravens Wood School in or near Kent, England.
 * Cecil Jones Academy in Southend-on-Sea, England.
 * Harris Academy Purley in Croydon, England.
 * Pennyhill Park Hotel in Surrey, England.

In each of theses cases a logo was removed. In the first three it should probably have been moved from Commons to en:wp.
 * Reasons

Clearly the indication here is that because 11 changes were made over 7 days, the high end of my estimate of one two changes a day is pretty accurate.

It would of course be possible to code a hard limit on the number of updates per day. If someone can present cogent reason to knowingly leave inaccuracies I would be prepared to consider it.

All the best: Rich Farmbrough, 22:16, 3 March 2016 (UTC).


 * This seems reasonable to me. ? —  Earwig   talk 03:08, 4 March 2016 (UTC)


 * I think a hard maximum of 1 update per page per day is reasonable. I don't, however, feel that this task's run frequency is in any way on par with Recent Deaths&mdash;clearly for obvious reasons (e.g., current events, blp, impending influx of edits and editors).  That said, if Rich can demonstrate emphatic consensus from numerous editors that those reports absolutely need to be updated constantly throughout the day, I'd be willing to re-examine the merits, but as of now, I see, if anything, the exact opposite&mdash;that the lack of substantial justification means a lower edit rate is preferable (despite it being less OCD-friendly). -- slakr  \ talk / 04:18, 4 March 2016 (UTC)


 * I hope we can move forward here. Should we do another trial to check that the daily updating is as we expect it to be? No more than 1 update/page/day should be fine. —  Earwig   talk 19:45, 14 March 2016 (UTC)
 * I have no objections. =) -- slakr \ talk / 00:51, 19 March 2016 (UTC)

Trial 2
—  Earwig   talk 03:02, 19 March 2016 (UTC) - the updates over the last 6 days have averaged about 1/50 of an edit per page per day. All the best: Rich Farmbrough, 01:13, 27 March 2016 (UTC).


 * All the best: Rich Farmbrough, 01:17, 29 March 2016 (UTC).


 * BAGAssistanceNeeded All the best: Rich Farmbrough, 23:49, 2 April 2016 (UTC).


 * —  Earwig   talk 20:53, 3 April 2016 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.