Wikipedia talk:Bot Approvals Group/Guide

Creation of the guide
I'm going to put this out of chronological order, because it'll make more sense in the archives later on. I'm going to ping all active and semi-active bag members to get their input on this guide.

•

•

•

•

•

•

•

•

•

•

•

•

•

•

This way no one will feel left out, and we'll get more feedback and ideas from people. I'm wondering if we should post notices on WP:BON/WT:BOTPOL too for transparency's sake? Headbomb {talk / contribs / physics / books} 18:26, 12 February 2017 (UTC)

Ramp up vs extended
I feel we are splitting hairs a bit trying to differentiate ramp-up style trials from just the extended ones. Ramp-up is just an extended trial, when done in multiple steps. It's a neat way to phrase it and a common occurrence for complex impactful tasks, but do we really need the instruction creep? What about trials that last for several weeks if not months, because there are so few pages or edits are deadline-based? Or trials where the bot edits a small subset of pages, but for several days or weeks? Also, extended trial happens after a regular trial -- it's not a "longer trial", it's an "additional trial". A regular trial can be very long and an extended trial could be super-short to verify some issue that occurred in the first trial (that was only caught because the trial was long enough). How would that fit the guide? I like the idea, but I feel we shouldn't stray too far into formalizing any "kinds" of trials. Perhaps as additional notes and common practice examples.

P.S. It would be interesting to gather some stats on BRFAs -- trial count, conditions, any extensions, participants, etc. — HELL KNOWZ  ▎TALK 12:31, 12 February 2017 (UTC)


 * Updated the table with that feedback in mind. Better? Headbomb {talk / contribs / physics / books} 14:49, 12 February 2017 (UTC)


 * I think so. It doesn't imply trial and ramp-up are wholly different, which was my main issue. — HELL KNOWZ  ▎TALK 14:57, 12 February 2017 (UTC)


 * I would disagree there. An extended trial is one that is used when the first small trial is complete, but then another one has to be done. At this point, an extended trial means that the bot is not approved and is in more testing to prove itself. A ramp us is used when the bot is approved, but the BAG are cautious about a large amounts of edits - so place restrictions on the amount of edits for the first few days. This lets any minor technical glitches be fixed. TheMagikCow (talk) 15:11, 18 February 2017 (UTC)


 * To be honest, that's a question for as he put the table together. In practice, we have trials and extended trials before approval. After approval, I've seen very few BRFAs with extra conditions. Only recently,  has approved a bot some 5 times a with "ramp up" schedule. I'm not even sure the "ramp up" in the trial table is the same "ramp up" as after approval. This isn't standard practice, whatever that means. (Personally, I think these should be part of trial before approval; see this discussion.)  —  HELL KNOWZ  ▎TALK 15:30, 18 February 2017 (UTC)
 * I've done it both ways - with extended trials with phases, and an initial throttle requirement. I think that approver discretion should be available. —  xaosflux  Talk 15:35, 18 February 2017 (UTC)

Well, initially I thought a ramp up approval would make sense in a "yeah seems fine, but just in case, have a ramp up rollout", but Hellknownz et al have made, I believe, a compelling case for having ramp up trials instead. So as far as best practices are concerned, ramp up as part of the trial makes more sense, since this implies the technical review, or the consensus gathering isn't quite over. Headbomb {talk / contribs / physics / books} 15:42, 18 February 2017 (UTC)

Consensus before trial
"If consensus has been demonstrated, or can reasonably presumed, BAG members have the discretion to allow the proposed bot to undergo trial to judge its technical soundness."

I feel this is too strongly worded for when consensus may not be clear or be WP:SILENCE. A short trial for technical verification, or to even garner further input, should be acceptable before consensus is reached. Leaving a BRFA open after trial to gather input is pretty standard. We've had editors unclear on this before that bots are being "approved" without consensus when they were simply trialed. I think it should make it clear that anything less than approval isn't approval and that a trial does not imply eventual approval. Bag should take care not to mislead the botop to code and run a task that they don't think will have consensus. But sometimes it's inevitable that issues are discovered and wider consensus is requested only once the trial runs. — HELL KNOWZ  ▎TALK 12:44, 12 February 2017 (UTC)


 * Not quite sure what exactly the issue with the current wording is here, especially since the next sentence is "Trials can also be used to help determine consensus if relevant communities have been notified, but failed to engage in dialogue after a reasonable amount of time has elapsed.". Headbomb {talk / contribs / physics / books} 14:45, 12 February 2017 (UTC)


 * May be I'm reading too much into it. It sounds a bit like it precludes trials for technical tasks without clear consensus. — HELL KNOWZ  ▎TALK 14:59, 12 February 2017 (UTC)


 * What I want to say by "if consensus has been demonstrated, or can reasonably be presumed" is that consensus has either been shown to exist, or that a reasonable person would agree that the task is likely to be supported by the community (e.g. the task isn't "Have all non-free images link to the GNU Manifesto"), then trials can be granted. No one would grand trial for that short of having an RFC conclude that this is something desired. Likewise for a typo bot, and other bots that are unlikely to achieve consensus.


 * I'm open to better phrasings though. Headbomb {talk / contribs / physics / books} 15:30, 12 February 2017 (UTC)


 * The typical case is where consensus is still being built (discussions started), but it is likely to form, yet we don't want to delay the trial needlessly, especially when the edits are complex and would better demonstrate what the bot is trying to achieve to participants. In other words -- a (small, first) trial is part of the discussion, demonstration and consensus, rather than only a result of the discussion and consensus. — HELL KNOWZ  ▎TALK 15:58, 12 February 2017 (UTC)


 * Regarding technical edits for demonstration purposes etc, these are often needed to be able to demonstrate when some of the more complex BRFA's are even going to do - I think we should be more strongly be encouraging BRFA applications to include something like "10 demonstration edits made under the operator's account". —  xaosflux  Talk 16:05, 12 February 2017 (UTC)
 * Unless the task needs some right that the bot account doesn't have, that's as likely to be more annoying than just making the 10 demonstration edits under the bot's account since it requires putting the user-account credentials into the bot code which may include having to set up OAuth or BotPasswords for the user account. Anomie⚔ 16:37, 12 February 2017 (UTC)
 * I'm ok with it being under the bot account to - as long as they file the BRFA and clearly indicate it is for that specific BRFA. —  xaosflux  Talk 19:13, 12 February 2017 (UTC)
 * That's OK too (I was thinking of this for new operators that don't yet have bots). — xaosflux  Talk 03:31, 25 February 2017 (UTC)

Require bot operators to self-report expected bot cost in human time
Bot designers can intend for bots to solicit and consume more human time or less human time. All things being equal, bots should consume less human time, be more discreet, and give priority to human activity over bot activity.

At Wikipedia_talk:Bot_policy I am in a conversation which talks this through. The change that I would like is for the bot review process to require bot operators to self-report a likely minimum and likely maximum amount of human time which their bot will consume. I am not advocating for a particular cut off, but in general, a bot which does a high value activity and consumes less human time is better than a bot which does a lower value activity and more human time. For consumption of human labor to be part of the discussion we need a measurement of this, which is challenging, but I think that operator self-reporting during the approval process is a good place to start.

A common response which I hear to this proposal is "It is hard to measure how much human time a bot consumes, therefore by default we should assume that all bots consume zero human time and human time costs should not be a factor in considering the value of a bot." I want to push back against this perspective. I want to avoid any administrative burden on anyone, but as bots do more in wiki, we should establish some community norms on how much human time bots solicit.  Blue Rasberry  (talk)  15:55, 21 February 2018 (UTC)
 * what I'm not seeing so far is a lot of people agreeing with you for creating a brightline rule of any sort, at least not yet. Interesting conversation though - and # of edits per namespace estimates are certainly something that can be asked of operators.  I'm not sold on the "solicit" component - if a bot makes talk notes of an "informational" nature, as opposed to also including an (optional) call to action (that is the "solicitation" component) it is still creating just as many pages.  Bots making lots of new pages, or perhaps expanding to "new talk sections" in general can also help trigger a "broader input" period for sure. —  xaosflux  Talk 16:04, 21 February 2018 (UTC)
 * I confirm that I have found no one who agrees with what I am proposing and that I am taking a fringe position. The mainstream opinion at this time is that bots will not consume human time in any way worth mentioning, measuring, or considering. Although I think this issue is urgent to address now and a major concern, and I think that the popular thought is out of touch with reality, I also recognize that the Wiki community process goes at its own pace and no individual can force an issue. I am the odd one here and most probably that means that I am mistaken, incorrect, misinformed, and wrong. I often am about these issues. I appreciate anyone who considers the issue regardless of the extent to which they agree or disagree.  Blue Rasberry   (talk)  16:37, 21 February 2018 (UTC)
 * "easy wins" are asking operators to give edit estimates per namespace, and using those to trigger the existing "broader advertisement" type requests (e.g. Village Pump advertisements for comment) - will this help your concern at all? — xaosflux  Talk 16:48, 21 February 2018 (UTC)
 * I will take what I can get, but I would like to request a time estimate. Some bots by design have the bot communicate requests for human labor, and some bots by design only make brief logs of what they did and only ask that anyone respond to complain or contest the activity. I want to make the review process distinguish the bots which have inherent design features to solicit large amounts of human time and labor versus those bots which by design both do not make appeals for human time and labor and are unlikely to attract large amounts of human time and labor. Edit counts per namespace are useful but a large number of edits which do not solicit human response are different from large numbers of friendly-seeming messages each of which is making a request for a human time investment and interaction. I want to flag the time consuming activity.  Blue Rasberry   (talk)  18:26, 21 February 2018 (UTC)
 * But the review process already does that and this was already flagged. That's why we ask bot operators to describe the bot task, and use the discussion to clarify things. What we have here is a case of one bot taking a 'large' amount of human review time via talk page messages (which was clearly known at the time of approval, "If the bot makes any changes to the page, a talk page notice is placed alerting the editors there that Cyberbot has tinkered with a ref." with an estimated number of pages affected ranging from 130,000 to 1,000,000). And while it had consensus to do that before, now consensus is against that, and accordingly it stopped posting such messages. Headbomb {t · c · p · b} 14:20, 22 February 2018 (UTC)

And concerning "It is hard to measure how much human time a bot consumes, therefore by default we should assume that all bots consume zero human time and human time costs should not be a factor in considering the value of a bot.", no one has said that. What was said is that how much human time a bot consumes can't be measured, and trying to come up with estimates of that doesn't yield any insight on whether or not a task should be done. It's not that we assume such time is zero (it clearly isn't), it's that having a number for this (e.g. this bot task is estimate to require 100 person-hours out of volunteers) doesn't help make decisions in any way. People will bicker about whether something is 50 person hours, 100 person hours, 1000 person hours, waste time on refining the estimate to get more precise numbers, come up with various scenarios yield different estimates, ... for what is essentially a completely useless number. Headbomb {t · c · p · b} 14:27, 22 February 2018 (UTC)
 * I am unable to understand your perspective or how you could think this way. I am unable to recognize any understanding that you have of what I am saying. I cannot understand your emphasis on how "the review process already does that and this was already flagged" when it seems likely that I have not communicated any of this effectively to you. Would any of the following constitute a measurement?
 * Someone records a screenshare of a user fulfilling one of the requests made by the bot. They do it after practice. The time that it takes to perform the activity which this bot requested forms a basis for measuring how much time that a bot requests.
 * Someone records a screenshare of a user looking for the first time at the talk page where a bot request is posted. They do it having never seen such a bot request. The time that it takes to determine report understanding of the bot request forms a basis for measuring how much time that the bot consumes of anyone, experienced Wikipedian or new, to understand the bot request on first sight.
 * Someone records a screenshare of an experienced Wikipedian scrolling past the bot request in an attempt to ignore it. The time that it takes to avoid the message forms a basis for measuring how much time that the bot consumes by people actively seeking to avoid it.
 * After taking these measurements we multiply it by the number of spam posts, page views to the articles in the time range when the spam post was active, then use this to estimate minimal, maximal, and likely times which this activity will consume.
 * I am baffled at how you could dismiss user time as immeasurable or useless to know. I take it as given that if a bot could be designed to take more human time or less human time then it should consume less time. In order to minimize time consumption, we should know roughly how much time a bot process makes. The range of the estimate might be 100-1000 hours, but in the case of IABot, the discussed range seems to be 0 hours - 10,000 hours, and I think we can do better for the future. If someone proposes a bot with intent that it will consume 10,000 hours there should be no misunderstanding that it will take 0 human time.
 * I hope that you are enjoying this conversation and will continue to participate for so long as it seems productive to you.  Blue Rasberry   (talk)  16:58, 3 March 2018 (UTC)
 * Honestly, this discussion seems like a waste of time looking for an end result that only wastes other people's time even more. We aren't business analysts, this isn't a six sigma process, and we aren't budgeting for people's time or the WMF's money. Additionally, I sure as hell won't be coming up with estimates on how much time I use or my bot uses of others just for the sake of having some arbitrary time measurement. The scope of the bot is considered at the time the BRFA is filed, so there is no benefit to what you are seeking. It is simply an unnecessary and incalculable constraint. Nihlus  17:41, 3 March 2018 (UTC)
 * sums it up pretty well. I asked for examples of where this would make an actual difference in the approvals process, and you didn't offer anything. You have a solution looking for a problem. Headbomb {t · c · p · b} 20:44, 3 March 2018 (UTC)