Wikipedia:Bots/Requests for approval/JCW-CleanerBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

JCW-CleanerBot
Operator:

Time filed: 16:20, Thursday, August 17, 2017 (UTC)

Automatic, Supervised, or Manual: Sometimes automatic, sometimes semi-automated

Programming language(s): WP:AWB

Source code available: Simple regexes

Function overview: To properly format journal names on Wikipedia, with AWB general fixes as a side-benefit. For the exact details, see User:JCW-CleanerBot.

Links to relevant discussions (where appropriate):

Edit period(s): After data dumps

Estimated number of pages affected: 10,000 ish?

Namespace(s): Main space mostly. Occasionally in other spaces.

Exclusion compliant (Yes/No): Yes

Function details:
 * Note: For the exact details, see User:JCW-CleanerBot.

The idea is to look in WP:JCW, and find common misspellings of journal names and abbreviations. I plan on doing edits such as
 * Closing entries with exposed '(journal)' disambiguators.
 * Fixing bad abbreviations.
 * Fixing bad journal names such as Animal Behavior &rarr; Animal Behaviour
 * Fix non-standard pseudo-ISO 4 abbreviations to correct ISO-4 abbreviations ( →  )
 * This excludes pseudo-ISO 4 abbreviations which may be standard in a field, such as . The bot will leave those alone.
 * Fixing capitalization mistakes, such as  →
 * And other fixes of similar nature.

The bot will mostly operate on journal of citation templates, but will sometimes cleanup non-journal entries when it can guarantee to not screw things up like improperly removing dots in places dots shouldn't be removed ( →  ), or changing the capitalization of things that should not be changed (  →  )

I'll be performing database scans, and develop regexes tailored to each misspelling. Sure shots one will be performed automatically, tricky ones will be performed semi-automatically. I have done this for years on my main account, but have often left the big ones with hundreds of instances because I couldn't be bothered to mindlessly click 'save' 200 times. Headbomb {t · c · p · b} 16:20, 17 August 2017 (UTC)

Discussion

 * I personally don't have a problem with all of the points, except number 3 which I believe violates WP:CONTEXTBOT. Is there any consensus for this particular fix?— CYBERPOWER  ( Chat ) 18:00, 17 August 2017 (UTC)
 * For Animal Behavior → Animal Behaviour? That's not WP:CONTEXTBOT, that's fixing a wrong journal name to a correct journal name. This is no different than changing say Isabel Boulay to Isabelle Boulay. Headbomb {t · c · p · b} 18:28, 17 August 2017 (UTC)
 * Fair enough.— CYBERPOWER  ( Chat ) 18:35, 17 August 2017 (UTC)
 * Thank you for including the exception for field-standard non-ISO journal abbreviations. That would have been my only concern, but you have already addressed it. —David Eppstein (talk) 18:01, 17 August 2017 (UTC)

I suggest a 100 edit (semi-automated) trial. I could also do a 50 edit automated trial, but that would pretty boring (e.g. 13x  → , followed by 25x   →  , and so on.) Headbomb {t · c · p · b} 18:52, 17 August 2017 (UTC)
 * — CYBERPOWER  (Around ) 06:24, 20 August 2017 (UTC)


 * - Trial, part 1. I took the liberty of extending the task to also cover  →   in magazine since it's just as easy to find in database scans. Task worked perfectly, with no errors to report. This is an example of a task that would run automatically.
 * Could I also include those in publisher/work and  →  ? I haven't done this in the trial, but I've tested it on my main account  and it work just as flawless at those above. Headbomb {t · c · p · b} 00:02, 21 August 2017 (UTC)
 * This batch of edits look good, so go ahead. I'm checking the next batch.— CYBERPOWER  ( Chat ) 16:41, 21 August 2017 (UTC)
 * - Trial, part 2. Semi-automated mode. I skipped typo fixing, but I plan to run with them when in this mode. I exceeded my trial by 2 edits so it would be easier to resume (I stopped at entry 250 in WP:JCW/TAR). I've done those edits for years without anything contentious / any objections coming out of it, so I don't see why it would become contentious now, but it would be good to have a detailed review + questions before full approval. Headbomb {t · c · p · b} 20:31, 20 August 2017 (UTC)
 * I'm a bit concerned about scope creep. I agree with everything you're doing, but it's a lot to be done in one bot task. Could you put together a subpage in the bot userspace (or your userspace) that explains each of the fixes fully, ideally with diff examples? Further, could you link that subpage from the edit summary of the bot? I just want to make sure that the average editor understands what's being done in the edits. ~ Rob 13 Talk 00:22, 21 August 2017 (UTC)
 * Done at User:JCW-CleanerBot. I'll update the edit summary to point to it during runs. Headbomb {t · c · p · b} 01:45, 21 August 2017 (UTC)
 * I'll ping to ensure review of User:JCW-CleanerBot. I've broken down the 6 main types of fixes I've been doing in the past years. Headbomb {t · c · p · b} 02:23, 21 August 2017 (UTC)
 * I'll ping to ensure review of User:JCW-CleanerBot. I've broken down the 6 main types of fixes I've been doing in the past years. Headbomb {t · c · p · b} 02:23, 21 August 2017 (UTC)


 * So regarding your second batch of edits, what happened here? No visible change in output and no change within the bot's scope.  At least I don't see one.— CYBERPOWER  ( Chat ) 16:53, 21 August 2017 (UTC)
 * It changed  to   (plus a couple of MOS:PERCENT fixes). Headbomb {t · c · p · b} 16:58, 21 August 2017 (UTC)


 * Remaining issues have been discussed on IRC.— CYBERPOWER  ( Chat ) 17:11, 21 August 2017 (UTC)

Phase 2

 * — CYBERPOWER  ( Chat ) 17:10, 21 August 2017 (UTC)


 * Part 1A trial. Headbomb {t · c · p · b} 17:15, 21 August 2017 (UTC)
 * The edits themselves look good, but none of them demonstrate Task A. I wanted to see the bot fix some typos here.— CYBERPOWER  ( Chat ) 18:08, 21 August 2017 (UTC)
 * I'm not running typo fixing on task A, because I'm doing this one automatically and WP:CONTEXTBOT applies. I'll find examples of typo fixing in the semi-automated run below however. Headbomb {t · c · p · b} 18:12, 21 August 2017 (UTC)


 * Part E/F trial. Edits containing typo fixing:  . Those are the only two instances of typos in this run.
 * In, the change is in the last entry.
 * In, the publication is fully bilingual. I suppose I could have used "Journal of Orofacial Orthopedics / Fortschritte der Kieferorthopädie", but as the cited title was in English, the English title of the journal was chosen (as is normally done).
 * and screwed up [didn't copy-paste things correctly], but I rectified it.
 * Headbomb {t · c · p · b} 18:10, 21 August 2017 (UTC)
 * Task A according to your bot's page is typo fixing, but ok.— CYBERPOWER  ( Chat ) 18:18, 21 August 2017 (UTC)
 * Oh, I see what you mean. I thought you mean 1A for extending the disambiguator logic to magazines/publishers, so that's the trial I did. I brainfarted there (mostly because I didn't see the point in testing legit typo/mispelling fixes since those are foolproof). I'll do 50 of those then, although I'll be a pretty boring trial. Headbomb {t · c · p · b} 18:23, 21 August 2017 (UTC)
 * Real task A trial. Headbomb {t · c · p · b} 18:50, 21 August 2017 (UTC)
 * So I have been giving this some thought, and these typo fixes are strictly applied to Journal names correct?— CYBERPOWER  ( Chat ) 07:17, 22 August 2017 (UTC)
 * Well, there would be two types of typo fixes. In automated mode, it's a tightly controlled 'match EXACTLY this in journal' and nowhere else. In semi automated mode, they may be additional typo fixes (AWB-based ones, always manually reviewed), and I'm letting the find/replace apply outside of journal, but only if it refers to the actual publication, and not random words.
 * So for a case like
 * In automated mode, the bot would do
 * while in semi-automated mode, the bot would do
 * In all cases, the bot would leave the first use of paleontology alone, as this is WP:ENGVAR stuff. The others uses concern the actual name of the publication and ought to be spelled correctly, and those are the cases the bot is interested in. In automated-mode the bot only touches journal (e.g. ). In semi-automated mode, I can catch more (e.g. ) since I'm making sure those it touches are legit cases of a typo. Headbomb {t · c · p · b} 09:34, 22 August 2017 (UTC)
 * while in semi-automated mode, the bot would do
 * In all cases, the bot would leave the first use of paleontology alone, as this is WP:ENGVAR stuff. The others uses concern the actual name of the publication and ought to be spelled correctly, and those are the cases the bot is interested in. In automated-mode the bot only touches journal (e.g. ). In semi-automated mode, I can catch more (e.g. ) since I'm making sure those it touches are legit cases of a typo. Headbomb {t · c · p · b} 09:34, 22 August 2017 (UTC)
 * In all cases, the bot would leave the first use of paleontology alone, as this is WP:ENGVAR stuff. The others uses concern the actual name of the publication and ought to be spelled correctly, and those are the cases the bot is interested in. In automated-mode the bot only touches journal (e.g. ). In semi-automated mode, I can catch more (e.g. ) since I'm making sure those it touches are legit cases of a typo. Headbomb {t · c · p · b} 09:34, 22 August 2017 (UTC)


 * — CYBERPOWER  ( Chat ) 09:37, 22 August 2017 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.