Wikipedia:Bots/Requests for approval/PhuzBot 2


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol neutral vote.svg Request Expired.

PhuzBot 2
Operator:

Time filed: 07:12, Sunday August 7, 2011 (UTC)

Automatic or Manual: Automatic unsupervised, after initial trial period

Programming language(s): AutoWikiBrowser standard with custom settings

Source code available: Yes

Function overview: Clean up capitalization of headers on pages, (example: ==External Links== to ==External links== ), AWB genfixes, RegExTypoFixes and removal of spaces from headers (example: == External links == to ==External links== )

Links to relevant discussions (where appropriate):

Edit period(s): One time run, then shortly after each database dump is released

Estimated number of pages affected: Roughly 30,000 initially, much fewer afterwards (likely to be less than 1000 per run)

Exclusion compliant (Y/N): Y

Already has a bot flag (Y/N): N

Function details: As explained in the overview, this bot will clean up the capitalization on the headers on pages. Additionally, any pages that have issues with the capitalization of headers will also have any AWB genfixes and typos fixed (from RegExTypoFix's engine and typo list). Initially, this will be using Rich Farmbrough's settings file, but I may expand it when I find more common issues that need corrected. To see an example of what this would do if automated, please check User:Phuzion/HeaderFixes, which is a series of edits that I allowed without skipping or modifying any of the edits that AWB proposed. If requested, I can remove the spacing rules, as it is technically an immaterial change to the page, as it is not rendered.

Discussion
How does the bot know which headers are to be capitalized and which not? Automatic typo fixing is not allowed -- WP:SPELLBOT. GenFixes should not be applied without other substantial change(s) to the page -- WP:COSMETICBOT. There is no consensus to remove (or add) spaces around headers, and bots should definitely not do so. There are exceptions for fixing inconsistent style or matching a style across the page. — HELL KNOWZ  ▎TALK 08:12, 7 August 2011 (UTC)
 * The bot knows which headers to fix capitalization for based on the settings file that I previously linked. It's a simple Find/Replace setup.  Disabling the spacing fix should be an easy task, but I don't think I can set it to only apply spacing fixes when applying other fixes.  Genfixes can be set up so that they are only applied if there are other changes being made to the article.  Typos fixes are simple to disable. Phuzion (talk) 08:21, 7 August 2011 (UTC)
 * OK, so basically this task is to fix header capitalization with genfixes on. I'll wait for some other BAG input. — HELL KNOWZ  ▎TALK 08:34, 7 August 2011 (UTC)
 * Doesn't SmackBot Helpful Pixie Bot already do this, except Rich likes to add the stupid spaces instead of remove them? Anomie⚔ 12:31, 7 August 2011 (UTC)
 * I do see that he has that listed on the bot's tasks page, and the BRFA for that ongoing task is here. I would still like to run this, as I am in the process of building a rather large backlog of articles that could definitely use some help from this task.  I plan to have the bot scan a major portion of the wiki (at least 10% initially, ideally I'd like to have it scan 100%, but AWB kinda crashes when you have 4million+ article titles loaded into it.  Sidenote:  As of the last database dump, the text file version of the article title list is 168MB!)  To address the spacing issue, I have completely disabled the regex for dealing with spacing, that way there will be no bot battles.  Until the folks at WP:MOS can come up with a conclusion on whether we want the spaces or not, I'll just ignore them for now. Phuzion (talk) 13:03, 7 August 2011 (UTC)

Hang on, so this bot would only makes changes to the capitalization of headers? To me, that seems like a cosmetic change that does not warrant an edit, unless it is done alongside other things. - EdoDodo  talk 15:35, 12 August 2011 (UTC)
 * Manual of Style (capital letters) mandates sentence-case for headers. Since this produces a visual difference, it is not in the realm of WP:COSMETICBOT. WP:COSMETICBOT applies to things like changing ==Foobar== to == Foobar ==, or making changes like to  , which do not make any visual difference in the article. Headbomb {talk / contribs / physics / books} 15:10, 14 August 2011 (UTC)


 * Looking through the settings file, most of these corrections seem to be fine, although I do have a couple questions/concerns (somewhat nitpicky sorry):
 * One of the corrections is "Board of Education" → "Board of education" - it seems like most BOE's will be titled "Board of Education" as a multi-word proper noun, which should remain capitalized as is, right?
 * All of the replacement things are of the form "==header text==", that is, no spaces between the equals signs. In short, it's not going to catch anything of the form "== header text ==". Unless I missed something.
 * "World Series of Poker bracelets" → "World Series of Poker bracelets" has no effect.
 * Where are you getting your list of articles from - are you just going through the entire wiki in blocks, or targeting particular articles? (I know you kinda addressed this above, just clarifying)
 * (Random question out of curiousity) Where did you get all these headers from - especially the weird ones about birds?
 * Thanks. Hers fold  (t/a/c) 04:52, 17 August 2011 (UTC)


 * I will admit that I did not write the settings file, I simply reviewed it to make sure it wasn't going to do anything that would be considered vandalism (replacing headers with profanity, etc).
 * I think, based on the formatting of Board of education, that WP:MOS would prefer sentence-case over title case.
 * I am not 100% sure, but I believe that the first two regular expressions take care of that, and standardize it to ==header text== . I can try it across a few pages to check and see what the behavior is based on a few different cases.
 * That is something that I can look at, it might be something that was inserted incorrectly and simply fixed rather than removing it.
 * I will be using a database backup to generate the initial list of articles using AWB's database scanner functionality for the full run, but for a trial run, I could just click the "Random pages" button a few dozen times and see if it catches anything.
 * As addressed above, I did not write the settings file for this, Rich Farmbrough did. I am simply utilizing an existing tool.  In the future, I could search through and find more cases that could use correction, but for now, I'll start with this.  I couldn't tell you where he got them, but I'm sure he'd fill you in if you asked on his talk page.
 * If there's anything I can do to help out, let me know. Phuzion (talk) 02:11, 18 August 2011 (UTC)

A bot should not be changing  or vice versa (unless you have consensus for this). There (as far as I know) has never been anything close to consensus on how these should be formatted. The Rich's settings file is Rich's preference that may not necessarily be the existing style or editor preference of any particular article. As an example, MediaWiki inserts new sections with  syntax. — HELL KNOWZ  ▎TALK 09:06, 18 August 2011 (UTC)


 * While this could be somewhat annoying, it may work to duplicate all of these replacement settings with == this style ==; it's possible you could do this by copying all of them into Microsoft Word or some other word processor, then using Find/Replace to replace ">==" with ">== " and "==<" with " ==<", then copy/pasting the whole mess in after the existing ones. Hers fold  (t/a/c) 00:59, 19 August 2011 (UTC)


 * Or change the regex to find, replace  , which will produce:


 * — HELL KNOWZ  ▎TALK 08:17, 22 August 2011 (UTC)


 * How do you calculate 30,000 as the estimated number of pages? Seems like a useful bot, but should definitely not do anything about spacing around title headings. The other bot (Rich's) should also not be changing spacing around headings. Pointless edits. --72.208.2.14 (talk) 10:23, 22 August 2011 (UTC)
 * I guesstimated the 30,000 page count originally, but I think that number may be higher after further reviewing the database. I originally did not have the "no limits" plugin loaded, which limited me to approximately 30,000 pages to be loaded.  I could go and re-scan the database with the no limits plugin loaded, which should result in a higher number.  I can start this process tomorrow, and get back to you in a couple of days.  Phuzion (talk) 02:25, 24 August 2011 (UTC)
 * For single article edits -- WP:PERF -- as long as the task is a substantial change. — HELL KNOWZ  ▎TALK 06:38, 24 August 2011 (UTC)

Which genfixes are on for this? I generally feel uncomfortable with bots doing genfixes, as it merely clutters the diff and "hides" the main, substantial edit. Besides, AWB genfixes tend to change and you never know what they will consider a "genfix" tomorrow. Most of the time, they are not critical, and are best left for the next human doing a manual AWB fix. I'm not sure how popular this opinion is, so other BAGgers? — HELL KNOWZ  ▎TALK 06:38, 24 August 2011 (UTC)


 * I don't think it should be too large a concern - provided the bot is set not to make edits when only genfixes would be made, it's fairly clear that this bot would only be fixing headers. Genfixes by definition are uncontroversial. Hers fold  (t/a/c) 21:15, 24 August 2011 (UTC)
 * Genfixes - yes; interpretation of what genfixes are - no. Since the operator initally proposed changing the header whitespace and is using another user's settings file with no modifications, I want to make sure we are on the same page. — HELL KNOWZ  ▎TALK 21:39, 24 August 2011 (UTC)


 * &mdash; Haven't heard back from bot operator; several outstanding issues, including issues related to WP:SPELLBOT. -- slakr \ talk / 04:40, 5 October 2011 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.