Wikipedia:Bots/Requests for approval/PkbwcgsBot 12


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

PkbwcgsBot 12
Operator:

Time filed: 13:10, Monday, December 24, 2018 (UTC)

Function overview: The bot will fix a range of unicode control characters in articles. This is.

Automatic, Supervised, or Manual: Supervised

Programming language(s): AWB

Source code available: AWB

Links to relevant discussions (where appropriate):

Edit period(s): Five times a week

Estimated number of pages affected: 100-250 at a time

Namespace(s): Mainspace

Exclusion compliant (Yes/No): Yes

Function details: This is an extension to Task 1 as I am already fixing Unicode Control Characters there. However, this task does more fixes to error 16 and fixes a range of Unicode control characters that WPCleaner can't fix. The following will be removed:
 * - Left-to-right mark (the bot will be careful when it comes to Arabic text and other foreign text as this is a supervised task)
 * - Byte order mark (all instances of this can be safely removed)
 * - Zero-width space (the bot will be careful when it comes to Arabic text and other foreign text as this is a supervised task)
 * - Line separator (all instances of this can be safely removed)
 * - Left-to-right embedding (the bot will be careful when it comes to Arabic text and other foreign text as this is a supervised task)
 * - Pop-directional formatting (the bot will be careful when it comes to Arabic text and other foreign text as this is a supervised task)
 * - Left-to-right override (the bot will be careful when it comes to Arabic text and other foreign text as this is a supervised task)
 * - Right-to-left override (the bot will be careful when it comes to Arabic text and other foreign text as this is a supervised task)
 * - Soft hyphen (all instances of this can be safely removed)

The following will be turned into a space:
 * - Three-per-em space
 * - Four-per-em space
 * - Six-per-em space
 * - Figure space
 * - Punctuation space
 * - No-breaking space (any cases of U+00A0 that are okay per MOS:NBSP will not be removed) (this is the most frequent unicode character in WP:WCW error 16)

The bot will use RegEx and general fixes will be switched on but typo fixing will be turned off as they are both not required for this task.

Discussion
I'm not sure about some of these. In particular, U+00AD may have been added by editors to specify the proper place for long words to be broken, and U+00A0 should more likely be turned into the  entity than changed into U+0020. The same might apply to the other space characters, editors may have specifically used these in preference to U+0020. Anomie⚔ 17:06, 24 December 2018 (UTC)
 * After going through the WP:WCW list, there are no instances of U+00AD anywhere. However, if it does come up, then I will replace it with a hyphen. U+00A0 takes up more bytes than a regular space (U+0020) so it is easier to leave a space. The other space characters can be safely replaced as they are unnecessary and they mostly come up in citations. See 1 which is taking out U+2005 which is four-per-em space, 2 which is taking out U+2008 which is punctuation space, 3 which is taking out U+2005 again, 4 which is taking out U+2008 again and 5 which is also taking out U+2008. All these occurred inside citations. Pkbwcgs (talk) 17:43, 24 December 2018 (UTC)
 * Replacing U+00AD with a hyphen would not be correct either. You'd want to replace it with shy or the like. For NBSP "takes up more bytes" is a very poor argument, and replacing it with a plain space could break situations described at MOS:NBSP. A figure space might be intentionally used to make columns of numbers line up properly where U+0020 would be a different width, and so on. I don't object to fixing things where specific fancy spaces don't make a difference, but you're arguing that they're never appropriate and that strikes me as unlikely. Anomie⚔ 17:55, 24 December 2018 (UTC)
 * There are no cases of U+00AD so the bot doesn't need to handle that. In terms of U+00A0, I will make sure my RegEx replaces the cases described at MOS:NBSP with &nbsp or otherwise skip them. Pkbwcgs (talk) 18:04, 24 December 2018 (UTC)
 * If you're not intending to handle U+00AD after all, you should remove mention of U+00AD from the task entirely. (I see you struck it) As for "the cases described", good luck in managing to identify every variation of those cases. It would probably be better to just make that part of the task be manually evaluated rather than "always replace". Anomie⚔ 18:09, 24 December 2018 (UTC)
 * The bot will still strip U+00A0 in wikilinks because replacing them with &nbsp is not going to work. Pkbwcgs (talk) 18:15, 24 December 2018 (UTC)
 * Replacing the cases stated at MOS:NBSP is trickier than I thought so I am going to skip those cases manually. This task is supervised. Pkbwcgs (talk) 18:20, 24 December 2018 (UTC)
 * BAG assistance needed I have made some amendments to this task including reducing down to five times a week and added general fixes so the removal of unicode control characters and general fixes can be combined together. I have also specified that non-breaking space will not be removed in cases described at MOS:NBSP and the bot will replace those cases with "&nbsp" with the general fixes. Pkbwcgs (talk) 20:10, 17 January 2019 (UTC)

. I concur with the edit summary tweak - no point in putting in the "replacement" field when it's all unicode whitespace. Primefac (talk) 13:55, 7 April 2019 (UTC)
 * Primefac (talk) 00:45, 20 January 2019 (UTC)
 * The edits are located here. WP:GenFixes were switched on as stated for this task. I will point out a couple of good edits. This edit removed unicode control character no-breaking space in the infobox. Because of that character, the "distributor" character disappeared from the infobox. Once the bot removed that character, it re-appeared which makes it a good edit. There were some good general fixes in this edit as well as the removal of a non-breaking space character. This edit is also a good edit because it changed the direction of text from being right-to-left to left-to-right. Before, the right-to-left text would have been confusing but now the direction is changed so it is not confusing anymore. That edit removed the  character which is "Right-to-left override". Some edits were removing non-breaking space within citations,   was also removed in some edits in Arabic text and a few edits were removing   which is punctuation space. Pkbwcgs (talk) 20:02, 20 January 2019 (UTC)
 * It might take me a few days to be able to verify any of these (and I have zero issue if another BAG gets to it first), but as a note it's much more helpful to point us to the bad/incorrect edits. In other words, we know how the bot is supposed to run, and pointing us to runs where the bot did what it was supposed to is... kind of pointless. Primefac (talk) 20:14, 22 January 2019 (UTC)
 * , I don't know if you wanted to go through these or not, given your previous interest/concerns. Primefac (talk) 19:52, 28 January 2019 (UTC)
 * For easier reference, and if multiple bag members want to split the work, all of the difs are listed at User:DannyS712/sandbox6 (I was bored and wanted to create a regex to convert the html of a contributions list to a wikitext-friendly list) --DannyS712 (talk) 00:49, 24 March 2019 (UTC)
 * I have little else to do, so I started looking through the trial edits -, , and all remove control characters from languages which are written in the other direction - should they be? Also, you may want to deactivate the setting that adds to the edit summary what replacements where made, since they can't really be understood and it just looks like a bunch of blank space. --DannyS712 (talk) 20:56, 29 March 2019 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.