Wikipedia:Bots/Requests for approval/FrescoBot 14


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

FrescoBot 14
Operator:

Time filed: 14:32, Sunday, September 10, 2017 (UTC)

Automatic, Supervised, or Manual: auto

Programming language(s): python

Source code available: standard pywikipedia

Function overview: removing unwanted special characters like LTR-mark, SHY or no-break special space

Links to relevant discussions (where appropriate): Specifically:
 * Manual of Style/Text formatting: the only invisible characters in the editable text should be spaces and tabs.
 * LTR mark (U+200E)
 * CHECKWIKI 16
 * Village pump (technical)/Archive 101
 * Village pump (technical)/Archive 112
 * no-break space (U+00A0)
 * MOS:NBSP

Edit period(s): monthly

Estimated number of pages affected:
 * LTR marks within wikilinks and categories tags = ~2100
 * no-break space = ~4300
 * SHY = 81
 * zero width space = 147
 * line separator = 82

Namespace(s): articles, categories, files. Manually also on templates, help, wikipedia and portal.

Exclusion compliant (Yes/No): yes

Function details:
 * It will remove unwanted Left-to-right mark (U+200E) from wikilinks and categories tags. Actually part of CHECKWIKI 16. This character is invisible, so in order to see it you have to cut and paste for example   in this webpage. Sometime they are cut and pasted in Hot Cat or other tools and then appear at the end of the category tag. Less often they are also cut and pasted within wikilinks or placed randomly within the page source. The point: there is absolutely no use for them within any wikilinks or category tags, notoriously they will break bot and other tools causing problems, they are completely invisible and should be avoided. Maybe they could be removed almost eveywhere on the encyclopedia, but within wikilinks and other markup is always safe. There is consensus for this fix and it is not cosmetic since it actually prevents real problems. I have been running this substitution for years in order to sanitize the text before applying my other fixes and nobody complained in the merit, I just had to clearly explain in the edit summary that the bot was removing an unwanted invisible mark. I just added it along the way in order to avoid mistakes parsing wikilinks and categories and some months ago I realized that I never properly asked the explicit approval for it.
 * It will remove LTR (U+200E), RTL (U+200F) and other invisible control characters (from U+2027 to U+202F) in the imagename within image markup, category markup and other selected safe locations. Same resons.
 * It will replace with a space the character no-break space (U+00A0) everywhere it is used as a space. It is stored as U+00A0 in the database, but it is looking like a space within the browser. It is considered a really bad practice (MOS:NBSP) and breaks bots and tools. Browsers get confused and tricky problems occours: for example if you search "joined Dionysus" on this page with the last Firefox everything is fine, but with IE11 you will not find anything.
 * It will remove everywhere (manually) the character SHY (U+00AD), aka soft hypen. It is an invisible hypen. It creates confusion and breaks links (eg. any­one vs. anyone), bots (eg. exceptions will not be identified) and tools (it is not the same word). See Soft hyphen (SHY) – a hard problem?. Pretty rare.
 * It will remove everywhere (manually) the character zero width space (U+200B). The same problems as SHY: History​ of Quebec vs. History of Quebec. Pretty rare as well.
 * It will replace everywhere (manually) the unicode character "line separator" (U+2028) with a standard newline character or a space. Please note that there is no use for it and it poses the same problems as SHY.

Discussion
Bottom line: every single fix in the list solves real problems, so they cannot be considered cosmetics. Many bot cannot run safely with these characters around because exceptions could be not identified and unexpected problems arise. The number of affected pages is not high so it definitely worth it. -- Basilicofresco  (msg) 14:32, 10 September 2017 (UTC)
 * How will the bot determine whether these invisible characters are safe to remove, versus them being present to fix things (and therefore should probably be turned into an entity instead of being removed)? For example, LRM to fix cases where adjacent RTL text is causing numbers and punctuation to be displayed as RTL, non-breaking space to prevent separation of numbers and units, SHY to indicate word breaks in tables, and so on. Anomie⚔ 02:10, 11 September 2017 (UTC)
 * (Struck the LRM bit, re-reading the description I see the replacement is proposed only in wikilinks) Anomie⚔ 02:13, 11 September 2017 (UTC)
 * The point is that nobody deliberately places non-breaking spaces using the invisible unicode special character. All the U+00A0 characters I saw are placed in points where a  is not useful. These characters appears with cut-and-paste jobs from wordprocessors / old browsers / other odd sources full of invisible characters esposed to the user. With some trial edits you will be able to see how (bad) are used these invisible non-breaking spaces. --  Basilicofresco  (msg) 04:55, 11 September 2017 (UTC)
 * There are at least a few people who use raw NBSP in their signature, which I notice because I routinely see Village pump edits that change them to  when someone else is using a non-standard editor that does that substitution. On some platforms it's easy enough to type the character, e.g. I can type Compose-Space-Space to get one . Anomie⚔ 20:19, 11 September 2017 (UTC)
 * I do not plan to run it on talk pages. However you mean that maybe on rare cases we could find at least few invisible non breaking spaces properly used... It is not a great deal but ok, I can manually check every invisibile non-breaking space after a digit in order to decide if it is better to replace it with a space or a . Afterall there are only 188 invisible non breaking spaces between a digit and   or a letter. As I said SHY will be replaced manually, so I should be able to identify any exception. --  Basilicofresco  (msg) 05:56, 12 September 2017 (UTC)
 * If there are no additional questions I would start a trial run. --  Basilicofresco  (msg) 07:35, 20 September 2017 (UTC)
 * I mean: there is consensus, there are no objections or questions, twelve days are passed and a new dump file is almost ready. I humbly dare to suggest that the time for a test run is come. Please... -- Basilicofresco  (msg) 11:43, 23 September 2017 (UTC)
 * . Headbomb {t · c · p · b} 13:40, 25 September 2017 (UTC)
 * -- Basilicofresco  (msg) 23:04, 26 September 2017 (UTC)
 * This awaiting your review. :-)— CYBERPOWER  ( Chat ) 13:40, 9 October 2017 (UTC)
 * thanks for the notice. I knew I had one going on, but couldn't find it. Headbomb {t · c · p · b} 13:51, 9 October 2017 (UTC)

do you have a links to diffs? Headbomb {t · c · p · b} 13:52, 9 October 2017 (UTC)
 * Well, the last 50 edits are the trial run. However here is the permanent link. As you can see each edit has got a proper summary. Let me know if you have any question or doubt. -- Basilicofresco  (msg) 10:12, 12 October 2017 (UTC)
 * I would like to stress that apparently there are no objections or questions and all the discussions showed that (it is not a popular topic but) there is consensus about a task like this one. These kind of invisible misplaced characters can create problems to many tools and bots, so let's do it. --  Basilicofresco  (msg) 06:43, 19 October 2017 (UTC)

don't really have an issue with anything from the trial, but how would the bot handle multiple fixes (e.g. both a non-breakingspace and a line seperator) at once? Also, for the edit summaries, it would be more accurate to have "Removing misplaced..." rather than simply "misplaced...". Headbomb {t · c · p · b} 15:22, 19 October 2017 (UTC)
 * the word "removing" in the edit summary sounds good, I will insert it. I do not expect to find many pages with more than one type of problematic characters (eg. within the above +6400 less than 1% required more than one type of substitution), so I will fix any additional type of problematic characters as cosmetic fix: a page with a non-breakingspace and a line seperator will have an edit summary like Bot: removing misplaced special no-break space character and minor changes. -- Basilicofresco  (msg) 12:30, 21 October 2017 (UTC)
 * Headbomb {t · c · p · b} 16:47, 21 October 2017 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.