User:Retro/Character encoding

This page is not limited to discussion about Unicode code points, but it is currently a dominant focus here.

I was able to painlessly generate a list of Unicode code point redirects using Special:AllPages, and I templated them all with R from Unicode code in a semi-automated way (I reviewed each edit, and made corrections to the redirect target when appropriate).


 * Many Unicode code point redirects don't exist. These could probably be created in a semi-automated way by utilizing the target of each code point's related character. It may be possible to transition to automated creation, but the main issue is that character point redirects should ideally redirect to a specific section in the article about the characters encoding, and the names for those sections have not been standardized. I would like to do a comprehensive sampling of section names in character articles so I can determine the best path towards standardization (there is one section name that I am a bit against here: "In popular culture" is a bit broad; I currently prefer something like "Encoding").
 * Examples of code points lacking redirects: U+2024, U+2022
 * Some articles don't mention the Unicode code points at all.
 * Right-to-left mark
 * 's a related case study; I need to be careful to avoid double redirects if I create new redirects. I should also have high certainty in what the final resulting product should be before I create many redirects.
 * These were my original thoughts about automating mass-creation of such redirects: I could make a bot that searches for where the actual Unicode character targets on Wikipedia, and also check that it exists, then use the indirect target to create the new redirect (to avoid creating a double redirect.) There's so much more, though. Section-specific redirects are preferable to me, and that would require standardization of section titles.
 * For character articles I think the Unicode code point is worth a mention in the infobox.
 * Latin-1 Supplement (Unicode block): Why does the character block have "XXX" for some of the acronyms even though they are know and even appear in the tooltip text? This is probably a broader issue than this single article, and should be investigated closely.
 * Non-breaking space: Are code blocks the standard way to include Unicode character point codes? Another question that can only be effectively answered with some automated searching.
 * What follows is an idea I thought of, but have since rejected: What if all the unicode code points were linked, and the unassigned ones redirected to the article on unassigned unicode code points. It might be too annoying though, and trivially useful. Instead, the absence of a page should communicate the message that the code point is unassigned.
 * R from Unicode code still has a few problems:
 * Whether to include R to related topic is an open question. At present, R to related topic is used on many pages that are Unicode character points. I'm currently leaning against. I think using is redundant for Unicode code points, because they already include R from Unicode code.
 * This gave me an idea of how I can improve the Unicode redirect category, but I've made a mistake in my implementation.
 * Category:Redirects from Unicode codes is not showing up on R from Unicode code's page even though I included it.
 * Neither is Category:Redirects from Unicode characters on R from Unicode.
 * Why is it taking so long for the template's cats to show the page's presence in the cat for R from Unicode code
 * The template still hasn't appeared in Category:Redirects from Unicode codes; I suspect I made a mistake somewhere. Perhaps it's because of the actual parameterized form in other templates, and the categories on the bottom had no affect.
 * The related category's description is wrong because I inserted the template in the wrong parameter: Category:Redirects from Unicode codes. Oh well, I'll fix it in the future.
 * If rcat shell in parameters is so common, why not just make a parameter entirely for that. like yes
 * Each character article should have its relevant unicode block mentioned in the page. Otherwise, one has to go to Unicode block to reverse engineer the parent block. This could probably even be templated, but it would require a bit of cleverness to design elegantly.
 * Example: É

Related articles
complexities involved, and links to the Unicode consortium page, a good place for information about these sorts of thing.
 * I remember a Tom Scott video about how these marks allowed an Android message to crash the messaging application... it hints at the
 * For Bullet (typography), I don't think it should have a subsection in unicode; most characters have a higher level unicode description table.
 * Right-to-left mark and Left-to-right mark are probably similar enough to be merged. But what the target name should probably not be either of the original target articles
 * U+00A4 is an exception to the normal redirect templates and uses R from systematic
 * Ugh, I don't like that Character point redirects to an article about roleplaying games. Perhaps it should be a disambiguation. But maybe I'm the only one who has confused in with "Code point"

Involved people

 * has been involved in recategorizing Unicode character codes from R from code to R from Unicode code.
 * has also been involved in this process consistently.
 * (now indef blocked) was also somewhat involved. They created 79 of the redirects.
 * used a category sortkey once.
 * has created 16 of the redirects.
 * created one page.
 * created one page.
 * created one page.
 * added a redirect shell to one page.
 * . Ah, but they were the one who ended up eventually blocked.
 * created 22 of the redirects.

Miscellaneous notes
I made these notes while I was adding R from Unicode code to the redirects. They are currently unsorted. While editing, I generally made the decision to preserve the original target in most cases, except if there was clearly a separate page that was a much better target. This decision was both to save time, and to allow different options of section linking or its lack to remain manifest within the redirect template links. I can analyze the links more systematically later.


 * The question here is whether section redirects of the following form are acceptable:
 * For consistency, I should match the previous character (in the U+009x row) I just edited as well.
 * U+009E:
 * REDIRECT ANSI_escape_code
 * Yeah, but U+009F redirects here. Is it consistent?
 * List of Unicode characters is too large to really be that useful in finding pages. But I'll leave it for now. (One of the redirects linked to List_of_Unicode_characters)
 * Should the Unicode code point for Yen sign redirect to the 'Code points' section?
 * The Broken Bar section should have a note that 'broken bar symbol' redirects here.
 * Ordinal indicator could redirect to the 'Encoding' section.
 * I suppose redirection to Not is legit. But it should probably be to a section on the characters used to represent the concept, not just the concept itself.
 * Why is 'soft hyphen' invisible? It's weird, I wonder what the deal is with my browser encoding.
 * Square and cube algebra might need hatnotes; I can imagine other uses (perhaps a section redirect would suffice.)
 * How is a micro sign different from a mu? I think maybe it should redirect to the unit (what do the ambiguous pronouns refer to!), because mu is almost certainly a different letter in the encoding in some Greek section of Unicode's scheme.
 * The pictures displayed for 'mu' and 'micro-' contradict; this should be resolved. I suspect 'micro-' is in the wrong, because it prescribes a difference, but font designers won't necessarily respect that difference.
 * Unicode subscripts and superscripts might be a better target for some of the previous superscripts I ran into (2 and 3).
 * Superscripts and Subscripts: this too? It's redundant and unnecessary if the other page exists.
 * There ya go Unicode consortium, adding characters with specifications that nobody uses.
 * Number Forms should use a template, instead of doing fractions with raw tags.
 * Wasn't sure about the use of decimal HTML codes in the U+00Cx block, so I left them like &#197;
 * Could use 'Circumflex in digital character sets' (of course, could remove the 'Circumflex in')
 * Some of these are nonsensically inconsistent. Why doesn't 'Ô' deserve its own page? Seems reasonable enough to me, especially with all the other similar pages I've seen. It shouldn't redirect to Circumflex like I just supported.
 * Multiplication has two encoding sections: 'In computer software' and 'Unicode'
 * U+00D9
 * Not entirely happy that it redirects to grave accent, the same as &. They should have their own page, in my opinion. Grave accent is *way* too general.
 * Why does Ø have a separate 'Computers' and 'Encoding' section? I left the 'Computers' section. Ideally, they would both be merged into 'Encoding', since that's the only way characters are related to computers in most cases: the method the character is encoded in the computer.
 * Unicode, U+066D ٭ ARABIC FIVE POINTED STAR (HTML &#1645; ·
 * Generating the HTML right after the unicode could be automated by a template; there's no need to type any of it out, except the character's code point.
 * Oh, U+2189 is used in basketball scorekeeping. I'm actually interested in that; perhaps I shouldn't redirect so hastily. But the basketball article doesn't currently mention it, so it needs expansion before being back-targetted.
 * Smiley
 * The people who make individual pages for Braille pattern dots are a bit insane. I can't believe that there's so much Japanese text. It's essentially a data dump.
 * What happened here, ?:
 * https://en.wikipedia.org/w/index.php?title=U%2B2826&action=edit&oldid=587077485
 * https://en.wikipedia.org/w/index.php?title=U%2B2834&type=revision&diff=587129512&oldid=587077530
 * Editor's note: I'm not sure what I was referring to in this parenthetical: (This again?)
 * Braille pattern dots-0 I thought this was inconsistent with the previous braille character articles, but now I think I may have been wrong.
 * -27 -127
 * Why are those both missing from the page, if it's supposed to encompass all the 7 and 8 dots combinations? This looks a bit fishy with the braille redirection consistency, but I gather dots 7 and 8 are less used.
 * I see, but I don't like those as much; it looks inconsistent, and I haven't looked at it carefully, only a general spot check for the braille.
 * Oh, I missed out on adding R to section where appropriate. Oh well.
 * could *definitely* have an automated check based on page content.
 * For U+FFFE and U+FFFF, I wonder if there's a page on Unicode non-characters; I could probably find a section somewhere in Unicode that covers it better.
 * For U+FFFE and U+FFFF, I wonder if there's a page on Unicode non-characters; I could probably find a section somewhere in Unicode that covers it better.

RfD
I tried to do a comprehensive search to check if any Unicode code point redirects have been RfD'd. The main discussion I found was this, where many participants mentioned the 'U+' redirects specifically and said they're fine.

I will probably RfD it eventually.
 * Not sure about U+E for getting the entire block... I've just skipped it entirely. I think it should be RFD'd, but I've still got to get my arguments together.
 * If they should, there should possibly be a hatnote noting that a range of characters redirects to the page. But I don't think there's necessarily an unambiguous target, or that this templates is unambiguous for Unicode code points anyway.
 * U+F8FF: I guess this is the last character of the block, but I dunno if private use area codes should redirect; They're not unambiguous enough to really need targets for each Unicode code point,
 * is a bit silly; very few people would misspell a Unicode code with O's instead of 0's, and ad infinitum, this would create a great deal of redirects. I don't think this should exist.
 * I also don't think should exist. Lower case version are of debatable necessity. I don't think anyone styles them lowercase, and search results probably turn up the uppercase redirect anyway, so it's unnecessary. But I need to investigate how search results work before committing to this RfD.
 * U+1F360 could be a section redirect instead of being redirected to the whole block, as I'm currently doing.
 * But I'm just going to leave it as a redirect to Sweet potato, even though I strongly disagree with it.
 * These were my attempts at searching, but I don't think it's worth spending any more time searching:
 * Here's a useful search:
 * https://en.wikipedia.org/w/index.php?search=insource%3A%2FU%5C%2B%2F&prefix=Wikipedia%3ARedirects+for+d&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&searchToken=eb5h8jo3b1fgi99vo43e4q7qe
 * Even better search that doesn't catch "GNU+Linux":
 * https://en.wikipedia.org/w/index.php?search=insource%3A%2FU%5C%2B%5B0-9a-fA-F%5D%2F&prefix=Wikipedia%3ARedirects+for+d&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

Other

 * Naming conventions (Unicode) (draft): This page is largely historical at this point; we now have specific guidelines about Unicode characters (which should be relatively straightforward), and this no longer serves its purpose as a centralized proposal to organize discussion; there are better places to have this discussion that would get more awareness. I can probably carry out this myself, since justification will be easy to come by
 * I wonder if a bot could search for pages to flag as historical; obviously the bot can't make the nuanced decision on its own, but at least it could search when the last edit was. If the last edit was more than a year ago, a page is suspect; it should probably just be back-prioritized. But sometimes people randomly edit pages, so it would still have false negatives.
 * Space: I think Space (punctuation) should be mentioned in the hatnote instead of Spacebar. But I guess I'll check out the page views.
 * \=: List of LaTex escapes (it's used to make a Macron)
 * Code point
 * Strictly, these definitions imply that it’s meaningless to say ‘this is character U+265E’. U+265E is a code point, which represents some particular character; in this case, it represents the character ‘BLACK CHESS KNIGHT’, ‘♞’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
 * Just make sure to cite the specific version of the documentation. But this cites other sources; it's too secondary.
 * Why do we have disambiguating notes in redirect templates? Oh I guess it's because they're mostly intended for editors, not readers.
 * Regarding Redirect category:
 * I wonder if there's any way to detect this and do it with a bot. (Note I implemented it wrong in the linked diff.)
 * The 'to' and 'from' should probably have consistency with the descriptions on the redirect page; a bot could definitely enforce that (or maybe some crazy transclusion filtering could take place? but that would probably be too expensive.)
 * R template index is nowhere near comprehensive enough; it doesn't include R from code
 * Why when I search for 'U+F8FF', the only result that shows up is 'U+f8ff'; U+F8FF a page, but it looks the lowercase form. I should test how this works by creating another similar page, but I need to ensure I understand how search works beforehand. I think this can be deleted because the other result will show up in search; having the lowercase form is likely redundant. But I need to work on evidence, not intuition and a philosophy of minimizing the number of pages.
 * The phrasing in search results "redirect from" preceding is ambiguous about which title is the redirect. A more clear interface would suffix the link with "redirects here".
 * Maybe CirrusSearch should be noted more prominently in the lead of Help:Searching, instead of being confined to a hatnote.
 * https://en.wikipedia.org/wiki/Special:Search?search=Article+title+grep&prefix=User+talk%3AJarry1250%2F&fulltext=Search+archives&fulltext=Search&ns0=1