Module talk:Unicode chart

Notes about notes
―cobaltcigs 17:55, 9 September 2019 (UTC)
 * I'm not convinced the "Notes" section at the bottom is worth the space it takes up, and I only added it as a proof-of-concept gesture to mimic existing layout convention. A collapsible (show/hide, just like the section above) section at the bottom with an additional list/table of character info (one per line) would certainly be feasible and only require a few more lines of code. Its hugeness of screen space would be the primary concern, because its expansion would displace other page content possibly including wrapped text or floating images (unlike navboxes, which occupy 100% width at the very bottom).
 * We should just give first and last rows for blocks with character names derived from code points (CJK, Tangut, Nushu, ...), so the largest block is Hangul Syllables with 11,184 code points, which I agree is too long for this approach. But the next biggest blocks are Yi Syllables (1,168), Egyptian Hieroglyphs (1,072), Mathematical Alphanumeric Symbols (1,024), and Cuneiform (1,024), which I think should be acceptable if the names list is initially hidden. I don't see that displacement of other text and images would be an issue, especially as the code charts are mostly only used in the corresponding Unicode Block name articles. BabelStone (talk) 11:33, 10 September 2019 (UTC)
 * One intuitive solution would be to mimic typical charmap program behavior by using a Javascript click handler on each character cell that populates the footer area (of about the same size as the "Notes" section, maybe slightly smaller) with the cursor-selectable name of the last clicked-upon codepoint, plus its  and any additional info we care to pull from Module:Unicode data (replacing any previous content). I could whip up a demo for that in the next few days. I just worry that it might be too interactive to be widely accepted.
 * Nice idea but I am also concerned that turning Wikipedia into an app is a step too far. I'd like to see a prototype of it though. BabelStone (talk) 11:33, 10 September 2019 (UTC)
 * A third approach might be to render the entire list (of names and whatnot) in a vertically scrollable footer panel containing "section" links, such that clicking on the character cell would cause the footer to scroll to and highlight (similar behavior to reflist anchors) the appropriate line. This might be even less popular.
 * I think this is the best solution, regardless of WP:SCROLL. Only 50 blocks with non-algorithmic character names have more than 128 code points, so if we make the scroll window 128 rows only the 50 largest blocks will be affected. BabelStone (talk) 11:33, 10 September 2019 (UTC)
 * On the other hand, some philosophies may have changed over the years. I mean, we do have interactive scrolling maps that pop up in a fullscreen div now (see example).
 * I haven't formed any opinion yet on how to handle combining character positioning, other than "oh god, I hope it's something other than " lol.
 * Personally I prefer NBSP as the base for combining characters as dotted circle (which we currently use) often interferes with the character. BabelStone (talk) 11:33, 10 September 2019 (UTC)

Existing charts
Interesting approach to create the Unicode code charts dynamically but I have many questions. Most only apply if this module is intended to replace the existing chart templates... DRMcCreedy (talk) 04:51, 10 September 2019 (UTC)
 * 1) What problem is this new approach solving? Is it just duplicating/replacing the existing templates?  If not, what will this module be used for?
 * 2) Do the charts get created every time they're displayed? If so, do we care about the extra processing incurred?
 * 3) How to handle fonts?  I saw the post at Template talk:Script and the notes above so I know this is a known issue.
 * 4) ✅ How to handle a varying number of reserved characters?  The current charts leave off the "Gray areas" notice if there are no non-assigned code points because having the "gray areas" notice for those blocks would be confusing. And the wording changes if there is only one non-assigned code point.
 * 5) ✅ How to handle charts with additional footnotes?  For example, Template:Unicode chart Arabic.  And for the existing charts, the notes are indeed valuable.
 * 6) ✅ How to handle non-characters?  For example, U+FDD0-FDEF in Template:Unicode chart Arabic Presentation Forms-A.
 * 7) How to handle combining marks (which are referenced above)?  Some charts have special additions for some combining characters.  For example, U+A980 in Template:Unicode chart Javanese uses a dotted circle. Other combining marks, like U+1D242 in Template:Unicode chart Ancient Greek Musical Notation use a non-breaking space.  Some combining marks use no additional character at all.
 * 8) How to handle characters with dashed boxes?  For example, U+0600-0605, 061C, and 06DD in the Template:Unicode chart Arabic chart.
 * 9) How to handle control(ish) characters where we don't want the actual character in the chart?  For example, U+061C in the Template:Unicode chart Arabic chart, and more obviously, control characters in Template:Unicode chart C0 Controls and Basic Latin and Template:Unicode chart C1 Controls and Latin-1 Supplement.
 * 10) How to create character name aliases?  See U+061C in Template:Unicode chart Arabic and the control characters in Template:Unicode chart C0 Controls and Basic Latin and Template:Unicode chart C1 Controls and Latin-1 Supplement.
 * 11) How to handle block-specific formatting? For example Template:Unicode chart Javanese has a specific height and some of the characters in Template:Unicode chart Control Pictures use a different font size.
 * 12) How to handle character links?  Like, I'm not a fan of linking specific characters (but others are).  It looks like your code, optionally, will link every character if an article exists, but this could increase the number of linked characters. And many characters aren't linked to the character itself, like U+2245 in Template:Unicode chart Mathematical Operators.  Some link to wikt, like U+0x2105 in Template:Unicode chart Letterlike Symbols and all the characters in Template:Unicode chart CJK Unified Ideographs Extension A.
 * 13) ✅ Some blocks have special parameters that need to be taken into account: Template:Unicode chart Alphabetic Presentation Forms, Template:Unicode chart Enclosed Alphanumeric Supplement, Template:Unicode chart Enclosed CJK Letters and Months, Template:Unicode chart Halfwidth and Fullwidth Forms, Template:Unicode chart Miscellaneous Symbols, and Template:Unicode chart Supplemental Symbols and Pictographs. As with most of these questions, this only only applies if you're replacing existing chart templates.
 * 14) How to determine the chart name?  Most charts use the block name for the title but some don't.  For example, "C0 Controls and Basic Latin" is the chart name for the "Basic Latin" block.
 * 15) How to determine what to link the chart name to.  For example, the Template:Unicode chart Kangxi Radicals chart links to "Kangxi radical#Unicode".  Most either link to the block name itself or the block name with "(Unicode block)" appended.
 * 16) Will the new approach be used for the list charts that make up List of CJK Unified Ideographs, part 1 of 4 and List of CJK Unified Ideographs Extension B (Part 1 of 7)?

―cobaltcigs 18:20, 10 September 2019 (UTC)
 * 1. Consistency of format, avoidance of stupidity like.
 * ✅ 2 (and 3). Here are four profiler outputs for the testcases page. Note that this is the total churning of five unicode charts transcluded on the same page (indirectly through the test case template/module in fact). Even with those factors the processing stats are at a small fraction of allowable limits in every case except for  (which should probably be the first feature taken out). Actual overhead in the wild would be lower. Based on the percentages at the bottom, it looks like the single worst bottleneck is the grand   statement at Template:Script. We could probably save at least 40% on parser juice by skipping that and moving its fairly trivial functionality (that of choosing a css class and a definition for same, having already obtained an ISO 15924 code from here) into some module.
 * ✅ Note: I'd like to get away from using script anyway if possible, for reasons outlined at Template talk:Script. ―cobaltcigs 20:21, 10 September 2019 (UTC)
 * ✅ Now skipping inner use of the script template and the outer use of the test case template, the "CPU time usage" is down to 0.436 seconds for rendering same five charts as above. ―cobaltcigs 22:44, 12 September 2019 (UTC)
 * 4. ✅ Keeping a count of reserved codepoints and rendering the "note" as plural/singular/blank will be a trivial step. I just didn't think of it. I do question whether the footnote system is the appropriate way to present this.
 * 5. ✅ My first version of the module actually did have a parameter accepting whole refs. I just took it out when I got the impression every existing template had the same two notes. I can put it back.
 * 6. ✅ Preview of  has them showing up as normally reserved codepoints (the default assumption based on lines missing from here), rather than choking. If we want to give the "permanently reserved" codepoints a different background and auto-generate a footnote explaining what this means, we'd have to maintain a list of them somewhere. Does anything like this occur in other blocks?
 * These are in fact easily detectable. Disregard above comments. ―cobaltcigs 16:15, 11 September 2019 (UTC)
 * Also 6. I'd be more immediately concerned about this cell-stretching monstrosity at U+FDFD, which seems to be a consequence of using script in places where the original chart template does not.
 * 7. Not sure yet. I did see some interesting suggestions here.
 * 8. Depends on what the rationale is for drawing these boxes, and whether it can be detected in any way from Unicode data. Or whether it needs to be listed elsewhere as a special case. Or whether the boxes are needed at all. I don't see a footnote explaining what the boxes even indicate. No hints on my own system either.
 * 9 and 10. Each of the display-aliased characters in the templates you mentioned returns false for the Module:Unicode data function, except for U+0020 SPACE and U+00A0 NO-BREAK SPACE, which return true for  . So both of these traits can easily be tested. Choosing the replacement alias we want would require maintaining a list of same. I'm not sure a printable space character should be aliased in this manner. Maybe it the cell background should be a different color with a footnote explaining yes, a whitespace character is there, and yes, you can copy it and paste it elsewhere. Also not sure "XXX" is appropriate for U+0080–0081. Maybe we want to display "PAD" and "HOP" instead?
 * 11. The existing chart for Javanese shows up with a cell height of 80px which seems excessive for the apparent line height of 33px on my screen. Preview of module output for Javanese looks fine. Better in my opinion. Maybe I just don't have the right fonts installed. But yes, cell height/width params can be added if there's a demonstrated need for this. Otherwise the browser should be trusted to stretch cells for large characters as needed. See "Also 6" above.
 * 12. I think if they are going to be linked, they shouldn't be piped to something else unless the character itself an illegal title char and even then it shouldn't be linked to anything other than a title that paraphrases said character (e.g. ). Making ≅ a disambiguation page (then piping the link to a more specific topic because linking to disambiguation pages is bad) was a mistake in my opinion. And nothing on Letterlike Symbols should link to wikt. Probably only the CJK Ideographs and such (which represent whole words and where wikt has, or should have, a page of that exact title which Wikipedia will never have) should link to wikt. This could be added as a separate   mode.
 * 12, continued. If the character title is a redirect to some other page (such as a list of emojis, or an article about the subject represented by some symbol), that's fine. Someday the character itself might become a separate article, which is also fine. The template need not know or care about that. I'm thinking a list of link aliases for bad-title chars (mapping  to   and so on) would be a good solution. But only if we're going to be linking the characters at all, which is unclear.
 * 13.✅ I did keep the optional start/end parameters, because I figured subdivision would be wanted in some blocks for reasons including hugeness. Note that these need not be multiples of 16. The module will pad leftover cells accordingly with  which is currently styled the same as   but this can be changed.
 * / parameters have been scrapped in favor of a single   parameter which can contain multiple ranges (connected by hyphen or en dash, and separated from each other by comma, whitespace, the word "and", or in fact anything that's not a hex digit).
 * 14 and 15. If the unicode block display names can't be made to exactly match the "official" names in all cases, we'll need a (hopefully short) list of aliases. Adding a blocknamelink parameter which continues to default to  if empty would be easy and sufficient. Let's try to avoid having three sets of names wherever possible.
 * ✅  and  parameters added for differing cases. ―cobaltcigs 13:13, 14 September 2019 (UTC)
 * ✅ 16. I don't see why not. See 13.

I have some follow-up: DRMcCreedy (talk) 23:09, 10 September 2019 (UTC)
 * 3 Could the font just be a passed parm? Most charts don't use a specific font.
 * 5 The following blocks have specific footnotes: Template:Emoji (Unicode block), Template:Unicode chart Hangul Jamo, Template:Unicode chart Superscripts and Subscripts, and Template:Unicode chart Sutton SignWriting. Additionally, blocks with non-characters have the "Black areas indicate noncharacters (code points that are guaranteed never to be assigned as encoded characters in the Unicode Standard)" footnote: Template:Unicode chart Arabic Presentation Forms-A and Template:Unicode chart Specials.  And these blocks have deprecated notes: Template:Unicode chart General Punctuation, Template:Unicode chart Khmer, Template:Unicode chart Miscellaneous Technical, Tags (Unicode block), and Template:Unicode chart Tibetan.
 * 6 ✅ There are only 66 non-characters (https://www.unicode.org/faq/private_use.html#nonchar3) and Unicode has promised not to add any more. I think the black background is effective and would want to keep it.  I think it's safer not to put non-characters themselves into the charts as they are "not normally interchanged with other users"  (https://www.unicode.org/faq/private_use.html#nonchar2).  The code points are U+FDD0-FDEF, FFFE-FFFF, 1FFFE-1FFFF, 2FFFE-2FFFF, 3FFFE-3FFFF, 4FFFE-4FFFF, 5FFFE-5FFFF, 6FFFE-6FFFF, 7FFFE-7FFFF, 8FFFE-8FFFF, 9FFFE-9FFFF, AFFFE-AFFFF, BFFFE-BFFFF, CFFFE-CFFFF, DFFFE-DFFFF, EFFFE-EFFFF, FFFFE-FFFFF, and 10FFFE-10FFFF.
 * 8 The "Dashed Box Convention" is explained at https://www.unicode.org/versions/Unicode12.0.0/ch24.pdf#G8175 It's an oversight not having a note explaing this convention.  It was added to match Unicode's charts.  I think it's useful.  Depending on the font, without the dashed box U+0602 is easily confusable with U+060E, U+1F1E6 looks the same as captial A, etc.  As far as I know there's no way to determine which characters get a dashed box programmatically.  As of version 12.1 it's used on U+0000-0020, 007F-00A0, 00AD, 034F, 0600-0605, 061C, 06DD, 070F, 08E2, 0CF1-0CF2, 0D4E, 0F0C, 1039, 115F-1160, 17B4-17B5, 17D2, 180B-180E, 1A60, 1BAB, 1CF5-1CF6, 2000-200F, 2011, 2028-202F, 205F-2064, 2066-206F, 2D7F, 2E3A-2E3B, 3000, 303E, 3164, AAF6, FE00-FE0F, FEFF, FFA0, FFF9-FFFB, 10A3F, 11003-11004, 1107F, 110BD, 110CD, 111C2-111C3, 11A3A, 11A47, 11A84-11A89, 11A99, 11D45-11D46, 11D97, 13430-13438, 16F8F-16F92, 1BC9D, 1BCA0-1BCA3, 1D159, 1D173-1D17A, 1DA9B-1DA9F, 1DAA1-1DAAF, 1F1E6-1F1FF, E0001, E0020-E007F, and E0100-E01EF.
 * 10 Unicode charts use XXX (in a dotted box) for U+0080, 0081, and 0099 and I don't think Wikipedia's charts should contradict the cited source. (For some archane history of these three characters, I recommend http://unicode.org/pipermail/unicode/2015-October/002876.html) I think the only way of determining the abbreviations to use in the charts is a hardcoded table.  They don't always match an alias.  For example U+E007F is displayed as "END". A lot of the code points that use the dashed box convention display abbreviations.  I haven't compiled a definitive list.
 * ✅ 13 In Template:Unicode chart Enclosed CJK Letters and Months the hangul subset isn't contiguous. Nor is the emoticon subset of Template:Unicode chart Miscellaneous Symbols.  I didn't add these features so I don't know what reaction you'll get from removing them.


 * 9 I think the current solution to control characters and invisible format characters is best, i.e. use the acronym or abbreviation in a dotted square, following the example of the official Unicode code charts. The new Basic Latin and Latin-1 Supplement charts show the control codes as reserved which is incorrect (they are assigned, with the general category Cc, but do not have formal character names, although they do have formal character name aliases). I also notice that U+003D (=) and U+007C (|) do not display properly. BabelStone (talk) 11:02, 11 September 2019 (UTC)

Update:
 * I've restored the  parameter. Any refs inputted here will be numbered before the auto-generated refs. Perhaps I should also have it sanitize anything that's not actually a   by wrapping it in a   tag so it doesn't appear in the title bar.
 * I prefer having the auto-generated refs first, that way the version, which covers the whole chart, is the very first one with additional notes, usually covering just a few codepoints, are at the end. This is just a preference. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)


 * I've added a  parameter that allows multiple ranges to be specified. Potentially in the wrong order, even. Perhaps they should be force-sorted ascendingly. And sanitized to avoid duplication due to overlap.
 * Black blocks were actually easy to detect. Previous code assumed anything containing "<" was  when it can actually be   or  . Whoops. It's all right there in Module:Unicode data. Will work on control chars next.
 * I've discovered Module:Unicode data/aliases includes (among other things) abbreviations for control characters. It does in fact use PAD and HOP.
 * The three characters that Unicode displays as "XXX" do indeed have abbreviations in NameAliases.txt but they all have a type of "figment" as in "figment of one's imagination". I feel strongly that we shouldn't assign abbreviations to the charts that contradict the ones used in the actual, cited Unicode charts. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)


 * I gave the control characters a light blue background and an explanatory footnote similar to those for RESERVED and NONCHARACTER. Also dashed boxes around the abbreviations, which are loaded from here. Some have multiple abbreviations. The current behavior is to choose the last one, because at brief glance that seemed most correct in most cases. I'd rather we move the "official" or preferred abbreviation to the top and consistently select the first one instead. I've yet to research what, if anything, might be broken by changing abbreviation order.
 *  Module:Unicode data/aliases is generated from Unicode's NameAliases.txt file. It looks like it is in the same order, so any tweeking we do to order would be problematic when the file is updated.  If we changed the script that creates aliases we would just be moving the logic from the chart script to the generation script.  Other users of alias may not have the same requirement so I think the right place to make the determination for what to use in the charts belongs in the chart script.  I have another abbreviation issue but I'll do that in a new section for clarity. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)

―cobaltcigs 09:17, 12 September 2019 (UTC)

The font problem, explained
The only way to load custom css definitions is through the  extension tag. This can be produced in the module by preprocessing the previous wikitext/pseudo-html, or by using  to the same effect. Either way, the  page must be of "content model: Sanitized CSS" meaning it must be in the template namespace and have a title ending with ".css" which puts you in a mode that checks for syntax errors and disallows the use of templates, modules, parserFunctions, or anything other than hard-coded css (with a few features excluded for security reasons).

In practice that means there's no way for a template/module parameter such as  (or for any string of text obtained or composed at module runtime) to create a reusable css. So any user-supplied font specs would need to be hard-coded as a  attribute to be used at all. Workarounds include, in descending order of sloppiness: I'm prepared to go with #4 for now, then upgrade to #5–6 only after all the other issues are addressed. ―cobaltcigs 09:17, 12 September 2019 (UTC)
 * 1) Duplicating that much code in the   attribute of every single   cell (which would be stupid as hell).
 * 2) Assigning the bulky   crap one time only to the root   element, then having the   css (conveniently everything that's not a character cell   is a  ) loaded from here attempt to negate any foreseeable user input back to the default so that the table's style attribute appears to only affect the   (codepoint grid) cells. This would be very difficult to do well, considering the defaults we'd seek to revert to could differ according to user skin and other environmental factors.
 * 3) Continue using script within each cell and suffer its inefficiency and incompleteness.
 * 4) Placing this much css (more to be added later) on a single acceptable css source page, then it import via.
 * 5) Make a better version of Template:Script by dividing the css into 154 one-liner subpages of CSS, each named to reflect the ISO 15924 code, and imported only when the need for it is detected (using this). Needing more than one in the same table will most likely be rare, so the question of how many small loads are processor-equal to one big load is probably not even worth testing.
 * 6) Avoid forking and turn the original Template:Script into what we want (use consistent names, include everything, and use a module instead of the switch statement and sub-template spaghetti logic).
 * I've never been very keen on specifying fonts on the Wikipedia side, because 1) most fonts for most Unicode scripts are not available on most users devices without downloading them; 2) in the past editors have tended to specify fonts that they have on their own system so that it looks nice for them, without considering other users; and 3) the Wikipedia specified fonts may override users' font preferences set in their browser (or in Wikipedia settings). Personally I would rather not specify any fonts, and leave it to the user's browser to apply an appropriate font, but I know that this is a minority view, so I'm OK with your suggested solution. BabelStone (talk) 13:06, 12 September 2019 (UTC)
 * My understanding was that certain browsers would show the little squares even if a suitable font was installed, unless specifically told to use that font. I have no idea whether this is (still?) accurate. I suppose could add a parameter like . Then we could ask several Windows users whether all the charts look okay with no fonts specified. ―cobaltcigs 19:04, 16 September 2019 (UTC)
 * ✅  parameter now exists as an option. ―cobaltcigs 21:38, 17 September 2019 (UTC)

Formatting abbreviations
Besides worrying about which abbreviations are used in the charts, there's an issue of formatting. Today, long ones are often split into two or more lines to control the width of the chart. An extreme example is NULL NOTE HEAD in Template:Unicode chart Musical Symbols but this practice happens in other places like Template:Unicode_chart_Mongolian and Template:Unicode chart Variation Selectors Supplement. I haven't checked to see if the abbreviations are always in a dashed box but maybe we could have a parm like ...|abbr|1D15|NULL NOTE &amp;nbsp;HEAD&amp;nbsp; to preserve the ability to format these in the current fashion. In any case, formatting is something to consider. DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)
 * Eww. See User:BabelStone/sandbox for an attempt to replicate that (without any  crap, which is great!). Note that 1D173–1D17A are identified as "format" characters in this file, but "NULL NOTE HEAD" is not. Hence the difference in css/color. The pink can of course be changed later. ―cobaltcigs 20:45, 13 September 2019 (UTC)
 * Wow, I've never realised that U+1D159 is not a format character. Are there any other characters displayed as a dashed box around text that are not format or control characters? I don't think so (variation selectors are gc=Mn). The worrying thing is there seems to be no way of extracting the information from the UCD, so it relies on visually checking the Unicode code charts, but what if it changes suddenly to a graphic character in a new version of Unicode? My gut feeling is that gc=So is wrong if the character has no visible glyph and is not whitespace. BabelStone (talk) 22:52, 13 September 2019 (UTC)
 * I couldn't immediately work out where you are specifying a smaller font size for "NULL NOTE HEAD" compared with "Begin Beam" etc. I think that all the dashed boxes need a smaller font size because (on my system at least) the dashed letters are much larger size than Basic Latin letters, and make the cells overwide. Can we simply add "font-size:75%" for td.box in Template:Unicode chart/styles.css, or is there more to it? BabelStone (talk) 23:30, 13 September 2019 (UTC)
 * This text uses span.small-1 { font-size:80%; } span.small-2 { font-size:59%; } wherein the suffix digit is determined by the number of spaces converted to linebreaks in whatever text is shown (which may be read from the aliases file or from a override parameter). Then the property white-space:pre; forces   to show up as literal linebreaks so we don't have to resort to . Thus one-word abbreviations such as   use the same size as regular chars. All of this can be easily changed. For now, I've tightened the dashed box and cell margins/padding a little bit. ―cobaltcigs 10:08, 14 September 2019 (UTC)

Version
There have been many past discussions about how to determine which Unicode version to show in the footnote of the chart. Because they were manually updated, it wasn't practical to have a master switch for the version. If the charts are created using Module:Unicode data it might be possible to do away with the mindless updating I do once a year for all the charts. A new Module:Unicode data/version item could be added that is manually updated after all of the other Module:Unicode data files are updated. Basically, it's just a string field to say "We've updated all the other data to version x". If the version footnote was pulled from that string, it would alleviate a lot of manual effort. It would mean adding Module:Unicode data/version to the list of "regenerate the charts if tables x, y, and z change". DRMcCreedy (talk) 21:47, 12 September 2019 (UTC)
 * FYI: After a few updates, all of the Module:Unicode data subpages are now up-to-date (Unicode version 12.1). DRMcCreedy (talk) 04:45, 14 September 2019 (UTC)
 * I do like the idea of centralizing the version string. Even as a single-purpose one-liner module return "12.1" would be fine. ―cobaltcigs 10:17, 14 September 2019 (UTC)
 * ✅ ―cobaltcigs 09:25, 17 September 2019 (UTC)
 * P.S. Can I get a complete list of subpages that have actually changed so I can update my localhost wiki (on which I test most of this stuff before posting) accordingly? ―cobaltcigs 10:42, 14 September 2019 (UTC)
 * I updated Module:Unicode data/category, Module:Unicode data/control, Module:Unicode data/scripts, Module:Unicode data/names/002, and Module:Unicode data/names/003. Some changes were unrelated to the release of v12.1.  For example, U+2BC9, 2BFF, and 2E4F were missing for some reason. DRMcCreedy (talk) 16:43, 14 September 2019 (UTC)
 * On Wiktionary I updated the the U+2xxx names module and several others for 12.0 back in March, but I didn't bother with the Wikipedia modules because they weren't being used. But I'm glad to see that now they are at last. — Eru·tuon 09:01, 16 September 2019 (UTC)

Perhaps it would be helpful, now, to put an edit notice on all Unicode data subpages that says "Please remember to update Module:Unicode data/version if applicable" etc. ―cobaltcigs 09:25, 17 September 2019 (UTC)

Pink cells
A footnote says “Pink cells indicate non-printable format characters.” That is untrue: they currently indicate all format characters, some of which are printable. It would be more useful, I think, to highlight default ignorable characters. Gorobay (talk) 03:27, 14 September 2019 (UTC)
 * Okay. I shall try to figure out how to distinguish between these using the available data modules. ―cobaltcigs 11:19, 14 September 2019 (UTC)
 * There wasn't a data module for the Default Ignorable property, so I added  to Module:Unicode data/sandbox and created Module:Unicode data/derived core properties. — Eru·tuon 10:06, 16 September 2019 (UTC)


 * Alternatively I could just take out the word "non-printable" (my own misconception at the time I wrote it) and make the existing footnote statement correct.
 * But assuming "default-ignorable" is the more important concept, is my understanding correct the characters identified by :
 * (a) are a proper superset of  characters, and
 * (b) are a disjoint set of  characters?
 * This is important to the extent that we can't highlight the same cell in two colors. The  css would be retired if we go this route, but the   designation would still be conveyed in the info panel below—which is (relatively speaking) a much a newer feature than the pink highlight.
 * Counting the default-ignorables and determining whether to show the footnote as plural/singular/none (just like the others) will be a two-minute coding task given the available functions. What kind of verbiage would we want in the footnote for the default-ignorables? Seems like we should try to briefly explain what that means. ―cobaltcigs 18:43, 17 September 2019 (UTC)

General Punctuation, row U+206x
I've noticed on User:BabelStone/sandbox that 2061–2064 and 206A–206F appear to have neither a visible glyph nor an official abbreviation. Assuming this isn't simply a font deficiency on my end, would inserting things like this based on the pdf be wildly inappropriate (see parameters)? And I don't just mean the words ASS and NADS, lol. ―cobaltcigs 11:19, 14 September 2019 (UTC)


 * I can't find a file in the Unicode Character Database that lists the display forms for the dotted box characters. They aren't in NamesList.txt, which is parsed into the PDF that you linked to. So they would have to be gathered manually from the PDFs, unless they can be found somewhere else. — Eru·tuon 04:13, 18 September 2019 (UTC)
 * As far as I know, there isn't anything in the UCD. I've always determined dotted box notation manually. BTW: I think the display_20xx parms above are appropriate.  DRMcCreedy (talk) 04:40, 18 September 2019 (UTC)
 * To clarify, "manually" would mean by visual approximation. Copy/paste gives us private-use codepoints assigned to arbitrary glyphs which represent the whole abbreviation (in some font that probably doesn't exist outside the PDF). So much eww. ―cobaltcigs 13:39, 18 September 2019 (UTC)
 * If you're interested, the fonts with the dashed glyphs (SpecialsUC4/5/6.ttf) are bundled with the free Unibook application that is used to generate the Unicode and ISO/IEC 10646 code charts. BabelStone (talk) 16:06, 18 September 2019 (UTC)

Info panel demo
Click the links and hold your breath. ―cobaltcigs 03:13, 16 September 2019 (UTC)


 * Thanks, I like the idea, especially showing a large version of the character. I think the U+ and character name do not need to be in a huge bold font (maybe just normal bold), and the "(assigned)" is redundant -- using a normal font and removing "assigned" should also reduce the annoying horizontal expansion and contraction of the box as you click on characters with different lengthed names. I suppose the UTF-8 is useful to some people, but I would remove the characterization of the UTF-8 hex values as I cannot see how they could be useful. BabelStone (talk) 09:05, 16 September 2019 (UTC)


 * "assigned" is the default phrase returned when a character in question is not "control", "format", "surrogate", "private-use", "unassigned", "space-separator", "line-separator", or "paragraph-separator". The struck-out categories will probably never be part of any chart, which leaves five that are potentially interesting.
 * ✅ Making the chart stay continuously at  would probably help. Setting th 8% and td 5.75% would add up to same, and might also be helpful.
 * I've got it loading named character entity references from a subpage in addition to calculating the numeric ones, which is probably the single crowdpleasingest information here. The UTF-8 is of interest to the extent that it's what our urlencoding uses (Δ is 0xCE 0x94 and %CE%94). UTF-16 less so, but I thought about it.


 * The mojibake depiction of these bytes as separate chars was slightly helpful when debugging but not meant as a serious feature.
 * ―cobaltcigs 10:13, 16 September 2019 (UTC)
 * Nice method – I was surprised that it could be done without JavaScript! Maybe instead of the values from Module:Unicode data/control, which include only some of the General Categories, the table could show the long name of the actual General Category. (I've added the long names of the General Categories to Module:Unicode data/category.) — Eru·tuon 10:37, 16 September 2019 (UTC)


 * It relies upon the css selector to show/hide the panel for any given codepoint. I think this would be nearly adequate if not for the vertical anchor-jumping. I suppose moving the info panel to the top (below the pdf link and above the column headers) would make it slightly less annoying, but it would look weird. Another consequence of this is that whenever multiple charts are present on the same page, opening an info panel on one chart will close info panels all others. So using Javascript would probably be better. It would only require convincing the right people that feature is worthwhile and not too app-like.
 * For now I've reduced the size of the bold-face character name from 125% to 110%, set the root  element to full page-width, and set the columns to fixed percentages that add up to 100%.
 * I've also removed the 'Amiri' font from the  css class, because it makes the U+FDFD ligature wide enough to make these percentages meaningless. I don't know if other characters are similarly affected. I'll need to install the first three fonts to test whether they have the same problem (or, indeed, others).
 * I've now made it pull the "long name" (which appears to always be more interesting than the word "assigned") from Erutuon's info. Hopefully it's never  and hopefully the extra info won't be overwritten by updates.
 * ―cobaltcigs 20:50, 16 September 2019 (UTC)
 * You can rely on  never returning   (at least when supplied a valid code point);   guarantees that. The return value is either a "real" category when the code point is found in   or   or Cn (Unassigned). — Eru·tuon 22:40, 16 September 2019 (UTC)
 * Oops. Actually, what I said is true of Module:Unicode data/sandbox, but at the moment Module:Unicode data is buggy. — Eru·tuon 23:35, 30 September 2019 (UTC)

Selectability: CSS vs. plain text
Putting general category after character name is good; show/hide is good; 100% width of chart is very good. At present you cannot select and copy the entire info panel information: UTF-8 and HTML headings, as well as parentheses around general category, are not selected, and there is no space between character name and general category so the two are concatenated on copy. Can we make all parts of the info panel copyable, and separate parenthesised general category from character name by a space character rather than putting in different cells? BabelStone (talk) 10:15, 17 September 2019 (UTC)
 * What you see there is actually an intentional css effect (see instances of  on the styles.css page). This is similar to navboxes (example) where they use a spaced U+00B7 MIDDLE DOT  as a non-selectable separator for list items . Here I've used commas and spaces instead, and also used the same technique for   list labels. It could all easily be reverted to plain text. I'll await further discussion about whether it should, because it did take a bit of work to make it look right. And really, the whole idea here was actually to help users copy the   html character entity reference without accidentally including the adjacent comma. ―cobaltcigs 12:01, 17 September 2019 (UTC)
 * Ah, that explains it. Personally, I prefer plain text so that the user can select everything. I think we can dispense with the comma between HTML forms (semi-colon followed by a comma just looks weird), and separate them with a space. BabelStone (talk) 16:11, 18 September 2019 (UTC)

Actual aliases vs. corrections
Can we have a demo of the info panel for a block with one or more characters that have a formal alias? I suggest Vertical Forms with its horrendously long name and alias for FE18. BabelStone (talk) 10:22, 17 September 2019 (UTC)
 * Eww, a spelling error ("BRAKCET"). So the correctly spelled name is currently not loaded at all because it's recorded in the aliases file as a  rather than an  . Aliases are currently loaded by the module (see control characters in the Latin chart above), whereas corrections will be a new concept which I'm not yet sure how best to handle. Do we want to show the misspelled title (maybe with a sic tag, even) and note the correction as such on the next line? Or should we just replace it outright without comment? I suppose I'll begin reviewing the other  s vs. what names they are correcting, to see how trivial or major their differences tend to be. For now, here's what the Vertical Forms block currently looks like: ―cobaltcigs 12:01, 17 September 2019 (UTC)


 * If we do want the corrections to appear below the boldface name, similarly to aliases (see screenshot from ye olde localhoste), I'm ready to update this module accordingly. Note: I did check the list and confirm no codepoint has both a correction and an alias. Perhaps we also want some kind of footnote explaining aliases and corrections to the reader, but I'll hold off on that. ―cobaltcigs 13:55, 17 September 2019 (UTC)
 * I like how the correction is shown directly below the name in your screenshot; it makes it easy to compare the two. — Eru·tuon 17:57, 17 September 2019 (UTC)
 * ✅ And with that in mind I've also removed the font-size enlargement of the bold-face character name. ―cobaltcigs 18:17, 17 September 2019 (UTC)
 * The code point name is immutable so it should always be shown as-is, as you're doing (even when it's clearly wrong). As far as the data is concerned, "correction" is just another type of alias like "alternate", "abbreviation", and "figment".  I think all alias types should be shown using the "Type: ALIAS" format without need for an explanation.  It looks like that isn't being done for code points like U+0093, etc.  Lastly, I wouldn't count on there never being a second alias to a code point with a correction type alias.  There's no such restriction, even though that's the case right now. DRMcCreedy (talk) 18:56, 17 September 2019 (UTC)
 * For the sake of clarity I'll use U+000A as a more extreme example. Are you saying it should look like this?
 * (visually at least; never mind the style attributes approximating current css class effects; actual markup will be much shorter)
 * i.e. putting aliases of all types (including abbreviations) in a single list, in the order given in the aliases file, with zero regard for what type of alias they are, and without choosing any of them to replace the name  at the top. I can do that once certain that's what you mean. Let's also revisit the question of how to decide which of multiple abbreviations should be shown in the box. ―cobaltcigs 20:09, 17 September 2019 (UTC)
 * i.e. putting aliases of all types (including abbreviations) in a single list, in the order given in the aliases file, with zero regard for what type of alias they are, and without choosing any of them to replace the name  at the top. I can do that once certain that's what you mean. Let's also revisit the question of how to decide which of multiple abbreviations should be shown in the box. ―cobaltcigs 20:09, 17 September 2019 (UTC)

My concern with using different chart abbreviations than Unicode is that there is no right answer. If someone were to change the Wikipedia chart abbreviation for U+000A from LF to NL would that be wrong/revertable? What about LINE ? Or LFEED ? If we don't have a definitive way to determine the chart abbreviation we open ourselves up to edit wars. Being able to cite the actual Unicode chart gives us one, definitive chart abbreviation. Great work so far, BTW. DRMcCreedy (talk) 22:10, 17 September 2019 (UTC)
 * Related: Can I also get your opinion on whether to put atypical abbreviations in the boxes for above? ―cobaltcigs 20:15, 17 September 2019 (UTC)
 * Yes. I'd display all of the aliases in the order they appear in NameAliases.txt (which is preserved in Module:Unicode data/aliases).  But I also think the type of alias is useful to know.  My preference would look like this:
 * ✅, see above. Using   because is for poetry and mailing addresses. And I've just noticed the word "alias" won't actually appear to the reader. ―cobaltcigs 17:09, 18 September 2019 (UTC)
 * As far as which abbreviation to use in the Wikipedia chart, I think it should match the official, cited Unicode chart. I'm guessing that a lot of them match the first/only abbreviation type of named alias but obviously not always.  As you mentioned, U+206x is a good example of chart abbreviations that don't match named aliases.  I'm thinking a table of chart abbreviations would be required.  You could probably default the chart abbreviation if no exception is found but would it be worth the processing to not find a match first or is it faster to just add them all to a table?
 * As far as which abbreviation to use in the Wikipedia chart, I think it should match the official, cited Unicode chart. I'm guessing that a lot of them match the first/only abbreviation type of named alias but obviously not always.  As you mentioned, U+206x is a good example of chart abbreviations that don't match named aliases.  I'm thinking a table of chart abbreviations would be required.  You could probably default the chart abbreviation if no exception is found but would it be worth the processing to not find a match first or is it faster to just add them all to a table?

The citation might be overkill. Although the nuances are pretty complicated so maybe the citation is justified. DRMcCreedy (talk) 02:04, 18 September 2019 (UTC)
 * Okay, clearly I misinterpreted "I think all alias types should be shown using the Type: ALIAS" to mean "replace more specific alias-type labels with the word ALIAS". Makes a lot more sense with a picture drawn, glad I asked.
 * So my actual concern about U+206x is that stand-in symbols might be mistaken for the actual glyph even by readers otherwise familiar with "normal" control/format character abbreviations which consist of multiple capital letters. So some explanatory footnotes might really be needed there.
 * Agreed. My first draft of a note would be "A dashed box indicates characters which normally have no visible display or only modify the display of other characters. "
 * Currently the display text can be overridden from the calling environment (ultimately, a block-specific template) for all assigned codepoints with few restrictions, which has been done in the U+206x example (and less constructively in the "Vulgar" Latin sandbox section). If we do load a master list of favored abbreviations from a sub-module (containing everything from  to  ), the  parameters could be totally deleted.
 * ✅ and
 * ―cobaltcigs 23:14, 17 September 2019 (UTC)
 * Oops, I completely forgot about the parm.  I like the idea of a master list because it centralizes the data but either approach will work. DRMcCreedy (talk) 02:04, 18 September 2019 (UTC)
 * +1 for a master list. BabelStone (talk) 16:13, 18 September 2019 (UTC)
 * ✅ ―cobaltcigs 06:42, 19 September 2019 (UTC)

Master list complete
See Module:Unicode chart/display and make any corrections/amendments as needed. Maybe I missed a few reading all those PDFs. Except for the CJK blocks where even "skimming" would be too generous a term. params will be whacked soon. ―cobaltcigs 04:38, 19 September 2019 (UTC)
 * ―cobaltcigs 06:42, 19 September 2019 (UTC)
 * I've reviewed the list and made some changes. DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)

Going horizontal
I've made the utf/html info slide to the right rather than downward when an alias list is present. Seems like a more efficient use of space. Seems to look okay next to the infamous BRAKCET correction, which I've confirmed is the longest string in the alias file. ―cobaltcigs 20:05, 18 September 2019 (UTC)
 * I don't like the other information forced to the right when there's an alias. It's unexpected and I don't think the savings in vertical space makes up for it.  Sorry, it just looks misaligned to me.
 * Unrelated to the down vs. side option, I have two comments on the displayed information when you click on a code point:
 * First, can we move the hex HTML escape sequence before the decimal one (&#x... / &#...)? I've never understood why someone would go through the trouble of calculating the decimal value of a code point in order to create an HTML escape sequence but maybe that's just me.  In any case, having the hex value first would align nicer with the UTF-16 information directly above it. Hopefully the hex usage is more comman anyway so it would make sense putting it first.
 * Second, instead of the wording "Introduced in Unicode version x", I'd like to use more precise wording that the source uses. This wording change seems trivial but it gets around the messy issue of various pre-1.1 characters.  If Age is 1.1 (the earliest shown in the file), it would say "Assigned as of Unicode 1.1".  Otherwise it would say "Newly assigned in Unicode x". Thanks. DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)

Named subsets added
To more thoroughly address DRMcCreedy's item #13, I've added a way to refer to pre-defined named subsets in lieu of inputting a. I suppose it may also be feasible to do unions/differences/intersections at some point, if there's a demand for it.

Also new is the black line indicating skipped rows. Seems like a helpful feature.

The block name is also optional now. If omitted, there's no PDF link. But we can still set a display title and a link target for the subject. This would allow greater flexibility in generating a chart that transcends block divisions, such as "all control characters" (the subset name for which could be "special" in that it's generated by a function reading an existing data file, rather than hardcoded). But here's a sillier example for now.

―cobaltcigs 13:45, 20 September 2019 (UTC)


 * I'd lean towards a jagged line like a ripped piece of paper but the thick black line is certainly noticable enough for the user to realize something's going on. I would, however, like the notes to say "heavy" or "thick" black line because every row has a "black horizontal line". DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)

Orientation of glyphs for vertical scripts
For scripts such as Mongolian and Phags-pa which are written in vertical columns, the glyphs in the font have horizontal orientation so that complete runs of horizontal text can be rotated into vertical orientation by a higher level protocol (commonly CSS). Currently, in our code charts we rotate the glyphs into vertical orientation. This used to match the Unicode code charts, which used to show vertically-oriented glyphs for Mongolian and Phags-pa, but a few years back the editor of the Unicode code charts deliberately changed the Mongolian and Phags-pa code charts to show horizontally-oriented glyphs to reflect how the glyphs are represented at the font level. My question is, should we continue to rotate glyphs in the dynamic Mongolian and Phags-pa charts or should we leave them in horizontal orientation to match the current Unicode code charts? My preference is to rotate into vertical orientation as this matches user expectation (it is how Mongolian and Phags-pa glyphs are presented in books on these scripts). BabelStone (talk) 08:12, 28 September 2019 (UTC)
 * I don't have a strong preference, although I do think Unicode showing them horizontally seems strange. Vertical seems better. DRMcCreedy (talk) 17:28, 28 September 2019 (UTC)

Unicode 13.0
Unicode 13.0 will be released in March. Can we complete outstanding work on the Unicode chart module by then? Or shall we continue to use the old Unicode chart templates for the Unicode 13 update? BabelStone (talk) 10:16, 10 January 2020 (UTC)

The cell displayed for U+E003B (TAG SEMICOLON) contains a colon in a dashed box instead of a semicolon
The chart shown on page Tags_(Unicode_block) shows the various tag characters as their normal version in a dashed box, but the character shown in the box for U+E003B (TAG SEMICOLON) is a colon instead of a semicolon. I'm not quite sure where/how to update the template. 81.107.76.114 (talk) 00:29, 12 August 2020 (UTC)

Missing end tag for table
Unicode chart has little usage guidance, and I came to Module talk:Unicode chart (this very page), which has 6 missing end tags for , all associated with Unicode chart. So I went to Pages that link to "Template:Unicode chart". There are 6 pages that transclude Unicode chart, and they all have missing end tags for .

So, my request is either abandon this project, or write some usage notes that include how to use it without leaving a missing end tags lint error for . —Anomalocaris (talk) 07:32, 8 October 2023 (UTC)
 * Vanisaac mistakenly got rid of the end of the table while inserting this module into Template:Unicode chart. SWinxy added it back, but inside the noinclude tag. I just moved it so that it was transcluded. I'm not sure the module should be in the template at this point because it's still marked as "pre-alpha" and hasn't been worked on since 2019, but I'm not going to try to evaluate that. — Eru·tuon 20:48, 8 October 2023 (UTC)
 * Ah thank you. I must've thought that Module:Unicode chart somehow emitted a |} upon transclusion of this template, but not when the module was invoked, hence why I put the |} in the noinclude. SWinxy (talk) 21:38, 8 October 2023 (UTC)
 * Erutuon: Thank you for taking care of this! —Anomalocaris (talk) 22:59, 8 October 2023 (UTC)

Trying again from scratch
When I stumbled across this (April 2024) Template:Unicode chart wasn't working and no one seemed to be actively working on it. I sent a message to User:Cobaltcigs (the last person who edited Module:Unicode chart and when I didn't hear back, I went ahead and started trying to build by own version in the sandbox. The pages I'm using are:
 * Lua: Module:Unicode chart/sandbox
 * CSS: Template:Unicode chart/sandbox/styles.css
 * Template: Template:Unicode chart/sandbox

After a couple days, I've created something that works in the majority of testcases, although there are still some edgecases for unusual characters that still need to be ironed out. You can see my version at:
 * Testcases: Template:Unicode chart/sandbox/testcases

- Eievie (talk) 18:22, 22 April 2024 (UTC)