Talk:UTF-8/Archive 5

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

How is utf8mb3 exactly the same as CESU-8?

Spitzak, you've repeatedly asserted that MySQL UTF8mb3 and CESU-8 are exactly the same in the edit comments. I believe you, but I can't follow you, because the source materials seem to say otherwise, and the citations seem insufficient.

In Unicode Technical Report #26, CESU-8 is explicitly defined to support supplemental characters: "In CESU-8, supplementary characters are represented as six-byte sequences". Whereas the MySQL 8.0 Reference Manual explicitly states that supplemental characters are not supported: "Supports BMP characters only (no support for supplementary characters)". And the MySQL 3.23, 4.0, 4.1 Reference Manual (when utf8mb3 first appears, as "utf8") says the same: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP."

How do you reconcile these conflicting definitions of CESU-8 and utf8mb3? Is one of them wrong, or do they require further interpretation? If so, is that cited somewhere? I checked the citations, but I'm not seeing how they back up what you're saying -- they only seem to note that utf8mb3 doesn't support supplemental characters. If what you're saying is in fact true, I think further explication is needed beyond saying it is so, because the MySQL docs and UTR#26 seem to suggest that utf8mb3 and CESU-8 are definitionally different, at least when perused by a non-expert like myself trying to learn about the subject.

While I think the introductory paragraph is trying to shed some light, "many programs" is vague and not cited, and nor is it cited that MySQL is definitively one of those many programs, and nor is it cited that MySQL "transforms UCS-2 codes to three bytes or fewer" for utf8mb3. Does it? How do we know?

If what you're trying to say is that when UTF-16 supplemental characters are converted to UTF-8 as though they are UCS-2 (and not UTF-16), the result is what came to be called CESU-8, then I think you also need to say that while utf8mb3 is not intended to support supplemental characters at all, it functionally operates as CESU-8 if they are present. And ideally that should be backed up with a citation, or an example sufficient to demonstrate that this article is not the only place where one will find this assertion.

And, even if you're right that utf8mb3 and CESU-8 (and Oracle UTF8) are technically identical, it's still not correct to say that "MySQL calls [UTF-16 supplemental characters converted to UTF-8 as though they were UCS-2 characters] utf8mb3", because MySQL quite clearly defines utf8mb3 as being BMP-only; so MySQL is not "calling" anything involving supplemental characters utf8mb3.

Having now been trying to understand this for hours, I think this Oracle document explains it pretty well: "The UTF8 character set encodes characters in one, two, or three bytes...If supplementary characters are inserted into a UTF8 database...the supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes." If what you're saying is correct (and I don't know that it is, because I don't have anything authoritative saying so), then it sounds like this could be equally applicable to utf8mb3. The article could make that clear, if properly cited or demonstrated.

TL;DR: It's not accurate to describe utf8mb3 as having any representation of supplemental characters, even if it can technically can do so as described by CESU-8, because it is defined otherwise. Further, claiming utf8mb3 is technically identical to CESU-8 warrants citation or demonstration, and the claim would benefit from greater clarity. Ivanxqz (talk) 00:45, 15 September 2020 (UTC)

Both of then translate a UTF-16 supplemental pair into exactly the same 6 bytes, and unpaired surrogate halves into exactly the same 3 bytes, therefore they are identical.Spitzak (talk) 21:20, 15 September 2020 (UTC)

Can you cite this anywhere? No original research, etc. The only source for your information is you. (And you haven't responded to anything that I wrote above, not even the TLDR -- even if technically identical, which you have only asserted and not cited, MySQL does not "call" CESU-8 "utf8mb3" as you state -- utf8mb3 explicitly does not support supplemental characters, and therefore any handling of them in the style of CESU-8 is an accident, not a design.) Ivanxqz (talk) 04:55, 16 September 2020 (UTC)

I decided to rewrite the CESU-8 section for what I think is greater clarity and accuracy. I included that CESU-8 in utf8mb3 is possible (though unsupported), on the basis of Spitzak's claim that it's the case. I noted that it needs a citation. I think it's not actually true, though, on the basis of Bernt's counter-demonstration at Talk:CESU-8#Comments, which I also just verified myself, and also the original references regarding utf8mb3 in the previous version, but I'll leave it for now. (Spitzak? Can you show somewhere why your claim that utf8mb3 can support supplemental characters via CESU-8 is accurate?)

I also gave utf8mb3 its own section again, since it is definitionally not CESU-8, even if technically it's the same thing (which, again, I don't think it is). It's like saying that Mountain Standard Time and Pacific Daylight Time are the same thing; they represent the exact same time of day in California and Arizona in the winter, but they're not the same thing, because they have different definitions. Ivanxqz (talk) 10:53, 16 September 2020 (UTC)

Adoption and non-adoption

https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

Under "Adoption": "Internally in software usage is even lower, with UCS-2, UTF-16, and UTF-32 in use, particularly in the Windows API, but also by Python". What I don't like about this is that Windows API only has a Unicode API for one encoding (plus legacy; codepages). It used to be UCS-2 (in now discontinued Windows versions, I believe they all are), but it's now UTF-16. And it doesn't have direct indexing, to Unicode characters so what follows isn't too helpful (it's outdated from UCS-2 era): "This is due to a belief that direct indexing of code points is more important than 8-bit compatibility". I think we should concentrate first on the main alternative to UTF-8 in use, UTF-16, then possibly explain programming languages. Since there are many and that text misrepresents Python (it also stores Latin1 internally) maybe just leave it out? Just as text on other encodings such as GB 18030 were moved to another page, possibly we need not mention all UTF-8 alternatives, or what all programming languages do, e.g. Python, as it's not strictly about adoption, rather non-adoption? comp.arch (talk) 12:34, 26 March 2021 (UTC)

In the work I do, the #1 impediment to using UTF-8 is that Qt uses UTF-16. The #2 impediment is that Python does not use UTF-8, in our code it uses UTF-32, though you are correct that they are trying to improve this to some selection between 8,16, and 32 bit storage of UTF-32 based on the highest code point value, and also by caching a UTF-8 version as they have finally realized the cost of conversion. It is also quite likely the underlying reason Python and Qt don't use UTF-8 is because of the Windows API using UTF-16, so for me that is the #3 reason (though for Windows programmers it probalby is #1). In any case Python and Qt are extremely similar in their guilt in preventing adoption of UTF-8 and should and must be mentioned together.Spitzak (talk) 19:06, 26 March 2021 (UTC)

Microsoft first developed its "Multi-Byte Character Set" APIs for Windows NT in the early 1990s, before UTF-8 had achieved much usage, and when Japanese Kanji character sets were more practically important than Unicode. UTF-8, if Microsoft programmers even knew about it at that time, would not have helped them deal with Shift_JIS or whatever... AnonMoos (talk) 22:51, 26 March 2021 (UTC)

Microsoft was pretty far ahead of everybody in figuring out muiti-byte character encodings, and thus were in a better position to start using UTF-8. I was working for Sun and they were way behind and convinced that 16-bit characters were necessary, and they were even incapable of handling 8-bit non-ASCII in any intelligent way, often insisting on converting it to 3 octal digits. Microsoft really blew it when they decided to scrap all that work and use UCS-2. Some of this may have been misguided political correctness, there was certainly sentiment that Americans should not get the "better" 1-byte codes. The end result is that ASCII-only software still exists even today!Spitzak (talk) 23:46, 26 March 2021 (UTC)

Microsoft was part of the initial alliance that launched Unicode, of course, but I find it difficult to imagine how it could have done a bunch of UTF-8 software implementation work in the early 1990s, which was then pulled out and replaced by 16-bit wide character interfaces. UTF-8 apparently didn't even exist until September 1992, at a time when Microsoft's MBCS people had to be focused mainly on making Japanese character sets work on the forthcoming Windows NT operating system (there was certainly more money to be made from that than from Unicode in 1992-1993). UTF-8 wasn't even introduced as a formal proposal until 1993, the same year that the first version of Windows NT was released, so the dates don't really seem to align... AnonMoos (talk) 16:14, 29 March 2021 (UTC)

What I meant is that the multi-byte Japanese encodings you are talking about were much more similar to UTF-8 and should have provided a method of transitioning to it, and Microsoft was doing far more to support these transparently than others. Instead Microsoft abandoned all the progress they had done with multibyte encodings to try to use UCS-2 and we are all paying the price even today.Spitzak (talk) 19:03, 29 March 2021 (UTC)

OK -- I'm skeptical as to whether Shift_JIS could prepare the way for UTF-8 in any practical or concretely-useful way, but now I understand what you're saying... AnonMoos (talk) 13:34, 31 March 2021 (UTC)