Talk:UTF-8/Archive 3

Three- and four-byte sequences
There are 256 × 256 − 128 × 128 not-pure-ASCII two-byte sequences, and of those, only 1920 encode valid UTF-8 characters (the range U+0080 to U+07FF), so the proportion of valid not-pure-ASCII two-byte sequences is 3.9%. This is correct. 3.9% of which are not pure ASCII, because a two-byte string has only two possibilites to be UTF-8: either it is ASCII, or it represents one code point U+0080–U+07FF. Similarly, there are 256 × 256 × 256 − 128 × 128 × 128 not-pure-ASCII three-byte sequences, and 61,406 valid three-byte UTF-8 sequences (U+000800 to U+00FFFF minus surrogate pairs and non-characters), so the proportion is 0.41%; finally, there are 2564 − 1284 non-ASCII four-byte sequences, and 1,048,544 valid four-byte UTF-8 sequences (U+010000 to U+10FFFF minus non-characters), so the proportion is 0.026%. This is incorrect. Although it is true that the number of non-ASCII $n$-byte sequences is equal to $256^{n} − 128^{n}$, those which are valid UTF-8 comprise not only single encoded code points, but also sequences containing ASCII characters. For $n$ = 3, [ASCII][U+0080–U+07FF] and [U+0080–U+07FF][ASCII], both of 1920×128 states, must be added to "61,406 valid three-byte UTF-8 sequences (U+000800 to U+00FFFF minus surrogate pairs and non-characters)". Things become even more complicated for $n$ = 4 . Incnis Mrsi (talk) 10:45, 7 February 2012 (UTC) Ironically, for $n$ = 3 the probability for non-pure-ASCII string to be UTF-8 is only slightly lower than for $n$ = 2 . The case of three MSB-set bytes ¤¤¤, which dilutes non-ASCII 3 bytes strings (otherwise having same probabilities as for $n$ = 2) has validity ratio about 1/32, not much lower than the average about 1/24. Something got into me to write about exponential decrease. Maybe the condition that at least one byte must be non-ASCII should be dropped to avoid original researches about this paradox? Incnis Mrsi (talk) 14:55, 7 February 2012 (UTC)

Advantages / Disadvantages
The section giving the relative merits of UTF-8 and UTF-16 is mostly bogus, referring for the most part to the merits of understanding your tools before using them, rather than differences between the tools themselves. The relative lengths of encoded strings is a valid comparison but could be stated far more succinctly. -- Elphion (talk) 18:49, 28 January 2012 (UTC)

UTF-8 versus UTF-16
Firstly, I removed the item that says “A simplistic parser for UTF-16 is unlikely to convert invalid sequences to ASCII. Since the dangerous characters in most situations are ASCII, a simplistic UTF-16 parser is much less dangerous than a simplistic UTF-8 parser.” since it's 1) totally unclear what it means, 2) no reference was provided for 2.5 years.

Secondly, the third item (“In UCS-2 (but not UTF-16) Unicode code points are all the same size...”) compares UTF-8 to UCS-2, so it's quite inadequate for this section. Moreover, the argument that some encoding allows constant-time random-access to code points, or constant time measurement of the number of them, is argumentative. I have seen no real world use for "counting code points", especially because of the complexities of Unicode where abstract characters (code points) don't correspond to user perceived characters. E.g. combining characters, or bidirectional marks.

If there's no objection I'll remove this item too. bungalo (talk) 13:00, 5 May 2012 (UTC)


 * Though I agree with you I still feel something has to be done because of the huge crowd of CS101 graduates who believe there is some horrible problem with the fact that characters are different sizes (though for some reason they see no problem with words, lines, paragraphs, sentences, and UTF-16 characters being different sizes). They will look for this text, and add it if it is not there, and think they have done a great service to the world of software engineering. And it will have to be reverted over and over.
 * It would be nice if *something* clear, concise, yet not sounding like preaching, could be put in there to stop this. The above text is certainly far from perfect but has persisted far longer than most.Spitzak (talk) 21:25, 7 May 2012 (UTC)

external link
I believe the external links section should reference the new page "utf8everywhere.org". It is a web site arguing that UTF-8 should be the default choice of encoding.92.50.128.254 (talk) 05:30, 22 May 2012 (UTC)


 * We don't usually put non-notable promotional pages in external links.--Prosfilaes (talk) 17:29, 22 May 2012 (UTC)


 * 92.50.128.254 -- It's semi-interesting, but it's just one person's lament about Microsoft Windows APIs.  By the way, during most of the 1990s Microsoft paid a certain lip-service to Unicode as the wave of the future, but was practically concerned with Asian codesets such as Shift-JIS / CP932 far more than with Unicode, so Unicode was then somewhat of an afterthought with respect to East Asian character sets, and used some of the internal code infrastructure... AnonMoos (talk) 18:10, 22 May 2012 (UTC)


 * It was written by more than one person, actually. bungalo (talk) 12:15, 22 June 2012 (UTC)

Byte vs Octet
has changed all many most "byte"s to "octet"s (twice now, reverting the revert that suggested a discussion here), so I've changed them back and started this discussion. While Mfwitten is correct that "byte" simpliter is not precise, the article makes clear that we are talking about 8-bit bytes, so there is no ambiguity. "Byte" is used far more frequently than "octet" in this domain – BOM, e.g., is the byte order mark, not the octet order mark – so it is not incorrect (as Mfwitten claims). Indeed, it is common on machines whose bytes are not 8 bits to see UTF-8 bytes stuffed into bytes, not octets. -- Elphion (talk) 02:37, 7 May 2012 (UTC)
 * Thanks. As you point out, "BOM" rather gives the game away, and it is not up to Wikipedia to convert people to use the "correct" terms. Johnuniq (talk) 04:20, 7 May 2012 (UTC)


 * I did not change "byte order marker" to "octet order marker". A byte order marker is used because the byte order of the architecture of the serializing computer is in question. Mfwitten (talk) 05:37, 7 May 2012 (UTC)


 * In the English language, "byte" is the term used by almost all working computer programmers, while "octet" appears in certain somewhat stiltedly-phrased standards documents, or is used by some whose native language is not English. Computers with non-8-bit bytes began to fall by the wayside in the 1980s, and by now are historical curiosities... AnonMoos (talk) 05:11, 7 May 2012 (UTC)


 * First:
 * I did not change all occurences of "byte" to "octet", but rather only relevant occurrences.
 * I did not change "byte order marker" to "octet order marker". A byte order marker is used because the byte order of the architecture of the serializing computer is in question.
 * Stuffing the values of the octets of a UTF-8 octet stream into a computer architecture's native non-octet bytes make for a native byte stream that is no longer a UTF-8 octet stream; serializing the result in terms of native bytes would not be valid UTF-8.
 * Second:
 * RFC 3629 for UTF-8 uses the term "octet" very much more than "byte"; for example:
 * A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax...
 * The Unicode standard frequently uses the term "code unit" (rather than "byte" or "octet"), which is defined as follows:
 * Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
 * Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
 * The Unicode standard also states:
 * The term byte, as used in this standard, always refers to a unit of eight bits. This corresponds to the use of the term octet in some other standards.
 * All of this suggests that the term "octet" is preferable to "byte" for the purposes of writing an article about UTF-8. Mfwitten (talk) 05:35, 7 May 2012 (UTC)
 * I would not object against the change of "sequence of bytes" to "sequence of octets" in some particular context. But the word "byte" focuses the reader's attention on the point that it is the minimal unit. It is more important to understand UTF-8 than the exact size of the byte. Incnis Mrsi (talk) 09:04, 7 May 2012 (UTC)

No, of course you didn't convert all of the "bytes" to "octets" -- just the vast majority of them. So I stand corrected, and I will (almost) accept that as a friendly amendment -- but that doesn't excuse your editing another user's remarks, a practice you should avoid.

The evidence you produce from the Unicode standard suggests to me that they would be perfectly happy talking about "bytes" as long as it's clear what size we're talking about. Standards often use fussy technical words for conciseness, but that doesn't make it encumbant on the rest of us to use their vocabulary -- and in this case most of us don't. "Byte" is well-understood to mean "octet" in this context (and as AnonMoos points out, increasingly in virtually all others as well). Since it is by far the more popular word, there's no real point in not using it. Vocabulary policing is not Wikipedia's job.

I did not make my point about BOM clearly enough. I was not pointing out that you hadn't changed BOM to OOM, but rather adducing BOM as evidence that octets are routinely called bytes, even by Unicode people. Your remark about the endianness of the machine is not correct: machines are capable of emitting either order, and the BOM marks the order of the bytes in the stream, regardless of the endianness of either the sending or the receiving machine.

Finally, a native handling of UTF-8 that assigns each octet to, say, a 9-bit byte (and, yes, my students always look at me funny when I confess that that's the architecture I cut my teeth on) doesn't mean it's not a UTF-8 encoding, perfectly acceptable for internal storage (and likely the only practical one). Nobody's talking about transmitting something like that over the Internet (so the question of "validity" never arises), and it minimizes the difficulty of processing UTF-8 on the machine in question. The point is that UTF-8 is easiest to process as bytes, regardless of the size of the bytes in the native architecture.

-- Elphion (talk) 06:57, 7 May 2012 (UTC)


 * Of course you will accept my amendment. Your comment sets the tone for this discussion, and that tone was based on a gross misrepresentation, which needed to be corrected.


 * The evidence from the standard shows that the authors deliberately avoid the use of the word "byte", certainly not for conciseness (as they use "code unit" instead"), but rather for preciseness, a goal for which a source of information like Wikipedia should strive as well.


 * As for the BOM, please note that octets may be considered bytes, but bytes may not necessarily be considered octets. The order in which bytes (or indeed octets) are serialized has NOTHING to do with how many bits compose a byte, so there is no particular value in specifying that the bytes in question are, say, octets or nonets. Hence, the general term "byte order mark" is used, particularly because the issue is really a matter of computer architecture beyond the scope of the Unicode standard.


 * What is the value in maintaining Joe Programmer's confused terminology? There are better terms not only that exist, but that are indeed used extensively in this very domain (as you put it). Making the distinctions that I would like to make (and that many others have made in actual standards documents) establishes a cleaner understanding of the involved concepts and separations of concerns (as seen, for example, in the nature of the term "byte order mark" just disccused). Mfwitten (talk) 16:09, 7 May 2012 (UTC)


 * Unfortunately for your proposed article changes, "Joe Programmer's confused terminology" is in fact by far the most commonly-accepted and widespread English usage, and it gives rise to no possible "confusion" whatsoever in the current context... AnonMoos (talk) 16:56, 7 May 2012 (UTC)


 * The value in maintaining "Joe Programmer's confused terminology" is the same one as writing this in English instead Lojban; it's the language people are familiar with, and it turns out to be completely unambiguous in practice. You may as well go around pointing out that we can't specify the weight of things without specifying where they're at, and thus should be specifying their mass.--Prosfilaes (talk) 11:05, 11 May 2012 (UTC)

"Byte" (means octet in this context) is correct because octet is the minimal character size in UTF-8 and sizes of all characters are integer multiples of it, so it is a "minimal addressable unit" of this data format. Although "octet" (as a specific definition) is correct too, the use of this term through the text would distract reader's attention in most instances. Incnis Mrsi (talk) 07:00, 7 May 2012 (UTC)


 * Mfwitten -- Some people would prefer that the word "byte" be replaced by "octet", but this has not happened in common English usage, where the word "octet" has the flavor of stuffy bureaucratic standards jargon, or foreignese said by those whose native language is not English. In an article about UTF-9 and UTF-18, it's highly essential to distinguish between 8-bit and 9-bit bytes (or "octets" and "nonets"), but for an article about UTF-8, it's enough to say once that a byte always has 8 bits (as is commonly the case) and then forget about any alternatives... AnonMoos (talk) 08:39, 7 May 2012 (UTC)
 * I do not see this comment as a valid argument. "Stuffy bureaucratic jargon" is something of AnonMoos' personal flavor taste, and non-standard sizes of byte are actually possible. But UTF-8 operates 8-bit bytes (see my comment above and other comments), not any other, and the size is explicitly given. Of course, on a hypothetical platform where bytes have less than 256 states, octets would not be "bytes", but the article does not consider (and should not consider) such theoretical possibilities. Incnis Mrsi (talk) 09:04, 7 May 2012 (UTC)


 * These are called "bytes". Any practical implementation will place each code unit of a UTF-8 string (or *any* 8-bit data) into whatever C calls a "byte" even if it has more than 8 bits in it, therefore "byte" is in fact the correct term for any programmer working with UTF-8. In addition the documentation that any programmer encounters will call these things "bytes" about 50% of the time (the other 50% of the time they will likely be called "characters"). Usage of the word "octet" is so small as to be un-measurable.
 * The word "octet" is intended when talking about *bits*. Such as when a larger code unit is divided up into bit fields, "octet" means that exactly 8 are used, and any remaining ones are used for another purpose.
 * I strongly suspect the real reason for putting "octet" in the text is a deliberate attempt to obfuscate the UTF-8 description because you want to discourage it's use (due to having made an investment in UTF-16). Shame on you.Spitzak (talk) 20:53, 7 May 2012 (UTC)


 * Well, this thread has given rise to some of the weakest and indeed weirdest rebuttals and accusations. Good day! Mfwitten (talk) 23:29, 7 May 2012 (UTC)

Extension past 0x7fffffff
Although referenced, this text does not belong in the section that describes UTF-8, since this extension is not part of UTF-8 (RFC 2279 or RFC 3629). Extending past 0x7fffffff also makes it not compatible with signed 32 bit data type. MegaHasher (talk) 18:36, 2 June 2012 (UTC)


 * The section contains lots of information that is not part of the UTF-8 standard, but is informative anyway.Spitzak (talk) 05:41, 3 June 2012 (UTC)


 * This extension could be one instructor's throw-away comment in a lecture note. It does not even have a name as far as I am aware. Is this really "notable" as defined in Wikipedia? If it has a name at least we can deposit that section of text under "Other Variants". MegaHasher (talk) 06:52, 3 June 2012 (UTC)


 * It may belong to "Compared to UTF-16 > Advantages". And you don't need a name for this. It's not a variant, like CESU-8 for example, it's a natural extension of the pattern that is already used to encode sequences of arbitrary natural numbers. bungalo (talk) 10:59, 16 June 2012 (UTC)


 * Any system using it would produce UTF-8-like data that would be rejected by any standards-conformant UTF-8 processor. I don't see how it's not a variant.--Prosfilaes (talk) 19:39, 22 June 2012 (UTC)

The UTF-8 encoding table
Because it was difficult to see the distinction between some of the colors used in the second table here, I switched them to background colors ("highlighting") instead, which produced far more contrast and clearly showed which bits were going where. I also broke it into nybbles instead of bytes - a common technique when dealing with bit patterns and transformation to/from hex digits. While I still wanted to come up with a little more contrast between the padding bits and the significant ones, I think the result (here) was better than the original, though:

Apparently, Elphion disagrees, and reverted it as follows, with the comment prev version is far easier to read:

Anyone else have an opinion? How about black foreground on softer pastel background? If we don't like my "highlighting", can we at least color the hex values in the last column to show where they came from, and break the bit patterns into nybbles, like:

Thanks, —&#91; Alan M 1  (talk) &#93;— 22:47, 7 June 2012 (UTC)


 * I find the colored background visually very hard to process; partly it's because there's no padding between the background and the content (and it would be hard to add any without muddying the division between the bytes). I also don't like the division into nybbles; the division into bytes is far more important, and is made less obvious by the nybble divisions.  To increase the color contrast, one might try bolding the content (below). -- Elphion (talk) 23:11, 7 June 2012 (UTC)


 * AlanM1 -- the bright backgrounds remind me of the aesthetic of early-1990s ANSI.SYS BBS's, and really make the text harder to read, since the background is highlighted at the expense of the text. The technical term is "angry fruit salad"... [[Image:SFriendly.gif|20px]] -- AnonMoos (talk) 10:18, 8 June 2012 (UTC)


 * OK – I understand the overuse issue – I just think this is a case that benefits from it, but the bolding makes the color changes more visible. How about a little lighter purple, though:
 * {| class="wikitable" style="font-weight: bold"

!colspan=2|Character ! Binary code point ! Binary UTF-8 ! Hexadecimal UTF-8
 * 𤭢 ||
 * align=right|
 * align=right|
 * align=right|
 * }
 * Also, I don't like the latest revision that was made – adding more bits to the left of the first row (instead of just showing the significant number to demonstrate that it is limited to 7 bits) and the left-alignment takes away from understanding the concept IMO. Numbers are generally best right-aligned – I don't know why this should be an exception. —&#91; Alan M 1  (talk) &#93;— 22:21, 8 June 2012 (UTC)
 * Also, I don't like the latest revision that was made – adding more bits to the left of the first row (instead of just showing the significant number to demonstrate that it is limited to 7 bits) and the left-alignment takes away from understanding the concept IMO. Numbers are generally best right-aligned – I don't know why this should be an exception. —&#91; Alan M 1  (talk) &#93;— 22:21, 8 June 2012 (UTC)


 * The lighter purple looks good, and I agree about the alignment too: right alignment (despite the earlier table) is much better.  I'm less enthusiastic about keeping nybble divisions even for the code point; if it is retained, there needs to be a clearer boundary between the bytes. -- Elphion (talk) 23:42, 8 June 2012 (UTC)

I'm confused. What is the formula for generating the "Binary code point" values from the "Binary UTF-8"? For example in the table, how does '11000010 10100010' (Binary UTF-8) become '00000000 10100010' (Binary code point)? Another example, for the character 'U+24B62', why does the third byte in the Binary Code Point start with '01' instead of '10' like the previous examples?

I don't understand how these conversions happen. — Preceding unsigned comment added by 86.14.215.103 (talk) 21:58, 7 July 2012 (UTC)


 * The "binary code point" is just the Unicode number, in binary. The table is attempting to show the translation between this number and UTF-8 by coloring in the binary digits in each that are caused by the same digit in the other one. The black digits in the UTF-8 are fixed ones that are always as given for a sequence of that many bytes. The black ones in the code point are ones that must be zero for a code that produces a sequence of UTF-8 of that many bytes. It might help to either remove the spaces or put them every 4 digits in the code point to remove some of the similarity to the UTF-8 to make this clearer?Spitzak (talk) 00:11, 10 July 2012 (UTC)


 * Thanks User:Spitzak, this is a good explanation. The key point is that 'The black digits in the UTF-8 are fixed ones that are always as given for a sequence of that many bytes.' This is not very clearly stated in the article, and makes it hard to understand. I feel there should be some sort of tutorial explanation underneath the table. Something like this:

--86.14.215.103 (talk) 06:49, 14 July 2012 (UTC)
 * 1) Start with the Unicode character U+00A2
 * 2) This requires 2 bytes to encode because...
 * 3) The binary code point for U+00A2 is...
 * 4) To convert the binary code point to the binary UTF-8 format, it requires padding numbers (shown in black) because...
 * 5) The final Hexadecimal UTF-8 is therefore ...

The description of how the black bits are determined is already in the article, just before the table. I've added a note to the table calling attention to that. -- Elphion (talk) 11:53, 14 July 2012 (UTC)

The design of AlanM1 uses aggressive background colours and almost no padding – I do not like this. BTW I feel these overextended binary, nibble and hex calculations are redundant and partially even off-topical. Octal will be more demonstrable, of course. Incnis Mrsi (talk) 16:41, 14 July 2012 (UTC)

I've added a step-by-step encoding example similar to what 86.14.215.103 proposed, although I put it above the table to hopefully make the table clear to the reader as soon as they get to it. I'm also pondering whether to do a similar explanation of how an overlong encoding comes about... -- Perey (talk) 18:15, 18 July 2012 (UTC)

I think the step-by-step guide is great. It provides extreme clarity, so that everyone can understand the concept. I tidied it up a little. --86.14.215.103 (talk) 09:04, 21 July 2012 (UTC)

Define "overlong sequence"
I found the use of the terminology "overlong" confusing on this page. It is not given a very clear definition and is used before that vague definition is given. I think I figured it out from a combination of this page and another off-Wikipedia, but perhaps someone more expert than me should make an appropriate edit, providing an example and explanation of such a sequence. I believe that would help make this more accessible. Joshua Davis (talk) 21:54, 27 June 2012 (UTC)


 * It's included as part of the comments of "00:20, 29 January 2012" and "11:56, 29 January 2012 above: any two-byte sequence beginning with C0 or C1; any three-byte sequence beginning with E0 followed by an 80-9F byte; and any 4-byte sequence beginning with F0 followed by an 80-8F byte... AnonMoos (talk) 00:22, 28 June 2012 (UTC)


 * P.S. There were some further possibilities in the original version of the standard, but irrelevant to the current version of UTF8... AnonMoos (talk) 00:25, 28 June 2012 (UTC)


 * The point is that an encoding is over-long if it uses more bytes than the minimum required for the given codepoint. One can add more bytes by encoding more leading zeroes (the canonical example encodes 0 as C0 80 instead of 00 to avoid using the NUL character), but it's more important to have a unique correspondence between codepoints and encodings.  As Moos implies, the quickest way to check for overlong sequences is to check them against known forbidden patterns. -- Elphion (talk) 02:53, 28 June 2012 (UTC)


 * Thank you very much for the responses. What you've replied with, Elphion, I think would make a nice clarifying addition to the page. As it stands, "overlong" is used before it's explained and the explanation is not as clear as what you've just said. Joshua Davis (talk) 13:41, 28 June 2012 (UTC)

Sorting UTF-8 as bytes
Dcoetzee recently made this edit, commenting that "sorting a UTF-8 string as an array of bytes obviously doesn't work". Spitzak reverted it, stating that "it does work as bytes, thats the point". Now, it seems to me that Dcoetzee is correct: if you sort an array of bytes that happen to represent UTF-8 characters--specifically, multi-byte characters--not only will you not end up with the characters in Unicode order, you won't end up with UTF-8 characters at all.
 * Example: Sort the string "€$¢" as an array of UTF-8 bytes. Pre-sorting, the array of UTF-8 bytes is [,, , , , ]. After sorting as bytes, the array becomes [, , , , , ], which (after the first byte) is invalid UTF-8. (In Unicode order, it should be "$¢€", which is [, , , , , ].)

I believe this is what Dcoetzee meant by the reverted edit, but I'd like to know what Spitzak meant before going and un-reverting the change--it seems to me I must be missing something that Spitzak understands better than I. -- Perey (talk) 12:58, 17 September 2012 (UTC)
 * Of course, Dcoetzee's wording is better than Spitzak's one. Incnis Mrsi (talk) 15:57, 17 September 2012 (UTC)
 * I am confused. I thought this was not about sorting a UTF-8 string as an array of bytes, but about sorting an array of UTF-8 strings lexicographically using bytewise comparison of the strings. — Tobias Bergemann (talk) 17:17, 17 September 2012 (UTC)
 * Tobias is correct: the point in the article is that lexicographic sorting of strings, whether considered codepoint by codepoint or ubyte by ubyte, yields the same result.  In this context, Spitzak is correct.  Dcoetzee is correct that sorting an array of ubytes numerically will yield a different result, but that's not what is being discussed.  The article's text should be clarified. -- Elphion (talk) 17:35, 17 September 2012 (UTC)

[outdent] First attempt at revision: -- Elphion (talk) 17:54, 17 September 2012 (UTC)
 * Sorting a set of UTF-8 encoded strings lexicographically as strings of unsigned bytes yields the same order as sorting the corresponding Unicode strings lexicographically by codepoint.
 * This is sorting of more than one string, not sorting of the bytes inside a string. The above quote looks good except there is no need for the first "lexicographically" because there really isn't any other way to sort unsigned bytes. I believe the words "set of UTF-8 encoded strings" remove the ambiguity about what kind of sort is being done.Spitzak (talk) 06:24, 18 September 2012 (UTC)
 * Spitzak, stop edit warring on lexicographical order links, please. I do not see any ambiguity neither about total ordering of octets or codepoints, nor in the article about lexicographical order in general. If you have some doubts about the link, or a proposal to further increase the precision, then express it here. Incnis Mrsi (talk) 11:56, 18 September 2012 (UTC)
 * The lexicographical order article does describe that the order of an earlier code supersedes the order of the later code. However it does not convey the far more important information that the order chosen for each code is what we would call "numeric" order (ie the smaller number is first, irregardless of what Unicode character it represents). Since the lexicographical article uses a lot of letters in it's descriptions, I am worried that some readers may think that sorting UTF-8 by bytes puts the Unicode into alphabetical order (ie it puts á near a).Spitzak (talk) 19:25, 18 September 2012 (UTC)

[outdent] Does this answer the problem? I guess the real question is whether this operation is of sufficient utility that this order-preserving property is a notable advantage of UTF-8. Since this order is not "alphabetic order" for some languages, is it an operation that is frequently performed? I find it useful, e.g., that the CJK ideographs are presented in radical + stroke count order, but I don't know whether their Unicode order is regarded as *the* canonical order -- Elphion (talk) 14:53, 20 September 2012 (UTC)
 * Sorting a set of UTF-8 encoded strings lexicographically as strings of unsigned bytes yields the same order as sorting the corresponding Unicode strings lexicographically by numerical codepoint value.
 * I'm not sure whether the link to lexicographical order is actually helpful. Although that article explains the notion (and a link to that is a good thing), it is such a mathematically oriented treatment that its use in processing data is rather obscured.  There is a link there to Alphabetical order, but no discussion that the one is not the other (typically even in English, since alphabetic order is typically case-insensitive). It would be useful to mention that they're not the same in the bullet point I've suggested above. -- Elphion (talk) 15:10, 20 September 2012 (UTC)
 * The order is useful for algorithms that manage sets of strings or use them as keys. It means that if you are using UTF32 strings as well as UTF8, the fastest and simplest methods of sorting these strings produces the same order and allow these to interoperate. This is not true of UTF16 and UTF32. Actual "alphabetical ordering" is ENORMOUSLY complex and ill-defined so it should NEVER be used for management of sets or lookup keys.
 * I think the article on "lexicographical order" is misleading. Yes it says that the order is determined by the first unequal element in the two arrays being compared. However it gives examples using letters, and the term "lexicographical" does not define the other requirement for this to work: the values must be compared as unsigned binary numbers. For that reason I think the link should be removed as it is misleading.Spitzak (talk) 20:29, 20 September 2012 (UTC)
 * That sounds to me like an argument for fixing lexicographical order rather than omitting the link. -- Elphion (talk) 21:04, 20 September 2012 (UTC)

Okay, so (dis)use of "lexicographical order" aside, it seems to me that the original issue came down to the use of "strings" and "arrays". If I've got it right, what Dcoetzee was fixing was the apparent statement that internally sorting a string of Unicode characters ('cab' -> 'abc') as a string of UTF-8 bytes works -- it doesn't. (Hence why Dcoetzee replaced "strings" with "characters".) What (I think) Spitzak was defending was that sorting an array of Unicode strings ('dog', 'fish', 'cat' -> 'cat', 'dog', 'fish'), as an array of arrays of UTF-8 bytes, does work.

The reason that the original wording implied the wrong meaning was that it said "Sorting of UTF-8 strings as arrays of bytes" -- it wasn't clear whether the plural meant "sorting strings in general, each sorted internally", or "a group of strings sorted against each other". The present wording ("a set of UTF-8 encoded strings as strings of unsigned bytes") is better, but still not good... that "strings as strings" phrasing just seems to be asking for misunderstandings. -- Perey (talk) 13:50, 24 September 2012 (UTC)


 * Yes, we've already agreed that the original wording is not clear enough. The challenge is stating it without getting hopelessly (and illegibly) technical.  Perhaps something like:
 * Sorting a set of UTF-8 encoded strings lexicographically (each string considered as an array of unsigned bytes) yields the same order as sorting the corresponding set of Unicode strings lexicographically by numerical codepoint value.
 * That's already bordering on illegibility. Part of the problem is that informal terms for collectives ("set", "array", "list") already have been pressed into service not entirely consistently as technical data structures in various languages.  In your summary, for example, saying that we are trying to sort arrays of strings as arrays of arrays of bytes is just as confusing as "strings as strings" -- and the original set of strings might be a list, say, rather than an array.  The technical jargon is obscuring the forest here.
 * -- Elphion (talk) 15:47, 24 September 2012 (UTC)
 * It is indeed. I was merely trying to summarise, because I think the question of whether or not to say "lexicographical order" was obscuring the real problems with wording (which you summed up much better than I). So you could say... we couldn't see that we couldn't see the forest for the trees? ;-)
 * Awful jokes aside, here's a suggestion: We go the opposite way and produce a simpler, possibly less rigorous, but much more understandable explanation. Here's a first draft...
 * Sorting text strings by their UTF-8 representations will put them in the same order as sorting by Unicode codepoint numbers.
 * And then, if necessary, we provide an example to show what we mean. -- Perey (talk) 09:55, 25 September 2012 (UTC)

By the way, there's a parallel section at ASCII... -- AnonMoos (talk) 20:46, 20 September 2012 (UTC)

addition of   template
How is this information supposed to be conveyed in reasonably compact and organized form? If the list format was rewritten as narrative paragraphs, I really don't think that this would increase readability, and would probably make it harder to find information. AnonMoos (talk) 01:32, 27 November 2012 (UTC)

Default Character Encoding
"is also increasingly being used as the default character encoding i"

This is quite true, even the programming language Ruby, as of version 2.0 and higher, now uses UTF-8 by default. It used to use US-ASCII before as default. 62.46.197.236 (talk) 15:54, 8 May 2013 (UTC)

Moved from article
(Should the word deprecated be added here like this | They supersede the definitions given in the following deprecated and/or obsolete works: ? Cya2evernote (talk) 14:31, 11 February 2014 (UTC))

Noncharacters

 * Incnis Mrsi made a change to state that surrogates and noncharacters may not be encoded in UTF-8, and I changed this to only surrogates as noncharacters can be legally represented in UTF-8. BIL then reverted my edit with the comment "Noncharacters, such as reverse byte-order-mark 0xFFFE, shall not be encoded, and software are allowed to remove or replace them in the same ways as for single surrogates". This is simply untrue, and I am pretty sure that nowhere in the Unicode Standard does it specify that noncharacters should be treated as illegal codepoints such as unpaired surrogates. In fact the Unicode Standard Corrigendum #9: Clarification About Noncharacters goes out of its way to explain that noncharacters are permissible for interchange, and that they are called noncharacters because "they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged". I think it is clear that noncharacters can legitimately be exchanged in encoded text, and as they can be represented in UTF-8, the article should not claim that they cannot be represented in UTF-8. BabelStone (talk) 18:04, 5 March 2014 (UTC)


 * The Unicode standard seems only concerned with making sure UTF-16 can be used. The noncharacters mentioned can be encoded in UTF-16 no problem. Only the surrogate halves cannot be encoded in UTF-16 so they are trying to fix this by declaring them magically illegal and pretending they don't happen. So there is a difference and user BIL is correct. (note that I think UTF-16 is seriously broken and should have provided a method of encoding a continuous range, just like UTF-8 can encode the range 0x80-0xff even thought those values are also 'surrogate halves'.Spitzak (talk) 05:38, 7 March 2014 (UTC)

5- and 6-byte encodings
UTF-8, as it stands, does not know 5- and 6-byte encodings, that’s a fact. Having this very important fact buried in the third paragraph after the table of “design of UTF-8 as originally proposed” is just misleading. I would even prefer a table with those encodings removed altogether, which would be still better than the current version. I agree it might be good to show them, but we need to be very clear there is an important caveat in there. I fail to see how a slight color in background could be confusing (maybe the single-cell in the 4-byte encodings? I do not insist on that one), I was more afraid of a (correct) reminder about accessibility than that.

If you dislike coloring, which device would you find acceptable? Maybe just a thicker line below the 4-byte row? Whatever, it just needs some distinction.

--Mormegil (talk) 17:20, 24 February 2013 (UTC)


 * I don't agree that some distinction is required in the table. It is presented as a table illustrating the original design.  That's a fact, as you say.  If you're concerned that people won't get the message that encodings conforming to RFC 3629 limit the range, then move that proviso into the sentence introducing the table. Trying to indicate it graphically in the table will just muddy the idea behind the design, and will require more explanatory fine print. -- Elphion (talk) 20:57, 24 February 2013 (UTC)


 * Basically agree with Elphion... AnonMoos (talk) 00:46, 25 February 2013 (UTC)


 * Agree here too. It clearly states this is the *ORIGINAL* design. The reason the table is used is that the repeating pattern is far easier to see with 6 lines than with 4. It is immediately followed with a paragraph that gives the dates and standards where the design was truncated. I also think this is a far clearer to show the 6 rows and then state that the last two were removed, than to show 4 rows and then later show the 2 rows that were removed (and the 5/6 byte sequences are an important part of UTF-8 history so they must be here). (talk) 06:35, 25 February 2013 (UTC)


 * I think the table should not include the obsolete 5/6 byte sequences, at all. Very misleading - it fooled me. — Preceding unsigned comment added by 90.225.89.28 (talk) 12:28, 2 June 2013 (UTC)


 * The table shall not contain the 5-6 byte sequences. The article shall present first what UTF-8 is today, and then, as a separate section, describe what it was many years ago. It is very confusing to present the material in the order of chronological development. Keep in mind that many come here to take a quick reference for UTF-8 as it is today, and the history is not that important for them. In other words, the article shall present the material in "the most important things go first" order. bungalo (talk) 12:16, 7 September 2013 (UTC)


 * There is still a lot of confusion among programmers, who think that UTF-8 can be as long as 6 bytes, and therefore it's "bad". Looking at this article, or "Joel on unicode" say, explains why. I blame you for this confusion. Many readers will look at the diagrams only, and don't bother to read the text. It is legitimate, and you should take them into consideration. That UTF-8 was once a 6-byte encoding is irrelevant for anything but historical curiosity. bungalo (talk) 12:42, 7 September 2013 (UTC)


 * It's legitimate for a programmer to only look at the diagram? I'm not sure how that makes sense; if you understand the syllogism "UTF-8 can be as long as 6 bytes" -> "it's bad" you should understand enough about Unicode to understand that UTF-8 is not 6 bytes long. In any case, I don't know of any force that can stop a hypothetical programmer that dismisses a technology based on reading skimming looking at the diagrams in a Wikipedia article.--Prosfilaes (talk) 21:06, 7 September 2013 (UTC)

Ybungalobill -- if those people just have the patience to scroll down to the big table, then they can see things that should be avoided highlighted in bright red... AnonMoos (talk) 23:07, 7 September 2013 (UTC)

Keep the original table. The cutoff is in the middle of the 4-byte sequences, so I do not believe truncating the table between 4 and 5 byte sequences makes any sense. The longer sequences make the pattern used much more obvious.Spitzak (talk) 21:07, 8 September 2013 (UTC)

UTF-8 supports 5- and 6-byte values perfectly well - UNICODE doesn't use them, and thus UNICODE-in-UTF-8 is restricted to the more limited range. (to belabor a point) Encoding high-end UTF-8 beyond the UNICODE range is perfectly legitimate, just don't call it UNICODE - unless UNICODE itself has (in some probably near future) expanded beyond the range it's using today. (more belaboring) The 0x10FFFF is a UNICODE-specific constraint, not one of UTF-8.-- — Preceding unsigned comment added by 70.112.90.192 (talk • contribs)


 * Unicode = ISO 10646 or UCS. UTF = UCS Transformation Format. That is, what UTF-8 is designed to process doesn't use values above 0x10FFFF, and so 5- and 6-byte values are irrelevant. There's no anticipation of needing them; there's 1,000 years of space at the current rate of growth of Unicode, which is expected to trend downward.
 * You can encode stuff beyond 0x10FFFF, but it's no longer a UCS Transformation Format. I'm not sure why you'd do this--hacking non-text data into a nominal text stream?--but it's a local hack, not something that has ever been useful nor something that is widely supported.--Prosfilaes (talk) 12:57, 28 February 2014 (UTC)


 * No, what the UTF-8 encoding scheme was "designed to process" was the full 2^31 space. The UTF-8 standard transformation format uses it only for the Unicode codepoints, and a compliant UTF-8 decoder would report out-of-range values as errors. I think we make that abundantly clear in the article. But "1,000 years of space at the current rate of growth" reminds me of "640K ought to be enough for anybody".  Whether we'll ever need to look for larger limits is a moot point.  There's no particular reason to prohibit software from considering such sequences.  And it's certainly not a good reason to obscure the history of the scheme. I think the article currently strikes the right balance between history and current practice. -- Elphion (talk) 18:26, 5 March 2014 (UTC)


 * It is incoherent to say "the full 2^31 space" without the context that implies "the full 2^31 space of Unicode". So it's not "no"; and in fact, I would say the emphasis is wrong; they wanted to support Unicode/ISO 10646, no matter what its form, not the 2^31 space. There is good reason to stop software from considering such sequences; "if you find F5, reject it" is much safer then adding poorly-tested code to process it, just to reject it at later level, and discouraging ad-hoc extensions to standard protocols is its own good. libtiff has had security holes because it supported features that that nobody had noticed hadn't worked in years. Whether we'll ever need to look for larger limits is not a moot point; writing unneeded, possibly buggy code for a situation that may come up is not wise.


 * If you want a copy of every book Harper Lee wrote, how many bookcases are you going to put up? Personally, I'm not going to put up multiple bookcases on the nigh-inconceivable chance that somehow dozens of new books are going to appear from her pen. We knew that memory was something people were going to use more of, but every single character anyone can think of encoding, including many that nobody cares about, fits on four Unicode planes, some 240,000 characters, with plenty of blank space.--Prosfilaes (talk) 03:43, 6 March 2014 (UTC)


 * It is not incoherent: everybody (even you) knows what is meant.  The scheme was designed when Unicode was expected to include 2^31 codepoints, and that is what the scheme was designed to cover.  As for broken software, nothing you say will prevent it from being written.  The only reasonable defense is to write and promote good software.  Software that parses 5 and 6 byte sequences as well as unused 4 byte sequences is not necessarily bad software.  In terms of safety, I would argue that well tested parsing routines that handle 5- and 6-bytes sequences are inherently safer than adding special case rejections at an early stage.  It is certainly a more flexible approach.  And the analogy with physical bookcases is not particularly apt; keeping code flexible adds only minimal overhead.  And in any event, your opinion or mine about how software should go about handling out-of-range sequences is really beyond the scope of this article.  It suffices that a compliant reader report the errors. -- Elphion (talk) 14:38, 6 March 2014 (UTC)


 * It is incoherent outside that context, and once we explicitly add that context it changes things. What it was designed to process is ISO-10646; the fact that they planned for a lot larger space is a minor detail. In terms of safety, your saying that well-tested parsing routines that have  => error are less safe then ... => some number that has to be filtered away later? If you believe your opinion about this subject is beyond scope, then don't bring it up. The simple fact is that UTF-8 in the 21st century only supports four byte sequences, that no encoder or decoder in history has ever had reason to handle anything longer. Emphasis should be laid on what it is, not what it was.--Prosfilaes (talk) 23:23, 6 March 2014 (UTC)


 * "You keep using that word. I do not think it means what you think it means." (:-) -- Elphion (talk) 15:40, 7 March 2014 (UTC)


 * The original design did in fact aim to cover the full 2^31 space. Ken Thompson's proposal  states:  "The proposed UCS transformation format encodes UCS values in the range [0,0x7fffffff] using multibyte characters of lengths 1, 2, 3, 4, 5, and 6 bytes." -- Elphion (talk) 16:08, 7 March 2014 (UTC)


 * The original design did cover the then-full 2^31 space. But that's in the technical part of the document; the aim of UTF-8 is stated above:


 * With the approval of ISO/IEC 10646 (Unicode) as an international standard and the anticipated wide spread use of this universal coded character set (UCS), it is necessary for historically ASCII based operating systems to devise ways to cope with representation and handling of the large number of characters that are possible to be encoded by this new standard.


 * So, no, it did not aim to cover the full 2^31 space, it aimed to handle "the large number of characters that are possible to be encoded by this new standard."--Prosfilaes (talk) 22:28, 7 March 2014 (UTC)


 * That is a weird interpretation of that sentence. That some characters are "possible to be encoded" does not say anything about what "could" be encoded by that method. &minus;Woodstone (talk) 06:02, 8 March 2014 (UTC)


 * I don't understand your response. "Could" and "possible" mean basically the same thing. I think that sentence is their goal, to cover the characters of Unicode, not the 2^31 space.--Prosfilaes (talk) 22:01, 8 March 2014 (UTC)

Hi, I just wanted to say that I was using this article for research, and I also found the table to be confusing. It isn't inherently wrong, but as-is it belongs in a History or Technical Background section, not at the top of Description which should reflect current standards and practice. If the table does stay, I think it should be updated to clarify current usage *within the table itself* with a note, color coding, etc. Perhaps we can unite around the general principle that tables/charts/diagrams should be self sufficient, and not rely on surrounding prose for critical clarifications. Proxyma (talk) 15:03, 6 July 2014 (UTC)


 * No, there is no reason to have two very similar tables. In addition the pattern is much easier to see with the 5 & 6 byte lines. Furthermore, a table "reflecting current usage" would have to somehow stop in the *middle* of the 4-byte line. Including the entire 4-byte line is misleading. Nobody seems to have any idea how to do that. Please leave the table as-is. This has been discussed enough.Spitzak (talk) 02:38, 7 July 2014 (UTC)


 * This discussion seems to be based on different opinions about what is easier and more straightforward, so it's hard for me to see how the case has been closed. I gave my feedback because as a new reader I experienced the confusion others warned about here, and I think it's important to focus on the semi-casual reader. Perhaps it's human irrationality, but when readers see a big chart at the top, they interpret it as authoritative, and wouldn't consider parsing the rest of the text to see if it's later contradicted. I agree that two similar charts may be overkill, but in that case we should remove the one which has been inaccurate for more than a decade. Proxyma (talk) 03:03, 7 July 2014 (UTC)


 * It would be useful if you could describe how you were confused. The table is quite clear, showing the layout for codepoints U+0000 to U+7FFFFFFF.  The accompanying text explains that the current standard uses the scheme for the initial portion up to U+10FFFF, which goes into the 4-byte area but does not exhaust it.  This seems perfectly clear to me.  Any table trying to show the "21-bit" space directly would not be nearly as clear; it would obscure the design of the encoding, and would require more verbiage to explain it.  The one improvement I would suggest is that the reduction of the codespace to U+10FFFF might usefully come before the table, so that the reader understands immediately that the full scheme is not currently used by Unicode. --- Elphion (talk) 04:23, 7 July 2014 (UTC)


 * Elphion, I think you and I basically agree. The only modification I'd make to your proposal is to suggest that the clarification of the codespace reduction be made within the table itself. As I said, I think tables/charts/graphs/etc should be self-contained with respect to essential information. The possible exception is a caption, but that's effectively part of what it's captioning. As for why I was confused, it was because the table didn't include such a clarification. I think sometimes it's difficult for those of us who edit an article to see it "with fresh eyes" like a new reader. When we look at the table, we're already aware of the content of the following prose because we've already read it. Proxyma (talk) 06:44, 8 July 2014 (UTC)


 * There have been endless attempts to colorize the table and split line 4 to "clarify it". All results are obviously less clear and have been reverted. They hid the pattern (by splitting line 4) and they just had to add more text than is currently attached to explain what the colored portion did. Or they did not split line 4 but used 3 colors and added even more text than is currently attached. Face it, it is impossible. Stop trying. Only possible change may be to move some of the text before the table, but I think that is less clear than the current order of "original design:", table, "modified later by this rfc...". That at least is in chronological order.Spitzak (talk) 18:56, 8 July 2014 (UTC)

Proposal of UTF-8 use lists
The article's introduction have an assertion that need citation:
 * "UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications"

Is difficult to find a unique source for all these applications. The alternative is to start a Wikipedia-list for all surveys:


 * ~ List of software that are UFT-8 compatible

So, in the list, grouped as below, add tables with columns "name", "extent of compatibility", "have suport to UTF-8" and "use UTF-8 as default". Tables:


 * Standards:
 * Operating system specifications compatible with UTF-8 (ex. POSIX);
 * Programming language specifications compatible with UTF-8 (ex. Python);
 * Web protocols compatible with UTF-8 (ex. SOAP);


 * Software:
 * Operating systems compatible with UTF-8;
 * Compilers compatible with UTF-8;
 * Mobile APIs compatible with UTF-8;
 * ... compatible with UTF-8;

--Krauss (talk) 11:17, 10 August 2014 (UTC)


 * There's no real limit to these lists, and no clear definition. Is Unix v7 compatible with UTF-8 because you can store arbitrary non-ASCII bytes in filenames? A lot of Unix and Unix programs are high-bit safe. Python isn't especially compatible with UTF-8; it can input any number of character sets, and I believe its internal encoding is nonstandard. Likewise, a lot of programs can process UTF-8 as one character set among many.--Prosfilaes (talk) 21:12, 10 August 2014 (UTC)


 * I think that there are two simple and objective criteria:
 * a kind of "self-determination": the software express (ex. in the manual pages) that is UTF-8 compatible;
 * a kind of confirmation: other sources confirm the UTF-8 compatibility.
 * No more, no less... It is enough for the list objectives, for users, etc. See EXEMPLES below. --Krauss (talk) 00:51, 18 August 2014 (UTC)

Examples
Draft illustrating the use of the two types of references, indicating "self-determination", and "confirming that it does".
 * Python3:
 * Source code is UTF-8 compatible. Self-determination ref-1 and ref-2. Independent sources: S1. "By default, Python source files are treated as encoded in UTF-8.", [Van Rossum, Guido, and Fred L. Drake Jr. Python tutorial. Centrum voor Wiskunde en Informatica, 1995]. S2. "In Python 3, all strings are sequences of Unicode characters". diveintopython3.net.
 * Build-in functions are UTF8-compatible. Self-determination string — Common string operations. Independent sources: ...
 * Support at the core language level: no.


 * PHP5:
 * Source code is UTF-8 compatible. ...
 * SOME build-in functions are UTF8-compatible. see  functions and PCRE... and str_replace and some another ones.
 * not compatible, but accepts automatically UTF-8 source-code and incorpore compatible libraries like mb*, PCRE, etc.
 * Support at the core language level: no. (see PHP6 history).

— Preceding unsigned comment added by Krauss (talk • contribs) 00:51, 18 August 2014
 * MySQL: yes, have compatible modes. ...
 * PostgreSQL: yes. have compatible modes. ...
 * libXML2: use UTF-8 as default (support at the core level)...
 * I don't think a list of software compatible with UTF-8 is useful. Eventually, all software that is used in any notable manner will be UTF-8 compatible. To do the job properly would require exhaustive mentions of versions and a definition of "compatible" (Lua is compatible with UTF-8 but has no support for it). Such a list is not really suitable here. Johnuniq (talk) 01:25, 18 August 2014 (UTC)
 * Maybe UTF-8 usage is increasing but I don't think it is taking any lead. The heavily used languages C# and Java use UTF-16 as default and Windows does also. I don't think that will change in short term. --BIL (talk) 07:58, 18 August 2014 (UTC)
 * Sure, but even Notepad can read and write UTF-8 these days, so it would feature on a list of software compatible with UTF-8. I can't resist spreading the good word: http://utf8everywhere.org/ Johnuniq (talk) 11:00, 18 August 2014 (UTC)
 * Oppose - as Johnuniq says, this list will be huge and essentially useless. RossPatterson (talk) 10:41, 18 August 2014 (UTC)
 * I have no idea what it means for Python 3 to not have "support at the core language level". It reads in and writes out UTF-8 and hides the details of the encoding of the Unicode support. I don't think this is a productive thing to add to the page.--Prosfilaes (talk) 22:00, 18 August 2014 (UTC)
 * Oppose, per Johnuniq's explanation. Such a list would be too long, it would never be complete, and it would doubtfully be used for the intended purpose. &mdash; Dsimic (talk | contribs) 08:20, 22 August 2014 (UTC)

Next step ...

 * 1) To remove the assertion "UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications" of the article's introduction. Need citation, but, as demonstred, never will get one.
 * 2) ... Think about another kind of list, tractable and smaller, like "List of software that are UFT-8 FULL compatible"; that is, discuss here what is "full compatible" in nowadays. Examples: LibXML2 can be showed as "configured with UTF8 by default" and "full compatible"; PHP was looking for "full compatibility" and "Unicode integration" with PHP6, but abandoned the project.

--Krauss (talk) 09:35, 22 August 2014 (UTC)

A bit of searching found these:

https://developer.apple.com/library/mac/qa/qa1173/_index.html

https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text

http://wayland.freedesktop.org/docs/html/apa.html#protocol-spec-wl_shell_surface-request-set_title

Spitzak (talk) 00:51, 24 August 2014 (UTC)

Double error correction
thumb|360px|Graph indicating that UTF-8 (light blue) exceeded all other encodings of text on the Web in 2007, and that by 2010 it was nearing 50%.&lt;ref name="MarkDavis2010"/&gt; Given that some ASCII (red) pages represent UTF-8 as [[Html_entity#Character_references|entities, it is more than half.&lt;ref name="html4_w3c"&gt;]]

The legend says "This may include pages containing only ASCII but marked as UTF-8. It may also include pages in CP1252 and other encodings mistakenly marked as UTF-8, these are relying on the browser rendering the bytes in errors as the original character set"... but, it is not the original idea, we can not count something "mistakenly marked as UTF-8", even it exist. The point is that there are a lot of ASCII pages that have symbols that web-browser map to UTF-8.

PubMed Central, for example, have 3.1 MILLION Articles in ASCII but using real UTF-8 by entity encode. No one is a mistake.

The old text (see thumb here) have a note &lt;ref name="html4_w3c"&gt; is: { { "HTML 4.01 Specification, Section 5 - HTML Document Representation", W3C Recommendation 24 December 1999. Asserts "Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding. (...) Character references are a character encoding-independent (...).". See also Unicode and HTML/Numeric character references.} }

This old text have also some confusion (!)... so I corrected to "Many ASCII (red) pages have also some ISO 10646 symbols representanted by entities,[ref] that are in the UTF-8 repertoire. That set of pages may be counted as UTF-8 pages."

--Krauss (talk) 22:45, 23 August 2014 (UTC)


 * I reverted this as you seem to have failed to understand it.


 * First, an Entity IS NOT UTF-8!!!!!!! They contain only ascii characters such as '&' and digits and ';'. They can be correctly inserted into files that are NOT UTF-8 encoded and are tagged with other encodings.


 * Marking an ASCII file as UTF-8 is not a mistake. An ASCII file is valid UTF-8. However since it does not contain any multi-byte characters it is a bit misleading to say these files are actually "using" UTF-8.


 * Marking CP1252 as UTF-8 is very common, especially when files are concatenated, and browsers recognize this due to encoding errors. This graph also shows these mis-identified files as UTF-8 but they are not really.


 * Spitzak (talk) 23:58, 23 August 2014 (UTC)


 * Sorry about my initial confused text. Now we have another problem here, is about interpretation of W3C standards and statistics.
 * 1. RFC 2047 (MIME Content-Transfer-Encoding) interpretarion used in the charset or enconding attributes of HTTP (content-type header with charset) and HTML4 (meta http-equiv): say what must be interpreted as "ASCII page" and what is a "UTF-8 page" . Your assertion "an ASCII file is valid UTF-8" is a distortion of these considerations.
 * 2. W3C standards, HTML4.1 (1999): say that you can add to an ASCII page some special symbols (ISO 10646 as expressed by the standard) by entities. Since before 2007, what all web-browser do, when typing special symbols, is replace the entity by an UTF-8 character (rendering the entity as its standard UTF-8 glyph).
 * 3. Statistics: this kind of statistics report must use first the technical standard options and variations. These options have concrete consequences that can be relevant to the "counting web pages". The user mistakes may be a good statistical hypothesis testing, but you must first to prove that they exist and that they are relevant... In this case, you must to prove that the "user mistake" is more important than technical standard option. In an encyclopedia, we did not show unproven hypothesis, neither a irrelevant one.
 * --Krauss (talk) 10:23, 24 August 2014 (UTC)


 * An ASCII file is valid UTF-8. That's irrefutable fact. To speak of "its standard UTF-8 glyph" is a category error; UTF-8 doesn't have glyphs, as it's merely a mapping from bytes to Unicode code points.--Prosfilaes (talk) 21:23, 24 August 2014 (UTC)


 * To elaborate on the second point above: Krauss in conflating "Unicode" and "UTF-8".  They are not the same.  A numerical character entity in HTML (e.g., &#x0026;#355; or &#x0026;#x0163;) is a way of representing a Unicode codepoint using only characters in the printable ASCII range.  A browser finding such an entity will use the codepoint number from the entity to determine the Unicode character and will use its font repertoire to attempt to represent the character as a glyph.  But this process does not involve UTF-8 encoding -- which is a different way of representing Unicode codepoints in the HTML byte stream.  The ASCII characters of the entity might themselves be encoded in some other scheme:  the entity in the stream might be ASCII characters or single-byte UTF-8 characters, or even UTF-16 characters, taking 2 bytes each.  But the browser will decode them as ASCII characters first and then keying on the "&#...;" syntax use them to determine the codepoint number in a way that does not involve UTF-8. -- Elphion (talk) 21:58, 24 August 2014 (UTC)


 * I agree the problem is that Krauss is confusing "Unicode" with "UTF-8". Sorry I did not figure that out earlier.Spitzak (talk) 23:28, 25 August 2014 (UTC)


 * Our job as Wikipedia editors is not to interpret the standards, nor to determine what is and isn't appropriate to count as UTF-8 "usage". That job belongs to the people who write the various publications that we cite as references in our articles.  Mark Davis's original post on the Official Google Blog, from whence this graph came and which we (now) correctly cite as its source, doesn't equivocate about the graph's content or meaning.  Neither did his previous post on the topic.  Davis is clearly a reliable source, even though the posts are on a blog, and we should not be second-guessing his claims.  That job belongs to others (or to us, in other venues), and when counter-results are published, we should consider using them. RossPatterson (talk) 11:13, 25 August 2014 (UTC)


 * Thanks for finding the original source. clearly states that the graph is not just a count of the encoding id from the html header, but actually examines the text, and thus detects ASCII-only (I would assume also this detects UTF-8 when marked with other encodings, and other encodinds like CP1252 even if marked as UTF-8): "We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example... Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8)." The caption needs to be fixed up.Spitzak (talk) 23:28, 25 August 2014 (UTC)
 * Krauss nicely points out below Erik van der Poel's methodology at the bottom of Karl Dubost's W3C blog post, which makes it explicit that the UTF-8 counts do not include ASCII: "Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered).". RossPatterson (talk) 17:24, 27 August 2014 (UTC)

Wow, a lot of discussion! So many intricate nuances of interpretations, sorry I was to imagine something more simple when started...

Mark Davis not say any thing about HTML-entities or about "user mistakes", so, sugestion: let's remove it from article's text. --Krauss (talk) 03:33, 26 August 2014 (UTC)
 * "Unicod" vs "UTF8": Mark Davis use "Unicod (UTF8...)" in the legend, and later, in the text, express "As you can see, Unicode has (...)". So, for his public, "Unicode" and "UTF8" are near the same thing (only very specialized public fells pain with it). Here, in our discussion, is difficult to understand what the technical-level we must to use.
 * About Mark Davis methodology, etc. no citation, only a vague "Every January, we look at the percentage of the webpages in our index that are in different encodings"... But, SEE HERE similar discussion, by those who did the job (the data have been compiled by Erik van der Poel)
 * Trying an answer about glyph discussion: the Wikipedia glyph article is a little bit confuse (let's review!); see W3C use of the term. In a not-so-technical-jargon, or even in the W3C's "loose sense", we can say that there are a set of "standard glyphs/symbols" that are represented in a subset of "UTF-8-like symbols", and  are not in ASCII neither CP1252 "symbols"...  Regular people see that "ASCII&ne;CP1252" and "UTF8&ne;CP1252"... So, even regular people see that "ASCII&ne;UTF8" in the context of the illustration, and that HTML-entities are maped to something that is a subset of UTF8-like symbols.
 * Neither W3C page you point to says anything about UTF-8, and I don't have a clue where you're getting "UTF-8-like symbols" from. Unicode is the map from code points to symbols and all the associated material; UTF-8 is merely a mapping from bytes to code points. The fact that it can be confusing to some does not make it something we should conflate.--Prosfilaes (talk) 06:00, 27 August 2014 (UTC)
 * My text only say "W3C use of the term" (the term "glyph" not the term "UTF-8"), and there (at the linked page) are a table with a "Glyph" column, with images showing the typical symbols. This W3C use of the term "glyph" as typical symbol, conflicts with the Wikipedia's thumb illustration with the text "various glyphs representing the typical symbol". Perhaps W3C is wrong, but since 2010 we need Refimprove (Wikipedia's glyph definition "needs additional citations for verification").
 * About my bolded sugestion, "let's remove it", ok? need to wait or vote, or we can do it now? --Krauss (talk) 12:38, 27 August 2014 (UTC)
 * I'm confused. Is Krauss questioning Mark Davis's reliability as a reference for this article?  It seems to me that the graphs he presents are entirely appropriate to this article, especially after reading Erik van der Poel's methodology, as described in his 2008-05-08 post at the bottom of Karl Dubost's W3C blog post, which is designed to recognize UTF-8 specifically, not just Unicode in general. RossPatterson (talk) 17:16, 27 August 2014 (UTC)
 * Sorry my english, I supposed Mark Davis and W3C as reliable sources (!). I think Mark Davis and W3C write some articles to the "big public" and other articles to the "specialized technical people"... We here can not confront "specialized text" with "loose text", even of the same author: this confrontation will obviously generates some "false evidence of contradiction" (see ex. "Unicode" vs "UTF8", "glyph" vs "symbol", etc. debates about correct use of the terms). About Erik van der Poel's explanations, well, this is other discussion, where I agree your first paragraph about it, "Our job as Wikipedia editors (...)". Now I whant only to check the sugestion ("let's remove it from article's text" above). --Krauss (talk) 11:13, 28 August 2014 (UTC)

It appears this discussion is moot - the graph image has been. RossPatterson (talk) 03:41, 30 August 2014 (UTC)
 * Thanks, fixed. --Krauss (talk) 17:25, 30 August 2014 (UTC)

Backward compatibility:
Re: ''One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.''

I'm a non-guru struggling with W3C's strong push to UTF-8 in a world of ISO-8859-1 and windows-1252 text editors, but either I have misunderstood this completely or else it is wrong? Seven-bit is the same in ASCII or UTF-8, sure; but in 8-bit extended ASCII (whether "extended" to ISO-8859-1, windows-1252 or whatever), a byte with the MSB "on" is one byte in extended ASCII, two bytes in UTF-8. A parser expecting "8-bit extended ASCII" will treat each of the UTF-8 bytes as a character. Result, misery. Or have I missed something? Wyresider (talk) 19:18, 5 December 2014 (UTC)


 * No, it is not a problem unless your software decides to take two things that it thinks are "characters" and insert another byte in between them. In 99.999999% of the cases when reading the bytes in, the bytes with the high bit set will be output unchanged, still in order, and thus the UTF-8 is preserved. You might as well ask how programs handle english text when they don't have any concept of correct spelling and each word is a bunch of bytes that it looks at individually. How do the words get read and written when the program does not understand them? It is pretty obvious how it works, and this is why UTF-8 works too.Spitzak (talk) 19:52, 5 December 2014 (UTC)


 * Wyresider -- This has been discussed in the article talk archives. Most of the time, if a program doesn't mess with what it doesn't understand, or treats sequences of high-bit-set characters as unanalyzable units, then simple filter etc. programs will often pass non-ASCII UTF8 characters through unaltered.  It's a design feature of UTF8 which is designed to lighten the programming load of transition from single-byte to UTF8 -- though certainly not an absolute guarantee of backward compatibility... AnonMoos (talk) 14:51, 7 December 2014 (UTC)


 * In a world of ISO-8859-1 and Windows-1252 text editors? What world is that? I live in a world where the most-spoken language is Chinese, which clears a billion users alone, and the text editors that come with any remotely recent version of Linux, Windows or Macs, or any version of Android or iOS, support UTF-8 (or at least Unicode). There's no magic button that makes UTF-8 work invariably with systems expecting 8-bit extended ASCII (or Windows-1252 with systems expecting 8-bit extended ASCII to not use C1 control codes 80-9F), but UTF-8 works better then, say, Big5 (which uses sub-128 values as part of multibyte characters) or ISO-2022-JP (which can use escape sequences to define sub-128 values to mean a character set other then ASCII).--Prosfilaes (talk) 13:45, 8 December 2014 (UTC)


 * Wikipedia Talk pages are not a forum, but to be crystal clear, ASCII bytes have a high bit of zero and are UTF-8-clean, and anything that has a high bit of one isn't ASCII and will almost certainly have some bytes that will be treated differently in a UTF-8 context. A parser expecting data encoded in Windows codepage 1252 or in ISO 8859-1 isn't parsing ASCII, and won't understand UTF-8 correctly. RossPatterson (talk) 00:09, 9 December 2014 (UTC)


 * There are many parsers that don't expect UTF-8 but work perfectly with it. An example is the printf "parser". The only sequence of bytes it will alter starts with an ascii '%' and contains only ascii (such as "%0.4f"). All other byte sequences are output unchanged. Therefore all multibyte UTF-8 characters are preserved. Another example is filenames, on Unix for instance the only bytes that mean anything are NUL and '/', all other bytes are considered part of the filename, and are not altered. Therefore all UTF-8 multibyte characters can be parts of filenames.Spitzak (talk) 02:24, 9 December 2014 (UTC)

Many errors
I'm not an expert here, but I am an engineer and I do recognize when I read something that's illogical.

There are 2 tables:

https://en.wikipedia.org/wiki/UTF-8#Description

https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

They cannot both be correct. If the one following #Description is correct, then the one following #Codepage_layout must be wrong.

Embellishing on the table that follows #Description:

1-byte scope: 0xxxxxxx = 7 bits = 128 code points.

2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 additional code points.

3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.

4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.

The article says: "The next 1,920 characters need two bytes to encode". As shown in the #Description table, 11 bits add 2048 code points, not 1920 code points. The mistake is in thinking that the 1-byte scope and the 2-byte scope overlap so that the 128 code points in the 1-byte scope must be deducted from the 2048 code points in the 2-byte scope. That's wrong. The two scopes do not overlap. They are completely independent of one another.

The text following the #Codepage_layout table says: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add." This implies there's a scope that looks like this:

2-byte scope: 01111111 10xxxxxx = 6 bits = 64 additional code points.

While that's possible, it conflicts with the #Description table. These discrepancies seem pretty serious to me. So serious that they put into doubt the entire article.

MarkFilipak (talk) 03:13, 23 February 2015 (UTC) )


 * 11000001 10000000 encodes the same value as 01000000. So, yes, they do overlap.--Prosfilaes (talk) 03:24, 23 February 2015 (UTC)
 * Huh? The scopes don't overlap. Perhaps you mean that they map to the same glyph? Are you sure? I don't know because I've not studied the subject, If this is a logical issue with me, it's probably a logical issue with others. Perhaps a section addressing this issue is appropriate, eh?
 * Also, what about the "Orange cells" text and the 2-bit scope I've added to be consistent? That scope conflicts with the other table. Do you have a comment about that? Thank you. --MarkFilipak (talk) 03:53, 23 February 2015 (UTC)
 * Perhaps this is what's needed. What do you think?
 * 1-byte scope: 0xxxxxxx = 7 bits = 128 code points.
 * 2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 alias code points that map to the same points as 0xxxxxxx.
 * 2-byte scope: 1100001x 10xxxxxx = 7 bits = 128 additional code points.
 * 2-byte scope: 110001xx 10xxxxxx = 8 bits = 256 additional code points.
 * 2-byte scope: 11001xxx 10xxxxxx = 9 bits = 512 additional code points.
 * 2-byte scope: 1101xxxx 10xxxxxx = 10 bits = 1024 additional code points.
 * 3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.
 * 4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.
 * --MarkFilipak (talk) 05:54, 23 February 2015 (UTC)


 * For the first question, it seems you don't understand "code points" the way the article means. "Code points" here refer to Unicode code points. The unicode code points are better described in the Plane (Unicode) article. In the UTF-8 encoding the Unicode code points (in binary numbers) are directly mapped to the x:es in this table:
 * 1-byte scope: 0xxxxxxx = 7 bits = 128 possible values.
 * 2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 possible values.
 * 3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 possible values.
 * 4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 possible values.
 * That means that in the 2-byte scheme you could encode the 2048 first code points, but you are not allowed to encode the first 128 code points, as described in the Overlong encodings section. And similarly it would be possible to encode all the 65536 first code points in the 3-byte scheme, but you are only allowed to use the 3-byte scheme from the 2049th code point. And the 4-byte scheme is used from the 66537th to the 1114112th (the last one) code point.
 * For your second question, continuation bytes (starting with 10) are only allowed after start bytes (starting with 11), not after "ascii bytes" (starting with 0). The "ascii bytes" are only used in the 1-byte scope. Boivie (talk) 10:58, 23 February 2015 (UTC)
 * Thanks for your reply. What you wrote is inconsistent with what Prosfilaes wrote, to wit: "11000001 10000000 encodes the same value as 01000000." Applying the principle of Overlong encodings, it seems to me that "11000001 10000000" is an overlong encoding (i.e., the encoding is required to be "01000000"), therefore, unless I misunderstand the principle of overlong encoding, what Prosfilaes wrote about "11000001 10000000" is wrong. I'll let you two work it out between you.
 * It occurs to me that the "1100000x 10xxxxxx" scope could therefore be documented as follows:
 * 2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 illegal codings (see: Overlong encodings).
 * Should it be so documented? Would that be helpful?
 * Look, I don't want to be a pest, but this article seems inconsistent and to lack comprehensiveness. I have notions, not facts, so I can't "correct" the article. I invite all contributors who do have the facts to consider what I've written. I will continue to contribute critical commentary if encouraged to do so, but my lack of primary knowledge prohibits me from making direct edits on the source document. Regards --MarkFilipak (talk) 15:12, 23 February 2015 (UTC)
 * I see nothing wrong with Prosfilaes' comment. Using the 2-bit scheme 11000001 10000000 would decode to code point 1000000, even if it would be the wrong way to encode it. I am also not sure it would be helpful to include illegal byte sequences in the first table under the Description header. It is clearly stated in the table from which code point to which code point each scheme should be used. The purpose of the table seem to be to show how to encode each code point, not to show how not to encode something. Boivie (talk) 17:16, 23 February 2015 (UTC)


 * 1, Boivie, you wrote, "I see nothing wrong with Prosfilaes' comment." Assuming that you agree that "11000001 10000000" is overlong, and that overlong encodings "are not valid UTF-8 representations of the code point", then you must agree that "11000001 10000000" is invalid. How can what Prosfilaes wrote, "11000001 10000000 encodes the same value as 01000000," be correct if it's invalid? If it's invalid, then "11000001 10000000" doesn't encode any Unicode code point. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)
 * 2, Regarding whether invalid encodings should be shown as invalid so as to clarify the issue in readers' minds, I ask: What's wrong with that? I assume you'd like to make the article as easily understood as possible. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)
 * 3, Regarding the Orange cells quote: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add", it is vague and misleading because,
 * 3.1, those cells don't add 6 bits, they cause a whole 2nd byte to be added, and
 * 3.2, they don't actually add because the 1st byte ("0xxxxxxx") doesn't survive the addition -- it's completely replaced.
 * Describing the transition from
 * this: 00000000, 00000001, 00000010, ... 01111111, to
 * this: 11000010 10000000, 11000010 10000001, 11000010 10000010, ... 11000010 10111111,
 * as resulting from "the 6 bits they add" is (lacking the detail I supply in the 2 preceding sentences) going to confuse or mislead almost all readers. It misled me. Now that I understand the process architecture, I can interpret "the 6 bits they add" as sort of a metaphorical statement, but there is a better way.
 * My experience as an engineer and documentarian is to simply show the mapping from inputs to outputs (encodings to Unicode code points in this case) and trust that readers will see the patterns. Trying to explain the processes without showing the process is not the best way. I can supply a table that explicitly shows the mappings which you guys can approve or reject, but I need reassurance up front that what I produce will be considered. If not open to such consideration, I'll bid you adieu and move on to other aspects of my life. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)
 * ¢ U+00A2 is encoded as C2 A2, and if you look in square C2 you find 0080 and in square A2 you find +22. If you in hexadecimal add the continuation byte's +22 to the start byte's 0080 you get 00A2, which is the code point we started with. So the start byte gives the first bits, and the continuation byte gives the last six bits in the code point.
 * I have no idea why a transition from the 1-byte scheme to the 2-byte scheme would be at all relevant in that figure. Boivie (talk) 21:02, 23 February 2015 (UTC)
 * Thank you for the explanation. --MarkFilipak (talk) 21:14, 23 February 2015 (UTC)
 * "Orange cells with a large dot are continuation bytes..." "White cells are the start bytes for a sequence of multiple bytes". Duh! You mean that the Orange cells aren't part of a "sequence of multiple bytes"? This article is awful and you guys just don't get it. I'm not going to waste my time arguing. I'm outta here. Bye. --MarkFilipak (talk) 21:25, 23 February 2015 (UTC)


 * The tables are correct. The "scopes" as you call them do overlap. Every one of the lengths can encode a range of code points starting at zero, therefore 100% of the smaller length is overlapped. However UTF-8 definition further states that when there is this overlap, only the shorter version is valid. The longer version is called an "overlong encoding" and that sequence of bytes should be considered an error. So the 1-byte sequences can do 2^7 code points, or 128. The 2-byte sequences have 10 bits and thus appear to do 2^10 code points or 2048, but exactly 128 of these are overlong because there is a 1-byte version, thus leaving 2048-128 = 1920, just as the article says. In addition two of the lead bytes for 2-byte sequeces can *only* start an overlong encoding, so those bytes (C0,C1) can never appear in valid UTF-8 and thus are colored red in the byte table.Spitzak (talk) 20:02, 23 February 2015 (UTC)
 * Thank you for the explanation. --MarkFilipak (talk) 20:13, 23 February 2015 (UTC)

Backward compatibility II
It rates a mention that unicode byte sequences never use a zero bytes except for the ASCII NUL. This means that processors that expect nul-terminated character strings (eg: C and C++ string libraries) can cope with UTF-8.

Paul Murray (talk) 03:56, 23 May 2015 (UTC)


 * ASCII bytes always represent themselves. I don't think we need to belabor that for \0.--Prosfilaes (talk) 15:08, 23 May 2015 (UTC)


 * Backward compatibility with ASCII is mentioned in the very first paragraph.Spitzak (talk) 05:11, 31 May 2015 (UTC)


 * Should this UTF-8 article explicitly mention that UTF-8 is compatible with software and hardware designed to expect null-terminated strings terminated by a single zero byte, while UTF-16 is incompatible? Yes, I agree with Paul Murray. --DavidCary (talk) 02:23, 18 June 2015 (UTC)