Talk:UTF-16

UTF-16 and UCS-2 as one topic
The history of this page makes it look like there was never anything but a redirect to UCS here (at UTF-16), but I am fairly certain there was a separate entry for UTF-16 in the recent past.

I do not like the idea of redirecting to UCS. While UCS should mention and perhaps summarize what encodings it defines, I strongly feel that the widely-used UTF-8, UTF-16, and UTF-32 encodings should have their own entries, since they are not exclusively tied to the UCS (as UCS-2 and UCS-4 are) and since they require much prose to accurately explain. Therefore, I have replaced the UCS redirect with a full entry for UTF-16. --mjb 16:53, 13 October 2002


 * I no longer feel so strongly about having both encoding forms discussed in the same article. That's fine. However, I do have a problem with saying that they are just alternative names for the same encoding form. You can't change the definition of UCS-2 and UTF-16 in this way just because the names are often conflated; the formats are defined in standards, and there is a notable difference between them. Both points (that they're slightly different, and that UCS-2 is often mislabeled UTF-16 and vice-versa) should be mentioned. I've edited the article accordingly today. &mdash; mjb 23:37, 13 October 2005 (UTC)


 * I would also like to see UCS-2 more clearly separated from UTF-16 - they are quite different, and it's important to make it clear that UCS-2 is limited to just the 16-bit codepoint space defined in Unicode 1.x. This will become increasingly important with the adoption of GB18030 for use in mainland China, which requires characters defined in Unicode 3.0 that are beyond the 16-bit BMP space. &mdash; Richard Donkin 09:07, 18 November 2005 (UTC)


 * I wanted to know something about UTF-16. Talking about UCS-2, is confusing. - Some anonymous user


 * I agree strongly with the idea that this entire article as of Sept. 27, 2014 seems to be a discussion of UCS-2 and UTF-16. I've never heard of UCS-2 (nor do I care to learn about it, especially in an article about something that supercedes it). I came to this article to briefly learn the difference between UTF-8 and UTF-16, from a practical pov. I found nothing useful. This article drones on and on about what UCS-2 was and how UTF-16 differs and couldn't possibly be of interest except as a regurgitation of information easily found elsewhere and only by the very few people who care about UCS-2. Its like taking 5 pages to explain the difference between a bic lighter and a zippo. Just unnecessary and of doubtful use to 99% of the people looking for understanding on what the differences between UTF-8,16,& 32 are. Needs a complete rewrite from a modern perspective. I've read that UTF-8 is ubiquitous on the web, if so why should we care about UTF-16 and especially UCS-2???72.172.1.40 (talk) 20:02, 27 September 2014 (UTC)


 * For what it is worth, UCS2 is used to encode strings in JavaScript. See . Peaceray (talk) 19:07, 29 May 2015 (UTC)

However, ECMAScript does not place any restrictions or requirements on the sequence of code units in a String value, so they may be ill-formed when interpreted as UTF-16 code unit sequences." Now it does make sense to conflate UCS-2 and UTF-16 whenever you only care about the BMP (Basic Multilingual Plane, see this). However, in an encyclopedic article I expect both concise information and accuracy. Assarbad (talk) 09:55, 15 October 2015 (UTC)
 * Stop! Standard time! If you look into the standard you'll find "Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit.


 * .NET & the Windows API apparently us UCS2 as well. Peaceray (talk) 19:19, 29 May 2015 (UTC)


 * Both Java and Windows may sometimes say they use UCS-2 but it is often unclear whether they really are limited to UCS-2. If the API does not actually do something special with non-BMP code points or with surrogate halves, then it "supports" UTF-16 just fine. An api that only treats slash and colon and null and a few other code points specially, as the majority of Windows api's do, therefore supports UTF-16. Officially filenames are UTF-16, so any api that is a filename stored as 16-bit units is UTF-16, no matter what the documentation says.Spitzak (talk) 04:54, 31 May 2015 (UTC)

UTF-16LE BOMs Away!
Concerning the text explaining UTF-16LE and UTF-16BE, would it not be better that instead of saying,


 * A BOM at the beginning of UTF-16LE or UTF-16BE encoded data is not considered to be a BOM; it is part of the text itself.

we say something like,


 * No BOM is required at the beginning of UTF-16LE or UTF-16BE encoded data and, if present, it would not be understood as such, but instead be mistaken as part of the text itself.

--Chris 17:27, 12 January 2006 (UTC)

Clean-up?
Compare the clean, snappy, introductory paragraph of the UTF-8 article to the confusing ramble that starts this one. I want to know the defining characteristics of UTF-16 and I don't want to know (at this stage) what other specifications might or might not be confused with it. Could someone who understands this topic consider doing a major re-write.

A good start would be to have one article for UTF-16 and another article for UCS-2. The UTF-16 article could mention UCS-2 as an obsolete first attempt and the UCS-2 article could say that it is obsolete and is replaced by UTF-16. --137.108.145.11 17:02, 19 June 2006 (UTC)


 * I rewrote the introduction to hopefully make things clearer; please correct if you find technical errors. The rest of the article also needs some cleanup, which I may attempt.  I disagree that UTF-16 and UCS-2 should be separate articles, as they are technically so similar.  Dmeranda 15:12, 18 October 2006 (UTC)


 * Agreed, despite the different names they are essentially different versions of the same thing
 * UCS-2 --> 16 bit unicode format for unicode versions <= 3.0
 * UTF-16 --> 16 bit unicode format for unicode versions >= 3.1
 * Plugwash 20:13, 18 October 2006 (UTC)

Surrogate Pair Example Wrong?
The example: 119070 (hex 1D11E) / musical G clef / D834 DD1E: the surrogate pair should be D874 DD1E for 1D11E. Can somebody verify that and change the example? —Preceding unsigned comment added by 85.216.46.173 (talk)
 * NO the surrogate pair in the article is correct (and btw this incorrect correction has come up many times in the articles history before)


 * 0x1D11E-0x10000=0x0D11E
 * 0x0D11E=0b00001101000100011110
 * split the 20 bit number in half
 * 0b0000110100 =0x0034
 * 0b0100011110 =0x011E
 * add the surrogate bases
 * 0x0034+0xD800=0xD834
 * 0x011E+0xDC00=0xDD1E


 * -- Plugwash 18:39, 8 November 2006 (UTC)

Decoding example
Could there be an example for decoding the surrogate pairs, similar in format to the encoding example procedure? Neilmsheldon 15:29, 27 December 2006 (UTC)

Java not UTF-16?
After reading through the documentation for the Java Virtual Machine (JVM) (See Java 5 JVM section 4.5.7), it seems to me that Java does not use UTF-16 as claimed. Instead it uses a modified form of UTF-8, but where it still uses surrogate pairs for supplemental codepoints (each surrogate being UTF-8 encoded though); so it's a weird nonstandard UTF-8/UTF-16 mishmash. This is for the JVM, which is the "byte code". I don't know if Java (the language) exposes something more UTF-16 like than the underlying bytecode, but it seems clear that the bytecode does not use UTF-16. Can somebody more Java experienced than me please verify this and correct this article if necessary. - Dmeranda 06:00, 23 October 2007 (UTC)
 * The serialisation format and the bytecode format do indeed store strings in "modified UTF-8" but the strings stored in memory and manipulated by the application are UTF-16. Plugwash 09:25, 23 October 2007 (UTC)
 * === Verified ! (2012) ===

Current documentation states it uses UTF-8 internally, with 1 exception -- it uses a 'invalid UTF-8' combination, to mark end of line, so that strlen/strcmp (which depend on \00 (NUL) ending the string). I'm not sure why this was done, since when thinking through that problem, (was seeing if there was a case where an 'ascii' null might be embedded in a UTF-8 encoded string). If it is a *valid* UTF-8 string, then it can't have a 0 byte except as a NUL. since each byte of a even the longest (unused, as a maximum of 4 bytes are required for full unicode support) encoding for a 32-bit value requires the high bit be 1. A bugging UTF-8 implementation, might try to rely on the fact that the first byte specifies the number of data bytes for the char -- and of each following char, the top 2 bits are ignored (spec says they must be 10). But data-wise, they are ignored, so one could encode UTF-8 data improperly, and still have it be decomposable by a non-validating UTF-8 decoder -- but the same string might have an embedded nul, and cause problems.

I think people got the idea that Java was UTF-16, because they didn't have to call a conversion routine on Windows -- but that's because the version for windows was built to do the conversion automatically.Astara Athenea (talk) 22:29, 23 January 2012 (UTC)


 * "Current documentation states it uses UTF-8 internally, with 1 exception" WHICH documentation do you think says that? The documentation for java.lang.string clearly states "A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String. " It is true that UTF-8 based formats are used for strings in some circumstances (serialisation, classfile formats etc) but the languages core string type has always has been based on a sequence of 16-bit quantities. When unicode was 16-bit these represented unicode characters directly, now they represent UTF-16 code units. Plugwash (talk) 23:27, 23 January 2012 (UTC)

Sorry for the delay -- didn't see your Q. Cited the text from the java documentation (http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4.7), the paragraph I quoted has been updated. It now lists two differences between standard UTF-8 and the JavaVM's format:


 * There are two differences between this format and the "standard" UTF-8 format. First, the null character  is encoded using the 2-byte format rather than the 1-byte format, so that modified UTF-8 strings never have embedded nulls. Second, only the 1-byte, 2-byte, and 3-byte formats of standard UTF-8 are used. The Java virtual machine does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.
 * For more information regarding the standard UTF-8 format, see Section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 6.0.0.

Astara Athenea (talk) 19:07, 24 September 2012 (UTC)

Windows: UCS-2 vs UTF-16
UTF-16 is the native internal representation of text in the Microsoft Windows NT/2000/XP/CE

Older Windows NT systems (prior to Windows 2000) only support UCS-2

That sounds like a contradiction. Besides this blog indicates UTF-16 wasn't really supported by Windows until XP: 

--Kokoro9 (talk) 12:44, 30 January 2008 (UTC)
 * I think surrogate support could be enabled in 2K but i'm not positive on that. Also iirc even XP doesn't have surrogate support enabled by default. As with java windows uses 16 bit unicode quantities but whether surrogates are supported depends on the version and the settings. The question is how best to express that succiently in the introduction. Plugwash (talk) 13:05, 30 January 2008 (UTC)


 * I've found this:"Note: Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.""If you are developing a font or IME provider, note that pre-Windows XP operating systems disable supplementary character support by default. Windows XP and later systems enable supplementary characters by default."Source
 * That seems to indicate Windows 2000 supports UTF-16 at some level. On the other hand, I think NT should be removed from the list of UTF-16 supporting OSs.--Kokoro9 (talk) 17:38, 30 January 2008 (UTC)

NT is a somewhat generic term that can refer to either the Windows NT series (3.51, 4.0, 2000, XP, etc.) or to specific versions that used NT in the product name (pretty much just 3.51 and 4.0). NT 3.51 and 4.0 were based on UCS-2. It would probably be more accurate to leave the "NT" out entirely, since the "2000" and "XP" stand on their own (not to mention that NT does not apply to CE). —Preceding unsigned comment added by 24.16.241.70 (talk) 09:37, 19 April 2010 (UTC)


 * Simply use NT-platform or NT-based and you'll be fine. But especially for a topic like UCS-2 versus UTF-16 it is of utmost importance to distinguish Windows versions. Windows XP, to my knowledge, introduced full Unicode support. Assarbad (talk) 09:58, 15 October 2015 (UTC)

Does what Windows calls Unicode include a BOM? Or is the endianness implicit?--87.162.6.159 (talk) 19:35, 2 May 2010 (UTC)


 * Windows will guess that a file is UTF-16LE if there is no BOM. However this is now considered a bug (see bush hid the facts) and that lack of a BOM should be used on UTF-8 and legacy encodings (which are easy to distinguish). Most Windows software inserts a BOM into all UTF-16 files.Spitzak (talk) 17:31, 3 May 2010 (UTC)


 * To be picky, the Windows OS itself doesn't usually interpret the BOM at all. Windows APIs generally accept UTF-16 parameters and the application is expected to figure out what encoding to use when reading text data. The BOM only applies to unmarked Unicode text, which the OS rarely deals with directly. XML files are an exception (the OS does process XML files directly), but the Unicode processing semantics for XML files is fairly well-specified. Individual applications on the Windows platform (including those that come with Windows such as Notepad) may have specific methods for detecting the encoding of a text file. It is probably most common to see applications save text files encoded in the system's default ANSI code page (not Unicode). In particular, Notepad will check for a BOM, and if there is no BOM, it will use a heuristic to try to guess the encoding. The heuristic has been known to make somewhat silly mistakes (incorrectly treating some short ANSI files as UTF-16, making them show up as nonsense Chinese characters), so the algorithm has been adjusted in recent versions of Windows to try to reduce the chance of this type of mistake. —Preceding unsigned comment added by 24.16.241.70 (talk) 07:38, 8 June 2010 (UTC)

UCS-2 use in SMS deserves mention
By far the most SMSes are encoded in either the 7bit GSM default alphabet or UCS-2, especially in GSM. In CDMA other encodings are also supported, like a Japanese encoding and a Korean encoding, but those are minority shares.

Also see: Short message service — Preceding unsigned comment added by 92.70.2.16 (talk) 15:02, 23 July 2008 (UTC)


 * As of ETSI TS 123 038 V16.0.0 (2020-07) (3GPP TS 23.038 version 16.0.0 Release 16), which appears, from this 3GPP listing of the 23.XXX specifications, to be the latest revision, UCS-2 ("UCS2") appears to be the only flavor of Unicode-like encodings supported. However, this page from Twilio claims that:
 * "These differences turn out not to matter in practice, because due to the lack of support for the UCS-2 encoding, in modern programming languages smartphones tend to just decode UCS-2 messages as UTF-16 Big Endian. This is good news, because it means in practice we can send non-BMP characters, such as Emoji characters, over SMS, despite the fact that the spec doesn't strictly allow it."


 * and the same Google search that found that seems to have found a bunch of other pages that speak of "UCS-2" and "UTF-16" in SMS as interchangeable, so perhaps, in practice, UTF-16 is used for at least some messages, even though the specification doesn't support it. Guy Harris (talk) 03:38, 11 February 2022 (UTC)

UTF-16 and Windows
In Windows 7 UTF-16 is still only supported for a small set of functions (i.e. IME and some controls). It is not supported for filenames or most API-functions. The SDK documentation does not speak of UTF-16. They say just Unicode and mean UCS-2. The sources in the section "Use in major operating systems and environments" say just this but the sentence tells us it supports UTF-16 native, which is wrong. 217.235.6.183 (talk) 23:45, 29 January 2010 (UTC)


 * You probably think that if a non-bmp character returns 2 for wcstrlen then "UTF-16 is not supported". In fact this is the CORRECT answer. The ONLY place "support" is needed is in the display of strings, at which point you need to consider more than one word in order to choose the right glyph. Note that to display even BMP characters correctly you need to consider more than one word, due to combining characters, kerning, etc.


 * Anybody claiming that strlen should return a value other than 2 for a non-bmp character in UTF-16 (or any value other than 4 for a non-bmp character in UTF-8) is WRONG. Go learn, like actually try to write some software, so you can know that measuring memory in anything other than fixed-size units is USELESS. The fact that your ancient documentation says "characters" is because it was written when text was ASCII only and bytes and "characters" were the same thing. Your misconception is KILLING I18N with obscenely complex and useless API and this needs to be stopped!!!Spitzak (talk) 04:37, 30 January 2010 (UTC)


 * Please keep the discussion to the content and don't assume my programming skills. Have you bothered to read the sources given in this section ? Try to pass a filename with surrogates to an Windows7 API function... 217.235.27.234 (talk) 14:41, 30 January 2010 (UTC)


 * Quick check seems to reveal that Windows on FAT32 and NTFS will create files with surrogate halves in their names using CreateFile and returns the same strings when the directory is listed. It is true that a lot of I/O does not render them correctly (ie it renders them as 2 UCS-2 characters) but I fully expected that and do not consider it a failure. Exactly what are you saying it does wrong? —Preceding unsigned comment added by Spitzak (talk • contribs) 06:06, 3 February 2010 (UTC)


 * You're also welcome to write software that will decode the names of all the files in a directory as floating point numbers. There seems to be some confusion here between character sets and encodings. The concept applicable to specifying the rules for legal file names on an OS or file storage system is the character set. How a file's name is represented in the directory involves an encoding, of course, but at the API level you're dealing with characters from a character set. —Largo Plazo (talk) 02:36, 27 March 2010 (UTC)

From how I read your response and the remark you responded to, you're talking about the same thing exactly, but using different terms. I presume by character set you mean the sum of possible characters which take the visual form of glyphs when displayed, right? The encoding, however, provides code points (to stick with Unicode terminology) assigned each character. So the use case is storage (not just on disk) versus display as you pointed out correctly. But I don't follow on the last statement. Why would an API taking a file name care for the character? Isn't it sufficient at this layer to count the number of, say, 16-bit unsigned words after normalization? The important aspect being: after normalization. Assarbad (talk) 10:18, 15 October 2015 (UTC)
 * By character set you mean the abstract concept of characters, right? Assigned to a given code point using an encoding and visualized through glyphs, or ...?

Different parts of Windows have different levels of support for Unicode. The file system essentially ignores the encoding of the file names. In fact, the file system pretty much only cares that the file names don't use null, back-slash, or forward-slash (maybe colon). Higher-level parts of the OS add checks for other things like *, ?, |, etc. In addition, the file system has only a very simple concept of uppercase/lowercase that does not change from locale to locale. All of these limitations are due to the fact that the file system is solving a specific problem (putting a bunch of bits on the disk with a name that is another bunch of bits -- any higher meaning for the name or contents of a file is up to the user). At the other extreme, the Uniscribe rendering engine does all kinds of tricks to make things like Hebrew and Arabic come out right (combining ligatures, bidirectional text rendering, etc.). To make a long story short, the parts of Windows that actually need to deal with UTF-16 can correctly deal with it. Other parts don't care, and simply store and transmit 16-bit values. To me, that sounds like the right way to deal with it. So I think the page is fully accurate in indicating that Windows supports UTF-16. —Preceding unsigned comment added by 24.16.241.70 (talk) 09:30, 19 April 2010 (UTC)

Can someone provide a citation or example as to how one can create invalid UTF-8 registry or file names, and how that matters to the API itself? Neither the filesystem nor the registry do anything in UTF-8, and don't use UTF-8 in any way shape or form. Therefore I don't see how it needs even enter the discussion (nor can I see how Windows could or would do anything incorrectly here). If nobody can source this, than that content should be removed. Billyoneal (talk) 22:09, 9 December 2010 (UTC)


 * A malicious program can actually change the bytes on the disk and make the registry entry names be invalid UTF-8. Apparently there are ways to achieve this other than writing bytes on the disk but I don't know it. The Windows UTF-16 to UTF-8 translator is unable to produce these byte arrangements, and therefore unable to modify or delete those registry entries.Spitzak (talk) 18:49, 10 December 2010 (UTC)


 * I think you are confusing some things here. Both the regular and the Native NT API refer to the registry only in terms of UTF-16 strings, any UTF-16 specific issues would not cause a difference between the APIs.  There is a well-known trick where the native API allows key names that include the null (U+0000) char but the Win32 API and GUI tools do not, thus allowing the creation of hard to access/delete keys, however this has nothing to do with UTF-16.  Finally manipulating the undocumented on-disk registry file format can create all kinds of nonsense that the kernel cannot deal with, if that file format uses UTF-8 for some strings (a quick check shows that it might) creating an invalid file with invalid or non-canonical UTF-8 could cause problems, but they would not be related to UTF-16. 94.145.87.181 (talk) 00:03, 26 December 2010 (UTC)

First, until windows XP (including) removing a non-BMP character from the system edit control required pressing backspace twice. So it wasn't fully supported in windows XP. This bug was fixed in vista or win7 though.

Second, it isn't fully supported even in windows 7. Two examples I can come with right now:
 * Writing UCS-2 to the console is supported if the font has the appropriate characters, but UTF-16 non-BMP characters are displayed as two 'invalid' characters (two squares, two question marks, whatever...) independently of the font.
 * You can't get what non-BMP characters are available in some font. GetFontUnicodeRanges that is supposed to do this returns the ranges as 16-bit values.

I'm sure there are more. 82.166.249.66 (talk) 17:46, 26 December 2011 (UTC)

First reference to apache xalan UTF-16 surrogate pair bug
The first reference to the apache xalan bug seems broken to me. I'm very interested in that particular problem, could someone fix the link? What was it all about? What was the problem? It is really unclear how it is right now. —Preceding unsigned comment added by 160.85.232.196 (talk) 12:52, 20 April 2010 (UTC) Nevermind, found the bug, the URL was missing an ampersand (&). Fixed it:

Document changes for lead and first half
The reorganization and changes attempted to improve the following points. Comments or additional suggestions are welcome. StephenMacmanus (talk) 12:00, 13 November 2010 (UTC)
 * The lede for the article only described UTF-16. Since it is now a joint article about both UTF-16 and UCS-2, it should include both methods.
 * The first paragraph used far too much jargon, in my opinion. Immediately using Unicode terms like "BMP" and "plane" and so forth without more background doesn't illuminate the topic. I moved most of these terms to later in the document and provided enlarged descriptions.
 * The encoding section of the article only described the method for larger values from U+10000 and up. The simple mapping of UCS-2 and UTF-16 for the original 16-bit values was never mentioned.
 * The section about byte order encoding had several issues. I think these occurred more from confusing writing than actual errors.
 * First, it implied the byte order issue was related to the use of 4-byte values in UTF-16.
 * Second, the explanation about the BOM value reverses the cause and effect. The value was chosen for other reasons, then designated as an invisible character because it was already the BOM, to hide it when not discarded.
 * Next, the discussion recommends various behaviors for using this feature, which isn't appropriate for an encyclopedia article.
 * Finally, it doesn't clearly explain the purpose of the encoding schemes, since it misleadingly states that endianness is "implicit" when it is part of the label. I think the original meaning referred to the absence of a BOM in the file for those cases, but it wasn't clear.


 * I do not think these changes are an improvement.
 * The lead indeed talked about both UTF-16 and UCS-2, in two paragraphs, rather than mangling them together.
 * References to documents and all history removed from lead.
 * Obviously not all 65536 UCS-2 characters are the same, the values for surrogate halves changed. Later paragraph saying that only "assigned" characters match is more accurate.
 * UTF16-BE/LE require that the leading BOM be treated as the original nbsp, not "ignored". Endinaness definately is "implicit" and a backwards BOM in theory should be the character 0xFFFE (though it might signal software that something is wrong).
 * If there is documentation that the use as BOM was chosen before the nbsp behavior, it should have a reference, and add it to the BOM page as well.
 * Errors is software rarely require "conversion", it is in fact disagreement between software about character counts that causes bugs. Old description of errors was better.
 * Description of encoding is incredibly long and confusing. Previous math-based one was much better. Don't talk about "planes", please, saying "subtract 0x10000" is about a million times clearer than "lower the plane number by 1".

Spitzak (talk) 20:27, 15 November 2010 (UTC)


 * StephenMacmanus raises some good points, but I agree with Spitzak that the previous version was much easier to read. I prefer the older lede to the new one, and the older discussion of the encoding for characters beyond BMP is better than the new one.  "BMP" is in fact defined before use in the old version. -- Elphion (talk) 02:50, 16 November 2010 (UTC)

Python UTF-8 decoder?
The article says 'the UTF-8 decoder to "Unicode" produces correct UTF-16' - that seems to be either wrong or something that could use a longer explanation. Was 'the UTF-16 encoder from "Unicode" produces correct UTF-16' meant? Kiilerix (talk) 15:05, 30 July 2011 (UTC)


 * When compiled for 16-bit strings, a "Unicode" string cannot possibly contain actual Unicode code points, since there are more than 2^16 of them. Experiments with the converter from UTF-8 to "Unicode" reveals that a 4-byte UTF-8 sequence representing a character > U+FFFF turns into two "characters" in the "Unicode" string that correctly match UTF-16 code units for this character. However I believe in most cases Python "Unicode" strings are really UCS-2, in particular the simple encode api means that encoders cannot look at more than one code unit at a time. On Unix where Python is compiled for 32-bit strings, the encoder from "Unicode" to UTF-16 does in fact produce surrogate pairs.Spitzak (talk) 20:41, 1 August 2011 (UTC)

UTF-16 considered harmful
added a good-faith blurb to the lead saying that "experts" recommend not using UTF-16. When I reverted that essentially unsourced claim, 85.etc responded with another good-faith addition to the lead with the following edit comment:  ''Better wording. If you guys really care about Unicode please do not just remove this text if you do not like it. Instead try to make it suitable in Wikipedia so we can let this message get to people.''

Although the newer blurb is an improvement -- it talks about a specific group that makes this recommendation, rather than the nebulous "experts" -- it provides no evidence that this group is a Reliable Source, a concept that lies at the root of WP. The opinion expressed by the group remains an unauthoritative one, and in any event it does [added: not] belong in the lead; so I have reverted this one too.

Moreover, the IP is under the misapprehension that WP's purpose is to "let this message get to people". WP is not a forum, and not a soap box: advocacy is not the purpose of these articles. There are good reasons for using UTF-16 in some circumstances, good reasons for using UTF-8 in others. Discussing those is fair game, and we already link to an article comparing the encodings. But WP is not the place to push one over the other.

And we do care about Unicode. It is in no danger of being replaced. The availability of three major encodings suitable in various ways to various tasks is a strength, not a weakness. Most of what the manifesto (their word) at UTF-8 Everywhere rails against is the incredibly broken implementation of Unicode in Windows. They have some reasonable advice for dealing with that, but it is not a cogent argument against using UTF-16. The choice of encoding is best left to the application developers, who have a better understanding of the problem at hand.

-- Elphion (talk) 21:07, 15 May 2012 (UTC)

I believe you didn't actually read it, since the manifesto doesn't say much about broken implementation of Unicode in Windows. You must be confusing with UTF-16 harmful topic on SE. What the manifesto argues about is that the diversity of different encodings, in particular three five (UTF-16/32 come it two flavors) major encodings for Unicode, is a weakness. Especially for multi-platform code. In fact, how is it a strength? Having different interfaces is never a strength, it's a headache (think of multiple AC power plugs, multiple USB connectors, differences in power grid voltages and frequencies, etc...) bungalo (talk) 12:36, 22 June 2012 (UTC)


 * I am never particularly impressed by the argument that "you did not agree with me, so you must not have read what I wrote". I did read the manifesto and, except for the cogent advice for dealing with Windows, found it full of nonsense, on a par with "having different interfaces is never a strength" -- an argument that taken to its logical conclusion would have us all using the same computer language, or all speaking English, as Eleanor Roosevelt once proposed in a fit of cultural blindness.  A healthy, robust program (or culture) deals with the situation it finds, rather than trying to make the world over in its own image. -- Elphion (talk) 13:35, 22 June 2012 (UTC)


 * Excuse me, where have I said "you did not agree with me, so you must not have read what I wrote"? Don't put to my mouth what I haven't said.
 * I said that your claim that "Most of what the manifesto [...] rails against is the incredibly broken implementation of Unicode in Windows" is totally false, since there is only one item that talks about this ("UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves ..."). Claiming that one sentence is the majority of 10 page long essay is a distortion of the reality, and assuming you are a rational person (and not a troll) the only explanation I could come with is that you didn't actually read the whole thing. I would be happy if it happens that I was mistaken, but in such case you'll have to take your words back.
 * "found it full of nonsense"
 * Heh? Which nonsense? I find it factually correct. I would appreciate if you could write some constructive criticism so you (or I) could forward it to the authors.
 * "an argument that taken to its logical conclusion would have us all using the same computer language"
 * The implication is false, because computer languages are not interfaces.
 * "or all speaking English"
 * Yet one more hasty conclusion. On the global scale English is the international language, and for a good reason: not everyone who wants to communicate with the rest of the world can learn all the languages out there, so there is a silent agreement that there must be *some* international language. Right, the choice of the international language is subject to dispute, but it's irrelevant to the fact that some language must be chosen. It is also applicable on the local scale—people of each culture communicate in some specific language. No communication would be possible if each person would speak her own language she invented.
 * "A healthy, robust program (or culture) deals with the situation it finds"
 * This is exactly what the manifesto proposes—a way to deal with the current situation of the diversity of encodings.
 * bungalo (talk) 14:34, 22 June 2012 (UTC)
 * Note: Btw, I'm not and I'm not protecting her edit. bungalo (talk) 14:38, 22 June 2012 (UTC)

Ahem: "I believe you didn't actually read it." (Was that a "hasty conclusion" too?) The clear implication is that, since I didn't take from the article the same conclusion you did, I must not have read it. This is never a good argument. If that's not what you meant, then don't write it that way. I have in fact read the article, and did so when considering 85.etc's edit.

The manifesto is not a bad article, but it is informed from start to finish by the broken implementation of wide characters in Windows (and I agree about that). It covers this in far more than "one item", and the shadow of dealing with Windows lies over the entire article and many of its recommendations. I have no problem with that. But the article does more than suggest "a way to deal with the current situation of the diversity of encodings" -- it has a clear agenda, recommending in almost religious terms (consider even the title) that you shun encodings other than UTF-8. The arguments it advances for that are not strong, especially since there is well-vetted open source code for handling all the encodings.

And computer languages certainly are interfaces, ones I am particularly grateful for. Dealing with computers in 0s and 1s is not my forte.

-- Elphion (talk) 16:21, 22 June 2012 (UTC)

Difficulty of converting from UCS-2 to UTF-16
I removed a sentence that was added saying "all the code that handles text has to be changed to handle variable-length characters".

This is not true, and this misconception is far more common when discussing UTF-8, where it is equally untrue and has led to a great deal of wasted effort converting UTF-8 to other forms to "avoid" the actually non-existent problem.

As far as I can tell the chief misconception is that somehow strlen or indexing will somehow "not work" if the index is not "number of characters". And for some reason the obvious solution of indexing using the fixed-size code units is dismissed as though it violates the laws of physics.

There seems to be the idea that somehow having a number that could point into the "middle" of a letter is some horrendous problem, perhaps causing your computer to explode or maybe the universe to end. Most programmers would consider it pretty trivial and obvious to write code to "find the next word" or "find the Nth word" or "find the number of words in this string" while using an indexing scheme that allows them to point into the "middle" of a word. But for some reason they turn into complete morons when presented with UTF-8 or UTF-16 and literally believe that it is impossible.

The other misconception is that somehow this "number of characters" is so very useful and important that it makes sense to rewrite everything so that strlen returns this value. This is horribly misguided. This value is ambiguous (when you start working with combining characters, invisible ones, and ambiguous ideas about characters in various languages) and is really useless for anything. It certainly does not determine the length of a displayed string, unless you restrict the character set so much that you are certainly not using non-BMP characters anyway. The number of code units, however, is very useful, as it is directly translated into the amount of memory needed to store the string.

In fact on Windows all equivalents of strlen return the number of code units. The changes were limited to code for rendering strings, and fixes to text editor widgets. The Windows file system was updated to store UTF-16 from UCS-2 with *NO* changes, and it is quite possible to create files with invalid UTF-16 names and use them.

This same observation also applies to use of UTF-8. Here the misconception that "everything has to be rewritten" is even more pervasive. Don't know how to stop it but me and obviously several others have to continuously edit to keep this misinformation out of Wikipedia. Spitzak (talk) 23:25, 23 September 2012 (UTC)

Encoding of D800-DFFF code points in UTF-16
The article says, "It is possible to unambiguously encode them [i.e. code points D800-DFFF] in UTF-16, as long as they are not in a lead + trail pair, by using a code unit equal to the code point."

This doesn't make sense to me. It seems to be saying that you just drop in a code unit like D800 or DC00 and software should understand that you intend it to be a standalone code point rather than part of a surrogate pair. I suppose if this results in an invalid code pair being formed (such as a lead unit without a valid trail unit following), then the decoder could fall back to treating it as a standalone code unit, but what if you want to encode point D800 followed immediately by point DC00? Wouldn't any decoder treat the sequence D800 DC00 as a valid surrogate pair rather than as a pair of illegal code points? If so, then the statement that UTF-16 can "unambiguously encode them" (i.e. all values in the range D800-DFFF) is not true.

If I have misunderstood the point here, please clarify it in the article, because it isn't making sense to me the way it's written. — Preceding unsigned comment added by 69.25.143.32 (talk) 18:09, 4 October 2012 (UTC)

Never mind, I figured it out. "as long as they are not in a lead + trail pair" is intended to mean "as long as you don't have a D800-DBFF unit followed by a DC00-DFFF unit" (which would be interpreted as a surrogate pair). I had previously interpreted "as long as it doesn't look like a legal surrogate pair" to mean "as long as you don't try to shoehorn the illegal code point into a 20-bit value and package it up as a surrogate pair", which wouldn't work. I'll adjust the article's wording to be more clear. — Preceding unsigned comment added by 69.25.143.32 (talk) 20:03, 4 October 2012 (UTC)

1st reference
The 1st reference (math) is not actually a reference, it should be a note instead. — Preceding unsigned comment added by Tharos (talk • contribs) 14:23, 23 August 2013 (UTC)

UTF-16 transformation algorithm
In my revert of Zilkane's edits, I got one too many 0s in the edit comment. It should read: and "0x100000" is incorrect. The point is that the value subtracted from the code point is 0x10000 (with 4 zeros), not 0x100000 (with 5 zeros). This converts a value in the range 0x1,0000..0x10,FFFF (the code points above BMP) monotonically into a value in the contiguous 20-bit range 0x0,0000..0xF,FFFF, which is then divided into two 10-bit values to turn into surrogates. -- Elphion (talk) 16:23, 6 September 2013 (UTC)


 * I want to edit the UTF-16 convert formula:
 * U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
 * W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
 * W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
 * to
 * U ∈ [U+010000, U+10FFFF]
 * U' = ⑲⑱⑰⑯⑮⑭⑬⑫⑪⑩ ⑨⑧⑦⑥⑤④③②①⓪ // U - 0x10000
 * W₁ = 110110⑲⑱ ⑰⑯⑮⑭⑬⑫⑪⑩           // 0xD800 + ⑲⑱⑰⑯⑮⑭⑬⑫⑪⑩
 * W₂ = 110111⑨⑧ ⑦⑥⑤④③②①⓪           // 0xDC00 + ⑨⑧⑦⑥⑤④③②①⓪
 * what do you think? Diguage (talk) 14:36, 8 May 2022 (UTC)


 * I find the current version more readable, though it could use some highlighting and spacing and/or punctuation (e.g. with vertical bars). -- Elphion (talk) 17:08, 8 May 2022 (UTC)
 * This formula is misleading at best. the yyyyyyyyyy values in the first appearance is a different string of bits than the second and third apperances. 85.250.226.28 (talk) 21:54, 29 June 2023 (UTC)
 * No, yyyyyyyyy signifies the same 10-bit sequence in all appearances. -- Elphion (talk) 14:19, 30 June 2023 (UTC)

What does "U+" mean?
In the section title "Code points U+0000 to U+D7FF and U+E000 to U+FFFF" and in the text therein, example: "The first plane (code points U+0000 to U+FFFF) contains...", a "U+" notation is used. This notation is also encountered in many other places, both inside and outside Wikipedia. I've never seen a meaning assigned. What does "U+" mean? - MarkFilipak (talk) 16:30, 22 July 2014 (UTC)


 * "U+" is the standard prefix to indicate that what follows is the hexadecimal representation of a Unicode codepoint, i.e., the hex number of that "character" in Unicode. See Unicode. -- Elphion (talk) 17:02, 22 July 2014 (UTC)

This article is incoherent. It needs a complete revision.
The lede talks mostly about UCS-2. UCS-2 is OBSOLETE! Perhaps the article means UCS as described in the latest ISO 10646 revision ??? I do NOT have a solid understanding of UTF-16, but if my math is correct: 0x0000 to 0xD7FF plus 0xE000 to 0xFFFF is 55,296 + 8,192 = 63,488 -doesn't this mean that there are 2047 code points from the BMP which are NOT addressable in UTF-16 ? (65535-63488 = 2047). In the description it says:"The code points in each plane have the hexadecimal values xx0000 to xxFFFF, where xx is a hex value from 00 to 10, signifying the plane to which the values belong." Code points have "values" different from their code point values?!?!?! Wow, somebody really is inarticulate. Lets agree that any binary representation of a "character" is dependent on some system of rules, ie a standard. We should also (maybe? I hope!) be able to agree that "code page" is ambiguous, and IMPLEMENTATION DEPENDENT (I can't speak to whether the term exists in any of the ISO/Unicode standards, but it is clear that different standards (ANSI, Microsoft, DBCS, etc.) define 'code page' differently.) So, using a common (but poorly defined) term in this article needs to be done with much more caution. I really don't understand why UCS is used as the basis of comparison for this article on UTF-16. Shouldn't Unicode 6.x be THE basis? I would suggest that if there is need for an article on UCS, then there should be an article on UCS, or at least a comparison article between it and Unicode. Afaics, this article lacks any mention of non-character Unicodes as well as text directionality, composition, glyphs, and importantly, graphemes. If UTF-16 is similar to UCS then this is a serious deficiency. It also seems to me that Microsoft is a huge part of the reason UTF-16 persists, and how it is implemented in various MS products & platforms should be addressed here. Also, I see virtually NOTHING on UTF-8, which is (it seems) much more common on the internet. All-in-all, this article does not present a clear picture of what "standard" UTF-16 is, and what the difference is between that and what its common implementations do. — Preceding unsigned comment added by 72.172.1.40 (talk) 20:15, 28 September 2014 (UTC)


 * I have already tried to delete any mention of "planes" but it got reverted, so I gave up. I agree that planes are irrelevant to describing UTF-16. You are correct that there are 2048 points that can't be put in UTF-16 and this causes more trouble than some people are willing to admit because they don't want to admit the design is stupid.Spitzak (talk) 21:45, 28 September 2014 (UTC)

Example for 軔 is wrong
The last example seems to be wrong. It lists the symbol "軔", which is the codepoint U+8ED4 (JavaScript: "軔".charCodeAt(0).toString(16) and Python 3: hex(ord("軔")) agree), but the example says it is the codepoint U+2F9DE. It looks similar, but it isn't the same codepoint.

panzi (talk) 23:48, 25 October 2014 (UTC)


 * You're quite right. U+2F9DE is a compatibility ideograph that decomposes to U+8ED4 軔 on normalization. As any software might (and presumably has in this case) silently apply normalization and therefore convert U+2F9DE to U+8ED4 it would be best not to use a compatibility ideograph for the example, but use a unified ideograph. BabelStone (talk) 12:32, 26 October 2014 (UTC)


 * OK. I just grabbed something from our Wikibooks UNICODE pages in the range of bit patterns I wanted to demonstrate. Thanks for catching the wrong bit 0xD bit pattern, too! —&#91;  Alan M 1  (talk) &#93;— 02:40, 29 October 2014 (UTC)

and. While the latter is used in the correct arithmetic sense, the former is apparently used in the sense of a text concatenation. (Or in the sense of some preschool kids who think that 1+2=12.) Should we use a dedicated operator from Concatenation, or use plain text? (Presumably, the intension might have been to emulate overloading, but that's very confusing here, where all operands can be seen as integers. Possibly there was some intention to express this through selective use of, but that's quite idiosyncratic.) ◅ Sebastian 13:39, 11 August 2023 (UTC)
 * the former is apparently used in the sense of a text concatenation No, just ordinary addition. 0xD800 = 0b1101100000000000 and 0xDC00 = 0b1101110000000000, so the values to the left of the comment are the result of adding those two hex values to yyyyyyyyyy and xxxxxxxxxx, respectively. Guy Harris (talk) 20:10, 11 August 2023 (UTC)
 * You're right, thanks! I missed that the values without  are meant to be  . That's already so in the source, but still it may not be the best presentation for an encyclopedic article, as it's not immediately obvious to all readers. ◅ Sebastian 14:58, 12 August 2023 (UTC)

Why comments?
 Headline inserted 12 August 2023 

Oh, and what's the point of the double slashes? Presumably, they stand for comments, but - with the rest of the formulas correct - the appropriate sign would be simply. Note that the  are rather idiosyncratic, as well, as the're not in the source. ◅ Sebastian 17:01, 11 August 2023 (UTC)
 * Yes, the point of the double slashes is to introduce comments, C++/C99-and-later-style. The comments are there to indicate the formula used to calculate the values to the left of the comments; the items to the left of the comments and the right of the equal signs are the bit representations of U', W1, and W2. And, yes, showing them as {name} = {bit representation} = {formula} might also work. Guy Harris (talk) 20:14, 11 August 2023 (UTC)
 * Of course double slashes are used by some languages for comments. But that's not the question here. My question is: What's the point? Why use an unintroduced syntax (which is probably not known to all readers) to “illustrate” something that can be expressed with a generally known sign like the equal sign? And even if every reader knew the meaning of : If an explanation itself needs a (meta-) comment, then it's not a good explanation. ◅ Sebastian 14:14, 12 August 2023 (UTC)