Talk:Byte order mark

Misc
Detailed discussion of BOM does not add to understanding of endianness, and BOM can be taken as a seperate concept, so i've moved it back to its own article.

It really was messy in the endianness article, especially as BOM has its own category links, external links, and the like.

--Pengo 00:52, 27 Oct 2004 (UTC)

edits by Cherlin
some of theese edits seem rather dodgy to me.

used-->misused : you claim that using the BOM to mark text as being in a utf- format is misuse yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading) states that the byte sequence may be used to indicate both byte order and charachtor set.

"contrary to its definition" : you claim that use of the BOM on utf-8 is contary to its definition yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

FF FE 00 00-->00 00 FF FE (already reverted) : encoding the code point FEFF in little endian utf-32 would give FF FE 00 00 as was in the original not 00 00 FF FE as your edit states. Furthermore the table that was there before your edit exactly corresponds to the information given in http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

unless i see good justification for theese edits i will be reverting the two that i have not already reverted Plugwash 16:13, 24 Dec 2004 (UTC)


 * It is now two days since you made the edits and you have not responded furthermore i find you to be a very new contributer who has got into trouble elsewhere and made few other edits im am therefore reverting the rest of the edits you made to this page Plugwash 02:23, 27 Dec 2004 (UTC)

Concerning UTF-16 big endian vs little endian
I have noticed that the Python interpreter reverses the byte order of UTF-16 big endian and little endian as compared to what is actually in the Unicode standard when given invalid input. When Python's codecs module is used to read UTF-8 text in from a file and write UTF-16 text out to another file, and the original UTF-8 file begins with the non-character U+FFFE (encoded as EF BF BE), the non-character is accepted as if it were the byte order mark U+FEFF and the resulting UTF-16 file has the opposite byte order of what was requested. I observed this on multiple platforms and Python versions.

The point is, if you are having trouble with the byte order of UTF-16 text, check your libraries/tools for problems, and verify everything using hexadecimal viewers. You may find incorrect assumptions are being made in your tools or libraries.

Canistota (talk) 14:47, 12 March 2009 (UTC)

Canistota: It's not only python, the description of UTF-16LE and UTF-16BE is reversed/wrong at this page. The UTF16-LE BOM is \xfe\xff resp. "\376\377", the UTF16-BE BOM is \xff\xfe resp. "\377\376" if read bytewise. This can be observed with every tool accepting BOMs or iconv, but in the meantime there are several tools which took the reverse wikipedia BOM, thus have it wrong. — Preceding unsigned comment added by ReiniUrban (talk • contribs) 13:27, 27 October 2016 (UTC)

Byte Order Mark in UTF-8
Does anyone know why Windows software likes to put a BOM at the front of UTF-8 files? Isn't it true that the order is unambiguous, and thus it does nothing for any endianness problems? Is it simply a way of flagging a file as containing UTF-8 instead of ASCII? -R. S. Shaw 23:38, 5 Jun 2005 (UTC)
 * yeah its simply used to mark the file as being utf-8 rather than the systems legacy encoding. Plugwash 00:25, 6 Jun 2005 (UTC)


 * Whenever you save a file as UTF-8 in Windows Notepad, the UTF-8 BOM is prepended to it. You can use a different editor (a non&#8211;Unicode-aware editor or a hex editor) to remove the BOM. If the file contains one or more legal UTF-8 sequences, and only legal UTF-8 sequences, then removing the BOM will have no effect on the file&#8212;it&#8217;ll still be UTF-8. If the file contains only ASCII and you remove the BOM, Notepad will flag it as ANSI (8-bit codepage mode). If the file contains a BOM and you insert an illegal sequence into it (like a single FF byte in the middle of the text, or C2 E4, etc), then the file will stay intact, but if it hasn&#8217;t got a BOM and you insert such a sequence, it&#8217;ll revert to ANSI, and legal UTF-8 sequences too will be viewed in Notepad according to the current Windows ANSI codepage semantics (for example CF 80 as &#207;&#8364; instead of &#960; if you&#8217;re on a US WinXP). --Shlomital 22:33, 2005 Jun 11 (UTC)


 * On Czech WinXP it works the same. Notepad marks it with BOM for easier recognition of the encoding, but does not require it. It is an unexpectedly tolerant approach.

Why is the byte sequence EF BB BF choose to be the mask?
Is there a reason? Or someone just picked it by change? —Preceding unsigned comment added by 117.104.188.16 (talk) 10:03, 25 January 2011 (UTC)


 * That is U+FEFF (the value of the BOM character) in UTF-8 encoding. It is what you would get if a translator from UTF-16 to UTF-8 that was completely unaware of the BOM would produce by translating the BOM character.Spitzak (talk) 19:37, 25 January 2011 (UTC)

Why is this a problem?
as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark. Shinobu (talk) 10:18, 20 November 2007 (UTC)
 * True though I could see that doing more harm than good, imagine you wrote your script on your desktop and it ran fine but when you put it on the production server an invisiable character stopped it from running. Plugwash (talk) 10:22, 20 November 2007 (UTC)


 * That assumes that the "free software" is of varied quality, not following a standard. That may be true.  However the context for the quote was biased to support this situation. Tedickey (talk) 11:18, 20 November 2007 (UTC)


 * "All those tools are free software or have free software equivalents" &mdash; no, not proprietary Unixes, and yes they are still around. -- intgr [talk] 11:27, 20 November 2007 (UTC)
 * The de-facto standard is for tools (including such core OS components as the binary loader) to recognise a script by the first two bytes of a file being "#!". If some versions of some tools start ignoring a preceeding BOM but others don't (free software DOES NOT mean you can force your changes on your distro maker or server host) then IMO there is likely to be far more confusion than if scripts with a BOM universally fail (which afaict is the status quo). Plugwash (talk) 12:57, 20 November 2007 (UTC)


 * uh - no. No one's presented any evidence of scripts which would be ambiguous if someone provided a loader which handles BOM.  Tedickey (talk) 13:10, 20 November 2007 (UTC)


 * I think the real question for Unix shell scripts is, what is the native character encoding that /bin/sh supports? Can you have a shell variable "$STRAßE"?  An environment variable of the same name?  What about Chinese?  My bet is that the Unix shells only support ASCII text, in which case a byte order mark is inappropriate.  After all, the kernel is looking for the bytes 2321, not the characters "#!". Canistota (talk) 23:28, 12 March 2009 (UTC)


 * Shell scripts support non-ASCII characters just fine (for instance in string literals - variable names may be more optimistic). The encoding is LC_CTYPE.  But this is irrelevant to the recognition of the #! sequence, which is not performed by the shell in any case. Ewx (talk) 08:59, 13 March 2009 (UTC)


 * Python and Perl also support well utf8 encoding, including with BOM althout the shebang does not. — Preceding unsigned comment added by 84.97.14.22 (talk) 16:28, 21 July 2012 (UTC)

"All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark." – In addition to what User:Plugwash writes above, I do not believe you can convince even a large minority of Unix users that placing a piece of crippled, limited character-encoding metadata into general files is a good idea. Although I only read about it just now, BOM for UTF-8 strikes me as an unusually stupid idea. The section on BOM in RFC 3629 illustrates some reasons why; it is full of heuristics and language that you rarely see in RFCs ("without a good reason", "only when really necessary", "an attempt at diminishing this uncertainty").

Should I interpret the article as if Windows Notepad is the only widely spread software which actually creates UTF-8 BOMs? It would make sense; Microsoft do not care about plain text editing – they are more into "one application, one proprietary file format" – and they have historically not cared about the usefulness of Notepad.

JöG (talk) 09:13, 29 March 2008 (UTC)

OK, now I see the article says "Quite a lot of Windows software (including Windows Notepad)". But it would be interesting to know if popular, serious text editors on Windows (emacs, vim, UltraEdit and popular Windows-specific editors) do this by default. JöG (talk) 09:18, 29 March 2008 (UTC)


 * You named two ports to Windows and one native. That's a rather small and unrepresentative example.  There are many Windows editors.  Btw, the comment regarding interprocess communication is unnecessary, since it adds no factual information.  Take a look at Windows PowerShell, which has to be doing this transparently.  Tedickey (talk) 11:01, 29 March 2008 (UTC)


 * * UltraEdit: If you select "UTF8" when you save, it adds the BOM without giving you a choice in the matter.
 * * Vim for Windows: It doesn't give the option to save as UTF8 and does not add a BOM, but when it opens a BOM'ed file it retains the BOM when saving. -- leuce (talk) 20:58, 30 March 2009 (UTC)
 * Vim's a port (and it doesn't recognize some of the Windows text formats). By the way, there are probably hundreds of applications to discuss in this manner. Tedickey (talk) 21:05, 30 March 2009 (UTC)
 * I agree -- I merely tested these two because they were mentioned by someone previously, plus the three mentioned below. -- leuce (talk) 14:44, 31 March 2009 (UTC)

In response to JöG's post, here are some Windows programs and whether they add a BOM to UTF8 or not.

--leuce (talk) 13:11, 29 March 2009 (UTC)
 * Akelpad: Gives user a choice, but BOM is suggested by default.
 * MS Word XP: Adds BOM, gives no option not to add BOM. If you open a BOM'ed UTF-8 file in MS Word, it autodetects the encoding as UTF-8; if you open a non-BOM'ed file in MS Word, it makes a guess based on the characters it contains, but if all characters are present in the ANSI scheme, it will save such a file as ANSI, not UTF-8.
 * OpenOffice.org 3.0: Adds BOM, gives no option not to add BOM.

Too technical!
OK, I understand everything in the article, since I'm a unicodopath, but the intro should say: The intro is a bit too technical for being an intro. The current text qualifies as a technical description intended for me and you, not any outsider. The missing nouns that should be in the intro are: computer, data coding, natural languages. L8R.  Said: Rursus   ☻   10:15, 25 April 2008 (UTC)
 * Unicode is a computer encoding of all languages characters (in principle),
 * The byte order mark is designed so that a computer who reads it, can guess (with a reasonable probability) that the data text is probably Unicode, and
 * Guess what kind of Unicode encoding, since there are many - the article already says that, I just wanted to stress that it shall.
 * I think this set of recommendations is met or eliminated in the current article's text. The explanation that Unicode intends to capture all human languages belongs in the Unicode article (and it's there, and there's a link over there in the first sentence here). The notion that the BOM has the purpose of identifying Unicode (rather than some other encoding entirely) is not, so far as I can see, justified by the primary references, and is significantly undermined by the fact that BOM is in all contexts optional. The "which Unicode encoding" part is, as acknowledged, already captured. Jackrepenning (talk) 22:56, 6 August 2010 (UTC)

How to remove it
There should a section in this page discussing how to remove it. The only reason 99% of people would ever come to this page is because they are trying to remove this ugly little thing from a web page they are developing. The 1% of people who come because they are interested in it may be getting what they want but not the rest of us. —Preceding unsigned comment added by Tjayrush (talk • contribs) 16:44, 6 February 2009 (UTC)

There is a nice easy to use peice of software called bomstrip that makes removing this thing quick work on Linux. I didn't want to edit the page directly but perhaps an interested party can. —Preceding unsigned comment added by Tjayrush (talk • contribs) 18:08, 6 February 2009 (UTC)

Added remove script to Unwanted BOMs section. In Linux: 1. To search for files contaning BOM by running this command:  grep -rl $'\xEF\xBB\xBF' 2. for each from the search results above, run: a. vi   b. from inside vi type the command (including the ":" sign)    :set nobomb c. save and exit :wq  — Preceding unsigned comment added by Drormik (talk • contribs) 13:45, 10 August 2012 (UTC)

To be exact: These commands are not for vi but for vim (which is the most popular vi clone). A non-vim implementation of vi (e.g. ex-vi, nvi, ...) most likely will not have an option "nobomb". --Meillo (talk) 18:53, 31 January 2021 (UTC)

Whether Unicode standards recommends UTF-8 BOM or not
The text is "Use of a BOM is neither required nor recommended for UTF-8" (and this already appears in the cite!). That seems like a pretty clear "not recommended" to me - "neither fish nor fowl" means "not fish and not fowl", it doesn't mean "not fish and not specifically fowl". Ewx (talk) 07:47, 31 March 2009 (UTC)


 * And then it goes on to say that applications still must expect that it'll happen. May as well address the complete sentence, rather than construe a (reasonably) carefully worded comment into a completely negative recommendation. Tedickey (talk) 10:07, 31 March 2009 (UTC)


 * The Wikipedia text already points out that it may be encountered, and a recommendation not to *use* it doesn't contradict that at all. Ewx (talk) 13:54, 31 March 2009 (UTC)


 * But that is the point -- the Unicode standard does not contain a recommendation not to use it. -- leuce (talk) 14:37, 31 March 2009 (UTC)


 * Yes it does! The text in the standard is "Use of a BOM is neither required NOR RECOMMENDED for UTF-8" (emphasis mine).  That is not an absence of a recommendation to use it and it is certainly not an absence of a recommendation not to use it; it is a straighforward and clear recommendation not to use it.  Ewx (talk) 07:49, 1 April 2009 (UTC)


 * Indeed (the emphasis is yours . Use the complete sentence, or find another source which supports your viewpoint. Tedickey (talk) 10:56, 1 April 2009 (UTC)


 * This is completely ridiculous. The text is right there.  It says it's not recommended. Ewx (talk) 08:07, 2 April 2009 (UTC)


 * Well I suspect this is a sticky point. I have searched chapter 2 and 16 of the Unicode standard for references to BOM, byte order mark and UTF-8, and in my opinion the reference under discussion here is the only instance in the standard that speaks even remotely negatively about a UTF-8 BOM.  In all other cases where the UTF-8 BOM is mentioned or discussed, it is mentioned as a matter of course in an informational, neutral tone, without making any value judgements or any indication that the UTF-8 BOM is deprecated.  My personal take on this reference is that people who want to implement the Unicode standard might wonder why the Unicode standard keeps making reference to the UTF-8 BOM (also in chapter 16) as if it were a valid construct, and they might become under the impression that the Unicode consortium actually recommends using a UTF-8 BOM even though it is not required. -- leuce (talk) 15:34, 1 April 2009 (UTC)


 * Having read comments by some of the people involved (in the topic itself...), my impression is that the statement is a compromise between two viewpoints, neither of which dominated in writing the source we're discussing. Tedickey (talk) 16:31, 1 April 2009 (UTC)


 * The phrase "X does not recommend Y" can have two meanings. It can mean that X recommends many things, but that Y is not one of the things that X recommends.  Or, it can mean that X makes a recommendation *against* Y.  The Unicode article does not recommend against a BOM... it simply does not make a recommend in favour of it.  My gripe is that the wiki article before I edited did create the impression that the Unicode standard recommends against the use of a BOM.  Even if one quotes directly from the Unicode standard, if quoted in a different context it can certainly give a slightly different impression of what the standard intends to say. --  leuce (talk) 14:37, 31 March 2009 (UTC)


 * Agree. And (for instance), if you consult some of the secondary sources, it's easy to come up with one that is wholly in favor of one or another viewpoint.  (Some are completely absurd, but I see those reflected on this page ;-) Tedickey (talk) 10:06, 2 April 2009 (UTC)


 * A recent flurry of edits has opened this can of worms again, and the text has grown decidedly text-booky and verbose. I’ve reverted to the state pre-edits. Firstly, we cannot interpret the Unicode standard for it. The text comes straight from the source. The reader is going to have to decide for “himself” what that means. There is no other authoritative source and therefore we are not allowed to interpret it for the reader. The cited mailing list thread is not authoritative; it is just one of hundreds of discussions all over the Web on the topic, each coming to its own conclusions. Secondly, it makes no sense to prognosticate at length over the reliability or unreliability of the UTF-8 BOM as a signal for UTF-8 encoding. Go find some reliable reference if you feel something definitive needs to be said about it. The article is fine as it is, particularly since these observations about the unreliability of the UTF-8 BOM apply equally well to the UTF-16 BOMs. A file of unknown provenance can never, with 100% confidence, be stated to be in any encoding whatever, or even to be text even though it might be the collected works of Shakespeare in 7-bit ASCII. The best you can state completely confidently is that the content is not in some particular encoding due to a violation of the encoding’s standards. Strebe (talk) 19:57, 14 July 2012 (UTC)

May I make this edit?
Current:

"While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may nonetheless be encountered, and it is explicitly allowed by the Unicode standard[1], the Unicode standard does not specifically recommend its usage[2]. It only identifies a file as UTF-8 and does not state anything about byte order.[3]"

When I read these two sentences, it almost sounds as if the Unicode standard identifies a file as UTF-8 :-) That second sentence doesn't really fit anymore.  Besides, it repeats what has been explained elsewhere.  I suggest we remove it or move it somewhere else in the article. -- leuce (talk) 14:55, 31 March 2009 (UTC)

Article needs an example
This article should include an example of a byte-order-mark. DMahalko (talk) 23:56, 15 June 2009 (UTC)


 * The article currently presents the definition and its encodings, including how it looks when rendered naively in various ways. David Spector (talk) 14:21, 28 March 2013 (UTC)

Why the dash in byte-order?
The Unicode specification reads "byte order mark", not "byte-order mark". Why was this article's name changed? On the face of it, this article title is wrong. Strebe (talk) 04:01, 28 July 2009 (UTC)


 * Proper English would dictate the use of the hyphen. See http://en.wikipedia.org/wiki/English_compound#Hyphenated_compound_adjectives - Blueguy 65.0.223.146 (talk) 00:25, 7 August 2009 (UTC)


 * This article is about something that has a name. The name, by the body that coined it, is "byte order mark". It is not encyclopædic to "correct" established terminology; that is editorializing. This article's title is wrong. Strebe (talk) 09:05, 8 August 2009 (UTC)


 * Wikipedia rules tell to name articles as the thing is called on the street and in life, not as it's called in the dictionary or how it should be called; Strebe is right. 88.148.214.15 (talk) 20:35, 12 October 2009 (UTC)


 * The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section. 

The result of the move request was moved. Skomorokh, barbarian  11:07, 27 October 2009 (UTC)

Byte-order mark → Byte order mark — Cannot move back to old name without administrator intervention. Strebe (talk) 09:57, 18 October 2009 (UTC)
 * The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

Which text editors add a BOM to the beginning of text files?
"Some text editing software in a UTF-8 environment on MS Windows adds a BOM to the beginning of text files." Which ones? Tisane (talk) 02:57, 24 February 2010 (UTC)


 * Probably a long list (Visual Studio .NET for instance) Tedickey (talk) 09:20, 24 February 2010 (UTC)

The BOM will make a batch file not executable on Windows…
I removed this completely misleading remark of October 28:. First, it is not impossible to remove the BOM even in Windows, so the conclusion about s.n. "ANSI" has not grounds. Second, user: BIL correctly stated that native encoding for .bat is CP437 but forgot to mention that non-Western Windows localisation actually use different OEMCP (see below a sample with code page 866); in any case this matter is quite off-topical and irrelevant though. And, the most important, .BATs starting with the BOM do execute: T:\>test.bat T:\>я╗┐echo ╨╗╤П╨╗╤П╨╗╤П╨╗╤П╨╗╤П╨╗╤П 'я╗┐echo' is not recognized as an internal or external command, operable program or batch file. T:\>ver Microsoft Windows XP [Version 5.1.2600] T:\> The test.bat file contains: Incnis Mrsi (talk) 18:15, 6 February 2011 (UTC)

Huh? Your example shows EXACTLY the problem: the BOM is not removed but is considered part of the "echo" command and therefore the .bat file fails to work.Spitzak (talk) 19:22, 6 February 2011 (UTC)
 * I see "the problem", but such .BAT do execute contrary to the statement quoted in the topic. As there is an error with the first line, there is, obviously, an easy workaround: skip the first line, say, leave it empty. This is all an WP:OR, just like deleted speculations. So I see no reason to keep BIL’s controversial OR in Wikipedia. Incnis Mrsi (talk) 20:36, 6 February 2011 (UTC)
 * If you leave the first line in the bat file empty, and save it as UTF-8, there will still be a BOM there, which will cause an error message, but the bat file will be executable. What I wanted to describe is that the Windows command prompt and bat files do not recognise BOM or Unicode. There might be a workaround, but still.--BIL (talk) 21:34, 6 February 2011 (UTC)
 * I agree that you have an extremely literal interpretation of the word "execute". Yes for almost any text file, the program the text file is for will start running, will open the file, and will actually read bytes from it, and only fail when it fails to interpret the line as the user of the text editor intended. By this criteria ALL programs "work with a BOM". However that is a pretty useless definition.Spitzak (talk) 21:23, 7 February 2011 (UTC)

Please, Spitzak and Strebe: do tango it into a good text here at Talk. If you cannot solve it here, it will not be good a good text in the article page for sure. I really would like to read the good article on this. -DePiep (talk) 20:40, 8 February 2011 (UTC)


 * The rationale of this edit is wrong: Without the BOM it would NOT be "the wrong encoding"


 * The character encoding is declared as part of the text file contents only if there is a BOM and only within Unicode environments. If there is not a BOM, or if the environment is not Unicode, then the character encoding is determined externally. You cannot claim that a file sent to the DOS command line is UTF-8, since, by definition, the file is DOS 437. It does not matter how the file was constructed or what its history was or whether it contains a BOM; when you sent it to the DOS command line, you implicitly declared that it was Code page 437, which is not a Unicode environment. If that is not what you intend, then you simply sent the command line the wrong file. Strebe (talk) 00:04, 9 February 2011 (UTC)
 * I shortened the text and wrote that batch files do not support Unicode and therefore not the BOM. Note that echo does not support Unicode, for example writing echo From Genève to Zürich in a batch file gives From Gen├¿ve to Z├╝rich, and Unicode file names do not work either.--BIL (talk) 10:34, 9 February 2011 (UTC)
 * Saying "the text has an encoding" shows that you completely do not understand why the BOM is not being recommended by some. Without the BOM, a UTF-8 file containing only ASCII letters is identical to a ASCII file. So it is simultaneously in UTF-8 encoding and also in ASCII encoding and DOS 437 encoding and ISO-8859-1 encoding and CP1252 encoding. The entire design of UTF-8 was to allow this, to eliminate the need to identify and transmit encodings. However it is defeated by the addition of the BOM which makes it no longer in these encodings, for a completely invisible letter that the programs now have to decode depending and add to their input syntax just so they can skip it! And don't print that bull about "batch files are DOS 437", if that was the problem the batch file would produce "this is in the wrong encoding error", not complain about the inability to find a command that happens to be equal to the first ASCII command with the three bytes of the BOM added to the start. In reality, batch files are streams of bytes and the byte values that happen to match the ASCII space and CR and LF and a few other values have some meaning. This is not an "encoding" at all.Spitzak (talk) 20:31, 9 February 2011 (UTC)


 * You might consider calming down and perhaps finding some soothing hobbies. You have no idea what I understand and do not understand, and I really am not interested in these sorts of petty pissing matches or discussing who’s stupid. I’m interested in improving Wikipedia. I can’t imagine anyone else is interested in such flaming, either.


 * We agree that a BOM is not recommended— after all, we must agree because that is what Unicode states. I have no disagreement with the first half of your diatribe. You might reconsider your rant about DOS 437, on the other hand. It is not the job of text processing systems in non-Unicode environments to recognize Unicode conventions. The Unicode Consortium recognizes this and take pains to make sure no one thinks they’re imposing Unicode on everyone, especially systems that existed before Unicode. Batch files existed long before Unicode. It cannot be batch processing systems’ responsibility to declare that the encoding is wrong because they don’t even know that it’s “wrong”. It’s NOT wrong; by using the file as a batch command, you have imposed DOS 437 semantics onto the file. Therefore your assertion that batch file processing ought to produce a “This is the wrong encoding” is nonsense. What you are calling a BOM is not a BOM in a batch file; it is a sequence of three characters: the “intersection” glyph from set theory, and two box-corner symbols. Just because Unicode came along does not deprive DOS 437 (or any other encoding) of its upper ASCII register, which you seem to be arguing for be claiming it’s “not an encoding”.


 * The important thing here is that the declaration of the encoding system is not part of the file’s content; it is externally imposed. A BOM has specific meaning within the Unicode environment. It does not outside of it. Batch files are outside the Unicode environment. It really is that simple. Strebe (talk) 01:00, 10 February 2011 (UTC)


 * It is obviously a waste of time trying to explain this. Bascially though: if a program takes some bytes in a buffer and puts them on a device that interprets them according to encoding X, then that buffer is in encoding X. It does not matter if that program does not understand encoding X or that it was written decades before encoding X existed. The bytes are in that encoding becuase they are interpreted as though they are in that encoding. Anyway I am going to delete the windows batch file comment because adding "it is in DOS 437" makes the argument completely nonsensical.Spitzak (talk) 19:36, 10 February 2011 (UTC)

Dubious claim in "Representations of byte order marks by encoding" section
The GB-18030 section has the following claim: "[132] and [149] are unmapped ISO-8859-1 characters". But my understanding is that these characters aren't unmapped even in ISO-8859, but are C1 control characters; 0-31 is the C0 control area and 128-159 the C1 control area. This is why the mapping by Windows-X of higher Unicode characters to the latter range can cause problems.

I think this section needs to be edited by a knowledgable person. — 93.97.40.177 (talk) 07:00, 16 June 2011 (UTC)


 * You are confusing the character values produced after decoding with the bytes that are in the encoding. 132 is a value of one of the bytes in the GB18030 encoding of the BOM. It and 3 other bytes decode into the value 0xFEFF.Spitzak (talk) 19:09, 16 June 2011 (UTC)

From which version of text editors recognize/do not recognize UTF-8 without BOM in the beginning of text files?
From which version of text editors recognize/do not recognize UTF-8 without BOM in the beginning of text files? Because when all text editors will recognize UTF-8 without a BOM, BOM will not be necesary anymore... — Preceding unsigned comment added by 86.75.236.140 (talk) 10:09, 30 June 2012 (UTC)

«One reason the UTF-8 BOM is not recommended is that pieces of software without Unicode support may accept UTF-8 bytes at certain points inside a text but not at the start of a text.»
«One reason the UTF-8 BOM is not recommended is that pieces of software without Unicode support may accept UTF-8 bytes at certain points inside a text but not at the start of a text.» Formulation of this sentence looks strange and illogical as from my point of view: If a software does not support UTF-8, presence of BOM helps to indicate this software is not compatible with UTF-8. — Preceding unsigned comment added by 86.75.236.140 (talk) 10:12, 30 June 2012 (UTC)


 * The rest of the paragraph explains it. Strebe (talk) 18:59, 30 June 2012 (UTC)


 * I understood the sentence just now: the intent is to mean to not use BOM for backward compatibility with legacy software which accept 8 bits regardless encoding.
 * It seams to me very specific, althouth I understand such a specific case can be considered by Wikipedia. I assume in 2012, there are very few software which have this issue.
 * I am not sure the case of a compiler is a good example. To be verifiable, a name and a version of assumed incompatible compiler (for instance PHP 5) should be given as example/reference. For the two compilers I searched for, I understand that this issue is solved and a BOM can be used:
 * A seven years old compiler (Visual C++ 2005).
 * Another compiler example with gcc fortran five years ago, which considered it as a corrected bug
 * So I would prefer a sentence which states BOM is for fully unicode compatible software and for old software (from the XXth century ;-) ), BOM should be avoided. Althought the Unicode position might say the same in a more neutral way might be better.
 * Above all, explanation should be simplified to be easily understandable.
 * To be more neutral, Wikipedia should not also focus on POSIX position but also consider Unicode and Microsoft one.

.

Imagine a second function that does pattern matching to determine if a file contains only valid UTF8, and it is called. It is in no way "error-prone" (it is in fact a good deal more reliable than relying on the odds that the first 3 bytes happen to not be the BOM in non-UTF8). You can argue about whether it is "complicated" or "slow" but imho it is neither of them when compared to the next function.

Let's imagine a third function which examines a text string called. This returns every other encoding you are interested in other than UTF8. I think it is fair to describe this function as "complicated, error-prone, or slow" and in fact that is exactly what the text is referring to.

Okay, lets write a program that uses these functions and returns true/false as to whether you think the file is UTF-8:

Version 1:

isUTF8(f): return hasUTF8BOM(f)

Version 2:

isUTF8(f): return matchUTF8(f)

Wait a second! Neither version calls ! So why is there some text talking about that when discussing using version 2????

Okay, maybe your concern is that you do need to figure out the legacy encoding. Lets try some new functions that figure out the encoding using the above:

Version 1:

getEncoding(f): return hasUTF8BOM(f) ? UTF8 : getLegacyEncoding(f)

Version 2:

getEncoding(f): return matchUTF8(f) ? UTF8 : getLegacyEncoding(f)

Oops! You need to call  in both of them!

Do you understand? The complexity of distinguishing legacy encodings is irrelevant when choosing whether to use the BOM or not. Both cases need or can ignore it equally. Therefore mentioning how hard that is in the context of not using a BOM is misleading.

Spitzak (talk) 01:16, 21 May 2015 (UTC)


 * There isn’t anything wrong with your logic. Again, it’s about your interpretation of the article’s meaning and intent. The complexity of distinguishing legacy encodings is relevant because many programmers will choose a library that does not look specifically for UTF-8 indicators first but instead acts as a general detector for character sets. Not only that, even if the programmer does the right thing “herself” for detecting encoding, a choice she makes about encoding affects systems that she has no control over: that text will end up in places where her choice of encoding suffers because the downstream system uses a slow, buggy character encoding detector—or is a Microsoft product. The article helpfully points out that UTF-8 is easier to detect, which you could read as a hint that the programmer should try to detect UTF-8 first if the problem domain is likely to be Unicode. But you seem to be so obsessed with the horrifying thought that someone might treat UTF-8 as a peer to other encodings that you interpret the article as, “If you don’t use a BOM then you’re stuck dealing with the messy legacy encoding problem.” Well, in a sense you are stuck with such problems, because you have no control over how others will handle your text. Meanwhile if you choose to use a BOM instead, your text just works with Microsoft products, and presumably any character encoding detection system will recognize the the encoding reliably as well. But the article emphasizes neither BOM or BOM-less. It is neutral and merely points out a few facts about the consequences of choice. As it should. Strebe (talk) 08:45, 22 May 2015 (UTC)

Huh?
'''Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable). '''

The italicized sentence here is not at all clear to me. Is it saying that a program expecting a UTF-8 file will display a UTF-16 file as garbage outside of parts of the file that are ASCII only? Why doesn't it just say that, instead of what I first interpreted as that it would display a file beginning with a UTF-16 BOM as garbage because of the BOM, which is what it might be taken to mean from the context, and is obviously silly because of the way UTF-8 resyncs. Also, saying that display of an ASCII only file in this case is fairly readable is a bit of a stretch, for instance is display of a highlighted NUL between each pair of characters fairly readable?

In fact, if this sentence is talking about how a program expecting UTF-8 displays UTF-16 in general, why is it in the article on BOM anyway without some clarification about what this has to do with BOM? 2601:646:8D01:8A90:29F4:44B8:5515:B1EE (talk) 07:25, 3 October 2015 (UTC)


 * The italicized text is just some random observation by some random editor and doesn't seem to have anthing to do with BOM. Feel free to fix it. Strebe (talk) 07:47, 3 October 2015 (UTC)

UTF-32LE matching UTF-16LE
I added a mention that the UTF-32LE BOM is the same byte pattern as a UTF-16LE BOM followed by a null (0) character. I thought this was interesting and indicates an example where blindly obeying the bit patterns does not work. Somebody else thought to add a lot of text which I think amounts to "text starting with null is very uncommon so this is not a problem". He seems to think it is because null is a string terminator in null-terminated strings, but actually that makes a leading null *MORE* common, since zero-length strings are probably by far the most common. The real reason this is not a problem is that UTF-32 is so very rare that there is no reason to test for it.

I guess I'll just delete this, as the tiny anecdote has inflated to a mess of unreadable text. Spitzak (talk) 19:09, 27 January 2016 (UTC)

Repeated deletion of math showing chances of misidentifying another encoding as UTF-8
This has been deleted as OR but it is actually based on several answers from Stack Exchange which is not allowed as a source. It also has provable math statements in it. Strebe does not seem to understand it, making two mistakes: 1/15 chance of an error is not 85% chance of it being correct, it is 93.3%. And that is for ONE character, the chance of finding N correct multibyte characters is 1-1/15^N which quickly becomes astronomical. For instance finding 7 UTF-8 characters without first finding an invalid sequence is 1/170,869,375. This can be compared to the 1/16,777,216 chance of the first three bytes being the BOM, a chance that is assumed to be zero by defenders of the BOM method.

This question of odds is asked quite a few times on the internet, though there are a lot of incorrect answers. I thought it would be useful to provide some kind of answer. Spitzak (talk) 22:44, 29 June 2017 (UTC)

'Consuming'
A tiny issue of language. My amendment was reverted with the comment 'As in, to “ingest” and serially “use up” the incoming stream. This is standard terminology.' As a native English speaker, I am confident my edit was an improvement. As a reader, I found the word 'consuming' for what a program does with an input string both jarring and jargon. It lacks clarity, and is not consistent with English usage. Detection of character encoding and byte order does not 'ingest', 'consume' or 'use up' anything. If this were standard terminology, why is there no common method consume? Instead we have read and open, the standard analogy being with a book. That analogy is surely clearer to a reader. (Conversely 'emit' may be used occasionally in a computing context as a substitute for 'print', but not 'excrete' or 'egest'.) I'd not seen the word 'consume' in any Unicode documentation I've been reading recently. The first language of English Wikipedia is English. I'll reinstate the correction once and once only. --Cedderstk 06:43, 29 October 2018 (UTC)
 * In standard computer terms, reading implies only the transfer of data, e.g. from a file or network into memory. Consuming involves interpreting the data.  Of course it is possible do both in parallel, but the reading part is independent of the type of data (for example it would be the same whether it was text, an image, etc.) and then consuming depends on the type of data.  If multiple types of data are accepted, the first step in consuming the data is often to identify the type of the data, and the next step would be to process it depending on the type that was identified.  The article is really talking about this step of identifying the type of data, which is often the first step in consuming the data after it is read.  The byte order marks tells it that it is text (and not, say, an image) and also the specific encoding of the text, so that it can then be processed/parsed.  I didn't revert but I don't think "reading" or "accessing" is an improvement; both of those terms refer to transfer of raw data that occurs before it is consumed or interpreted, but the byte order mark is about interpreting the data that was read. -LiberatorG (talk) 15:44, 29 October 2018 (UTC)
 * How about "interpreting" then? "consuming" does have the problem that (at least for many readers) it implies the destruction of the original data.Spitzak (talk) 17:16, 29 October 2018 (UTC)
 * I have no issue with "interpreting". -LiberatorG (talk) 18:01, 29 October 2018 (UTC)

Odds of UTF-8 being in random byte stream
There are 128 bytes with the high bit set, the following bytes can have 256 values each. Therefore there are $128×256^{N-1}$ N-byte sequences starting with a byte with the high bit set.

If there are M characters encoded in UTF-8 using N bytes, the chances of a byte with the high bit set starting a value N-byte character is $M / (128×256^{N-1})$.


 * For N=2, M is 0x800 - 0x80 = 0x780.
 * For N=3, M is 0x10000 - 0x800 = 0xF800 (I am allowing surrogate halves)
 * For N=4, M is 0x110000 - 0x10000 = 0x100000.

As the valid sequences are disjoint the odds of finding the different lengths can simply be added.

$0x780/(128×256) + 0xF800/(128×256×256) + 0x100000/(128×256×256×256) = 0.0586 + 0.0075 + 0.00049 = 0.06665$

This is really close to 1/15. In fact if you do this in integers it is 143130624/2147483648 which reduces to 273/4096, and 15*273 is 4095. Spitzak (talk) 23:42, 26 April 2019 (UTC)


 * I don’t understand your notation. In binary:
 * N = 2, M = 110xxxxx (and therefore 32 values)
 * N = 3, M = 1110xxxx (and therefore 16 values)
 * N = 4, M = 11110xxx (and therefore 8 values)
 * The hex numbers are how many characters are encoded with N bytes. For N=2 for instance the number of characters it could encode is 0x800 (2^11), but 0x80 of these are overlong encodings of 1-byte characters, so the total is 0x800 - 0x80 = 0x780.
 * Since you already stipulated the top bit being set, that’s (32+16+8)/128 probability of a random byte being a legitimate first byte of a multibyte UTF-8 character. However, for those to be legitimate, the following byte must be:
 * M = 10xxxxxx (and therefore 64/256 values)
 * And this is not an independent probability, since the preceding byte’s legitimacy as a first byte depends on it. Byte 3 and Byte 4 have similar requirements as Byte two, but the results are dominated by Bytes 1 and 2 and so I didn’t bother with them. Also, not all of the xxx bits result in real characters, but again, most of that space is populated in the two-byte space, and so I ignore it.
 * And this is why these sorts of factoids need to be cited, not just generated. We shouldn’t even be having this conversation. Strebe (talk) 05:26, 27 April 2019 (UTC)
 * And this is why these sorts of factoids need to be cited, not just generated. We shouldn’t even be having this conversation. Strebe (talk) 05:26, 27 April 2019 (UTC)
 * And this is why these sorts of factoids need to be cited, not just generated. We shouldn’t even be having this conversation. Strebe (talk) 05:26, 27 April 2019 (UTC)


 * You are wrong. The byte stream is *RANDOM*. This means the value of byte 1 is independent of the value of byte 0, completely the opposite of what you said. You also added 5 invalid lead bytes. So a quick approximation like you are doing is (30+16+5)/128 * 64/256, which is close to 1/10, this is an overestimate as it counts many invalid 3 and 4 byte sequences. An underestimate would be to say that only 2-byte leads are valid, which gives a value close to 1/17. My math is in fact correct and gives a value near 1/15.
 * I don't think any citation is needed for obvious math. However what I wanted a citation for is some analysis for *real* text, or even real binary data, which is not random. I believe the odds of a byte having the high bit set is significantly less than 1/2 in real text, but at this point I do think a citation is needed.
 * I am unclear where you are getting your impression that the second byte somehow magically has greater odds of being correct, though I think the underlying misconception is why people insist on the BOM and can do such muddled thinking about UTF-8. It seems there is a chain of thought whereby just the existence of UTF-8 in the Universe some how forces patterns that are invalid UTF-8 to either be physically impossible or (in your thinking) less likely than valid patterns, perhaps as some kind of quantum physics effect? Besides simple errors like yours of over-estimating the chances that a random stream will be valid UTF-8, this thinking has caused the far more serious problem of designs that assume turning an array of bytes into "Unicode" and then back again is a lossless operation, and thus storing strings as "Unicode" rather than UTF-8 internally is acceptable.Spitzak (talk) 20:02, 27 April 2019 (UTC)


 * There is a reason Wikipedia does not permit WP:OR, so stop trying to engage in WP:OR and stop telling people what is going on in their own heads and fantasizing that you are the only person in the world who understands these things. I’m not interested. 1:10 versus 1:15 is meaningless, and only serves to demonstrate your obsession with this topic that has incited, over the span of many years, a long history of poor edits under the misconception that someone might construe the article to be advocating using the BOM. It doesn’t; it never has; and it’s just boggling that you cannot move on with your life. The world doesn’t work anything like your rant implies. Strebe (talk) 21:45, 27 April 2019 (UTC)

"the text stream's encoding is Unicode"
Unicode is no encoding. It is a charset and provides several encodings. Thus, it should either read "the text stream's character set is Unicode" or the "text stream has a Unicode encoding". --Meillo (talk) 18:56, 31 January 2021 (UTC)
 * I don’t think anyone is confused by the present verbiage, which is merely shorthand for “the text stream’s encoding is one of Unicode’s encoded forms”. Unicode isn’t a “charset”, either; Unicode is standard that defines a character repertoire, a code point for each abstract character in that repertoire, several encoded forms for the set of code points, and a lot of rules and recommendations. I don’t think either of your proposed changes is satisfactory. In the first case, Unicode does not use the term “character set” because the term is not well defined. In the second case, almost any text stream “has a Unicode encoding”; i.e., can be represented in a Unicode encoding by some transformation. Strebe (talk) 00:40, 1 February 2021 (UTC)
 * So, in the end, the presence of a BOM only really tells that there (most likely) is some Unicode stuff involved? ;-) I really would like to get rid of the words "the encoding is Unicode". There is so much confusion in the whole topic, and wordings like this one add to it. I like the long form (“the text stream’s encoding is one of Unicode’s encoded forms”) much more, as it provides more clarity. But actually, is this really what the bullet point wants to point out? In relation to the third bullet point ("Which Unicode character encoding is used."), the second one maybe should rather be: "The fact that the text stream is Unicode, to a high level of confidence;" (i.e. omitting "'s encoding"). --Meillo (talk) 03:07, 1 February 2021 (UTC)
 * I’m fine with “the text stream is Unicode”. Yes, the BOM just says Unicode is probably involved. Strebe (talk) 04:50, 1 February 2021 (UTC)