Talk:Unicode equivalence

Technical tone
The tone of this article is really technical. babbage (talk) 04:23, 4 October 2009 (UTC)

Useful link
Is it OK to add a link to some software that I found it useful? It's called charlint and its a perl script that can be used for normalisation. It can be found at http://www.w3.org/International/charlint/ Wrecktaste (talk) 15:54, 21 June 2010 (UTC)

Redirect
Glyph Composition / Decomposition redirects here, but the term glyph is not used in this article. — Christoph Päper 15:50, 27 August 2010 (UTC)

Subset
Mathematically speaking, the compatible forms are subsets of the canonical ones. But that sentence is a bit confusing and should probably be rewritten. 213.100.90.101 (talk) 16:36, 11 March 2011 (UTC)
 * Then please do so. I prefer a readable Unicode description. -DePiep (talk) 22:46, 11 March 2011 (UTC)

Rationale for equivalence
The following rationale was offered for why UNICODE introduced the concept of equivalence:
 * it was desirable that two different strings in an existing encoding would translate to two different strings when translated to Unicode, therefore if any popular encoding had two ways of encoding the same character, Unicode needed to as well.

AFAIK, this is only part of the story. The main problem (duplicated chars and composed/decomposed ambiguity) was not inherited from any single prior standard, but from the merging of multiple standards with overlapping character sets. One of the reasons was the desire to incorporate several preexisting character sets while preserving their encoding as much as possible, to simplify the migration to UNICODE. Thus, for example, the ISO-Latin-1 set is exactly incuded in the first 256 code positions, and several other national standards (Russian, Greek, Arabic, etc.) were included as well. Some attempt was made to eliminate duplication; so, for example, European punctuation is encoded only once (mostly in the Latin-1 segment). Still, some duplicates remained, such as the ANGSTROM SIGN (originating from a set of miscellaneous symbols) and the LETTER A WITH RING ABOVE (from Latin-1). Another reason was the necessary inclusion of combining diacritics: first, to allow for all possibly useful letter-accent combinations (such as the umlaut-n used by a certain rock band) without wasting an astronomical number of code points, and, second, because several preexisting standards used the decomposed form to represent accented letters. Yet another reason was to preserve traditional encoding distinctions between typographic forms of certain letters, for example the superscript and subscript digits of Latin-1, the ligatures of Postscript, Arabic, and other typographically-oriented sets, and the circled digits, half-width katakana and double-width Latin letters which had their own codes in standard Japanese charsets. All these features meant that UNICODE would allow multiple encodings for identical or very similar characters, to a much greater degree than any previous standard --- thus negating the main advantage of a standard, and making text search a nightmare. Hence the need for the standard normal forms. Canonical equivalence was introduced to cope with the first two sources of ambiguity above, while compatibility was meant to address the last one. Jorge Stolfi (talk) 14:49, 16 June 2011 (UTC)
 * I agree it would be nice to find a source that says the exact reasons. There are better quotes in some other Unicode articles on Wikipedia. However, except for the precomposed characters, all your reasons are the same as "an exising character set had N ways of encoding this character and thus Unicode needed N ways".
 * Precomposed characters were certainly mostly driven by the need to make it easy to convert existing encodings, and to make rendering readable output from most Unicode easy. There may have been existing character sets with both precomposed and combining diacritics, if so this would fall into the first explanation. But I doubt that would have led to the vast number of combined characters in Unicode.Spitzak (talk) 18:58, 16 June 2011 (UTC)


 * So Unicode equivalence is necessary. the question you want to answer is why where NFD and NFC introduced?86.75.160.141 (talk) 21:31, 24 October 2012 (UTC)

Usage
This article says nothing about Unicode equivalence usage. This mean there is misisng some text.

I do not know many software which relies on/supports Unicode equivalence, but there is at least one: Wikipedia. Unicode équivalence is recognized by Wikipedia software in a way which allows users of both NFD and NFC systems to access the same page-article despite technical internal NF differentiation.

Might be that some people woul need a reference to proove this, but I do not bring any refernce, only this demonstration:

For instance those two pages are a single article:
 * http://fr.wikipedia.org/wiki/Identit%C3%A9
 * http://fr.wikipedia.org/wiki/Identite%CC%81

The same does occur with Cancún: Canc%C3%BAn and Cancu%CC%81n (despite colors might differ for any obscure and not obious reason):
 * http://en.wikipedia.org/wiki/Canc%C3%BAn
 * http://en.wikipedia.org/wiki/Cancu%CC%81n

I suggest to use this information in a way to improve the article, without making wikipedia article any «how to use wikipedia». 86.75.160.141 (talk) 19:14, 11 October 2012 (UTC)

I found this article looking for why 𝓌𝒾𝓀𝒾𝓅𝓮𝒹𝒾𝒶.org seemed to land me on wikipedia.org - might be a good illustrating example. 155.94.127.118 (talk) 23:36, 4 September 2019 (UTC)

Well-formedness
"Well-formedness" refers to whether the sequences of 8-bit, 16-bit or 32-bit storage units properly define a sequence of characters (technically, 'scalar values'). Having combining characters without base characters makes a string 'defective'. There are other faults in a well-formed string that have no name, such as broken Hangul syllable blocks, characters in the wrong order (not all scripts have been set up so that canonical equivalence will 'eliminate' ambiguities), and variation selectors in the wrong places. RichardW57 (talk) 00:49, 17 June 2014 (UTC)

External links modified
Hello fellow Wikipedians,

I have just added archive links to 1 one external link on Unicode equivalence. Please take a moment to review my edit. If necessary, add after the link to keep me from modifying it. Alternatively, you can add to keep me off the page altogether. I made the following changes:
 * Added archive https://web.archive.org/20100109162824/http://forums.macosxhints.com:80/archive/index.php/t-99344.html to http://forums.macosxhints.com/archive/index.php/t-99344.html

When you have finished reviewing my changes, please set the checked parameter below to true to let others know.

Cheers.—cyberbot II  Talk to my owner :Online 18:31, 18 January 2016 (UTC)

Canonicality
Currently the article states (under Combining and precomposed characters) that "In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur", but also (under Canonical ordering) that the canonical decomposition (of U+1EBF) U+0065 U+0302 U+0301 "is not equivalent with U+0065 U+0301 U+0302". Either this is a contradiction and should be fixed, or some further clarification would help.

German Article
There seems to be a German version, which is not linked: https://de.wikipedia.org/wiki/Normalisierung_(Unicode) Skillabstinenz (talk) 20:22, 23 June 2020 (UTC)

Naming

 * Why do we have this article named as «Unicode equivalence» instead of «Unicode normalization forms»? It's confusing.

AXO NOV (talk) ⚑ 10:43, 21 January 2022 (UTC)