Talk:List of XML and HTML character entity references

Canonical Reference
The W3C standard XML Entity Definitions for Characters April 1, 2010 is the final authority on entity names. The ISO original standards committee (ISO/IECJTC1 SC34) invited the W3C MathML working group to take over the maintenance and development of entity names. The Unicode Consortium accepts the ISO recommendation. Since there is one defining document for all entity names it should be referenced as the authoritative document for all entity names. Other references for entity names should be shown for historical reasons since some entity names have been associated with different characters over time (examples are 'lang' and 'rang' from U+2329 and U+232A to U+27E8 and U+27E9 respectively). —Preceding unsigned comment added by Joejava (talk • contribs) 17:04, 16 November 2012 (UTC)

Octal anyone?
I don't think that the standard references octal numbers, but some of my text tools (e.g. Unix's od(1) command) output octal representations of data. It sure would be convenient to search on whatever we've got without having to convert to hex. Since was the one that triggered this thought, here's a proposal for an alternative.

And... since I'm guessing that this is machine generated, here's a Perl snippet that prints the (augmented) cell.

MichaelRWolf (talk) 14:03, 28 November 2009 (UTC)

P.S. I'd be glad to flesh out this line of Perl to generate the entire table, should you like.

&#nnn; or &#nnnn;
Some cheat sheets show 3 digit references, some show 4 digit references. If I'm correct, the 3 digit references refer to ISO-8859-1 and the 4 digit references refer to ISO10646/Unicode.

For example, I'd like to use an en dash on my site, but I'm not sure whether to use  or  …

Which should I be using, or does it depend on my encoding (or something else)?

Thanks,

Wulf (2006-08-28T23:28:00Z)


 * Your encoding and the number of digits doesn't matter, but the range of numbers represented by those digits does.  through , whether you write them like that or with any number of leading zeroes (or in hexadecimal form preceded by 'x') are technically not allowed in HTML documents, and if they were, they'd be, according to the specs, referring to non-printing control codes.


 * Browsers that render some refs in that range as if they were references to Windows-1252 bytes, rather than UCS code points, are doing so only for backward compatibility with pre-HTML 4 browsers that were trying to accommodate authors who were using those refs in an attempt to put certain then-illegal characters (such as the Euro symbol, en dash, em dash, and curved quotation marks) in their documents. If you use the proper codes for the characters you want (most of which would indeed require 4 digits), you should see them in all modern browsers and environments. —mjb 05:36, 30 August 2006 (UTC)


 * Thanks :) –Wulf 03:30, 1 September 2006 (UTC)

need to add
ř is a Czech character that is used in the name of the composer Dvořák, but I don't know the rest of the information for that row. I just know it would be useful to list. Symphony Girl (talk) 00:43, 6 May 2008 (UTC)

character entity reference
I'd like to know what allowable names are for non-numeric entity references. a-z, numbers, dashes seem to be allowed, but what about underscores? Other characters? Case sensitivity? How long can a name be?

Also, it appears that at least in SGML entity values are not restricted to one character. Is there a lenght limit, and how does it compare to XML? 85.178.100.140 (talk) 17:40, 8 December 2007 (UTC)

Vertical bar
What is the code for "|"? Since the code for the broken vertical bar exists, shouldn't one exist for the "original", unbroken version? __meco (talk) 14:40, 9 June 2010 (UTC)
 * (U+007C). &amp;#124; . fileformat.info. Dan ☺ 19:54, 10 June 2010 (UTC)
 * &amp;#x7c;. —Tamfang (talk) 20:17, 10 June 2010 (UTC)

In the article. __meco (talk) 21:14, 10 June 2010 (UTC)
 * We're funnin' ya. Since the common-or-garden pipe is not a special character in HTML, nor an extension to the "original" character set, it needs no code other than "|"; but any character can be specified by its Unicode number, as shown above.  Same goes for the "original" unaccented 'e'. —Tamfang (talk) 02:34, 11 June 2010 (UTC)

Case sensitivity of named character entities
The article does not mention anywhere, whether (XML and/or HTML) named entitied are case sensitive or not.

I.e. does &apos; &APOS; &Apos; and &apoS; all signify the same apostrophe character, or is only the first of the preceding list valid?

For HTML character entities, there are separate definitions that differ only by case (e.g. &amp;Oslash; and &amp;oslash; for an upper-/lowercase letter "O" with a forward slash (Ø and ø). But does the standard allow "free case" where no ambiguity exists?

—Preceding unsigned comment added by Mortenhattesen (talk • contribs) 08:31, 6 December 2010 (UTC)

-- No idea how to reply but they are case-sensitive in both HTML and XML. —Preceding unsigned comment added by 83.85.115.123 (talk) 17:57, 4 January 2011 (UTC)
 * Entity names have been case sensitive since HTML 2.0. See rfc 1866 section "3.2.3." which says "Element and attribute names are not case sensitive, but entity names are. For example, `&lt;BLOCKQUOTE&gt;', `&lt;BlockQuote&gt;', and `&lt;blockquote&gt;' are equivalent, whereas `&amp;amp;' is different from `&amp;AMP;'."


 * However, the OP's question asked about &amp;apos; &amp;APOS; &amp;Apos; and &amp;apoS;. None of those are valid entity names for HTML 2.0 through 4.01. &amp;apos; is part of the HTML 5.0 proposal and is in XHTML 1.0. --Marc Kupper&#124;talk 18:24, 5 September 2011 (UTC)

Apos entity
The HTML 4 doesn't include the "apos" entity. However, with "apos", the list consists of 253 items. — Preceding unsigned comment added by 85.50.221.168 (talk) 14:55, 31 October 2011 (UTC)

Title
As XML does not have "character entity references" but "predefined entities" is this the best title? Widefox (talk) 10:13, 13 June 2012 (UTC)

HTML5
HTML5 adds a truckload of new named references, and changes a few from HTML 4.0 (like &amp;lang; and &amp;rang;). How should we handle this? 08:25, 15 October 2014 (UTC)

Perpendicular or bottom?
Unicode spec says:

22A4 ⊤ DOWN TACK

= top

→ 2E06 ⸆  raised interpolation marker

→ 1F768 🝨  alchemical symbol for crucible-4

22A5 ⊥ UP TACK

= base, bottom

→ 27C2 ⟂  perpendicular

So how is the XML perp defined? 22A5 would not make sense

I'm sorry I don't have time to investige now :( — Preceding unsigned comment added by 37.152.9.190 (talk) 17:29, 2 December 2015 (UTC)

Spaces
This page defines a complete set of space codes in the range U+2000 to U+200B but does not give them character entity codes. This page shows some, possibly all of them. Sorry, I do not feel moved to chase up their history and add the missing ones to this table. — RHaworth (talk · contribs) 10:02, 19 January 2019 (UTC)

Updated spec from WHATWG
I understand that since this W3C announcement, the canonical reference for the named entities is the WHATWG’s list of named references. I updated the spec link and table accordingly. Two major changes are that: Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)
 * 1) some of entities are also valid without the trailing semicolon (they seem to be those of the DTD HTMLLat1, and some HTMLspecial);
 * 2) some entities correspond to two code points (but still to one grapheme).

Automated checking
I programmatically verified that the table was respecting a few rules:


 * A. all named entities from the spec are in the table
 * B. no named entity outside of the spec is in the table
 * C. the decimal code points corresponding to the named entities is as per the spec
 * D. the code points are in format "U+HHHH (D)"
 * E. the hexadecimal value of the code points match the decimal value
 * F. the default order of the entities is a) per ascending number of code points, b) per ascending value of the code points
 * G. there are no duplicate code points (so that named entities with the same code points are gropued in the same row)
 * H. the descriptions of the named entities consist of the name of the Unicode code points as per the Unicode standard, optionally followed by a wiki reference and/or additional text in parenthesis
 * I. the characters match the decimal code points

This checks the correctness of three out of the six columns of the table: “Names”, “Character”, “Unicode code point (decimal)”, “Description”. I am not sure about the three other columns, and I may have made mistakes. In particular I have added entities with the value “HTML 5.0” for the “Standard” column, but I think that the WHATWG only has a living standard (as opposed to the W3C which has versions). So please feel free to fix those if needs be. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Making use of the code
To anyone maintaining the table, please consider making use of the code I used to check the rules mentioned above. This is JavaScript code to run in the browser console. Note that if you do not trust this code, do not execute it. Running untrusted code can present security risks.

To make use of the code, go to the article page, then open the JavaScript console of the browser (F12 is a common shortcut for that), then paste the following snippets:

This assigns the  element of the table to a variable. It will be needed for the various checks. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Check list of named entities
There are two steps to perform the checks A, B, C. First open a tab on https://whatwg.org, and in the console paste the following function:

Call it as follows to replace the content of the page with the JavaScript object  which contains the spec of the named entities. Copy the content of that page, and paste it in the console of the tab for the wikipedia article.

Then, paste the following function which checks the wikipedia table using the object above.

Call it as follows:

Fix all errors before proceeding. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Check code points
The following function performs the checks D, E, F, G:

Call it as follows:

Fix all errors before proceeding. — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Check descriptions
To perform the check H, two steps are necessary. First, go to the Unicode data page URL at https://www.unicode.org/Public/UNIDATA/UnicodeData.txt and run the following code:

This will transform the data in the page into the JS object  needed to perform the description check. Copy the updated content of the page, and paste it in the JavaScript console of the wikipedia article page. Beware however that it may slow down your browser considerably (on my system after I pasted the object in the console, trying to use the DOM inspector on that page caused firefox to freeze). I suspect that it is because the object is very large.

Then paste the following function to check the table:

Call it as follows:

Currently it outputs two warnings, as there is extra explanation for rows as follows: The description of entities equiv, Congruent has extra text after the Unicode name of its code point(s): -> Unicode name is "identical to" -> extra tailing text is "; sometimes used for 'equivalent to' or 'congruent'" The description of entities nequiv, NotCongruent has extra text after the Unicode name of its code point(s): -> Unicode name is "not identical to" -> extra tailing text is "; sometimes used for 'not congruent'" — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Check characters
The following function performs the check I:

Call it as follows: — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Automate table row code creation
Here is a helper function which generates a new row for entities with a given code point. It needs the object  generated above.

It can be called for example as follows: — Rangitoto2 (talk) 06:21, 17 September 2019 (UTC)

Representation in HTML
Should be another (second?) column in table List of XML and HTML character entity references with code look, e.g. &amp;Tab;, &amp;NewLine;, &amp;excl; etc. And it's hard to find e.g. "ge" entity now, "&amp;ge;" would be really easier.

Do you support or could I add? (see this) Segu (talk) 20:12, 15 February 2020 (UTC)

Does Wikipedia support HTML5 ?
The codes from HTML versions 4 and early work — for example &amp;Acirc; ( &Acirc; ) but the codes from HTML5 do not work for me — for example &amp;Scedil; ( Ş - &Scedil; ) - I tried in Chrome and Firefox.

Wikipedia does not support HTML 5 ? —  Ark25  (talk) 17:44, 28 March 2020 (UTC)
 * Neither does it for me in SeaMonkey 2.53.2 (Gecko 60.3.2) on Linux. I see your Ş when written in UTF-8, but the following entity is spelled out as an uninterpreted entity. What does work is using either the decimal value (&amp;#350; gives &#350;) or the hex value (&amp;#x15E; gives &#x15E; and &amp;#x15e; gives &#x15e;) but of course to use them you have to know the decimal or hex value of the codepoint, which is not always obvious to get. The Unicode scripts chart and the Unicode character name index can help you. Every script or character link there resends to a PDF file containing a part of the current Unicode character list. — Tonymec (talk) 18:27, 28 March 2020 (UTC)


 * This is remarkably hard to find a solid answer for. 8-(
 * AFAICS, Mediawiki has been HTML5 (in the output) for some years. They've also adopted the idea of a rigorously XML-compliant output model, and with full Unicode support. There's also strong encouragement for authors to use Unicode directly, rather than entities.
 * An XML output model raises an old issue with XHTML: which entities are permitted? For some XML parsing models, none of them (except the five XML entities) are usable. In others, the HTML DTD is parsed (or assumed) and the HTML entities are permissible. But which set of entities? In particular, HTML5 doesn't indicate the DTD to be used (it's implicit, by defined HTML5 behaviour outside the normal XML or SGML parsing models).  Clearly (by observation), Mediawiki passes HTML 4 entities through as entities but anything else (including HTML5 entities) are   escaped. I can't explain this choice, I can't find a source for the decision. I'm puzzled as to why: passing them through  would work (HTML5 is accepted as effectively universal), converting them to characters would work (it's Unicode clear throughout), but this behaviour gives an unexpected behaviour for editors, based on whether an entity if HTML5 or HTML 4.
 * Note that this isn't a browser behaviour. What Mediawiki puts out in the source for a page is only to clear. Andy Dingley (talk) 01:51, 29 March 2020 (UTC)
 * Really strange, imo. I just created a .htm file in Notepad and Chrome, Firefox and IE have no problem to show &Scedil; as "Ş". Why MediaWiki refuses to parse it is quite a mistery. —  Ark25  (talk) 21:42, 17 April 2020 (UTC)
 * Possibly this might help one day? Unless it has been removed by this. All the best: Rich Farmbrough  (the apparently calm and reasonable) 17:53, 17 May 2020 (UTC).