User:Topbanana/Reports/This article contains a malformed HTML entity

Overview
The list below shows articles containing malformed HTML entities. It was generated on 7th December 2004 from the 15th November 2004 database dump.

Preamble
Malformed HTML entities may not be rendered correctly in some browsers. They should begin with an ampersand, end with a semi-colon and contain a valid token in the middle.

How does this work?

 * 1) Examine the links shown on the report below and correct those that are wrong.
 * 2) Delete from the list those things you've fixed, or
 * 3) Score out false positive suggestions by enclosing them within  tags, and explain why the correction suggestion in inappropriate
 * 4) Optionally, mark your edit comment with the following text, so that other users are drawn to this page thus increasing the number of people fixing wiki errors: Help Wikipedia fix suspected malformed HTML entities - click here!

Regenerating this report
This report is generated from a Link Analysis Database using the SQL:

DROP TABLE html_entity;

CREATE TABLE html_entity (   code varchar(32) NOT NULL,	PRIMARY KEY( code ) ) ENGINE=MyISAM;

INSERT INTO html_entity VALUES ( 'sup1' ); INSERT INTO html_entity VALUES ( 'sup2' ); INSERT INTO html_entity VALUES ( 'sup3' ); INSERT INTO html_entity VALUES ( 'amp' ); INSERT INTO html_entity VALUES ( 'lt' ); INSERT INTO html_entity VALUES ( 'gt' ); INSERT INTO html_entity VALUES ( 'nbsp' ); INSERT INTO html_entity VALUES ( 'mdash' );

SELECT concat( '*', art_title, ' - check ', group_concat( code ) ) FROM art, html_entity WHERE art_text REGEXP concat( '&', code, '([^;]|$)' ) GROUP BY art_title ORDER BY art_title;

Suggested improvements

 * specify the line on which the HTML is malformed, and/or specify the nature of the malformation.
 * This regexp only searches for one type of malformation: leaving off the final semicolon. --ChrisRuvolo 17:58, 18 Nov 2004 (UTC)
 * Flag false positives that aren't really malformed HTML entities, for example the use of &amp;c instead of etc or similar. (There were several of these in Abbey.) It would also be worth searching for &amp;c;, which is valid if uncommon English masquerading as a correctly formed but non-existent HTML entity. -- Avaragado 20:59, 24 Nov 2004 (UTC)
 * Well, even if they're not supposed to be HTML entities, they still make malformed HTML and should have the ampersand replaced with &amp;amp; DopefishJustin (&#12539;&#8704;&#12539;) 23:16, Nov 29, 2004 (UTC)
 * Search for other HTML entities. Common ones should include: lt, gt, amp, nbsp, mdash, sup1, sup3, and numeric entities (eg. &amp;#x391;).  A regexp like this might work:

REGEXP '&amp;(sup[123]|amp|lt|gt|nbsp|mdash|#x[0-9a-fA-F]*)([^;]|$)'


 * --ChrisRuvolo 22:59, 24 Nov 2004 (UTC)
 * How about REGEXP '&amp;(sup[123]|amp|lt|gt|nbsp|mdash|#x[0-9A-Fa-f]+|#[0-9]+)([^;]|$)' – ABCD 23:53, 29 Dec 2004 (UTC)


 * Search for missing leading ampersand. Possible regexp:

REGEXP '(^|[^&amp;])(sup[123]|amp|lt|gt|nbsp|mdash|#x[0-9a-fA-F]*);'


 * --ChrisRuvolo 22:59, 24 Nov 2004 (UTC)
 * Note, this needs improvement. As it stands, it would hit false positives showing how HTML entities should look.  e.g.: &amp;amp;sup2;  --ChrisRuvolo 23:03, 24 Nov 2004 (UTC)
 * And also words like "lamp", "felt", etc, when followed by a semi-colon. IMHO checking for missing leading ampersand is unnecessary. I'm not sure I've ever seen one of these in the wild - people forget the semi-colon all the time, but remember the ampersand. -- Avaragado 23:13, 24 Nov 2004 (UTC)
 * This is true. I withdraw my suggestion. --ChrisRuvolo 23:56, 24 Nov 2004 (UTC)


 * List updated. I've opted for a slightly more databasey mechanism to report on several different HTML entities.  Yes, this still just checks for unclosed entities - there's work to be done on other malformations, not forgetting out-and-out typos such as &nsbp; .  However, as is the list shows enough problems to keep eveyone occupied for now ;) - TB 09:30, 2004 Dec 7 (UTC)


 * Its been nearly a year. Can we get another run of this report, TB?  Thanks.  --ChrisRuvolo (t) 04:47, 1 November 2005 (UTC)

List
'''The list below omits sup2s for now - most seem to have been fixed after the datbase dump I'm working from was taken. - TB 11:21, 2004 Dec 7 (UTC)'''

False Positives

 * Franklin_W._Olin_College_of_Engineering - check lt
 * false positive&mdash;part of the URL for an external link
 * U.S._presidential_election,_2012 - check nbsp
 * false positive&mdash;nothing found in substub article, no significant history
 * University_of_Dallas - check lt
 * false positive&mdash;it's part of the URL for an external link

Additional problems found

 * List_of_Famicom_games - Malformed html entity fixed but faulty table markup leaves an extraneous &lt;tr&gt;&lt;td&gt; at top. fixed --Phil | Talk 12:26, Mar 9, 2005 (UTC)