User:Chrisahn/CommentTooLong

 See http://lists.wikimedia.org/pipermail/wikitech-l/2009-October/thread.html#45709 

''' I fixed it. '''

What's the problem?
I use mwdumper (latest SVN revision 57818) to import jawiki-20090927-pages-articles.xml into MySQL, but I get an error:


 * Data too long for column 'rev_comment'

The problem is that the xml file contains a revision comment that is 257 bytes long, but the column accepts at most 255 bytes.

What may be the cause?
On the Wikipedia page, the comment ends with the byte 'e3', while in the xml file it ends with 'ef bf bd'.

I think the cause is something like this:


 * Comments are truncated to 255 bytes when they are stored.
 * In this case, this means that a three-byte UTF-8 sequence is cut off after its first byte (hex value e3), so the comment ends with an invalid one-byte UTF-8 sequence.
 * The dump process has to generate valid UTF-8 (otherwise, most XML parsers wouldn't accept the file), so it replaces the invalid one-byte UTF-8 sequence by the 'replacement character' U+FFFD, which has the three-byte UTF-8 sequence 'ef bf bd'. See UTF-8 table U+FFF0 to U+100EF.
 * In this case, the comment grows from 255 bytes to 257 bytes.

But also see Odd...

How to fix it?
MediaWiki should make sure that a comment contains only valid UTF-8 sequences, even when it is truncated. This may mean that it has to be truncated to less than 255 bytes.

Alternatively, the dump process could drop invalid UTF-8 sequences instead of replacing them.

Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary.

Hacked it
I implemented the mwdumper fix / hack and checked it in at dbpedia.svn.sourceforge.net. Seems to fix that problem for me. Feel free to copy that code back to mediawiki if you want.

The gory details
The culprit is the comment of this diff.

In the HTML code of that page, the comment ends with the byte 'e3'.

Here's the relevant section from the xml file:

7th ストリート/メトロセンター駅 (ロサンゼルス郡都市圏交通局) 1876626 27886676 2009-09-10T15:23:02Z まあイケる冗談 332172 ページ 7th ストリート/メトロセンター駅 (ロサンゼルス郡都市圏交通局) を 7th ストリート/メトロセンター駅 へ移動: 7th ストリート/メトロセンターという名前の駅は 今のところ他には� #転送 7th ストリート/メトロセンター駅

Here are the 257 bytes of the comment, ending with the replacement character 'ef bf bd':

00000000 20 20 20 20 20 20 3c 63  6f 6d 6d 65 6e 74 3e e3  |      .| 00000010 83 9a e3 83 bc e3 82 b8  20 5b 5b 37 74 68 20 e3  |........ 00000020 82 b9 e3 83 88 e3 83 aa  e3 83 bc e3 83 88 2f e3  |............../.| 00000030  83 a1 e3 83 88 e3 83 ad  e3 82 bb e3 83 b3 e3 82  |................| 00000040  bf e3 83 bc e9 a7 85 20  28 e3 83 ad e3 82 b5 e3  |....... (.......| 00000050  83 b3 e3 82 bc e3 83 ab  e3 82 b9 e9 83 a1 e9 83  |................| 00000060  bd e5 b8 82 e5 9c 8f e4  ba a4 e9 80 9a e5 b1 80  |................| 00000070  29 5d 5d 20 e3 82 92 20  5b 5b 37 74 68 20 e3 82  |) ... 00000080 b9 e3 83 88 e3 83 aa e3  83 bc e3 83 88 2f e3 83  |............./..| 00000090  a1 e3 83 88 e3 83 ad e3  82 bb e3 83 b3 e3 82 bf  |................| 000000a0  e3 83 bc e9 a7 85 5d 5d  20 e3 81 b8 e7 a7 bb e5  |...... .......| 000000b0 8b 95 3a 20 37 74 68 20  e3 82 b9 e3 83 88 e3 83  |..: 7th ........| 000000c0 aa e3 83 bc e3 83 88 2f  e3 83 a1 e3 83 88 e3 83  |......./........| 000000d0 ad e3 82 bb e3 83 b3 e3  82 bf e3 83 bc e3 81 a8  |................| 000000e0 e3 81 84 e3 81 86 e5 90  8d e5 89 8d e3 81 ae e9  |................| 000000f0 a7 85 e3 81 af e4 bb 8a  e3 81 ae e3 81 a8 e3 81  |................| 00000100 93 e3 82 8d e4 bb 96 e3  81 ab e3 81 af ef bf bd  |................| 00000110 3c 2f 63 6f 6d 6d 65 6e  74 3e 0a                 | .|

Even more (probably irrelevant) details
mysql Ver 14.12 Distrib 5.0.67, for suse-linux-gnu (x86_64) using readline 5.2

This is how I start the MySQL server:

Here's how I run the import:

Here's the affected table. AFAIK, tinyblob allows 255 bytes:

-- MySQL dump 10.11 -- -- Host: localhost   Database: dbpedia_ja -- -- -- Server version      5.0.67

[SNIP...]

-- -- Table structure for table `revision` --

DROP TABLE IF EXISTS `revision`; SET @saved_cs_client    = @@character_set_client; SET character_set_client = utf8; CREATE TABLE `revision` ( `rev_id` int(10) unsigned NOT NULL auto_increment,  `rev_page` int(10) unsigned NOT NULL,  `rev_text_id` int(10) unsigned NOT NULL,  `rev_comment` tinyblob NOT NULL,  `rev_user` int(10) unsigned NOT NULL default '0',  `rev_user_text` varbinary(255) NOT NULL default '',  `rev_timestamp` binary(14) NOT NULL default '\0\0\0\0\0\0\0\0\0\0\0\0\0\0',  `rev_minor_edit` tinyint(3) unsigned NOT NULL default '0',  `rev_deleted` tinyint(3) unsigned NOT NULL default '0',  `rev_len` int(10) unsigned default NULL,  `rev_parent_id` int(10) unsigned default NULL,  PRIMARY KEY  (`rev_id`) ) ENGINE=MyISAM AUTO_INCREMENT=28201307 DEFAULT CHARSET=binary MAX_ROWS=10000000 AVG_ROW_LENGTH=1024; SET character_set_client = @saved_cs_client;

[SNIP...]

MySQL charset settings:

mysql> show variables like '%char%'; +--++ +--++ +--++
 * Variable_name           | Value                      |
 * character_set_client    | utf8                       |
 * character_set_connection | utf8                      |
 * character_set_database  | utf8                       |
 * character_set_filesystem | binary                    |
 * character_set_results   | utf8                       |
 * character_set_server    | utf8                       |
 * character_set_system    | utf8                       |
 * character_sets_dir      | /usr/share/mysql/charsets/ |

Odd...
The comment at Infektion durch Protozoen in dewiki-20090917-pages-articles.xml should have crashed as well... but for some reason, it didn't...

Infektion durch Protozoen 862403 8433859 2005-08-13T00:08:51Z Muck 77837 Infektion durch Protozoen wurde nach Protozoeninfektion verschoben: Zur Vereinheitlichung der verschiedenen Infektionsarten wie Virusinfektion, Pilzinfektion, Wurminfektion, Bakterielle Infektion usw. Die Bezeichnung &quot;Infektion durch Protozoen&quot; f�      #redirect Protozoeninfektion

00000000 20 20 20 20 20 20 3c 63  6f 6d 6d 65 6e 74 3e 5b  |      [| 00000010 5b 49 6e 66 65 6b 74 69  6f 6e 20 64 75 72 63 68  |[Infektion durch| 00000020 20 50 72 6f 74 6f 7a 6f  65 6e 5d 5d 20 77 75 72  | Protozoen]] wur| 00000030 64 65 20 6e 61 63 68 20  5b 5b 50 72 6f 74 6f 7a  |de nach  00000040  6f 65 6e 69 6e 66 65 6b  74 69 6f 6e 5d 5d 20 76  |oeninfektion v| 00000050  65 72 73 63 68 6f 62 65  6e 3a 20 5a 75 72 20 56  |erschoben: Zur V| 00000060  65 72 65 69 6e 68 65 69  74 6c 69 63 68 75 6e 67  |ereinheitlichung| 00000070 20 64 65 72 20 76 65 72  73 63 68 69 65 64 65 6e  | der verschieden| 00000080 65 6e 20 49 6e 66 65 6b  74 69 6f 6e 73 61 72 74  |en Infektionsart| 00000090 65 6e 20 77 69 65 20 56  69 72 75 73 69 6e 66 65  |en wie Virusinfe| 000000a0 6b 74 69 6f 6e 2c 20 50  69 6c 7a 69 6e 66 65 6b  |ktion, Pilzinfek| 000000b0 74 69 6f 6e 2c 20 57 75  72 6d 69 6e 66 65 6b 74  |tion, Wurminfekt| 000000c0 69 6f 6e 2c 20 42 61 6b  74 65 72 69 65 6c 6c 65  |ion, Bakterielle| 000000d0 20 49 6e 66 65 6b 74 69  6f 6e 20 75 73 77 2e 20  | Infektion usw. | 000000e0 44 69 65 20 42 65 7a 65  69 63 68 6e 75 6e 67 20  |Die Bezeichnung | 000000f0 26 71 75 6f 74 3b 49 6e  66 65 6b 74 69 6f 6e 20  |&amp;quot;Infektion | 00000100 64 75 72 63 68 20 50 72  6f 74 6f 7a 6f 65 6e 26  |durch Protozoen&amp;| 00000110 71 75 6f 74 3b 20 66 ef  bf bd 3c 2f 63 6f 6d 6d  |quot; f....|

Infektion durch Protozoen wurde nach Protozoeninfektion verschoben: Zur Vereinheitlichung der verschiedenen Infektionsarten wie Virusinfektion, Pilzinfektion, Wurminfektion, Bakterielle Infektion usw. Die Bezeichnung "Infektion durch Protozoen" f� 255 5B5B496E66656B74696F6E2064757263682050726F746F7A6F656E5D5D207775726465206E616368205B5B50726F746F7A6F656E696E66656B74696F6E5D5D207665727363686F62656E3A205A75722056657265696E686569746C696368756E672064657220766572736368696564656E656E20496E66656B74696F6E73617274656E20776965205669727573696E66656B74696F6E2C2050696C7A696E66656B74696F6E2C205775726D696E66656B74696F6E2C2042616B74657269656C6C6520496E66656B74696F6E207573772E204469652042657A656963686E756E672022496E66656B74696F6E2064757263682050726F746F7A6F656E222066EF