User:Spacemartin/Pinyin Wiki

Click here to go to Pinyin Wiki!

Pinyin Wiki is a Pinyin mirror of the Chinese Wikipedia to help language learners.


 * Important note: sometimes, some pages will not display correctly. For instance, at the time of writing, the Main Page (首页) is displaying Chinese characters instead of Pinyin. This is because the server appears to have cached some pages while I was developing some new functionality. Hopefully it will be regenerated at some point, and it will then display Pinyin. In the meantime, just click through to another page until you see Pinyin, or use the Random Page feature (随机页面 or Alt+Shift+x).

Feedback
If you have any feedback about Pinyin Wiki, please leave it in the Talk Page.

All feedback is welcome. I am particularly interested in how helpful Pinyin Wiki is for people learning to read Chinese. I am not a native Chinese speaker, and I currently have only a small vocabulary of Chinese words and characters, so it is hard for me to assess the usefulness of Pinyin Wiki.

I know that the Pinyin is not perfect, but it is hopefully better than nothing. I have encountered some obvious mistakes - e.g. "1990 niánfǎ míngliǎo" for "1990年发明了" (instead of "1990 nián fǎmíngliǎo"), but by and large it looks plausible. I am able to recognise words like "jièshào" and "gōngzuò". It is certainly a lot easier than trying to pick these out from a page full of undifferentiated syllables. One way to improve this might be to use frequency data and probably some kind of dynamic programming to determine which words to use (instead of the current greedy algorithm). I could probably build a database of word frequencies by processing the Chinese Wikipedia. But that would take a while.

Issues

 * It does not allow login or editing - you would need to go to the Chinese Wikipedia for that.


 * At the moment, it displays characters (in tooltips) in mainland Chinese standard characters (simplified characters) - you can select another variant, but the setting will not persist when you go to another page. I may try to find a solution for this.


 * Search is non-functional. Again, I may try to look for a solution at some point. The random article link works, though, and you can use the Chinese Wikipedia to find an article and then paste the title into the URL of Pinyin Wiki.


 * History does not work - and in any case, the server used by Pinyin Wiki does not have access to old versions of the page content.


 * The Math extension is not enabled, so you will occasionally see MathML (XML versions of mathematical expressions) interspersed with the article text.


 * Images are not displayed. I may try to solve this, but I'm not sure whether it's practical as it might involve using up large amounts of disk space on the tool server.

Pinyin details
The wiki uses the Chinese Wiktionary and the Unihan database (which you can browse here) to find a Pinyin reading for each character or word (of 2-4 characters). The readings are displayed in place of the corresponding character or characters, with a space between each word. The characters are still present as tooltips, and if you are using a desktop or laptop computer you can view them by hovering over a word with the mouse pointer.

First, it uses a greedy matching algorithm to find a word of 2-4 characters which is present in the Chinese Wiktionary, and whose entry has (or appears to have) a valid Hanyu Pinyin entry. The ASCII part of the Pinyin is checked against the Unihan database to ensure that each part of it is a valid reading for the corresponding character. The Pinyin is then corrected where necessary by moving tone marks to the correct positions, changing -uen to -un, and inserting required apostrophes. If this whole process succeeds, the Pinyin is used. (In actual fact, a dump of the Chinese Wiktionary was processed once to produce a JSON file with all these readings, which is then used by the web-server.)

If no multi-character word can be found, then it tries to find the most popular reading for each individual character in the kHanyuPinlu field from the Unihan database, which ranks readings by frequency of occurrence, based in part on the frequencies given in Xiandai Hanyu Pinlu Cidian (full details here). This database field sometimes contains invalid Pinyin (the tone marks are sometimes in the wrong place), so if necessary it is corrected. Only a small proportion of characters have a kHanyuPinlu entry.

If all else fails, it uses the kMandarin field from Unihan - I believe this covers pretty much all characters in current use. According to Unicode, this is "The most customary pinyin reading for this character." It is intended to be a stable reading of the character for collation and transliteration purposes in word processors, spreadsheets, etc. (It forms part of the CLDR.)

MediaWiki details
Pinyin Wiki runs a somewhat modified MediaWiki instance, with the following characteristics:


 * It gets most of the data from a (partially redacted) read-only mirror of the Chinese Wikipedia in MySQL. This provides almost everything except article text.


 * Most of the cache is either disabled or running off redis.


 * The msg_resource table is in a separate, writeable database - I did this to avoid having to install memcached.


 * The article text is read from a backup database dump multistream bzip2 file, with the file indexed by an SQLite database.


 * The word and character readings are all read from a single PHP file which is "required" from LocalSettings.php - this defines $wgPinyin, an associative array (i.e. mapping) from each Chinese word or character onto the best matching Pinyin. In this way, I (hopefully) benefit from the PHP opcode cache.