User:Ctxz2323/sandbox

memoes:
 * title: Chinese character orders -
 * "Chinese character sorting" redirected here

Chinese character order, also called Chinese character sorting (漢字排序, 汉字排序), is the way in which a Chinese character set is sorted into a sequence for the convenience of information retrieval. It may also refers to a sorted sequence of Chinese characters.

English dictionaries and indexes are normally arranged in alphabetical order for quick lookup. But Chinese is written in tens of thousands of different characters, not just dozens of characters in an alphabet, and we have to rely on more complicated sorting methods. The orders or sorting methods of Chinese dictionaries are traditionally divided into three categories: (1) form-based orders, including stroke-based orders and component-based orders, which further includes radical-based orders, etc., (2) sound-based orders, including Pinyin-based order and Bopomofo-based order, and (3) meaning-based orders. In modern Chinese, we also have frequency orders where words or characters are sorted by their frequencies of use in a text corpus.

Chinese dictionaries include character dictionaries (zidian, 字典) and word dictionaries (cidian, 詞典, 词典). Chinese word orders are based on character orders. Single-character words are arranged by character sorting directly, and multi-character words can be sorted character by character in a similar way. In the following sections, there is a general introduction to the orders and sorting methods currently in use, focused on those which are more popular and effective.

Form-based orders
In this category of orders, words are sorted according to various features of the forms or shapes of Chinese characters. Comparing with sound-based sorting, form-based sorting has the advantages of (a) allowing word lookup without knowing its pronunciation, and (b) effective collation of large character sets without support from other methods. There are two subcategories of form-based orders: stroke-based order and component-based order.

Stroke-based orders
The important orders of this category include:

Stroke-count order
In this order, Chinese characters are sorted by their numbers of strokes ascendingly. A character with less strokes is put before those of more strokes. For example, the different characters in "漢字筆劃, 汉字笔画 " (Chinese character strokes) are sorted into "汉(5)字(6)画(8)笔(10)[筆(12)畫(12)]漢(14)", where stroke counts are put in brackets. (Please note that both 筆 and 畫 are of 12 strokes and their order is not determinable by stroke-count order.).

Stroke-count order was used in Kangxi Chinese Character Dictionary to arrange the radicals and the characters under each radical when the dictionary was first compiled in 1710s.

Stroke-count-stroke-order sorting
This is a combination of stroke-count sorting and strok-order sorting. Characters are first arranged by stroke-counts ascendingly. Then Stroke-order sorting is employed to sort characters with the same number of strokes. The characters are firstly arranged by their first strokes according to an order of stroke groups (such as “heng (横), shu (竖), pie (撇), dian (点), zhe (折)”, or “dian (点), heng (横), shu (竖), pie (撇), zhe (折)”), if the first strokes belong to the same group, then sort by their second strokes in a similar way, and so on. In our example of the previous section, both 筆 and 畫 are of 12 strokes. 筆 starts with stroke ㇓of the pie (撇) group, and 畫 starts with ㇕ of the zhe (折) group, and pie is before zhe in the groups order, so 筆 comes before 畫. Hence the different characters in "汉字笔画, 漢字筆劃" are finally sorted into "汉(5)字(6)画(8)笔(10)筆(12)畫(12)漢(14)", where each character is put at its unique position.

Stroke-count-stroke-order sorting was used in Xinhua Zidian (新华字典, Xinhua Chinese Character Dictionary) and Xiandai Hanyu Cidian (现代汉语词典, Contemporary Chinese Word Dictionary) before the national standard for stroke-based sorting was released in 1999.

GB stroke-based order
GB Stroke-Based Order, full name GB13000.1 Character Set Chinese Character Order (Stroke-Based Order) (GB13000.1字符集汉字字序（笔画序）规范), is a standard released by the National Language Commission of China in 1999. This is an enhanced version of stroke-count-stroke-order sorting. According to this standard, the characters are first sorted by stroke counts, followed by stroke order (of the five families of heng, shu, pie, dian and zhe). Then if there are characters of the same stroke count and stroke order, they will be sorted by the primary-secondary stroke order. For example, 子 and 孑 have the same five-group stroke order (㇐ and ㇀ both belong to the heng family), but according to primary-secondary stroke order rule, primary stroke ㇐ is before secondary stroke ㇀. So 子 comes before 孑. If two characters are of the same stroke count, stroke order and primary-secondary stroke, then sort them according to the mode of stroke combination. Stroke separation precedes stroke connection, and connection precedes intersection. For example: 八 is before 人, and 人 is before 乂. And there are other sorting rules in the standard for more accurate sorting.

YES order
YES is a simplified stroke-based sorting method free from stroke counting and grouping, without comprise in accuracy. And it has been successfully applied to the indexing of all the characters in Xinhua Zidian and Xiandai Hanyu Cidian. In this joint index you can look up a Chinese character to find its pinyin and Unicode, in addition to the page numbers in the two popular dictionaries

Component-based orders
In this category, characters are sorted by one or more components.

Radical-based orders
A radical (bùshǒu, 部首, or section head) is a common component share by a group of characters. The radical usually lies on the upper part or left side of a character and helps to express its meaning. For example, 江(river)，湖(lake)，海(sea) all have the radical of 氵（水，water), which indicates they are related to water; 推(push)，拉(pull)，打(beat) share the radical of 扌(手, hand), and are actions normally involving hands. In radical-based order, all the characters sharing a radical are put under that radical to form a radical family or section. Different families are arranged by their leading radicals in stroke-based order, and characters inside a family are also sorted by their strokes.

In many contemporary dictionaries, including Xinhua Zidian, Xiandai Hanyu Cidian and Oxford Chinese Dictionary, the radical-based character lookup system consists of three indexes or tables: a radical index, a character lookup index, and an index of characters with radicals difficult to find, all sorted in stroke-based order. To lookup a character (such as 家, home) in a dictionary (e.g., Xinhua Zidian, version 12), first find out its radical (the component 宀 at the top). Count its number of strokes (3 strokes) and find it in the radical index in stroke-based order. When found, get its page number (p49) on the right side. Then, according to the page number, find the radical family in the character lookup table in stroke-based order. Count the number of strokes in the remaining parts of the character (except radical 宀, there are 7 strokes in 家) and find the target character within the family. And the page number on the right (217) is the page number in the dictionary main body for the entry of the character (characters entries in the main body of Xinhua Zidian are sorted by Pinyin). Characters with radicals difficult to find out can be looked up in the Index of Characters with Radicals Difficult to Find in stroke-based order.

The first radical system in history was created by a Chinese Scholar Xu Shen in his Shuowen Jiezi (説文解字, 说文解字) Dictionary almost two thousand years ago in the Eastern Han Dinasty. This dictionary is still available today, with a total number of 540 radicals. Another milestone is the Kangxi radical system employed in the Kangxi Dictionary in 1716 in the era of Emperor Kangxi, with the number of radicals reduced to 214. The Kanxi radical sorting method is still in use in China, Japan and Korea. It is also used for the official orders of Unicode CJK Unified Ideographs. The latest standard radical table of Chinese Mainland is the Table of Indexing Chinese Character Components with a list of 201 radicals.

Four-corner order
Chinese characters are written in the form of a square block. The Four-Corner Method assigns a 4-digit code to a character, each digit representing one corner of the block. The four corner digits appear in the sequence of "upper-left, upper-right, lower-left and lower-right". For example, the code of character 顏 (meaning "face") is 0128, where the first digit 0 represents the upper-left component 亠, 1 for the upper right 一， 2 for the lower-left ㇓, and 8 represents the lower-right 八.

A fifth digit can be added to represent an extra part above the lower-right corner to gain higher sorting accuracy. For example the extended code of character 佳 is 24214, where the fifth digit 4 represents component 十 above the final 一 in the lower-right corner.

When a set of characters are encoded in four-corner codes, they are sorted ascendingly into a four-corner order by the first four digits (followed by the fifth digits if they exist).

Cangjie-code order
In this method, Chinese characters are arranged alphabetically by their codes used in Cangjie input method. The Cangjie code of a character is a string of English letters each representing a selected Cangjie component in the character. For example, the Cangjie codes of the characters in 漢字排檢法 (Methods for Chinese character sorting and retrieving) are 漢(ETLO)字(JND)排(QLMY)檢(DOMO)法(EGI), and can be sorted into a Cangjie-code order of 檢(DOMO)法(EGI)漢(ETLO)字(JND)排(QLMY).

Sound-based orders
There are two sound representation systems in Mandarin Chinese or Putonghua, i.e., Pinyin and Bopomofo. Accordingly we have two methods of sound-based sorting for modern standard Chinese.

Pinyin-based order
In this method, Chinese characters are sorted by their Pinyin (pīnyīn, 拼音) alphabetically, for example, 汉字拼音排序法 (Pinyin sorting method of Chinese characters) is sorted into "法(fǎ)汉(hàn)排(pái)拼(pīn)序(xù)音(yīn)字(zì)" with pinyin in brackets. Pinyin expressions of similar letters are ordered by their tones in the order of tone 1, tone 2, tone 3, tone 4 and tone 5 (light tone), such as "妈(mā), 麻(má), 马(mǎ), 骂(mà), 吗(ma)". Characters of the same sound, i.e., same Pinyin letters and tones, are normally sorted by a stroke-based method.

Words of multiple characters can be sorted in two different ways. One is to sort character by characters, if the first characters are the same, then sort by the second character, and so on. For example, 归并(guībìng)，归还(guīhuán)，规划(guīhuà)，鬼话(guǐhuà)，桂花(guìhuā). This method is used in Xiandai Hanyu Cidian. Another method is to sort according to the pinyin letters of the whole words, followed by sorting on tones when word pinyin letters are the same. For example, 归并(guībìng)，规划(guīhuà)，鬼话(guǐhuà)，桂花(guìhuā)，归还(guīhuán). This method is used in the ABC Chinese–English Dictionary.

Pinyin-based sorting is very convenient for looking up words which you know its pronunciation and Pinyin expressions. But you can not find words which you do not know the sound.

Bopomofo-based order
Bopomofo, or Phonetic Symbols (zhùyīn fúhào, 注音符號, 注音符号), is a Chinese phonetic system created by the Commission on the Unification of Pronunciation (讀音統一會) in 1913, and formally issued by the Ministry of Education of the Chinese Government in 1918. It consists of a table (or alphabet) of 37 letters or symbols in the order of "ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩ" and 5 tone diacritics of “ˉ, ˊ, ˇ, ˋ, ˙”.

Chinese characters are sorted according to the Bopomofo expressions of their sounds by their order in the alphabet table, first by letters, then by tones in the order of first tone, second tone, third tone, fourth tone, and fifth tone (also called neutral tone, light tone). For example, the Bopomofo order for the characters in “注音字母排序法 (Bopomofo-based sorting)” are “排(ㄆㄞˊ)母(ㄇㄨˇ)法(ㄈㄚˇ)序(ㄒㄩˋ)注(ㄓㄨˋ)字(ㄗˋ)音 (ㄧㄣ)”. Characters of the same sounds are normally sorted by a stroke-based method.

The first dictionary sorted in Bopomofo is 國語辭典 (Guoyu Dictionary) published in 1937, followed by many other dictionaries. Bopomofo is more popular in Taiwan than in Chinese Mainland, where Pinyin is predominant.

Dialect-sound orders
In addition to the sounds of standard Chinese, Chinese characters can be sorted by the sounds of dialects as well. For example, by Jyutpin (Cantonese Pinyin) of the Cantonese dialect popular in Hong Kong.

In Jyutpin, the sound of a Chinese character is represented by a string of English letters, followed by a number of 1, 2, 3, 4, 5 or 6 to represent the tone. For instance, the Jyutpin order of the characters in “粵拼排檢法 (Jyutpin-based sorting and retrieving)” is “法[faat3]檢[gim2]粵[jyut6]排[paai4]拼[ping3]”, where Jyutping expressions are in square brackets”.

The most serious limitation of sound-based orders is their lack of support to look up words with unknown pronunciation. And that is why dictionaries collated by sounds often provides some indexes in form-based orders.

Meaning-based orders
Meaning-based orders, also called semantics-based orders, arranges characters and words in a hierarchical structure of semantic categories. The first surviving Chinese dictionary Erya (date from the 3rd century BC) is arranged by semantic classification. The words were divided into nine categories, each with a large number of entries. An entry is a list of synonyms, which are explained by a word commonly used. For instance, entry "林、烝、天、地、皇、王、後、辟、公、侯，君也. ", where the ending 君也 says (the previous words are) synonyms of "君 (king)".

Modern semantically-sorted dictionaries include "同义词词林" and "实用广州话分类词典". Their classification systems are much more accurate and detailed than the ancient dictionaries, but still need indexes of radicals or strokes. That means meaning-based sorting is not powerful enough to function as an independent sorting method.

Semantics-based sorting involves these questions: What are the categories and subcategories to use? How to put a word into its category and subcategory? How to arrange the categories and subcategories in order? How to arrange the words in the lowest subcategories in order? And the answers to these questions may vary between the user and compiler of the dictionary, and that will lead to difficulties in word lookup.

In fact, radical-based sorting is meaning-based to a certain degree, because in many cases the radical represents the semantic category of a character, e.g., radical 氵(water) in character 江(river), 扌(hand) in 推 (push), 木(wood) in 椅(chair).

Frequency-based orders
This category of orders have Chinese characters sorted by their frequency of uses, normally in descending order. That means the most frequently-used character is at the top of the list. A frequency list is created from a text corpus. In corpus linguistics, the frequency of a character is the ratio percentage of its number of occurrences in the corpus to the total number of characters of the corpus.

The first frequency list of Chinese characters based on a corpus was created by Chen Heqin (陳鶴琴). In the 1920s, he and his assistants spent two years manually counting the characters in a corpus of 554,478 characters, and obtained 4,261 different characters with frequency information. The top 10 characters in their frequency list are (in descending order): "的(of), 不(no, not), 一(one, a/an), 了(had, done), 是 (be), 我(I, me), 上（on, up), 他(he, him), 有(have, has), 人(person, people)".

In 2001, the Chinese University of Hong Kong published a number of frequency lists on the Web, entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a trans-reginal diachronic survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, Mainland China and Taiwan and in the two time periods of the 1960's and 1980/1990's. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.

From the data of these frequency lists, we can discover that the 100 most frequently-used characters in the 1980/90's cover (i.e., have an accumulated frequency of) 41.00% of the Hong Kong texts of that period, 41.34% of the Mainland texts, and 41.88% of the Taiwan texts. That is more than 4 out of every 10 characters for the three regions. The 1000 most frequently-used characters in the 1980/90's cover 89.25% of the Hong Kong texts of that period, 90.26% of the Mainland texts, and 88.74% of the Taiwan texts. And similar results can also be found from the frequency lists of the 1960s.

As a matter of fact, both meaning-based sorting and frequency-based sorting are employed in other languages as well, though often at word level, not at character level.

Orders of words
A Chinese word consist of one or more characters. Single-character words can be sorted by a character order, and multi-character words can be sorted character by character in a similar way. For example, according to the Pinyin, Radical and Stroke-based orders used in Xiandai Hanyu Cidian (version 7), the five words of [爱,好,好事,好人,好人家] would be arranged in the following orders:
 * Pinyin-based sorting: "爱(ài), 好(hǎo), 好人(hǎorén), 好人家(hǎorénjiā), 好事(hǎoshì)".
 * Radical-based sorting: "好(radical 女 of 3 strokes), 好事(事: radical 一 of 1 stroke), 好人(人: radical 人 of 2 strokes), 好人家, 爱(radical 爪 of 4 strokes)".
 * Stroke-based sorting: "好(6 strokes), 好人(人：2 strokes), 好人家, 好事(事：8 strokes), 爱(10 strokes)".

Computer sorting
Chinese texts can be automatically sorted on the computer as well. For example, on Microsoft Windows and Office, users can sort their Chinese characters or words in the optional orders of:
 * Unicode order. (This is generally speaking Kangxi Radical order.)
 * Pinyin order. (More popular in Chinese Mainland)
 * Bopomofo order. (More popular in Taiwan)
 * Stroke-based order. (More widely used in Hong Kong)