Chinese computational linguistics

Chinese computational linguistics is the scientific study and information processing of the Chinese language by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language applications. The term Chinese computational linguistics is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical. ^[1]

Rather than introducing computational linguistics in a general sense, this article will focus on the issues of Chinese language which are different from English and other languages. The contents include Chinese character information processing, word segmentation, proper noun recognition, natural language understanding and generation, corpus linguistics, and machine translation. ^[1]

Chinese character information processing[edit]

Chinese character Information Technology (IT) is the technology of computer processing of Chinese characters. While the English writing system makes use of a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the Xinhua Dictionary.^[2] In the Unicode multilingual character set of 149,813 characters, 98,682 (about 2/3) are Chinese.^[3] That means computer processing of Chinese characters is the toughest among other languages.

Chinese character input[edit]

Computer input of Chinese characters is by no means as easy as English. English is written with 26 letters and a handful of other characters, and each character is assigned to a key on the keyboard. Chinese can be input in a similar way. However that would involve a huge keyboard with at least thousands of keys. Searching for a character on the keyboard would be a daunting job.^[4] An alternative way is to encode each Chinese character in English characters, enabling Chinese input on an English keyboard. As a matter of fact, this method has become predominant for Chinese computer input.

Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as the Pinyin Scheme for Mandarin Chinese or Putonghua, and the Jyutping Scheme for the Cantonese dialect. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua Pinyin input code of 香港 (Hong Kong) is "xianggang" or "xiang1gang3", and the Cantonese Jyutping code is "hoenggong" or "hoeng1gong2", all of which can be easily input via an English keyboard.

A Chinese character can alternatively be input by form-based encoding. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. There are a few hundred basic components,^[5] much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the Cangjie input method, character 疆 (border) is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Popular form-based encoding methods include Wubi (五笔) in the Mainland and Cangjie (仓颉) in Taiwan and Hong Kong.^[6]

The most important feature of intelligent input is the application of contextual constraints for candidate character selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", he/she will get "大学教授 / 大學教授" (University Professor), when types "daxuepiaopiao" the computer will suggest "大雪飘飘 / 大雪飄飄" (heavy snow flying). Though the non-toned Pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.^[7]

Chinese character encoding for information interchange[edit]

Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes.

The first GB Chinese character encoding standard is GB2312, which was released by the PRC in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by Pinyin, and the rest by radicals (indexing components). GB2312 was designed for simplified Chinese characters. Traditional characters which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of 香港 (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released.^[8] The latest version of GB encoding is GB18030, which supports both simplified and traditional Chinese characters, and is consistent with the Unicode character set.^[9]

The standard of Big5 encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters.^[10]

The full version of the Unicode standard represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which overs 98,682 (about 2/3) are Chinese sorted by Kangxi Radicals. Even very rarely-used characters are available. For example: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5).^[11] ^[12]

Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.^[13]

Chinese character output[edit]

Like English and other languages, Chinese characters are output on printers and screens in different fonts and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families.^[14]

Fonts appear in different sizes. In addition to the international measurement system of points, Chinese characters are also measured by size numbers (called zihao, 字号) invented by an American for Chinese printing in 1859. ^[15]

Word segmentation[edit]

It is straightforward to recognize words in English text because they are separated by spaces. However, Chinese words are not separated by any boundary markers. Hence, word segmentation is the first step for text analysis of Chinese. For example,

中文信息学报 (Chinese original text)
中文 信息 学报 (word-segmented text)
Chinese information journal (word-by-word English translation)
Journal of Chinese Information Processing (English name)

Chinese word segmentation on a computer is carried out by matching characters in the Chinese text against a lexicon (list of Chinese words) forwardly from the beginning of the sentence or backwardly from the end. There are two kinds of segmentation ambiguities: the intersection-type (交集型歧义字段) and polynomial type (多义型歧义字段) ^[16]).

Typically an intersection ambiguity is in the format of

ABC, where A, AB, BC and C are all words in the lexicon.

It is possible to divide the original character string into word AB followed by C, or A followed by BC. For example ‘美国会’ may mean ‘美国会’ (the US Parliament) or ‘美国会’ (the US can/will).

The most common form of polynomial segmentation ambiguity is AB, where A, B, and AB are all words. That means the character string can be regarded as one single word or be divided into two. For example, string ‘可以’ in the following sentences:

(1) 你	可以	坐下。
    you can	sit down.
    You can sit down.
(2) 你	可	以	他们	为	样板。
    you	can	take	them 	as	example.
    You can take them as an example.

Word segmentation ambiguities can be resolved with contextual information, using linguistic rules and probability of word co-locations derived from Chinese corpora. Usually longer words matching are more reliable. The correctness rate of automatic word segmentation has reached 95 % ^[17]. However there will be no guarantee of 100% percent correctness in the foreseeable future, because that will involve a complete understanding of the text. An alternative solution is to encourage people to write in a word segmented way, like the case in English ^[18]. But that does not means computer word segmentation will no longer be needed, because even in English, word segmentation is required for speech analysis.

Proper noun recognition[edit]

A proper noun is the name of a person, a place, an institution, etc. and is written in English with the initial letter of each word capitalized, for example, ‘Mr. John Nealon’, ‘America’ and ‘Cambridge University’. However, Chinese proper nouns are usually not marked in any style. ^[19]

Recognition of names of people and place in Chinese text can be supported by a list of names. However such a list can never be complete, considering the huge number of places and people all over the world, not to mention their dynamic feature of coming, changing and going. And there are names similar to non-proper nouns. For example, there is a town named 民众 (Minzhong) in southern China, which is also a common noun meaning ‘people’. Therefore, recognition of names of people and place has to make use of their distinguishing features in internal composition and external context. Corpora with proper nouns annotated can also serve as useful reference. ^[19]

A people’s name not found in the dictionary can be recognized with a list of surnames and titles, for example ‘张大方先生’’,李经理’, where 张 (Zhang) and 李 (Li) are Chinese surnames, and 先生 (Mr.) and 经理 (Manager) are titles. In 张大方说, 张大方 can be successfully recognized as a person’s name by the rule that a Chinese given name normally follow the surname and consists of 1 or 2 characters, and the fact that people can speak (说).

Names of place also have characteristics useful for computer recognition. For example, in ‘在广东省中山市民众镇’, component words 省 (province), 市 (city) and 镇 (town) are end markers of place names, while 在 (in, at, on) is a preposition frequently appearing in front of a location.

The correctness rate of computer recognition has reached around 90 % for persons’ names and 95 % for place names ^[17].

Journals and proceedings[edit]

Journal of Chinese information processing (http://jcip.cipsc.org.cn/CN/home)
International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) (https://www.aclclp.org.tw/journal/index.php)
China National Conference on Chinese Computational Linguistics (https://link.springer.com/conference/cncl)
Rocling Proceedings (https://www.aclclp.org.tw/pub_proce.php)

Notes[edit]

References[edit]

Citations[edit]

^ ^a ^b Zhang 2016, p. 420.
^ Language Institute 2020.
^ "Unicode Statistics". www.unicode.org. Retrieved 2023-12-08.
^ Su 2014, p. 218.
^ National Language Commission 1997.
^ Zhang 2016, p. 422.
^ Su 2014, p. 222.
^ Su 2014, pp. 213–215.
^ Lunde, Ken (4 August 2022). "The GB 18030-2022 Standard". Medium. Retrieved 7 August 2022.
^ "[chinese mac] Character Sets". chinesemac.org. Retrieved 2023-11-24.
^ "Unicode Statistics".
^ Unicode Consortium 2023.
^ "Usage Statistics and Market Share of UTF-8 for Websites, December 2023". w3techs.com. Retrieved 2023-12-08.
^ Li 2013, p. 62.
^ Zhang 2006.
^ Liu 2000, pp. 58–61.
^ ^a ^b Xu 2006.
^ Zhang 1998.
^ ^a ^b Zhang 2016, p. 427.

Works cited[edit]

Fromkin, Victora (and Robert Rodman) (1993). An Introduction to Language ) (5th ed.). New York: HBJ. ISBN 0-03-075379-1.
Language Institute, Chinese Academy of Social Sciences (2020). 新华字典 (Xinhua Dictionary ) (in Chinese) (12th ed.). Beijing: Commercial Press. ISBN 978-7-100-17093-2.
Li, Dasui 李大遂 (2013). 简明实用汉字学 [Concise and Practical Chinese Characters] (in Chinese) (3rd ed.). Beijing: Peking University Press. ISBN 978-7-301-21958-4.
Liu, Kaiying (刘开瑛） (2000). 中文文本自动分词和标注 [Automatic word segmentation and annotation of Chinese text] (in Chinese). Beijing: Commercial Press). ISBN 7-100-03068-4.
National Language Commission (1997). Chinese Character Component Standard of GB13000.1 Character Set for Information Processing (PDF). Beijing: National Language Commission of China.
Su, Peicheng 苏培成 (2014). 现代汉字学纲要 [Essentials of Modern Chinese Characters] (in Chinese) (3rd ed.). Beijing: 商务印书馆 (The Commercial Press, Shangwu). ISBN 978-7-100-10440-1.
Unicode Consortium (2023). Unicode Standard, Version 15.1.0. Mountain View, CA: Unicode Consortium.
Xu, Jialu (and Fu Yonghe) (2006). 中文信息处理现代汉语词汇研究 [Morphological Studies in Modern Chinese Information Processing] (in Chinese). Guangzhou: 广东教育出版社 (Guangdong Education Press).
Zhang, Xiaoheng (1998). "也谈汉语书面语的分词问题 -- 分词连写十大好处 ('Written Chinese Word Segmentation Revisited: Ten Advantages of Word-segmented Writing')". Journal of Chinese Information Processing. 12 (3): 57–63.
Zhang, Xiaoheng (2006). "The Number, Point and Metric Systems of Font Size (字形的"号制""点制"与"米制")". Computer Engineering and Applications (计算机工程与应用). 42 (2006) (10): 175–177 & p 215.
Zhang, Xiaoheng (2016). "Computational Linguistics". The Routledge Encyclopedia of the Chinese Language. Oxfordford: Routledge. pp. 420–437. ISBN 978-0-415-53970-8.

[FOOTNOTEZhang2016420-1] Zhang 2016, p. 420.

[FOOTNOTELanguage_Institute2020-2] Language Institute 2020.

[3] "Unicode Statistics". www.unicode.org. Retrieved 2023-12-08.

[FOOTNOTESu2014218-4] Su 2014, p. 218.

[FOOTNOTENational_Language_Commission1997-5] National Language Commission 1997.

[FOOTNOTEZhang2016422-6] Zhang 2016, p. 422.

[FOOTNOTESu2014222-7] Su 2014, p. 222.

[FOOTNOTESu2014213–215-8] Su 2014, pp. 213–215.

[gb18030-2022-kenlunde-9] Lunde, Ken (4 August 2022). "The GB 18030-2022 Standard". Medium. Retrieved 7 August 2022.

[10] "[chinese mac] Character Sets". chinesemac.org. Retrieved 2023-11-24.

[11] "Unicode Statistics".

[FOOTNOTEUnicode_Consortium2023-12] Unicode Consortium 2023.

[13] "Usage Statistics and Market Share of UTF-8 for Websites, December 2023". w3techs.com. Retrieved 2023-12-08.

[FOOTNOTELi201362-14] Li 2013, p. 62.

[FOOTNOTEZhang2006-15] Zhang 2006.

[FOOTNOTELiu200058–61-16] Liu 2000, pp. 58–61.

[FOOTNOTEXu2006-17] Xu 2006.

[FOOTNOTEZhang1998-18] Zhang 1998.

[FOOTNOTEZhang2016427-19] Zhang 2016, p. 427.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]