Chinese computational linguistics

Chinese computational linguistics is a subset of computational linguistics; it is the scientific study and information processing of the Chinese language by means of computers. The purpose is to obtain a better understanding of how the language works and to bring more convenience to language applications. The term Chinese computational linguistics is often employed interchangeably with Chinese information processing, though the former may sound more theoretical while the latter more technical.

Rather than introducing computational linguistics in a general sense, this article will focus on the unique issues involved with implementing the Chinese language compared to other languages. The contents include Chinese character information processing, word segmentation, proper noun recognition, natural language understanding and generation, corpus linguistics, and machine translation.

Chinese character information processing
Chinese character Information Technology (IT) is the technology of computer processing of Chinese characters. While the English writing system makes use of a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the Xinhua Dictionary. In the Unicode multilingual character set of 149,813 characters, 98,682 (about 2/3) are Chinese characters. This means that computer processing of Chinese characters is the most intensive among all languages.

Chinese character input
Computer input of Chinese characters is more complicated than languages which have simpler character systems. For example, the English language is written with 26 letters and a handful of other characters, and each character is assigned to a key on the keyboard. Theoretically, Chinese characters could be input in a similar way, but this approach is impractical for most applications due to the number of characters; it would require a massive keyboard with thousands of keys, and the user would find it difficult and time-consuming to locate individual characters on the keyboard. An alternative method is to use the English keyboard layout, and encode each Chinese character in the English characters; this is the predominant method of Chinese character input today.

Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as the Pinyin Scheme for Mandarin Chinese or Putonghua, and the Jyutping Scheme for the Cantonese dialect. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua Pinyin input code of 香港 (Hong Kong) is "xianggang" or "xiang1gang3", and the Cantonese Jyutping code is "hoenggong" or "hoeng1gong2", all of which can be easily input via an English keyboard.

A Chinese character can alternatively be input by form-based encoding. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the Cangjie input method, character 疆 (border) is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted. Popular form-based encoding methods include Wubi (五笔) in the Mainland and Cangjie (仓颉) in Taiwan and Hong Kong.

The most important feature of intelligent input is the application of contextual constraints for candidate character selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", he/she will get "大学教授 / 大學教授" (University Professor), when types "daxuepiaopiao" the computer will suggest "大雪飘飘 / 大雪飄飄" (heavy snow flying). Though the non-toned Pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.

Chinese character encoding for information interchange
Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The first GB Chinese character encoding standard is GB2312, which was released by the PRC in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by Pinyin, and the rest by radicals (indexing components). GB2312 was designed for simplified Chinese characters. Traditional characters which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of 香港 (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released. The latest version of GB encoding is GB18030, which supports both simplified and traditional Chinese characters, and is consistent with the Unicode character set.

The standard of Big5 encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters.

The full version of the Unicode standard represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional. In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which overs 98,682 (about 2/3) are Chinese sorted by Kangxi Radicals. Even very rarely-used characters are available. For example: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5).

Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.

Chinese character output
Like English and other languages, Chinese characters are output on printers and screens in different fonts and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families.

Fonts appear in different sizes. In addition to the international measurement system of points, Chinese characters are also measured by size numbers (called zihao, 字号) invented by an American for Chinese printing in 1859.

Word segmentation
It is straightforward to recognize words in English text because they are separated by spaces. However, Chinese words are not separated by any boundary markers. Hence, word segmentation is the first step for text analysis of Chinese. For example,

中文信息学报 (Chinese original text) 中文 信息 学报 (word-segmented text) Chinese information journal (word-by-word English translation) Journal of Chinese Information Processing (English name)

Chinese word segmentation on a computer is carried out by matching characters in the Chinese text against a lexicon (list of Chinese words) forwardly from the beginning of the sentence or backwardly from the end. There are two kinds of segmentation ambiguities: the intersection-type (交集型歧义字段) and polynomial type (多义型歧义字段) ).

Typically an intersection ambiguity is in the format of ABC, where A, AB, BC and C are all words in the lexicon. It is possible to divide the original character string into word AB followed by C, or A followed by BC. For example ‘美国会’ may mean ‘美 国会’ (the US Parliament) or ‘美国 会’ (the US can/will).

The most common form of polynomial segmentation ambiguity is AB, where A, B, and AB are all words. That means the character string can be regarded as one single word or be divided into two. For example, string ‘可以’ in the following sentences: (1) 你	可以	坐下. you can	sit down. You can sit down. (2) 你	可	以	他们	为	样板. you	can	take	them 	as	example. You can take them as an example. Word segmentation ambiguities can be resolved with contextual information, using linguistic rules and probability of word co-locations derived from Chinese corpora. Usually longer words matching are more reliable. The correctness rate of automatic word segmentation has reached 95 %. However there will be no guarantee of 100% percent correctness in the foreseeable future, because that will involve a complete understanding of the text. An alternative solution is to encourage people to write in a word segmented way, like the case in English. But that does not means computer word segmentation will no longer be needed, because even in English, word segmentation is required for speech analysis.

Proper noun recognition
A proper noun is the name of a person, a place, an institution, etc. and is written in English with the initial letter of each word capitalized, for example, ‘Mr. John Nealon’, ‘America’ and ‘Cambridge University’. However, Chinese proper nouns are usually not marked in any style.

Recognition of names of people and place in Chinese text can be supported by a list of names. However such a list can never be complete, considering the huge number of places and people all over the world, not to mention their dynamic feature of coming, changing and going. And there are names similar to non-proper nouns. For example, there is a town named 民众 (Minzhong) in southern China, which is also a common noun meaning ‘people’. Therefore, recognition of names of people and place has to make use of their distinguishing features in internal composition and external context. Corpora with proper nouns annotated can also serve as useful reference.

A people’s name not found in the dictionary can be recognized with a list of surnames and titles, for example ‘张大方先生’’,李经理’, where 张 (Zhang) and 李 (Li) are Chinese surnames, and 先生 (Mr.) and 经理 (Manager) are titles. In 张大方说, 张大方 can be successfully recognized as a person’s name by the rule that a Chinese given name normally follow the surname and consists of 1 or 2 characters, and the fact that people can speak (说).

Names of place also have characteristics useful for computer recognition. For example, in ‘在广东省中山市民众镇’, component words 省 (province), 市 (city) and 镇 (town) are end markers of place names, while 在 (in, at, on) is a preposition frequently appearing in front of a location.

The correctness rate of computer recognition has reached around 90 % for persons’ names and 95 % for place names.

<!-- to be edited:

Chinese Natural Language Understanding and Generation
The ultimate goal of computational linguistics is to specify a theory of natural language understanding and generation to such a level of detail that a person could write a computer program that can understand and generate languages. 6.1. Natural Language Understanding Computationally, Natural language understanding refers to the conversion of a sentence into some internal representation which can be used to solve relevant problems (Allen, 1995). For example, sentence ‘约翰有一本书’ (John has a book) can be converted to predicate logic expression ‘own(John, B1) and is-a (B1, BOOK)’, which will allow the computer to answer such relevant questions as ‘Does John own anything?’, ‘What is owned by John?’. Computer analysis for natural language understanding can be separated into three steps. (1) Syntactic analysis, also called parsing, processes a sentence into its syntactic descriptions. For example, sentence ‘他买了一个苹果’. (He bought an apple.) can be parsed into the following representation: S (NP(N(他)), VP( V(买了), NP(ART(一个), N(苹果)))) he,	 	bought,	     an,	          apple (2) Semantic analysis converts the syntactic description of a sentence into a form representing the literal meaning of the sentence independent of the context. A possible interpretation of the example sentence is      PAST EVENT, BUY-OBJ, (AGENT ‘他’ PERSON), (OBJ ‘苹果’ FRUIT/COMPUTER/MOBLIE), (QUANTITY  ‘一个’)), where ‘苹果’ (apple) may be a fruit, a computer or a mobile phone; an ambiguity not revealed at the syntactic level. (3) Pragmatic analysis maps the semantic representation into a representation of the intended meaning of the sentence. This step makes use of the context and world knowledge, including commonsense. Suppose the previous sentence was ‘He was thirsty’, then the apple should be fruit, and the computer will output analysis result like Bought (他, 苹果) and is-a (苹果, fruit). he,  apple,	    apple 6.2 Natural Language Generation From the point of view of speech act, natural language generation involves plan generation and plan execution. The task of the software is to produce natural language sentences which can achieve given linguistic goals (Allen, 1995). Natural language generation can also be divided into three steps: (1) Planning speech actions This is to generate a plan to achieve the goal of speaking. For example, suppose the goal of the computer system is to have a locked door (DOOR1) opened, and the system knows that the key (KEY2) is on the desk called DESK3, and that the user doesn’t know where the key is. A planning procedure to achieve the goal is as follows: Goal ‘is-open(DOOR1)’ is the effect of action OPEN(USER, DOOR1, KEY2), which has two conditions: INTEND(USER, OPEN(USER, DOOR1, KEY2)) and HAS(USER, KEY2). The first condition is the effect of REQUEST(SYS, USER, OPEN(USER, DOOR1, KEY2)), which can be expressed by the computer. The second condition is the effect of action PICK-UP(USER, KEY2) with the precondition KNOW(USER, LOCATION(KEY2)), which is the  effect of speech action INFORM(SYS, USER, is-on(KEY2, DESK3)). Resulting in a plan of two speech actions: REQUEST(SYS, USER, OPEN(USER, DOOR1, KEY2)); INFORM(SYS, USER, is-on(KEY2, DESK3)). (2) Generating semantic meanings For example, speech action REQUEST(SYS, USER, OPEN(OPENER, OBJECT, KEY)) can be achieved with semantic frame OPEN (AGENT = OPENER, THEME = OBJECT, TOOL = KEY). (3) Generating sentences This can be performed using generation rules or augmented transition network (ATN). For our example, the following simplified rules can be used for English and Chinese generation. English: G(ACTION) =G(>AGENT) G(>VERB)G(>THEME){with G(>TOOL)}, Chinese: G(ACTION) = G(>AGENT) {用 G(>TOOL)}G(>VERB)G(>THEME). For Chinese, the PP representing the tool of an action is located before the action verb. The previous introduction is based on traditional linguistics. Recently there has been more and more research on natural language processing based on corpora, which will be our next topic.

Chinese Corpora and Corpus Linguistics
A corpus is a body of texts stored in a computer. It is normally built to represent a language or a sub-language. There are mainly three kinds of activities in corpus linguistics: (a) development of corpora, (b) development of tools for effective use of corpora, (c) corpus-based linguistic research and language applications. Corpora are built for certain purposes and aim at optimal representativeness of the languages. Texts in a corpus are selected from the natural language the corpus is meant to represent, and normally stored in the computer as files. In the early days the texts were manually typed into the computer from hard copy resources, nowadays they are more often captured from soft copy resources, or generated by OCR and speech recognition. To be more useful, many corpora have been annotated with additional information, such as word segmentation, part-of-speech tagging, syntactic annotation and semantic annotation. Corpus annotation can be performed by the computer based on statistic models derived from a training corpus manually annotated by human beings (McEnery, et.al, 2006). As a corpus represents a language with authentic texts, it has many uses for language research and application. For example, we can consult a corpus to test a hypothesis in linguistics inquiry, provide machine translation with bilingual aligned sentences, improve dictionaries with real life word meanings and examples, and support language learning with real life expressions and their frequencies of use, etc. 7.1 Some Important Chinese Corpora •	‘Academia Sinica Balanced Corpus of Modern Chinese’ (Taiwan), simplified as Sinica Corpus, is an influential corpus designed by Professors Keh-jiann Chen and Chu-Ren Huang. Every text in the corpus is segmented and each segmented word is tagged with its part-of-speech. Texts are collected from different areas to be representative sample of modern Chinese language (traditional). The corpus is available on the Web (http://www.sinica.edu.tw/SinicaCorpus). •	‘Linguistic Variations in Chinese Speech Communities’ (LIVAC) synchronous corpus (designed by Professor Benjamin Tsou, Hong Kong) contains texts from representative Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The collection of materials from the diverse communities is synchronized, and so offers an environment for comparative studies of Chinese. The corpus can be accessed at http://www.livac.org/. •	The ‘National Language Commission Corpora’ (国家语委语料库, Mainland). There are two member corpora. One is a corpus of modern Chinese language (simplified), which allows free information retrieval from 20 million Chinese characters of texts word-segmented with part-of-speech tagging. The other one is a corpus of ancient Chinese language with around 100 million characters of texts from Dynasty Zhou (周朝) to Qing (清朝). Both corpora can be accessed at http://www.cncorpus.org/. The Center for Chinese Linguistics Corpus (the CCL Corpus) at Peking University has three subcorpora: a corpus of modern Chinese with 581,794,456 characters of texts, a corpus of ancient Chinese with 201,668,719 characters of texts, and a Chinese-English aligned bilingual corpus especially useful for language translation and comparative studies. It allows disjoint keywords information retrieval and flexible length of contexts on both left and right. The CCL Coupus is available at http://ccl.pku.edu.cn:8080/ccl_corpus/. 7.2. Corpus Information Retrieval In corpus linguistics, words and word-forms may not be the same. For example, the English word ‘study’ may appear in different forms of ‘study’, ‘studying’, and ‘studied’. And its Chinese counterpart may appear in traditional Chinese as ‘學習’, or in simplified Chinese as ‘学习’. It is normally safer to use word forms in corpus information retrieval. 7.2.1 Frequency lists of word-forms and characters The computer can count the occurrences of each word-form or character in a corpus, and sort them according to their frequency of use. Such a list is very useful to language research, learning and application. Figure 2 is a section of the Chinese Character Frequency of Hong Kong, Mainland and Taiwan built by the Hong Kong Chinese University. Figure 2 Chinese character frequency of Hong Kong, Mainland and Taiwan The complete frequency list and much more useful data can be found at http://humanum.arts.cuhk.edu.hk/Lexis/chifreq/. 7.2.2 Concordances A concordance is a collection of the occurrences of a word-form, each in its own textual environment. The program that automatically generates a concordance from a corpus is called a concordance program, which is arguably the most important tool for corpus information retrieval. The most commonly-used layout for concordances is KWIC (Key Word in Context), where the word-form under examination appears in the center of each line, and in different color or with extra spaces on both sides. The length of the context is set according to different purposes. Figure 3 is a sample of a concordance for keyword ‘中文’ (Chinese) retrieved from Academia Sinica Balanced Corpus of Modern Chinese. Figure 3 A sample of a concordance for keyword ‘中文’ In addition to KWIC, there are line contexts, sentence contexts, paragraph contexts, and even whole-text contexts. The entries of a concordance can be sorted according to their occurrences in the original texts or according to the right/left neighboring words. A concordance provides rich contextual and statistical information on real life application of the given word. More details on Chinese corpus linguistics can be found in the textbook by Huang and Li (2002).

Chinese Machine Translation and Machine Assisted Translation
Machine Translation (MT) is translation by a machine, i.e., the conversion of a text or speech in one natural language (source language) into a text or speech of the same meaning in another natural language (target language) by means of a computer. 8.1 Introduction to Chinese Machine Translation and Machine Assisted Translation Chinese machine translation commenced not long after the release of ‘Weaver Memorandum’ in 1949, which is customarily considered as the starting point of MT research in the world. In 1956, a project on Russian-Chinese machine translation was included in the national plan for scientific research (Feng 2004). The first commercialized MT software in China was Transtar (译星), a rule-based English-Chinese MT system, which was released by the China National Software and Service Co., Ltd. in 1988. Another important system was 863-IMT/EC (Intelligent MT for English-Chinese translation) developed by the Institute of Computer Technology, Academia Sinica. The software was later commercialized by Huajian Group (华建集团). Translation between Chinese and many other languages are also available on Systran, Microsoft Translator and Google Translate. In addition to fully automatic translation, machine assisted translation is also employed for Chinese translation using tools such as Trados and Google Translator Toolkit. Home-made tools of this category include Yaxin CAT by Yaxin Software (雅信) and Huajian IAT by Huajian Group, both supporting development and application of translation memories and terminology databases. 8.2. Approaches of Machine Translation An MT system normally consists of a user interface, a source language analyzer, a target language generator, and the supporting dictionaries and knowledge bases. There are two basic approaches for developing MT systems: Rule-based and corpus-based. Traditionally, MT is based on linguistic rules and artificial intelligence. Rule-based methods parse a text to create an intermediary symbolic representation, from which the text in the target language is generated. In the recent years, a number of MT systems based on sentence-aligned bilingual corpora and statistics calculation have been developed. The following is a simplified example of rule-based English-Chinese machine translation with syntactic analysis. The English source language sentence is He works in the office. The machine translation consists of three steps: Step 1: Find the syntactic structure of the source language sentence according to some grammatical theory, as shown in Figure 4.a.

a. English syntactic structure: S             /     \ NP    VP           /       /     \ PRON   V      PP         |        |       /   \ He     works   P     NP                        |    /     \ in ART    N                             |      | the   office b. Chinese syntactic structure: S                  /      \ NP      VP                /          /   \ PRON     PP    V              |         /    \      \ He     P      NP   works 他     |      /    \    工作 in  ART     N                      在    |        | the    office 这   办公室 Figure 4. Rule-based English-Chinese machine translation with syntactic analysis. Step 2: Convert the source language syntactic structure into a structure of the target language, as shown in Figure 4.b. According to Chinese grammar, the PP sub-tree has been moved before the verb it modifies. Step 3: Looking up the corresponding words in the bilingual dictionary as shown in Figure 4.b, and generate a sentence in the target language: 他在这办公室工作. More often, there are ambiguities which have to be resolved by semantic or even pragmatic analysis. If we change the input into a slightly different sentence such as    The thief was caught by the bank. the task becomes much more tricky. If the word ‘bank’ refers to a riverside, prepositional phrase ‘by the bank’ should express the location of the event, and the Chinese translation should be ‘那个小偷在河岸附近被抓到’. If ‘bank’ refers to a financial institution, ‘by the bank’ may still express a location, and the translation is ‘那个小偷在银行附近被抓到’. The financial bank may alternatively play the role of agent, in which case the translation becomes ‘那个小偷被银行抓到’. Selection among these three interpretations needs to be supported by semantic and pragmatic analysis according to context information and background knowledge. 8.3 MT Evaluation There are various methods to evaluate the output of machine translation. The traditional way is by human judges to assess a translation's quality. Even though human evaluation is time-consuming, it is still the most reliable way. Automated means of evaluation include BLEU, METEOR and WER. BLEU is the most commonly used metric. It makes use of multiple reference translations prepared by human translators and take into account matches of both words and word order (Papineni et. al. 2002). In his book entitled A Practical Guide for Translators (5th Edition), Samuelsson-Brown (2010: 82) says: ‘Development has been slow since the first serious attempts at MT were made 60 or more years ago. These attempts were limited by contemporary hardware, software and other factors. The facility is becoming more of a viable option, but still needs a skilled translator or language editor to make the result acceptable’. Wholly automatic high quality MT remains a remote dream, and MT output still needs human post-editing. As for MT between English and Chinese, Chinese-English is poorer than English-Chinese translation because analysis of Chinese is more challenging and less developed.

Chinese-Chinese Machine Translation
Translation not only happens between languages but also within a language. In this section, we will introduce two important types of intra Chinese machine translation, including Chinese dialect MT and simplified-traditional Chinese conversion. 9.1 Chinese Dialect MT Automatic translation is more achievable here since inter-dialect difference is much less serious than inter-language difference. In the following sections we will discuss inter-dialect MT with emphasis on Cantonese and Putonghua (Zhang 1998b). 9.1.1	Dialects and Chinese dialects The dialects of a language are that language’s systematic variations, developed when people of a common language are separated geographically and socially. Among this group of dialects, normally one serves as the lingua franca. Inter-dialect differences exist in pronunciation, vocabulary and syntactic rules. However, they are usually insignificant in comparison with the similarities the dialects have. It has been declared that dialects of one language are mutually intelligible (Fromkin and Rodman 1993: 276). Nevertheless, the seven major Chinese dialects --- the Northern Dialect (with Putonghua as its standard version), Cantonese, Wu, Min, Hakka, Xiang and Gan --- are for the most part mutually unintelligible, and inter-dialect translation is required for successful communication, especially between Cantonese, the most influential dialect in South China, and Putonghua, the lingual franca of China (Yuan 1989). 9.1.2	Linguistic consideration of dialect MT Most differences among the dialects of a language are found in their pronunciations. Words with similar written forms are often pronounced differently in different dialects. For example, word ‘香港’ (Hong Kong) is pronounced xiang1gang3 in Putonghua, but hoeng1gong2  in Cantonese. There are also lexical differences although dialects share most of their words. For example, the word ‘umbrella’ is 雨伞 (yu3san3) in Putonghua, and 遮 (ze1) in Cantonese. Differences in syntactic structure are less common but linguistically more complicated and computationally more challenging. For example, the positions of some adverbs may vary from dialect to dialect. To express ‘You go first’, we have Putonghua: 你	先	走. ni3	xian1 	zou3		(1) you first 	go Cantonese: 你	行 	先. nei5 	hang4 	sin1		 (2) you 	go 	first 9.1.3	Words processing in dialect MT Inter-dialect differences mostly exist in words. In the Cantonese vocabulary, there are about seven thousand to eight thousand dialect words (including idioms and fixed phrases), which are in different forms from their Putonghua counterparts. These dialect words account for about one third of the total Cantonese vocabulary. Because of historical reasons, Hong Kong Cantonese is linguistically more distant from Putonghua than other regions in Mainland China. One can easily spot Cantonese dialect articles in Hong Kong newspapers which are totally unintelligible to Putonghua speakers, while Putonghua articles are easily understood by Cantonese speakers. The critical task in Cantonese-to-Putonghua MT is word processing, especially dialect words recognition and translation (Zhang 1999). The most challenging issue is to deal with Cantonese ambiguous words that do not have semantically equivalent counterparts in Putonghua. For example, Putonghua word 桔 (ju2, orange) has a much larger coverage than the Cantonese 桔 (gwat1). In addition to the Cantonese 桔, Putonghua 桔 also includes the fruits Cantonese refers to as 柑 (gam1) and 橙 (caang2). On the other hand, the Cantonese 行 semantically covers Putonghua 走 (go, walk) and 行 (row). 9.2 Simplified-traditional Chinese Conversion Due to historical reasons, modern Chinese is written in both traditional characters and simplified characters, which quite frequently renders text conversion between the two scripting systems indispensable. Computer-based simplified-traditional Chinese conversion is available on MS Word, Google Translate and many language tools on the WWW. Their performance has reached very high precision. However, because of the existence of one-to-many relationships between simplified and traditional Chinese characters (For example, character 干in simplified Chinese corresponds to 干, 乾 and幹 in traditional Chinese.), there is no guarantee of 100% correct conversion. That means human proofreading is needed, especially when high quality text output is required. Zhang (2011) has developed a tool which goes on to support human proofreading after converting a text between simplified and traditional Chinese. The important features include: •	Simplified-traditional Chinese bi-directional conversion with 4 options: (a) simplified to Hong Kong traditional Chinese, (b) simplified to Taiwan traditional Chinese, (c) simplified to high frequency traditional Chinese characters, and (d) traditional to simplified Chinese. •	Support for human proofreading by (a) high-lighting all characters with one-to-many relationship between simplified and traditional Chinese, (b) providing relevant dictionary information for reference, (c) Correcting mistakes automatically by a single click. •	Employment of standard and frequently-used characters and punctuation marks in the target writing system. The tool is available on the Web at http://myweb.polyu.edu.hk/~ctxzhang/jfj/ and http://www.mypolyuweb.hk/~ctxzhang/jfj/. Further improvement is in progress. Normally, simplified-traditional Chinese conversion is performed in a character-to-character way. For example, simplified 汉字信息处理 (Chinese characters information processing) is converted to traditioanl 漢字信息處理. MS Word can also translate common terms (though not always correct), in which case the previous translation becomes漢字資訊處理, because in Taiwan and Hong Kong, ‘information’ is more often translated into 資訊 than 信息. A comparative study on simplified-traditional Chinese conversion tools can be found in a recently published paper (Zhang, 2014).

/-->

Journals and proceedings

 * Journal of Chinese information processing (http://jcip.cipsc.org.cn/CN/home)
 * International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP) (https://www.aclclp.org.tw/journal/index.php)
 * China National Conference on Chinese Computational Linguistics (https://link.springer.com/conference/cncl)
 * Rocling Proceedings (https://www.aclclp.org.tw/pub_proce.php)