Chinese character IT

Chinese character IT is the information technology for computer processing of Chinese characters. While the English writing system uses a few dozen different characters, Chinese language needs a much larger character set. There are over ten thousand characters in the Xinhua Dictionary. In the Unicode multilingual character set of 149,813 characters, 98,682 (about two-thirds) are Chinese. That means computer processing of Chinese characters is the toughest among other languages.

Chinese faces special issues compared to other languages, including the technology of computer input, internal encoding and output of Chinese characters.

Character input
Computer input of Chinese characters is by no means as easy as English. English is written with 26 letters and a handful of other characters, and each character is assigned to a key on the keyboard. Chinese can be input in a similar way. However that would involve a huge keyboard with at least thousands of keys. Searching for a character on the keyboard would be a daunting job.

People did try to 'shrink' the Chinese keyboard by putting multiple characters on one key. That turned the original one-step input procedure into two steps for the writer: The resulting keyboard still remained clumsy, because if you put more characters on one key, the key becomes bigger to make the characters recognizable, and selecting a character from a large group is difficult. Additionally, it is not easy to group the characters evenly in a reasonable and easy-to-learn way. Another drawback of a Chinese keyboard for direct whole character input is its inconsistency with English input.
 * 1) pressing the key for the character group of the target character,
 * 2) selecting the target character in the group.

An alternative way is to encode each Chinese character in English characters, enabling Chinese input on an English keyboard. As a matter of fact, this method has become predominant for Chinese computer input. The software of an encoding input method includes a character-code table. When an ASCII input code is typed on the English keyboard, the software will search for matching Chinese characters in the table. If there are multiple characters sharing the same code, they will be presented to the user for selection. To make the input method easy to learn, encoding must be based on distinctive features in forms, sounds or meanings of Chinese characters. Because the meanings of characters tend to be more abstract and complicated, input encoding is normally based on the sound or form.

Sound-based encodings
Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as pinyin for Putonghua, and Jyutping for Cantonese. The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone. For example, the Putonghua pinyin input code of 香港 (Hong Kong) is xianggang or xiang1gang3, and the Cantonese Jyutping code is hoenggong or hoeng1gong2, all of which can be easily input via an English keyboard. In Putonghua pinyin, there are two letters not appearing on the English keyboard: ê and ü. According to the national standard, ê should be represented by 'ea', and ü by 'v' in the pinyin input code. In some Chinese input software ê is also represented as 'e^', and ü as 'u:' or 'uu'. Popular sound-based input methods in China include Microsoft Pinyin, Sogou Pinyin, Google Pinyin and Jyutping on the mainland and Hong Kong, and bopomofo in Taiwan.

There are a number of advantages for sound-based encoding:
 * 1) Easy to learn because most Chinese writers have already got a good command of Putonghua and pinyin.
 * 2) Consistent with Chinese language learning.
 * 3) Allows simplified and traditional Chinese characters to be input in a similar way.
 * 4) Allows writing Chinese and English on the same keyboard.

The shortcomings of sound-based encoding lie in its high degree of duplicate encoding, with homophone Chinese characters sharing the same code. A Chinese character is normally pronounced with one syllable. Chinese Putonghua only has about 400 different syllables without considering tones, or approximately 1,200 syllables when tones are considered. On the other hand, there are tens of thousands of Chinese characters. That means on the average, each syllable has to cover over 10 characters. This problem can be largely solved by inputting Chinese word by word instead of character by character, because most words in modern Chinese consist of more than one character and duplicate encoding is much less frequent at words level. For example, the pinyin of 香港 (Hong Kong) is unique to the word, while either character 香 or 港 shares its pronunciation with many other characters. Another limitation of sound-based Chinese input is that you must know the pronunciation of a Chinese character before you can input it into the computer. This issue can be solved by form-based encoding.

Form-based encodings
A Chinese character can alternatively be input according to its form (or shape) and structure. Most Chinese characters can be divided into a sequence of components each of which is in turn composed of a sequence of strokes in writing order. For example, the character 福 ('good fortune', 'happiness') can be decomposed as

There are a few hundred basic components, much less than the number of characters. By representing each component with an English letter and putting them in writing order of the character, the input method creator can get a letter string ready to be used as an input code on the English keyboard. Of course the creator can also design a rule to select representative letters from the string if it is too long. For example, in the Cangjie input method, character 疆 ('border') is encoded as "NGMWM" corresponding to components "弓土一田一", with some components omitted.

Stroke-based coding is simpler than component-based coding. But the codes tend to be longer. There are approximately 30~40 distinctive strokes of Chinese characters. They are usually classified into five categories of heng (一), shu (丨), pie (丿), dian (丶) and zhe (𠃍) for dictionary consultancy and Chinese input on a mobile phone. For Chinese input with an ASCII keyboard, 2 strokes can be combined to form 5*5=25 different pairs for mapping to the English letters. For example, in input method ZYQ, the sequence of stroke pairs '一一, 一丨, 一丿, ..., 𠃍丿, 𠃍丶, 𠃍𠃍' are represented by 'a, b, c, ..., w, x, y' respectively. Popular form-based encoding methods include Wubi on the mainland and Cangjie in Taiwan and Hong Kong.

The pros and cons of form-based input methods are complementary to sound-based methods. The major advantage of form-based methods lies in their low degree of duplicate encoding, enabling high speed input of Chinese characters. And the major shortcoming is difficulty of learning. Normally students have to remember over one hundred components and their corresponding English letters. In addition, they have to learn the complicated rules for breaking a character into a sequence of components and making a selection among them.

Optical character recognition
Chinese characters can also be input into the computer by optical character recognition (OCR), handwriting recognition and speech recognition based on technology similar to that of English.

Compared with English, Chinese OCR and handwriting recognition is more difficult, because there are thousands of different commonly-used characters instead of 26 letters. Generally speaking, print character recognition is more accurate than handwriting characters because their forms are more standardized. There are OCR tools for different fonts, including the popular Song, Kai and Hei. In comparison with offline handwriting, online handwriting recognition is more efficient, because the computer not only 'sees' the written character but also the procedure of writing it.

Speech recognition
Speech recognition converts a continuous speech signal into a sequence of words. There are two problems: the variation in pronunciation of words by different speakers and the existence of homophones such as 'pair', 'pear' and 'pare' in English, and 攻势, 公式, 公示 (gong1shi4) in Chinese. Speech recognition relies on corpus statistical methods and linguistic rules. A helpful feature of Chinese is that each character is pronounced with one syllable.

Both Chinese character recognition and speech recognition has reached application level. However neither can guarantee 100% correctness without human proofreading or online character selection.

Intelligent input engines
The most important feature of intelligent input is application of contextual constraints for candidate characters selection. For example, on Microsoft Pinyin, when the user types input code "daxuejiaoshou", they will get 大学教授 (University Professor), when types "daxuepiaopiao" the computer suggested 大雪飘飘 (heavy snow flying). Though the non-diacritical pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.

Intelligent Chinese input also makes use of corpus information and linguistic rules. The computer's selection among ambiguous Chinese characters is not always correct, and further improvement is required.

Other input
In the Chinese writing system, there are graphemes other than complete Chinese characters, such as punctuation marks (e.g. '. ', '、' and '《》'), strokes (e.g. '丿', '𠃍' and '乚'), radicals (e.g. '氵', '宀' and '刂'), and letters used for romanization, like the vowel letters with diacritics used in pinyin and the Yale romanization of Cantonese. (e.g. 'ā', 'á', 'ǎ', 'à').

There are facilities available on Microsoft Windows, Office and the web, which will enable us to input almost all of these Chinese auxiliary characters, ranging from the input of punctuation marks in general Chinese input methods, to inputting diacritical pinyin with soft keyboards, to inputting strokes and radicals from the Unicode website and by Unicode-character conversion, as well as the application of special tools on the Web to input pinyin and other characters. More information on non-logogram input can be found in paper, which includes a list of 280 non-ASCII non-logograms, with each annotated with its Unicode code point and the input code of the author's design. It is also possible to input a character on Microsoft Word by typing its Unicode code point and pressing keys Alt+X.

Chinese character encoding for information interchange
Inside the computer each character is represented by an internal code. When a character is sent between two machines, it is in information interchange code. Nowadays, information interchange codes, such as ASCII and Unicode, are often directly employed as internal codes. The following sections will introduce the most important encoding standards used in Chinese information technology, including GB, Big5 and Unicode.

GB
GB stands for Guobiao, "Guojia Biaozhun" (国家标准, or ‘national standard’) in Putonghua, and is the prefix for reference numbers of official standards issued by the People's Republic of China.

The first GB Chinese character encoding standard is GB 2312, which was released in 1980. It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by Pinyin, and the rest by radicals (indexing components). GB2312 was designed for simplified Chinese characters. Traditional characters which have been simplified are not covered. The code of a character is represented by a two-byte hexadecimal number, for instance, the GB codes of 香港 (Hong Kong) are CFE3 and B8DB respectively. GB2312 is still in use on some computers and the WWW, though newer versions with extended character sets, such as GB13000.1 and GB18030, have been released.

The latest version of GB encoding is GB18030. GB18030 supports both simplified and traditional Chinese characters, and is consistent with Unicode's character set.

Big5
Big5 encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since. Big5 is popularly used in Taiwan, Hong Kong and Macau. The original Big5 standard included 13,053 Chinese characters, with no simplified characters of the Mainland. Each character is encoded with a two byte hexadecimal code, for example, 香 (ADBB) 港 (B4E4) 龍 (C073). Chinese characters in the Big5 character set are arranged in radical order. Extended versions of Big5 include Big-5E and Big5-2003, which include some simplified characters and Hong Kong Cantonese characters.

Unicode
Unicode is the most influential international standard for multilingual character encoding. It is consistent with (or virtually equivalent to) standard ISO/IEC10646. The full version of Unicode represents a character with a 4-byte digital code, providing a huge encoding space to cover all characters of all languages in the world. The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages. There are 27,522 characters in the CJKV (China, Japan, Korea and Vietnam) Ideographs Area, including all the simplified and traditional Chinese characters in GB2312 and Big5 traditional.

In Unicode 15.0, there is a multilingual character set of 149,813 characters, among which 98,682, about two-thirds, are Chinese sorted by Kangxi radicals. Even very rarely-used characters are available. The following are some example characters with their Unicode put in brackets: H (0048) K (004B), 香 (9999), 港 (6E2F), 龍(9F8D), 龙 (9F99), 龖 (9F96), 龘 (9F98), 𪚥 (2A6A5).

All the 5,009 characters of the Hong Kong Supplementary Character Set (HKSCS) are included in Unicode. HKSCS was developed by the Hong Kong government as a collection of locally specific Chinese characters not available on the computer in the early days, for instance 咗 (already), 嘢 (thing), 脷 (tongue), and 曱甴 (cockroach).

As GB, Big5 and Unicode are concurrently used in Chinese encoding, when the computer mistakenly interprets a text with an encoding standard different from its original code, it will be presented with wrong characters, a phenomenon called "luànmǎ" (code confusing), which occasionally happens on the Web or in emails. This problem is often solved by manual selection of encoding or character set (such as the case on Web browsers) or by code conversion beforehand.

Unicode is becoming more and more popular. It is reported that UTF-8 (Unicode) is used by 98.1% of all the websites. It is widely believed that Unicode will ultimately replace all other information interchange codes and internal codes, and there will be no more code confusing.

Typefaces
Like English and other languages, Chinese characters are output on printers and screens in different fonts and styles. The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families, for example, 汉字字体 (Song) 汉字字体 (Kai) 汉字字体 (Hei or Black) 汉字字体 (FangSong)

Font size
Fonts appear in different sizes. In addition to the international measurement system of points, Chinese characters are also measured by size numbers (called zihao, 字号) invented by an American for Chinese printing in 1859. Table 1 is a list of all the font sizes in numbers available on Chinese version MS Word and their equivalent points.

Table 1：Chinese font sizes in numbers, points and mm 字号	(Number)	点数 (pt)	毫米 (mm)	Example 八号 	(#8)		5		1.76		中文 七号	(#7)		5.5		1.93		中文 小六号	(#small 6)	6.5		2.28		中文 六号	(#6)		7.5		2.64		中文 小五号	(#small 5)	9		3.16		中文 五号	(#5)		10.5		3.69		中文 小四号	(#small 4)	12		4.22		中文 四号	(#4)		14		4.92		中文 小三号	(#small 3)	15		5.27		中文 三号	(#3)		16		5.62		中文 小二号	(#small 2)	18		6.33		中文 二号	(#2)		22		7.73		中文 小一号	(#small 1)	24		8.44		中文 一号	(#1)		26		9.14		中文 小初号 (#small primary)	36		12.65		中文 初号	(#primary)	42		14.76		中文 This table is particularly useful for Chinese typesetting on computers not supporting font sizes in numbers. For example, from the table, we get to know that Chinese size number 3 (三号) is equivalent to 16 points, or 5.62mm high, as shown by the example characters.

The image of a Chinese character in a particular font is represented in the computer by a matrix of dots (called dot matrix fonts or bitmapped font) or by outlines (called outline font), again like the case in English.