Chinese character frequency

Chinese character frequency is the applicational frequency of characters in written Chinese. It is calculated on a corpus, i.e., a collection of texts representing one or more languages. The frequency of a character is the ratio of the number of its occurrences to the total number of characters in the corpus, with the formula of $F_{i} = n_{i}/N 100%$,

where is the number of times a certain  Chinese character appears in the corpus, and  is the total number of (occurrences of) characters in the corpus.

Chinese character frequency is fundamental to quantitative linguistics of Chinese, and is of referential value to Chinese language teaching and information processing.

Origins
The first person to make a serious statistic study on the frequency of Chinese characters was Chen Heqin. In the 1920s, he and his assistants spent over two years manually counting and comparing the characters in a corpus of six categories of texts. There were totally 554,478 characters in 4,261 different character forms. They then compiled a book entitled Applied Lexis of Vernacular Chinese. The 10 most frequently-used characters in their corpus are, by descending frequency,

(of), (no, not),  (one, a(n)),  (PERF),  (to be),  (I/me),  (on, up),  (he/him),  (to have),  (person).

A trans-regional diachronic survey
In 2001, the Chinese University of Hong Kong (CUHK) published a number of frequency lists on the Web, entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a Trans-regional Diachronic Survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, mainland China and Taiwan and in the two time periods of the 1960s and 1980/90's. Each sub-corpus consists of approximately 660,000 characters, making a total of 3,970,514 characters for the whole corpus. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.

From the data of these frequency lists, some important and interesting features of Chinese can be discovered:


 * , and  are the three most frequently-used characters across the regions and time periods of the corpora. And  is number one in all the frequency lists.
 * 1) The 10 most frequently-used characters across the three regions and two time periods are very consistent. That means a frequently-used character in one region or period is very likely to be frequently-used in another region or period.
 * 2) The 100 most frequently-used characters in the 80/90's cover (i.e., have an accumulated frequency of) 41.00% of the Hong Kong texts of that period, 41.34% of the Mainland texts, and 41.88% of the Taiwan texts. That is more than 4 out of every 10 characters for the three regions.
 * 3) The 1000 most frequently-used characters in the 80/90's cover 89.25% of the Hong Kong texts of that period, 90.26% of the Mainland texts, and 88.74% of the Taiwan texts.

The top 10 characters in the frequency lists for the three regions of the 1980/1990's are Hong Kong: 的，一，是，不，人，有，在，了，我，中; Taiwan:   的，一，是，不，人，在，有，我，了，中; Mainland: 的，一，是，了，不，在，有，人，我，他.

More information can be found in the English Users' Guide on the home page.

Frequencies in different divisions
Most of the previous frequency experiments are for comprehensive usage of Chinese characters. In addition, there is the frequency of use of Chinese characters in a certain discipline, such as news reporting, literature and art, information technology, etc.

And there are frequency lists for linguistic divisions. Polyphonic characters may be counted separately according to different pronunciations, for example, the frequencies for 的 (de), 的 (di1), 的 (di2) and 的 (di4). Polysemy characters are counted separately according to different meanings, for example, 里 (裡裏, inside) and 里 (里, 0.5 km). There are also frequencies for different parts of speech, for example: 花(n) and 花(v). Or a combination of the above divisions.

Application of frequency statistics
Chinese character frequency is essential to quantitative research of Chinese characters, and has been applied to language teaching, dictionary composition, character lists compilation, Chinese character information processing, etc.

Chinese character utility decline rate
The uses of Chinese characters mainly concentrate on frequently used characters. Zhou Youguang summarized the Chinese character utility decline rate based on the frequency statistics results of various parties. Its basic content is:

The coverage rate of the most frequently-used 1,000 characters on the corpus is about 90%, which means the missing rate is about 10%. For every additional 1,400 secondary frequent characters, the missing rate is reduced to 10% of the original number. For example, The missing rate of 1000+1400=2400 most frequently-used characters is approximately 10% * 10% =1% of the corpus, that means the coverage rate is 99%. The missing rate of 2400+1400=3800 most frequently-used characters is about 1% * 10% = 0.1%, and the coverage rate is 99.9%. The rule is supported by later experiment results as well, such as:

Decreasing rate of frequently-used character strokes
The basic content of the Decreasing rate of frequently-used character strokes  is:

The application rate of a character is inversely proportional to its number of strokes, that is, characters with high application rates have fewer strokes on average. This is supported by the data in article Stroke numbers. According to the data of the second and third tables, the average number of strokes of the 3,500 frequently-used characters is 9.74, and the average number of strokes of the 7.000 commonly-used characters (a super set of the 3,500 characters) is 10.75. That means generally speaking, frequently-used characters have less strokes than less frequently-used characters.

The reason is for convenience of writing. If a character of many strokes is used frequently, people will try to simplify it. If there are multiple variant characters of the same function, regardless of other reasons, the one with fewer strokes is more likely to be used.

Distribution rate and application rate
When determining the importance of a character, in addition to frequency of use, it is often necessary to consider distribution rate. The formula for calculating distribution rate is

$D_{i} = t_{i}/T$, where Di is the distribution rate of character or word i, ti is the number of texts in which the character or word appears, and T is the total number of texts in the corpus.

Application rate is a combination of distribution rate and frequency. A newer calculation formula is:

Ui=(Fi*Di)/Σ(j=1 to n)(Fj*Dj)

where Ui is the application rate of character i, Fi is the frequency of character i, Di is the distribution rate of character i, and n represents the total number of characters. This calculation method allows the cumulative application rates to approach 1.

Application in Media
Large-scale surveys by the Ministry of Education and the State Language Commission of PRC over the years have shown that the use of Chinese characters and words has a strong distribution pattern. The number of different characters used in modern Chinese is stable at about 12,000, and the number of different words has stabilized at around 2.5 million.

The number of most frequently-used characters with a coverage rate of 80%, 90%, and 99% is about 590, 940, and 2,400 respectively. The number of words with coverage rates of 80%, 90%, 95%, and 99% is about 4,900, 14,000, 32,000, and 241,000 respectively. Words with greater changes from the previous years in frequency of use reflect the hot topics of social life and media attention that year.