Talk:GB 2312

Proofreading (2011)
"The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254). Hence, like UTF-8, it is possible to check if a byte is part of a two-byte construct when using EUC-CN."

These two sentences don't make sense to me. How does the second sentence follow from the first?


 * It's incorrect as far as I can tell. It's not possible to check if a byte is the tail of a two-byte construct, with UTF-8 you can because a tail byte starts with binary 10 while a heading byte starts with binary 11.


 * "Compared to UTF-8, GB2312 (whether native or encoded in EUC-CN) is also more storage efficient, since Chinese characters are limited to a maximum of two bytes each, while UTF-8 uses at least three bytes."


 * That line is incorrect as well. UTF-8 has 2048 two-byte sequences. I'll go ahead and fix the article. --Scandum (talk) 00:20, 8 May 2011 (UTC)


 * CJK Unified Ideographs (Unicode block) has a minimum code point of 4E00, well outside of the double-byte UTF-8 range. Always consider the context: GB 2312 is a Chinese encoding. --Artoria2e5 emits crap 13:24, 29 September 2016 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified 2 external links on GB 2312. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20160303230643/http://cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html to http://www.cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html
 * Corrected formatting/usage for http://www.itscj.ipsj.or.jp/ISO-IR/058.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 12:18, 9 October 2017 (UTC)

EUC-CN conversion issues
"To map the code points to bytes, add 158 (0x98) to the row number of the code point to form the high byte, and add 158 column number of the code point to form the low byte. The row number is the code point integer divided by 94, and the column the code point modulo 94.

For example, if you have the GB2312 code point 4566 ("外", which means foreign), the high byte will be 4566/94+158=206=0xCE, and the low byte will come from 4566%94+158=212=0xD4. So, the full encoding is 0xCED4=52948."

This section does not appear to be correct. The example given of code point 4566 (row 45, column 66, see character at https://archive.org/details/GB2312-1980/page/n17) is converted to EUC-CN by adding 160 (0xA0) to each row and column value, resulting in a new two byte value of 0xCDE2 (45 + 160 = 205 (0xCD), 66 + 160 = 226 (0xE2)) The current page value of 0xCED4 is another character (卧), code point 4652, row 46, column 52).

Both of these values (0xCDE2 and 0xCED4) and the characters they represent can be verified by viewing the Unicode to GB2312 conversion table at https://web.archive.org/web/20160303230643/http://cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html and looking at characters U+5916 (外) and U+5367 (卧) and seeing the values listed underneath each.

Additionally, the constants given in the current section as 158 and 0x98 are different values. 158 in decimal is 0x9E and 0x98 is 152.

It also looks like before the edit for 15 December 2016, this section was correct. HalfCap (talk) 23:29, 29 November 2018 (UTC)

I went ahead and made the changes based on the information above HalfCap (talk) 14:39, 10 December 2018 (UTC)