Unicode is a character set or standard.
We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.
A character encoding is a way to convert text data into binary numbers to store in memory.
A chinese character: | 汉 |
it's unicode value: | U+6C49 |
convert 6C49 to binary: | 01101100 01001001 |
Let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!
But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one.
1st Byte | 2nd Byte | 3rd Byte | 4th Byte | Number of Free Bits | Maximum Expressible Unicode Value |
0xxxxxxx | 7 | 007F hex (127) | |||
110xxxxx | 10xxxxxx | (5+6)=11 | 07FF hex (2047) | ||
1110xxxx | 10xxxxxx | 10xxxxxx | (4+6+6)=16 | FFFF hex (65535) | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | (3+6+6+6)=21 | 10FFFF hex (1,114,111) |
According to the table above, if we want to store this character using the 'UTF-8' format, we need to prefix our character with some 'headers'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:
Header | Place holder | Fill in our Binary | Result |
1110 | xxxx | 0110 | 11100110 |
10 | xxxxxx | 110001 | 10110001 |
10 | xxxxxx | 001001 | 10001001 |
Writing out the result in one line:
11100110 10110001 10001001
This is the UTF-8 (binary) value of the chinese character! (confirm it yourself: http://www.fileformat.info/info/unicode/char/6c49/index.htm)
Summary
A chinese character: | 汉 |
it's unicode value: | U+6C49 |
convert 6C49 to binary: | 01101100 01001001 |
embed 6C49 as UTF-8: | 11100110 10110001 10001001 |
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
Encoding: ASCII, UTF-7, UTF-8, UTF-16, UTF-32 are the example of encoding. All UTF encodings are variadble length encoding.
function _uniord($c) {
if (ord($c{0}) >=0 && ord($c{0}) <= 127)
return ord($c{0});
if (ord($c{0}) >= 192 && ord($c{0}) <= 223)
return (ord($c{0})-192)*64 + (ord($c{1})-128);
if (ord($c{0}) >= 224 && ord($c{0}) <= 239)
return (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
if (ord($c{0}) >= 240 && ord($c{0}) <= 247)
return (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
if (ord($c{0}) >= 248 && ord($c{0}) <= 251)
return (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
if (ord($c{0}) >= 252 && ord($c{0}) <= 253)
return (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
if (ord($c{0}) >= 254 && ord($c{0}) <= 255) // error
return FALSE;
return 0;
}
function _unichr($o) {
if (function_exists('mb_convert_encoding')) {
return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
} else {
return chr(intval($o));
}
}
Process 2: // code point to UTF-8 string
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
// UTF-8 string to code point
function uniord($s) {
return unpack('V', iconv('UTF-8', 'UCS-4LE', $s))[1];
}
Note: The difference between the Unicode of Bangla and English digits are 2486.
Labels: PHP, Web development