UTF-8, Unicode and ASCII code of characters

What is unicode?

Unicode is a character set or standard.

We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.

What is encoding?

A character encoding is a way to convert text data into binary numbers to store in memory.

Example

A chinese character:
it's unicode value:U+6C49
convert 6C49 to binary:01101100 01001001

Let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!

But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one.

Binary format of bytes in sequence

1st Byte2nd Byte3rd Byte4th ByteNumber of Free BitsMaximum Expressible Unicode Value
0xxxxxxx7007F hex (127)
110xxxxx10xxxxxx(5+6)=1107FF hex (2047)
1110xxxx10xxxxxx10xxxxxx(4+6+6)=16FFFF hex (65535)
11110xxx10xxxxxx10xxxxxx10xxxxxx(3+6+6+6)=2110FFFF hex (1,114,111)

According to the table above, if we want to store this character using the 'UTF-8' format, we need to prefix our character with some 'headers'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:

HeaderPlace holderFill in our BinaryResult
1110xxxx011011100110
10xxxxxx11000110110001
10xxxxxx00100110001001

Writing out the result in one line:

11100110 10110001 10001001

This is the UTF-8 (binary) value of the chinese character! (confirm it yourself: http://www.fileformat.info/info/unicode/char/6c49/index.htm)

Summary

A chinese character:
it's unicode value:U+6C49
convert 6C49 to binary:01101100 01001001
embed 6C49 as UTF-8:11100110 10110001 10001001

More explanation

"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.

ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

Encoding: ASCII, UTF-7, UTF-8, UTF-16, UTF-32 are the example of encoding. All UTF encodings are variadble length encoding.

Unicode for Bangla characters in PHP

PHP ord() function returns the ASCII code of any character but not the Unicode. For example ord() function returns 224 for any Bangla character as ASCII code. To get the Unicode of any character we can use one of the two process described below:

Process 1: 
function _uniord($c) {
    if (ord($c{0}) >=0 && ord($c{0}) <= 127)
        return ord($c{0});
    if (ord($c{0}) >= 192 && ord($c{0}) <= 223)
        return (ord($c{0})-192)*64 + (ord($c{1})-128);
    if (ord($c{0}) >= 224 && ord($c{0}) <= 239)
        return (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
    if (ord($c{0}) >= 240 && ord($c{0}) <= 247)
        return (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
    if (ord($c{0}) >= 248 && ord($c{0}) <= 251)
        return (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
    if (ord($c{0}) >= 252 && ord($c{0}) <= 253)
        return (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
    if (ord($c{0}) >= 254 && ord($c{0}) <= 255)    //  error
        return FALSE;
    return 0;
}

function _unichr($o) {
    if (function_exists('mb_convert_encoding')) {
        return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
    } else {
        return chr(intval($o));
    }
}

Process 2: 
// code point to UTF-8 string
function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

// UTF-8 string to code point
function uniord($s) {
    return unpack('V', iconv('UTF-8', 'UCS-4LE', $s))[1];
}
Note: The difference between the Unicode of Bangla and English digits are 2486.

Labels: ,

© copyright-2020 Rejaul