Encoding Formats

ASCII

ASCII (American standard code for information interchange) uses a 7-bit binary system to represent 128 different characters. In this set, the first 32 characters are used for control characters such as newline or backspace. Here's a clever thing that ASCII does, it assigns the character A the value of 65. This makes it so the binary conversion for A is 1000001so that you can look at a 7-bit binary character, ignore the first 2 digits, and know the position in the alphabet, to add to this they start a at 97 which is 1100001.

Unicode, UTF-8, UTF-16

An encompassing character set including all languages is needed since we need a way to communicate in the modern world. Unicode assigns every character a unique number called a code point. An advantage to Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1 and hence also ASCII. It also manages to represent the majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set.

Memory considerations

UTF-8:
- 1 byte: Standard ASCII
- 2 bytes: Arabic, Hebrew, most European scripts
- 3 bytes: BMP
- 4 bytes: All Unicode Characters
UTF-16:
- 2 bytes: BMP
- 4 bytes: All Unicode characters

If working mostly with ASCII characters, then UTF-8 is more memory efficient. UTF-16 could be up to 1.5 times more memory efficient than UTF-8.

Encoding basics

UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backward compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.

UTF-8 Encoding

UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters, UTF-16 introduces surrogate pairs. In this case, a combination of two two-byte portions maps to a non-BMP character. These two-byte portions come from the BMP numeric range but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.

As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using!

Sources

Source of information: Stackoverflow - What is unicode

Video explanation of unicode: Tom Scott - UTF-8

Encoding Formats

Enjoy the notes on this website? Consider supporting me in this adventure in you preferred way: Support me.