Character Encoding Basics

· 3 min read · 466 Words · -Views -Comments

https://static.1991421.cn/2022/2022-07-05-225814.jpeg

Common encoding settings

  1. IDE/editors

    Screenshot shows configuring the encoding for a single file in WebStorm.

    image

  2. Page source

    Screenshot of the page source in Chrome.

    image

    image

  3. Network requests

    Screenshot of the encoding info shown for an XHR request in Chrome DevTools.

    image

image

image-20220724223834393

  1. URLs

    image

  2. Programming languages

    image

Why encodings exist

Inside a computer, everything is ultimately stored as the binary digits 0 and 1. Encodings define how characters are persisted and rendered.

https://static.1991421.cn/2022/2022-07-05-225147.jpeg

ASCII vs. Unicode vs. UTF-8

A few concepts show up frequently; it helps to separate them clearly.

  1. ASCII is an encoding based on the Latin alphabet that represents 128 characters, so it cannot represent many languages (Chinese included).

    • Those 128 characters cover a-z, A-Z, 0-9, punctuation such as +-, and some control characters such as ACK.
  2. Unicode is a character set whose goal is to assign a code to every symbol in every language. For example, the Chinese character for “middle” is U+4E2D, while the Japanese katakana character “u” is U+30A6. Every symbol maps to a unique code point. The code points themselves are fixed-width, so naïvely storing everything in Unicode would waste space. Unicode is also called Universal Coded Character Set or UCS.

  3. UTF-8/ASCII are encodings — they define how code points are stored or transmitted. UTF-8 is one implementation of Unicode. For example, that “middle” character becomes 中 when encoded in UTF-8.

    • UTF-8 is backward compatible with ASCII.
    • If Unicode is the interface, the various UTF encodings are implementations.

Timeline

  1. The first ASCII standard was published in 1963.
  2. Unicode 1.0 arrived in 1991.
  3. UTF-8 debuted in 1992.

Where is a file’s encoding stored?

Per the Unicode spec, a file may start with a special character called the zero-width no-break space to indicate encoding order (the BOM).

Other encodings

Base64

Characteristics:

  • Any binary file can be converted into printable text so that it can be edited as plain text.
  • It provides a lightweight form of obfuscation.

Aside from Base64, there is Base58. Compared with Base64, it drops characters that are easy to confuse: the digit 0, uppercase O, uppercase I, lowercase l, and the symbols + and /, i.e., 64 − 6.

URL encoding

image

JavaScript provides helpers such as encodeURI and encodeURIComponent for URL encoding.

Why mojibake happens

Because the encoding used to store the data doesn’t match the encoding used to read/display it.

![image-20220703175332829](/Users/alanhe/Library/Mobile Documents/comappleCloudDocs/Typora/image-20220703175332829.png)

![image-20220703175750816](/Users/alanhe/Library/Mobile Documents/comappleCloudDocs/Typora/image-20220703175750816.png)

![image-20220703180749581](/Users/alanhe/Library/Mobile Documents/comappleCloudDocs/Typora/image-20220703180749581.png)

A quick example: ISO-8859-1 is based on ASCII and adds 96 characters in the 0xA0–0xFF range for Latin languages with diacritics, but it still cannot represent Chinese characters. UTF-16 is not backward compatible with UTF-8.

Detecting encodings

https://github.com/aikuyun/iterm2-zmodem/blob/master/iterm2-recv-zmodem.sh

![image-20220703211018747](/Users/alanhe/Library/Mobile Documents/comappleCloudDocs/Typora/image-20220703211018747.png)

![image-20220703211555644](/Users/alanhe/Library/Mobile Documents/comappleCloudDocs/Typora/image-20220703211555644.png)

![image-20220703214219452](/Users/alanhe/Library/Mobile Documents/comappleCloudDocs/Typora/image-20220703214219452.png)

  1. Notes on ASCII, Unicode, and UTF-8
  2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  3. ASCII chart
  4. Mojibake
  5. Why do we need URL encoding?
Authors
Developer, digital product enthusiast, tinkerer, sharer, open source lover