Character Encoding Basics

Jul 3, 2022 · 3 min read · 466 Words · -Views -Comments

https://static.1991421.cn/2022/2022-07-05-225814.jpeg

Common encoding settings

IDE/editors
Screenshot shows configuring the encoding for a single file in WebStorm.
Page source
Screenshot of the page source in Chrome.
Network requests
Screenshot of the encoding info shown for an XHR request in Chrome DevTools.

URLs
Programming languages

Why encodings exist

Inside a computer, everything is ultimately stored as the binary digits 0 and 1. Encodings define how characters are persisted and rendered.

https://static.1991421.cn/2022/2022-07-05-225147.jpeg

ASCII vs. Unicode vs. UTF-8

A few concepts show up frequently; it helps to separate them clearly.

ASCII is an encoding based on the Latin alphabet that represents 128 characters, so it cannot represent many languages (Chinese included).
- Those 128 characters cover a-z, A-Z, 0-9, punctuation such as +-, and some control characters such as ACK.
Unicode is a character set whose goal is to assign a code to every symbol in every language. For example, the Chinese character for “middle” is U+4E2D, while the Japanese katakana character “u” is U+30A6. Every symbol maps to a unique code point. The code points themselves are fixed-width, so naïvely storing everything in Unicode would waste space. Unicode is also called Universal Coded Character Set or UCS.
UTF-8/ASCII are encodings — they define how code points are stored or transmitted. UTF-8 is one implementation of Unicode. For example, that “middle” character becomes 中 when encoded in UTF-8.
- UTF-8 is backward compatible with ASCII.
- If Unicode is the interface, the various UTF encodings are implementations.

Timeline

The first ASCII standard was published in 1963.
Unicode 1.0 arrived in 1991.
UTF-8 debuted in 1992.

Where is a file’s encoding stored?

Per the Unicode spec, a file may start with a special character called the zero-width no-break space to indicate encoding order (the BOM).

Other encodings

Base64

Characteristics:

Any binary file can be converted into printable text so that it can be edited as plain text.
It provides a lightweight form of obfuscation.

Aside from Base64, there is Base58. Compared with Base64, it drops characters that are easy to confuse: the digit 0, uppercase O, uppercase I, lowercase l, and the symbols + and /, i.e., 64 − 6.

URL encoding

JavaScript provides helpers such as encodeURI and encodeURIComponent for URL encoding.

Why mojibake happens

Because the encoding used to store the data doesn’t match the encoding used to read/display it.

![image-20220703175332829](/Users/alanhe/Library/Mobile Documents/com~~apple~~CloudDocs/Typora/image-20220703175332829.png)

![image-20220703175750816](/Users/alanhe/Library/Mobile Documents/com~~apple~~CloudDocs/Typora/image-20220703175750816.png)

![image-20220703180749581](/Users/alanhe/Library/Mobile Documents/com~~apple~~CloudDocs/Typora/image-20220703180749581.png)

A quick example: ISO-8859-1 is based on ASCII and adds 96 characters in the 0xA0–0xFF range for Latin languages with diacritics, but it still cannot represent Chinese characters. UTF-16 is not backward compatible with UTF-8.

Detecting encodings

https://github.com/aikuyun/iterm2-zmodem/blob/master/iterm2-recv-zmodem.sh

![image-20220703211018747](/Users/alanhe/Library/Mobile Documents/com~~apple~~CloudDocs/Typora/image-20220703211018747.png)

![image-20220703211555644](/Users/alanhe/Library/Mobile Documents/com~~apple~~CloudDocs/Typora/image-20220703211555644.png)

![image-20220703214219452](/Users/alanhe/Library/Mobile Documents/com~~apple~~CloudDocs/Typora/image-20220703214219452.png)

Last updated on Jul 3, 2022

FamilyPro

全球巴士

自用超10年老机场

ScreenStudio

Vultr

Raycast

✈️推荐

FamilyPro

✈️推荐

SS

Character Encoding Basics

Common encoding settings

Why encodings exist

ASCII vs. Unicode vs. UTF-8

Timeline

Where is a file’s encoding stored?

Other encodings

Base64

URL encoding

Why mojibake happens

Detecting encodings

Character Encoding Basics

Common encoding settings

Why encodings exist

ASCII vs. Unicode vs. UTF-8

Timeline

Where is a file’s encoding stored?

Other encodings

Base64

URL encoding

Why mojibake happens

Detecting encodings

Related docs