Overview
Unicode Transformation Format (UTF) is a set of character encodings that can represent every character in the Unicode character set. The most common variants are UTF-8, UTF-16, and UTF-32, each with different characteristics and use cases.
Technical Details
UTF-8
- Variable-width encoding (1-4 bytes)
- ASCII compatible
- Most common on the web
- Space efficient
UTF-16
- Variable-width encoding (2-4 bytes)
- Used in Windows and Java
- Supports surrogate pairs
- Endianness dependent
UTF-32
- Fixed-width encoding (4 bytes)
- Simple indexing
- Memory intensive
- Endianness dependent
Examples
Character Encoding Examples
Character: A UTF-8: 41 UTF-16: 0041 UTF-32: 00000041 Character: € UTF-8: E2 82 AC UTF-16: 20AC UTF-32: 000020AC Character: 😊 UTF-8: F0 9F 98 8A UTF-16: D83D DE0A UTF-32: 0001F60A
Implementation
JavaScript Example
// UTF-8 Encoding/Decoding
const text = "Hello 😊";
const utf8Bytes = new TextEncoder().encode(text);
console.log(utf8Bytes); // Uint8Array [72, 101, 108, 108, 111, 32, 240, 159, 152, 138]
// UTF-16 Encoding/Decoding
const utf16Bytes = new Uint16Array(text.length);
for (let i = 0; i < text.length; i++) {
utf16Bytes[i] = text.charCodeAt(i);
}
console.log(utf16Bytes); // Uint16Array [72, 101, 108, 108, 111, 32, 55357, 56842]
// Converting between encodings
const decoder = new TextDecoder('utf-8');
const decodedText = decoder.decode(utf8Bytes);
console.log(decodedText); // "Hello 😊"