Unicode

Unicode is a character encoding standard designed to represent and handle text and symbols from virtually all writing systems in the world. It provides a unified and consistent way to encode characters and symbols, regardless of language, script, or platform.

Character Set

Unicode assigns a unique number (called a code point) to every character, symbol, and diacritic used in human writing systems, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many others (including emojos!❤️ ). It aims to cover all written languages and scripts worldwide.

Code Points

Each character in Unicode is identified by a unique code point, typically represented in hexadecimal format (e.g., U+0041 for the Latin letter 'A'). Unicode currently defines over 143,000 code points, with room for expansion.

Unicode Bit Length

Unlike ASCII, Unicode can be represented using various bit sizes, depending on the encoding scheme chosen. The most common Unicode encoding schemes are :

UTF-8

Unicode characters in UTF-8 are encoded using 8, 16, 24, or 32 bits, depending on the specific character being encoded. This is the most common format as it is the most space efficient.

UTF-16

Unicode characters in UTF-16 are encoded using either 16 or 32 bits.

UTF-32

Unicode characters in UTF-32 are encoded using a fixed 32 bits (4 bytes) for each character. This encoding scheme provides a straightforward and fixed-length representation for all Unicode characters.

What is Unicode?

Character set

Activity Complete

Home OCR GCSE CS Memory, Storage & Data Representation Data Representation ASCII & Unicode