Unicode is a character encoding standard designed to represent and handle text and symbols from virtually all writing systems in the world. It provides a unified and consistent way to encode characters and symbols, regardless of language, script, or platform.
Unicode assigns a unique number (called a code point) to every character, symbol, and diacritic used in human writing systems, including Latin, Greek, Cyrillic, Arabic, Chinese, Japanese, and many others (including emojos!❤️ ). It aims to cover all written languages and scripts worldwide.
Each character in Unicode is identified by a unique code point, typically represented in hexadecimal format (e.g., U+0041 for the Latin letter 'A'). Unicode currently defines over 143,000 code points, with room for expansion.
Unlike ASCII, Unicode can be represented using various bit sizes, depending on the encoding scheme chosen. The most common Unicode encoding schemes are :
Unicode characters in UTF-8 are encoded using 8, 16, 24, or 32 bits, depending on the specific character being encoded. This is the most common format as it is the most space efficient.
Unicode characters in UTF-16 are encoded using either 16 or 32 bits.
Unicode characters in UTF-32 are encoded using a fixed 32 bits (4 bytes) for each character. This encoding scheme provides a straightforward and fixed-length representation for all Unicode characters.
What is Unicode?
Character set