(Unicode Transformation Format-8) A format in the Unicode coding system that uses from one to four bytes. When coding the English language, only one byte is used per character like regular ASCII encoding. See Unicode and ASCII.
| (character) | UTF-8 - (UCS transformation format 8) An
ASCII-compatible multibyte Unicode and UCS encoding,
used by Java and Plan 9.
The Unicode character set occupies a 16-bit code space. The
most obvious Unicode encoding (known as UCS-2) consists of a
sequence of 16-bit words. Such strings can contain bytes like
'\0' or '/' which have a special meaning in filenames and
other C library function parameters. In addition, the
majority of Unix tools expects ASCII files and can't read
16-bit words as characters without major modifications. For
these reasons, UCS-2 is not a suitable external encoding of
Unicode in filenames, text files, environment variables, etc.
The ISO 10646 Universal Character Set (UCS), a superset of
Unicode, occupies a 31-bit code space and the obvious UCS-4
encoding for it (a sequence of 32-bit words) has the same
problems.
The UTF-8 encoding of Unicode and UCS avoids the problems of
fixed-length Unicode encodings because an ASCII file encoded
in UTF is exactly same as the original ASCII file and all
non-ASCII characters are guaranteed to have the most
significant bit set (bit 0x80). This means that normal tools
for text searching etc. work as expected.
UTF-8 is defined in RFC 2279.
["File System Safe UCS Transformation Format (FSS_UTF)",
X/Open Preliminary Specification, X/Open Company Ltd.,
Document Number: P316. This information also appears in
ISO/IEC 10646, Annex P].
Plan 9 UTF manual entry. | |