s i s t e m a o p e r a c i o n a l m a g n u x l i n u x | ~/ · documentação · suporte · sobre |
Next
Previous
Contents
1. Introduction
1.1 Why Unicode?
People in different countries use different characters to represent the words of their native languages. Nowadays most applications, including email systems and web browsers, are 8-bit clean, i.e. they can operate on and display text correctly provided that it is represented in an 8-bit character set, like ISO-8859-1. There are far more than 256 characters in the world - think of cyrillic, hebrew, arabic, chinese, japanese, korean and thai -, and new characters are being invented now and then. The problems that come up for users are:
The solution of this problem is the adoption of a world-wide usable character
set. This character set is Unicode
http://www.unicode.org/.
For more info about Unicode, do `
1.2 Unicode encodings
This reduces the user's problem of dealing with character sets to a technical problem: How to transport Unicode characters using the 8-bit bytes? 8-bit units are the smallest addressing units of most computers and also the unit used by TCP/IP network connections. The use of 1 byte to represent 1 character is, however, an accident of history, caused by the fact that computer development started in Europe and the U.S. where 96 characters were found to be sufficient for a long time. There are basically four ways to encode Unicode characters in bytes:
The space requirements for encoding a text, compared to encodings currently in use (8 bit per character for European languages, more for Chinese/Japanese/Korean), is as follows. This has an influence on disk storage space and network download speed (when no form of compression is used).
Given the penalty for US and European documents caused by UCS-2, UTF-16, and UCS-4, it seems unlikely that these encodings have a potential for wide-scale use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at least), yet this encoding has not been widely adopted for documents - SJIS remains prevalent in Japan. UTF-8 on the other hand has the potential for wide-scale use, since it doesn't penalize US and European users, and since many text processing programs don't need to be changed for UTF-8 support. In the following, we will describe how to change your Linux system so it uses UTF-8 as text encoding.
Footnotes for C/C++ developers
The Microsoft Win32 approach makes it easy for developers to produce
Unicode versions of their programs: You "#define UNICODE" at the top
of your program and then change many occurrences of ` Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA character set registry http://www.isi.edu/in-notes/iana/assignments/character-sets says about ISO-10646-UCS-2: "this needs to specify network byte order: the standard does not specify". Network byte order is big endian. And RFC 2152 is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the UCS-2 form are serialized as octets, that the most significant octet appear first." Whereas Microsoft, in its C/C++ development tools, recommends to use machine-dependent endianness (i.e. little endian on ix86 processors) and either a byte-order mark at the beginning of the document, or some statistical heuristics(!). The UTF-8 approach on the other hand keeps `
1.3 Related resources
Markus Kuhn's very up-to-date resource list: Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs: http://czyborra.com/utf/#UTF-8 Some example UTF-8 files:
Next Previous Contents |