s i s t e m a o p e r a c i o n a l m a g n u x l i n u x | ~/ · documentação · suporte · sobre |
4.8. Character Encoding4.8.1. Introduction to Character EncodingFor many years Americans have exchanged text using the ASCII character set; since essentially all U.S. systems support ASCII, this permits easy exchange of English text. Unfortunately, ASCII is completely inadequate in handling the characters of nearly all other languages. For many years different countries have adopted different techniques for exchanging text in different languages, making it difficult to exchange data in an increasingly interconnected world. More recently, ISO has developed ISO 10646, the ``Universal Mulitple-Octet Coded Character Set (UCS). UCS is a coded character set which defines a single 31-bit value for each of all of the world's characters. The first 65536 characters of the UCS (which thus fit into 16 bits) are termed the ``Basic Multilingual Plane'' (BMP), and the BMP is intended to cover nearly all of today's spoken languages. The Unicode forum develops the Unicode standard, which concentrates on the UCS and adds some additional conventions to aid interoperability. Historically, Unicode and ISO 10646 were developed by competing groups, but thankfully they realized that they needed to work together and they now coordinate with each other. If you're writing new software that handles internationalized characters, you should be using ISO 10646/Unicode as your basis for handling international characters. However, you may need to process older documents in various older (language-specific) character sets, in which case, you need to ensure that an untrusted user cannot control the setting of another document's character set (since this would significantly affect the document's interpretation). 4.8.2. Introduction to UTF-8Most software is not designed to handle 16 bit or 32 bit characters, yet to create a universal character set more than 8 bits was required. Therefore, a special format called ``UTF-8'' was developed to encode these potentially international characters in a format more easily handled by existing programs and libraries. UTF-8 is defined, among other places, in IETF RFC 2279, so it's a well-defined standard that can be freely read and used. UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127) encode to themselves as a single byte, while characters with larger values are encoded into 2 to 6 bytes of information (depending on their value). The encoding has been specially designed to have the following nice properties (this information is from the RFC and Linux utf-8 man page):
In short, the UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world's languages, yet it is backward compatible with U.S. ASCII files as well as having other nice properties. For many purposes I recommend its use, particularly when storing data in a ``text'' file. 4.8.3. UTF-8 Security IssuesThe reason to mention UTF-8 is that some byte sequences are not legal UTF-8, and this might be an exploitable security hole. UTF-8 encoders are supposed to use the ``shortest possible'' encoding, but naive decoders may accept encodings that are longer than necessary. If an attacker intentionally creates an unusually long format, input checkers might not notice the problem. The RFC describes the problem this way:
A longer discussion about this is available at Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux at http://www.cl.cam.ac.uk/~mgk25/unicode.html. 4.8.4. UTF-8 Legal ValuesThus, when accepting UTF-8 input, you need to check if it's valid UTF-8. Here is a list of all legal UTF-8 sequences; any character sequence not matching this table is not a legal UTF-8 sequence. In the following table, the first three rows list the legal UTF-8 sequences in binary, hexadecimal, and octal. The last row lists the UCS code region that the sequence encodes to (in hexadecimal). In the binary column, the ``x'' indicates that either a 0 or 1 is legal in the sequence. In the other columns, a ``-'' indicates a range of legal values (inclusive). Of course, just because a sequence is a legal UTF-8 sequence doesn't mean that you should accept it (see the other issues discussed in this book). Table 4-1. Legal UTF-8 Sequences
I should note that in some cases, you might want to cut slack (or use internally) the hexadecimal sequence C0 80. This is an overlong sequence that, if permitted, can represent ASCII NUL (NIL). Since C and C++ have trouble including a NIL character in an ordinary string, some people have taken to using this sequence when they want to represent NIL as part of the data stream; Java even enshrines the practice. Feel free to use C0 80 internally while processing data, but technically you really should translate this back to 00 before saving the data. Depending on your needs, you might decide to be ``sloppy'' and accept C0 80 as input in a UTF-8 data stream. If it doesn't harm security, it's probably a good practice to accept this sequence since accepting it aids interoperability. 4.8.5. UTF-8 Illegal ValuesThe UTF-8 character set is one case where it's possible to enumerate all illegal values (and prove that you've enumerated them all). You detect an illegal sequence by checking for two things: (1) is the initial sequence legal, and (2) if it is, is the first byte followed by the required number of valid continuation characters? Performing the first check is easy; the following is provably the complete list of all illegal UTF-8 initial sequences: Table 4-2. Illegal UTF-8 initial sequences
The second step is to check if the correct number of continuation characters are included in the string. If the first byte has the top 2 bits set, you count the number of ``one'' bits set after the top one, and then check that there are that many continuation bytes which begin with the bits ``10''. So, binary 11100001 requires two more continuation bytes. Again, although C0 80 is technically an illegal UTF-8 sequence, for interoperability purposes you might want to accept it as a synonym for character 0 (NIL). 4.8.6. UTF-8 Related IssuesThis section has discussed UTF-8, because it's the most popular multibyte encoding of UCS, simplifying a lot of international text handling issues. However, it's certainly not the only encoding; there are other encodings, such as UTF-16 and UTF-7, which have the same kinds of issues and must be validated for the same reasons. Another issue is that some phrases can be expressed in more than one way in ISO 10646/Unicode. For example, some accented characters can be represented as a single character (with the accent) and also as a set of characters (e.g., the base character plus a separate composing accent). These two forms may appear identical. There's also a zero-width space that could be inserted, with the result that apparently-similar items are considered different. Beware of situations where such hidden text could interfere with the program. This is an issue that in general is hard to solve; most programs don't have such tight control over the clients that they know completely how a particular sequence will be displayed (since this depends on the client's font, display characteristics, locale, and so on). |