s i s t e m a o p e r a c i o n a l m a g n u x l i n u x | ~/ · documentação · suporte · sobre |
Next
Previous
Contents
6. Making your programs Unicode aware
6.1 C/C++
The C `
For normal text handling
The ISO/ANSI C standard contains, in an amendment which was added in 1995,
a "wide character" type ` Good references for this API are
Advantages of using this API:
Drawbacks of this API:
Portability notes
A ` In detail, here is what the
Single Unix specification
says about the ` One particular consequence is that in portable programs you shouldn't use
non-ASCII characters in string literals. That means, even though you
know the Unicode double quotation marks have the codes U+201C and U+201D,
you shouldn't write a string literal Here is a survey of the portability of the ISO/ANSI C facilities on various Unix flavours. GNU glibc-2.2 will support all of it, but for now we have the following picture.
As a consequence, I recommend to use the restartable and multithread-safe wcsr/mbsr functions, forget about those systems which don't have them (Irix, HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below) on those systems which permit you to compile programs which use these wcsr/mbsr functions (Linux, Solaris, OSF/1). A similar advice, given by Sun in http://www.sun.com/software/white-papers/wp-unicode/, section "Internationalized Applications with Unicode", is: To properly internationalize an application, use the following guidelines:
If, for some reason, in some piece of code, you really have to assume that
`wchar_t' is Unicode (for example, if you want to do special treatment of
some Unicode characters), you should make that piece of code conditional
upon the result of
The libutf8 library
A portable implementation of the ISO/ANSI C API, which supports 8-bit locales and UTF-8 locales, can be found in libutf8-0.7.3.tar.gz. Advantages:
The Plan9 way
The Plan9 operating system, a variant of Unix, uses UTF-8 as character
encoding in all applications. Its wide character type is called
` Drawback of this API:
For graphical user interface
The Qt-2.0 library http://www.troll.no/ contains a fully-Unicode QString class. You can use the member functions QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text. The QString::ascii and QString::latin1 member functions should not be used any more.
For advanced text handling
The previously mentioned libraries implement Unicode aware versions of the ASCII concepts. Here are libraries which deal with Unicode concepts, such as titlecase (a third letter case, different from uppercase and lowercase), distinction between punctuation and symbols, canonical decomposition, combining classes, canonical ordering and the like.
For conversion
Two kinds of conversion libraries, which support UTF-8 and a large number of 8-bit character sets, are available:
iconv
The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.1.3. ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.1.3.tar.gz. The iconv manpages are now contained in ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz. The portable iconv implementation by Bruno Haible. ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.3.tar.gz The portable iconv implementation by Konstantin Chuguev. <joy@urc.ac.ru> ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz Advantages:
librecode
librecode by François Pinard ftp://ftp.gnu.org/pub/gnu/recode/recode-3.5.tar.gz. Advantages:
Drawbacks:
ICU
International Components for Unicode
http://oss.software.ibm.com/icu/
(look also at
http://oss.software.ibm.com/icu/icuhtml/API1.5/).
IBM's internationalization library also has conversion facilities, declared
in ` Advantages:
Drawbacks:
Other approaches
6.2 Java
Java has Unicode support built into the language. The type `char' denotes a Unicode character, and the `java.lang.String' class denotes a string built up from Unicode characters. Java can display any Unicode characters through its windowing system AWT, provided that 1. you set the Java system property "user.language" appropriately, 2. the /usr/lib/java/lib/font.properties.language font set definitions are appropriate, and 3. the fonts specified in that file are installed. For example, in order to display text containing japanese characters, you would install japanese fonts and run "java -Duser.language=ja ...". You can combine font sets: In order to display western european, greek and japanese characters simultaneously, you would create a combination of the files "font.properties" (covers ISO-8859-1), "font.properties.el" (covers ISO-8859-7) and "font.properties.ja" into a single file. ??This is untested?? The interfaces java.io.DataInput and java.io.DataOutput have methods called `readUTF' and `writeUTF' respectively. But note that they don't use UTF-8; they use a modified UTF-8 encoding: the NUL character is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at the end. Encoded this way, strings can contain NUL characters and nevertheless need not be prefixed with a length field - the C <string.h> functions like strlen() and strcpy() can be used to manipulate them.
6.3 Lisp
The Common Lisp standard specifies two character types: `base-char' and `character'. It's up to the implementation to support Unicode or not. The language also specifies a keyword argument `:external-format' to `open', as the natural place to specify a character set or encoding. Among the free Common Lisp implementations, only CLISP
http://clisp.cons.org/
supports Unicode. You need a CLISP version from March 2000 or newer.
ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.
The types `base-char' and `character' are both equivalent to 16-bit Unicode.
The functions Among the commercial Common Lisp implementations: LispWorks http://www.xanalys.com/software_tools/products/ supports Unicode. The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char' (subtype of `character') contains all Unicode characters. The encoding used for file I/O can be specified through the `:external-format' argument, for example '(:UTF-8). Limitations: Encodings cannot be used for socket I/O. The editor cannot edit UTF-8 encoded files. Eclipse http://www.elwood.com/eclipse/eclipse.htm supports Unicode. See http://www.elwood.com/eclipse/char.htm. The type `base-char' is equivalent to ISO-8859-1, and the type `character' contains all Unicode characters. The encoding used for file I/O can be specified through a combination of the `:element-type' and `:external-format' arguments to `open'. Limitations: Character attribute functions are locale dependent. Source and compiled source files cannot contain Unicode string literals. The commercial Common Lisp implementation Allegro CL will have Unicode support in its upcoming release 6.0.
6.4 Ada95
Ada95 was designed for Unicode support and the Ada95 standard library features special ISO 10646-1 data types Wide_Character and Wide_String, as well as numerous associated procedures and functions. The GNU Ada95 compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of wide characters. This allows you to use UTF-8 in both source code and application I/O. To activate it in the application, use "WCEM=8" in the FORM string when opening a file, and use compiler option "-gnatW8" if the source code is in UTF-8. See the GNAT ( ftp://cs.nyu.edu/pub/gnat/) and Ada95 ( ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm) reference manuals for details.
6.5 Python
Python 2.0 ( http://starship.python.net/crew/amk/python/writing/new-python/new-python.html) will contain Unicode support. In particular, it will have a data type `unicode', representing a Unicode string. a module `unicodedata' for the character properties, and a set of converters for the most important encodings. See http://starship.python.net/crew/lemburg/unicode-proposal.txt for details.
6.6 JavaScript/ECMAscript
Since JavaScript version 1.3, strings are always Unicode. There is no character type, but you can use the \uXXXX notation for Unicode characters inside strings. No normalization is done internally, so it expects to receive Unicode Normalization Form C, which the W3C recommends. See http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode for details and http://developer.netscape.com/docs/javascript/e262-pdf.pdf for the complete ECMAscript specification.
6.7 Tcl
Tcl/Tk started using Unicode as its base character set with version 8.1. Its internal representation for strings is UTF-8. It supports the \uXXXX notation for Unicode characters. See http://dev.scriptics.com/doc/howto/i18n.html.
6.8 Perl
Perl 5.6 stores strings internally in UTF-8 format, if you write
at the beginning of your script. length() returns the number of
characters of a string. For details, see the Perl-i18n FAQ at
http://rf.net/~james/perli18n.html.
Next Previous Contents |