the help desk

A Look at EUC Encoding

The November Help Desk focused on the basics of PC character sets and JIS and Shift-JIS encoding, and in December we looked at Unicode, a relatively new 16-bit encoding method already implemented on Microsoft's Windows NT operating system. This month, we round out our tutorial on character sets and encoding methods by examining EUC (Extended UNIX Code), the encoding method used in most UNIX environments.

by Steven Myers

You will recall from the November Help Desk that the high number of (often lengthy) escape sequences used to signify a character set change in JIS encoding makes for highly inefficient internal storage. The Shift-JIS encoding method avoids this problem by designating certain byte values to always be the first byte of a two-byte character code. A major drawback of Shift-JIS, however, is that the available encoding space is severely limited, and thus is used primarily to encode only the 6,355 kanji in the JIS X 0208-1990 character set.

The best of both worlds

EUC (Extended UNIX Code), the encoding standard found on most UNIX workstations, attempts to capture the best of both worlds by providing for the inclusion of four different character sets without requiring the use of escape sequences. Furthermore, whereas JIS and Shift-JIS are Japanese-specific encoding methods, EUC can be "localized" for a particular country simply by using appropriate character sets from the language of that country. Code set 0 is always set to the local version of ASCII (this would be JIS-Roman for Japan), while use of the remaining three code sets -- and their implementation -- is left up to each country.

In this sense, EUC conforms to the code page model used on DOS/
Windows PCs. That is, the user can specify which "version" of EUC to use, such as EUC-J for Japanese or EUC-KR for Korean. The EUC locale tells the system which character sets to use for code sets 1, 2, and 3. Whereas EUC-J would use code set 1 to encode the JIS X 0208-1990 character set (Japanese kanji), for example, EUC-KR would employ the same code set 1 for the Korean KS C 5601-1992 character set.

Note, however, that this scheme does not escape the problem of having to deal with multiple code pages -- which can be a real headache for developers and programmers, many of whom have long called for a "unified" coding system such as Unicode. Yet efforts at making Unicode more widespread are meeting with considerable resistance from countries that fear their particular rendering of a character will die out, as it is merged with similar-looking characters from other countries into a single code point. (For more discussion of this, see "A Unicode Tutorial" (page 9) and "Will Unicode Kill Japanese Kanji," (page 15) in our December issue.)

There is little doubt that the coding scheme of the future must be truly international and allow for the inclusion of the characters from virtually all languages within a fixed-width encoding space. In practical terms, though, EUC appears at this point to be the most efficient of the "non-controversial" encoding methods.

Japanese implementations of EUC

Two different methods are used to implement EUC in the encoding of Japanese character sets. The most widely used of these is called "packed format"; it includes not only 1- and 2-byte characters, but also 3-byte values. Figure 1 shows the distribution of code space for Japanese packed format EUC. Note that, like Shift-JIS, the value of a byte determines whether it is to be taken as a single-byte character code, or as the first of a 2- or 3-byte value. Also note that the 3-byte codes are used to encode characters from the JIS X 0212-1990 standard (kanji that are encountered less frequently).

The other Japanese implementation of EUC, known as "complete two-byte format," is much less common than packed format. Like Unicode, all the values in complete two-byte-format EUC are 16 bits wide, even the JIS-Roman/ASCII values of code set 0. Figure 2 shows the range of code values assigned to Japanese complete two-byte format EUC.

Conclusion

Over a period of three months, we have examined the basic encoding methods used to process Japanese text: JIS, Shift-JIS, EUC, and the Unicode international encoding standard. Because the relative positions of characters in the Japanese national character sets are kept somewhat consistent in the JIS, Shift-JIS (excluding JIS X 0212), and EUC encoding methods, conversion among these three methods is fairly straightforward (and has become a standard inclusion in Japanese-capable applications). Conversion to/from Unicode is not quite so simple, though, requiring the use of mapping tables (which can be obtained from the Unicode Consortium).

At present, the main issue concerning the future of character sets and encoding methods appears to be whether or not the benefits of using an international scheme (such as Unicode) are strong enough to offset the potential conflicts among countries that perceive a threat to their language and culture. The efforts of Unicode supporters notwithstanding, this problem may not be completely resolved until 32-bit character code values and a virtually unlimited encoding area become more practical. Until then, users will likely have to make do with existing, "partial-solution" methods.ç




(c) Copyright 1996 by Computing Japan magazine