How Computers Speak Japanese

A review of Ken Lunde's Understanding Japanese Information Processing

The revolutionary shift by the Japanese PC market to the DOS/V standard has at last provided a common, familiar platform for hardware and software developers alike. As a (perhaps unintended) side effect, it has also opened the door for foreign software firms to start developing and localizing their products for the Japanese market. Programmers eyeing opportunities in the Japanese market, as well as users who would like to know more about how Japanese text is handled electronically, will find Ken Lunde's book a valuable source of information and advice.

New opportunities in the Japanese market have created increased interest in Japanese computing and the methods used to electronically process text in the Japanese language. For those who are doing, or thinking about doing, any kind of work in this area, Ken Lunde's Understanding Japanese Information Processing provides a wealth of detailed information about the encoding methods and processing tools/techniques that are necessary for this kind of development. It also includes a valuable bibliography for those who want/need to delve deeper into this complex topic. Japanese character sets

The first few chapters of Lunde's book provide an introduction to the Japanese writing system and give a general overview of Japanese information processing techniques. Some basic terminology is defined (see sidebar for a sampling, and the evolution of Japanese character sets (both electronic and non-electronic) is described in detail.

In addition to ASCII, four other electronic character sets (JIS-Roman, half-width katakana, JIS X 0208-1990, and JIS X 0212-1990) are used heavily in Japan and considered to be "national" character sets. Lunde covers each of these sets in detail, as well as some other Asian character sets (primarily those used in China and Korea). He also covers international sets, including Unicode. (Unicode is essentially an amalgamation of character set standards from all over the world into a single, unified set with 65,536 character positions, which can be visualized as a 256-row by 256-cell matrix. Lunde notes that the actual number of characters in Unicode is constantly changing, and that further changes can be expected in the future.) Encoding methods

The second section of the book deals with the numerous encoding methods used to process Japanese text. Lunde reduces these to three basic schemes: JIS, Shift-JIS, and EUC. JIS (Japan Industrial Standard) encoding, which uses seven bits to encode characters, is modal ó escape sequences are used to signal a change between character sets, or modes. While not very efficient for internal representation, this method is widely used for electronic transmission such as e-mail. Also, because the seven bits used for the actual data encoding correspond to the printable ASCII character set, each kanji character can be represented as a sequence of two ASCII characters. Lunde provides a specific coding illustration (see figure on page 42) to show how JIS encoding with escape sequences works. Shift-JIS encoding was developed by Microsoft Corporation and is implemented as the internal code for a wide assortment of platforms. In contrast to JIS, Shift-JIS (SJIS) is non-modal ó if the numeric value of a character falls within a particular range (81-9F or E0-EF hex for SJIS encoding), then it is treated as the first byte, and the next character as the second byte, of a double-byte character. Because no escape sequences are necessary, this type of non-modal encoding is generally considered much more efficient for internal processing. (The figure on page 42 shows the same example in SJIS coding. Note that the range of the first byte falls completely within the extended set of ASCII characters, meaning that there is no standard representation because the characters encoded in the ASCII extended set vary according to the implementation.)

EUC (Extended UNIX Code) encoding, sometimes called UNIXized JIS or UJIS, was developed as a method for processing multiple character sets in general ó not just Japanese. This scheme is used as the internal code for most UNIX workstations that are set up to support Japanese. Like SJIS, EUC is a non-modal encoding method and resembles SJIS in its internal representation. It consists of four separate code sets: the primary code set (which is the ASCII character set) and three additional code sets that can be specified by the user (and are generally used for non-Roman characters). Japanese input/output

Lunde goes on in the following section to discuss the software and hardware used for Japanese input, including FEPs (Front End Processors) and the various types of Japanese keyboards currently in use. One chapter focuses on Japanese output and offers a detailed analysis of both printer output and display monitor output. Processing techniques, such as code conversion algorithms and text stream handling algorithms, are also covered. Lunde provides the C source code for several routines of interest. Tools for Japanese text processing

The final two sections of the book contain a survey of the Japanese text processing tools currently available and a look at Japanese e-mail and network domains. The chapter on processing tools covers a wide variety of existing software: operating systems, input software, text editors, word processors, page layout software, online dictionaries, machine translation software, and terminal software. A good starting point

Understanding Japanese Information Processing is an excellent and well-organized guide for anyone wishing to learn the basics of processing Japanese text on a computer. Lunde's writing style is clear and concise, and he includes several diagrams to aid in visualizing the various encoding methods. One feature of the book that potential developers should find particularly useful is an extensive set of appendices that contain code conversion tables, Japanese corporate character-set standards, corporate encoding methods, character lists and mapping tables, software and document sources, and mailing lists. Lunde also provides a helpful "advice to developers" section at the end of several chapters; these contain his personal recommendations and tips on localizing products for the Japanese market.

As Lunde himself notes, readers should not expect to find much information here on specific design or market issues. Developers hoping to design their own Japanese applications will need to consult other sources for information on internationalization and localization (some reference manuals of this type are mentioned in the bibliography). Also, the often problematic cultural aspects of software localization (issues such as kanji sorts and Japan's date and time formats) are not addressed in this book. Lunde's focus is not the Japanese software market, but rather the fundamentals of Japanese text processing. This book is best considered a general starting point for potential developers and others interested in Japanese information processing.

Publication information: Lunde, Ken. Understanding Japanese Information Processing. Sebastopol, CA: O'Reilly & Associates, Inc., 1993. ISBN 1-56592-043-0.

A terminology sampler Selected terms from the glossary of Understanding Japanese Information Processing

Bitmapped font. A font whose character shapes are defined by arrays of bits.

Code position. The numeric code within an encoding method that is used to refer to a specific character.

Encoding. The correspondence between numerical character codes and the final printable glyphs.

Escape character. The control character (0x1B) that is used as part of an escape sequence. Escape sequences are used in JIS encoding to switch between one- and two-byte-per-character modes.

Escape sequence. A string of characters that contains one or more escape characters, and is used to signify a shift in mode of some sort. In the case of the Japanese character set, they are used to shift between one- and two-byte-per-character modes, and to shift between different character sets or different versions of the same character set.

JIS. Japanese Industrial Standard. The name of the standards established by JISC. Also the name of the encoding method used for the JIS X 0208-1990 and JIS X 0212-1990 character set standards.

JISC. Japanese Industrial Standards Committee. The name of the organization that establishes the JIS standards.

JIS Level 1 kanji. The name given to the 2,965 characters that constitute the first set of kanji in JIS X 0208-1990. Ordered by pronunciation.

JIS Level 2 kanji. The name given to the 3,390 characters that constitute the second set of kanji in JIS X 0208-1990. Ordered by radical, then by total number of strokes.

JIS X 0208-1990. The latest version of the document that describes the Japanese character set standard; 6,879 characters are enumerated.

Outline font. A font whose characters are described mathematically in terms of lines and curves. Often referred to as scaleable fonts, because they can be scan converted to bitmaps of any desired size and orientation.

Wide character. A character represented by 16 bits.

In addition to providing a thorough, platform-independent discussion of Japanese text-processing issues, Lunde also describes some as-yet-unsolved problems involving Japanese output/text transmission. One of the more interesting of these issues is the gaiji problem. Many companies and users have defined their own characters and fonts, and they run into problems when they try to transmit these user- and corporate-defined characters to systems that do not support those characters. Lunde observes that while there is currently no elegant solution to this problem, a necessary step in finding an answer might be to "embed character data, both bitmapped and outline, into files when they are transmitted. This includes a mechanism for detecting which characters are user-defined." The first person who can offer a viable, platform independent solution to this problem, says Lunde, will be "rewarded well by the Japanese computer industry."