Software localization on Windows


- by Atsushi Kaneko -

 
Software localization is more complicated than mere translation of the text and messages that a user sees. Additional key considerations include internationalization of software source code and full localization of the software user interface for language, time, currency, script, and culture. Prior to starting any localization project, there are several steps you can take in order to minimize the resources required and maximize the number of languages for the localization.

For the first step - internationalization of the software source code - the main considerations are not to touch the core component during the actual localization work and to break the job down into separate, localizable components. This helps reduce the scale of resources required.

Internationalization of source code

The first task is to extract code strings into Windows resource files. This is a step in basic software development as well as in software localization. By extracting all strings that need to be translated from the core components, you don't have to touch the core itself and you can focus only on translation of the user interface (UI) message text.

The second task is to adapt the code to country-specific conventions and formats, including date, time, and currency. This is very important but is often neglected at the programming stage. For example, whereas Americans usually write a date in the order of month, day, then year, in Japan, the year comes first followed by month and day. The Windows system provides dates according to a default convention (which is user customizable), and your application should obtain all such data from the operating system. Another example of data amenable to localization are the font names that appear in the UI, which vary by language, character set, and Windows version.

The next task is to make the string buffers long enough to store the translated strings, or allocate sufficient space at runtime. The string lengths after translation might be longer than the original (English) strings. If a translated message cannot fit into the buffer, it will either be truncated or buffer overflow will occur, resulting in message truncation or a memory leak. One good approach is to consider the string buffer length as a variable or make it longer than the longest expected translation length.

The fourth task is to provide for the handling of Asian character (multibyte) sets. Asian character sets such as Japanese, Korean, and traditional or simplified Chinese are generally referred to as multibyte character sets (or two-byte or double-byte sets) since they make use of extra memory to store each character. Why is this? It's a simple question of quantity. Single byte (Western) character sets need encode no more than a few dozen characters to provide basic functionality. In contrast, the complicated kanji characters (ideograms) used by Japanese (originally adopted from Chinese) run to 40,000 and more (counting highly specialized characters - a mere 2,000 are sufficient for reading a daily newspaper), and a single 8-bit word is obviously inadequate to code even a fraction of them.

Unicode vs. MBCS (Multibyte character set)

The design of Unicode is based on the simplicity and consistency of ASCII, but goes far beyond ASCII's limited ability to encode only the Latin alphabet. The Unicode standard provides sufficient capacity to encode all characters used for the written languages of the world. It uses 16-bit encoding that provides code points for more than 65,000 characters. To keep character coding simple and efficient, the Unicode standard assigns each character a unique 16-bit value, and does not use complex modes or escape codes.

Enabling MBCS with Visual C/C++

To enable MBCS when programming with Visual C/C++, it is necessary to understand the concept of "code page". A code page is a character set that includes numbers, punctuation marks, and other glyphs. Different languages and locales may use different code pages. For example, ANSI code page 1252 is used for American English, and OEM code page 932 is used for Japanese kanji & hiragana (hiragana is one of the two phonetic character sets that Japanese uses). A code page can be represented in a table as a mapping of characters to single or multibyte values. Many code pages share the ASCII character set for characters in the range 0x00 - 0x7F.

The Microsoft runtime library uses the following types of code pages: * system-default ANSI code page

* local code page, and,

* multibyte code page.

In the system default ANSI code page, the runtime system automatically sets the multibyte code page at startup, by default, to the system default ANSI code page, which is obtained from the operating system.

In the locale code page, the behavior of a number of runtime routines is dependent on the current locale setting, which includes the locale code page. By default, all locale-dependent routines in the Microsoft runtime library use the code page that corresponds to the "C" locale. At runtime you can change or query the locale code page in use with a call to setlocale.

In the multibyte code page, the behavior of most of the multibyte-character routines in the runtime library depends on the current multibyte code page setting. By default, these routines use the system-default code page. At runtime you can query and change the multibyte code page with _getmbcp and _setmbcp, respectively.

Most multibyte-character routines (_ismb, _mbs and _mbc routines) in the Microsoft runtime library recognize double-byte character sequences according to the current multibyte code page setting, although some multibyte-character routines depend on locale code page. Because these runtime libraries obtain and use the value of the operating system code page (system default code page), you don't have to set the multibyte code page unless you need to change it to a value other than the system default.

Avoiding errors in handling multibyte characters

Most errors in handling multibyte characters occur in relation to the trailing byte, i.e. the second half of a double byte, leading to it being mistaken for an independent ASCII character. Because the trailing byte range is also used within the ASCII code range, it is impossible to identify a trailing byte in a string without knowing the leading byte. In other words, when referencing a stored value, you must find the leading byte first and you then assume that the next "character" is the trailing byte. When you need to find a leading byte, use "IsDBCSLeadByte()" and you can either skip the (next) trailing byte or increment the pointer by "CharNext()".

The IsDBCSLeadByte() call tests whether the specified byte is in the leading byte range of the default code page. If a string may include multibyte characters, this API is ideally used to look for the first appearance of the leading byte. The testing should always start with the beginning of a string. The CharNext() call increments the pointer one character width, while the CharPrev() call decrements the pointer by one width.

Sample case 1

First, obtain all the Windows resource files from the English software development environment. Next, translate the resource files and put them back into the English environment. Then, rebuild the software from scratch, so that the translated resource files can be compiled and linked with other object files to make localized executable files. In this case, since you can use the original development environment, you don't have to change the original development structure and you can simply copy the original build environment.

Sample case 2

Again, first obtain the Windows resource files from the English software development environment. Further likewise, translate the resource files. Next, however, compile the translated Windows resource files into binary files, and rebuild the Windows resource binary files into executable files.

In this second case, you don't have to keep the original development environment and you can remove all .CPP or .OBJ files, so you only need the Windows resource files and the associated makefiles. Other scenarios are possible, for example, making resource DLLs only, and then creating a physically separate UI and strings from core components. Then, translate the Windows resource files, compile them into binary files, create the resource DLLs with the translated Windows resource binary files, and finally replace the resource DLLs with the translated ones in the original English product. This last example might be the simplest way to do the localization, since you don't need the development environment.

Using localization tools to translate Windows resource files

Now, you have come to the final stage where you can simply focus on the Windows resource files. At this stage, you can translate strings in the Windows resource files directly, and a tool like Microsoft Developer Studio can do the button size or dialog box size adjustment.

However, there is a more efficient way to do Windows resource translation using the localization tools that are now available. Basically, using a localization tool, you can extract only the strings from the Windows resource files and make a simple database (table) of those strings, enabling translators to input the translated text into the translation string column next to the English string column. Next, the translated strings can be easily placed back into the resource files. Also, the localization tool enables you to visibly resize buttons on dialog boxes or re-format the entire dialog box.

The main purposes of using localization tools are to make the task allocation between software engineers and translators clear, to share the glossary with other translations, and to do localization engineering concurrent with the core code development.

Today, most vendors are planning to release localized software simultaneously or very close to the English release. To achieve this, you have to start the localization work at a very early stage of the initial software development period, and localization must be conducted concurrently with the core development.

Atsushi Kaneko is localization manager at Autodesk in the US, and helps introduce basic software localization techniques to companies who are planning to start Japanese localization projects. Contact him c/o:editors@cjmag.co.jp



Back to the table of contents