Globalizing the web with XML

- by Steven Myers and Ryan Shaw -

Promising to open up a whole new world of internationalized online applications and document formats, XML (eXtensible Markup Language) has been gathering industry support and media hype at a steady clip for the last two years. What is XML? Will it free developers from the ever-present annoyances related to character encoding, multilingual data sharing, and other software internationalization issues?

While just about everyone involved in Web development has at least heard of XML and has some idea of its touted advantages over HTML, it seems that only a relative handful of developers have any clear understanding of what XML is really about or how the syntax of an XML document is structured. As XML is yet to be fully supported by the major browsers, it has not made its way into the development mainstream, and good books or tutorials on the topic are nowhere near as plentiful as they are for most other Web-related technologies. Still, the excitement and hype surrounding XML continues to grow.

The eXtensible Markup Language, as it turns out, is not really a markup language at all, but rather a set of rules for creating markup languages. Known as a meta-language, XML is actually a subset of SGML (Standard Generalized Markup Language), which itself is a huge, all-encompassing language designed to accommodate every conceivable markup language in every possible domain, from military simulations to mathematics to obscure publishing applications (see Figure 1). Because of its sheer bulk and the difficulty of implementing a browser to handle the standard, SGML never became a practical option for moving massive numbers of documents over a network. As a result, early Web pioneers turned to HTML (actually an extremely small and specific application of SGML, but far more manageable), which quickly became the de facto standard for publishing information on the Internet.

In the process, however, Web publishers lost the ability to include descriptions of the information inside the document. HTML is a "hard-coded" language that specifies document layout and nothing else. As an interchange format, HTML is essentially useless. The main objective of XML is to bring back the freedom and flexibility of SGML without having to support every last detail of the SGML standard, many of which are not relevant to the needs of most content providers.

Why another Web language?

Although most Web developers and businesses are well aware of the current hype surrounding XML, many remain skeptical about the true value of incorporating support for yet another document format. Indeed, for developers who have spent the last four years struggling to master the idiosyncrasies of HTML and learning all the tricks and workarounds to make documents look the same in different Web browsers, the benefits of doing it all over again for yet another markup language can be far from obvious. So what exactly does XML have to offer that warrants the necessary investment of time, effort and money?

In some sense, XML is to data interchange what Java is to software development - a technology that by its very nature facilitates openness, reuse, and cross-platform compatibility. Whereas Java allows a single program to run on multiple platforms, XML is supposed to enable the transfer of data across any two systems. At the moment, content providers are constrained to the single, rigid HTML document format and must rely on either their own ingenuity or the proprietary extensions of browser vendors if they want to do anything not supported by that format. The primary purpose of XML is simply to provide a standard, widespread method for making these extensions.

The online music store

One of the key differences between HTML and XML is that the latter allows a particular document to be associated with its own document-type definition (DTD), which specifies how the document is to be structured. For example, most online music shops have a standard format for Web pages on their site that displays details for a particular CD. This format specifies the relevant information for the CD (e.g. title, artist, category, review, and price), as well as the layout (what the web customer sees in the browser). Using XML, this information would be spelled out in the DTD, and documents using the DTD would have to include specific tags for each item. Viewed in this light, HTML pages can be regarded as XML documents that all conform to a single, rigid DTD - the one for Web pages.

Currently, most businesses engaging in online commerce over the Web store all of their important information about clients, products, and so forth in some type of relational database. Web pages are then created by first pulling the information from the database, then formatting it to conform to a particular HTML template that will be displayed to site visitors. The problem here is that the process of writing code to query the database, format the resulting information, and display details consistently in different Web browsers can be extremely tedious and prone to error.

It is therefore no accident that the structure of an XML document is very similar to that of both a database row and a software object. XML tags correspond roughly to the names of database fields and object attributes. Just as attributes can themselves be objects with their own sets of attributes, XML tags can be nested in much the same manner, allowing the author to specify object structures inside the document rather than having to convert field/value pairs to objects in Java code, then back into field/value pairs for entry into the database. Furthermore, by building support for XML directly into the database, Web authors will be able to make a query from say, a Java servlet, and retrieve the results as a set of XML documents already formatted for display in the browser (see sidebar - Using XML on


Not surprisingly, one of the major forces pushing for widespread XML adoption is the EDI (Electronic Data Interchange) community, comprising representatives from a wide variety of industry groups dedicated to the development of common formats and standards for electronic commerce and data interchange. The 1998 Tokyo XML Conference, held last December under the sponsorship of Toshiba Advanced Systems and the Asahi Shimbun, featured an in-depth seminar by David Webber, one of the co-founders of the XML/EDI Internet Group. Following an overview of the group's vision for applying XML within the existing EDI framework, Webber reviewed the progress made so far and discussed the remaining obstacles which must be overcome to make the XML/EDI vision a reality.

Although traditional EDI (in the form of various industry-specific standards for exchanging data) has proven to be highly cost-effective for the elite group of companies that have embraced it, Webber pointed out that these organizations are still few and far between. While 95% of the Fortune 1000 companies use EDI in some capacity, a global survey of companies with the capability to use EDI showed that a mere 2% are actually using it. Clearly, the larger firms would benefit greatly by convincing their smaller trading partners to also embrace the EDI standards, but the large up-front investment of time and money puts traditional EDI beyond the reach of most small- to medium-size companies.

However, by integrating XML with existing EDI standards, these larger companies can leverage the flexibility and reach of the Internet to bring the smaller enterprises into the fold. Smaller companies can use relatively inexpensive XML-enabled tools to generate documents that can be easily integrated into their larger trading partners' existing EDI processes. Both sides benefit from the resulting reduction of low-level processing costs and optimization of high-level business practices and workflows. Furthermore, the document-centric nature of XML allows for seamless integration of purchase orders, invoices, etc., with word processing and spreadsheet programs - a major time-saving advantage that would be inconceivable with traditional EDI.

The hope is that by combining XML and EDI, a unified architecture will eventually emerge throughout the enterprise, with relational databases, ERP systems, intra-nets, Web servers and EDI processes becoming a set of loosely-coupled, distributed systems using XML as the standard data format. The result would be a comprehensive solution that incorporates not only the secure, proven, and legally binding infrastructure of traditional EDI, but also the added scalability, maintainability, and ease-of-use provided by XML.

Japanese interest in XML

Unlike HTML and other traditional formats for exchanging data, the XML specification was designed from the start to be fully international, with complete support for Unicode. The importance of this design goal has not been lost on Japanese developers. The homepage for the Japan XML User Group sports a prominently-displayed quote from Jon Bosak of Sun Microsystems, who chairs the W3C XML Coordination Group: "XML is more than a technology; it is the first step in the true globalization of information and information technology. Japan is in an ideal position to take a leadership role in this movement." Indeed, Japanese companies - both venture startups and the more conservative ichi-ryuu (conglomerate) companies - have been relatively quick to jump on the XML bandwagon in announcing support for the standard and in developing XML-related tools.

One of these forward-looking startups, Tokyo-based Infoteria Inc., announced in late January the release of iPEX, the world's first commercial XML processing engine. "XML has high potential to describe not only documents but all kinds of computer data," says Pina Hirano, the company's CEO. "With iPEX, software developers can create XML-enabled applications much more quickly and cheaply than they could by using XML modules developed in-house from scratch."

Other companies actively pursuing XML development include Fujitsu Software Information and Toshiba Advanced Systems, which recently released a series of XML-related tools for tasks such as XML authoring, creation of DTDs, and conversion to/from XML and word-processing formats. Fuji Xerox Information Systems sponsors the SGML/XML Cafe website, which contains a wealth of Japanese-language and Japan-specific information about XML. In January, the Digital Open Network Co-op (also known as D-ONE and Digital Worker's Co-op) - a nationwide, cross-industry interest group that monitors developments in multimedia and Internet technologies - created an official business unit devoted exclusively to XML. The new group states that it will focus primarily on surveys and reports about new XML applications developed overseas and their potential uses. Another excellent repository of Japan-specific news can be found on the Japan XML User Group (JXMLUG) website.

Infoteria, Fuji Xerox, JXMLUG, and several others were out in strength for Japan's first-ever XML Developers Day. The technical conference was held March 13 in Shibuya and featured several Unicode-related seminars, one of which was presented by Martin Duerst, a key figure in the Unicode community. Sponsored by ASCII, the conference was chaired by Shin Murata of the Nihon Keizai Shimbun and sponsored by Nihon Keiei Kyokai (Japan Management Association). If this sudden increase in XML-related conferences, trade shows, and industry news articles is any indication, XML adoption in Japan could well reach a critical mass by the end of this year.

Potential potholes for XML

Although the potential benefits of a unification of XML/EDI are clear, companies and developers hoping to use XML in useful, non-trivial e-commerce applications still have a tough road ahead of them. In his Tokyo seminar, Webber specified five areas in particular that need further development if an XML/EDI system is to succeed:

  1. more sophisticated tools are needed for handling XML;

  2. EDI standards organizations and companies already using EDI need to develop support for XML-based EDI;

  3. industries must develop repositories of standards particular to their field;

  4. standard templates must be developed for mapping between other data formats and XML; and

  5. intelligent agents must be developed for handling low-level system issues such as version control and transaction facilitation.

Much of this development is already underway. There are dozens of tools and several programming APIs available already for dealing with XML. In addition, several major efforts are underway in industries ranging from finance to chemical engineering to build repositories of standard data definitions. Adobe has been active in the development of the Precision Graphics Markup Language (PGML), a 2D, scalable graphic language that is supposed to be simple enough for casual users while at the same time incorporating all of the advanced features required by professional graphic artists. Another consortium is developing the Weather Observation Format (OMF) for standardization of weather observation reports. Finally, Tokyo's Linc Media (Computing Japan's parent company - Ed.) is currently at work on an XML data definition for music notation that could be used for Web-based music applications such as THETA (see

All of the major database and tool vendors have announced XML capability for their products by the end of 1999, and Microsoft and Netscape are both claiming that full XML support will be incorporated into version 5 of their respective browsers. Still, many within the industry are skeptical about the actual motives and intentions of the major software companies. After all, Microsoft and Netscape do not stand to gain much at all from an open document format, and if history is any indication, both will soon be adding their own proprietary extensions to the standard. What the XML movement needs most at the moment is a simple, practical (yet flashy) example that can penetrate even the thickest of managerial mindsets to demonstrate the potential of the language. Java received this boost in the form of applets, which came along at just the right time to create a media sensation. If XML can somehow replicate this feat, it stands a strong chance of becoming a legitimate cross-platform, cross-application data interchange format.

Steve Myers heads the Software Development Group at Linc Media. Ryan Shaw is a developer on Steve's team. They can be reached at and ryan@

Back to the table of contents