The FTAG Model for
Creating Fault-tolerant Software



This month, the focus is on an ongoing collaborative research effort between prominent computer scientists from the Tokyo Institute of Technology and the University of Arizona to implement a model for designing more robust software.

by Steven Myers

One of the chief goals of software engineers who work on mission-critical systems is to write code that is as robust as possible. Certain types of software systems simply must work properly at all times, no matter what happens. In order to aid software designers in the creation of fault-tolerant software, Professors Takuya Katayama of the Tokyo Institute of Technology (TITech) and Rick Schlichting of the University of Arizona have undertaken a joint research project aimed at implementing a new attribute-based programming model, called FTAG.

The collaboration started in 1990, when Prof. Schlichting began a sabbatical at TITech, funded by a grant from the US National Science Foundation. A follow-on grant from the NSF supported exchanges over the next two years, and more recent support for the project has come from the US Office of Naval Research. Currently, most of the actual technical work is going on at TITech and the Japan Advanced Institute of Science and Technology (JAIST, where Prof. Katayama is also an active faculty member).

The FTAG model

Under the FTAG project, the FTAG model is used to write a program as a series of module decompositions, with provisions for redoing and replicating modules used to implement the fault-tolerance requirements. In simple terms, the program starts with one main "top" module, which gets broken down into smaller modules in a recursive, tree-like fashion. The model consists primarily of two parts: type definitions and module definitions. FTAG has basically the same set of primitive types found in traditional programming languages, such as C and Pascal, and it supports type constructors that can be used to make more complex types, such as arrays and records.

The fault-tolerance features of FTAG include built-in functions for redoing, replication, and stable-object access. The redoing function replaces a portion of the computation tree with a new computation. This is used as a mechanism for replacing a part of a computation that has failed.

With each computation, a set of attribute values is stored that can be tested to determine the validity of the computation. If a failure is detected in a certain module, then the entire execution starting at that module is discarded and recomputed. (This action does not affect the execution of other program modules.)

Replication enables copies of an FTAG module to be created and executed in parallel, providing backups in the event of a failure in the execution of one of the modules. The stable-object access feature, meanwhile, provides a means for determining which attribute values are important enough to be stored somewhere other than in main memory, so that they can be retrieved if the reconstruction of a computation becomes necessary.

The advantages of FTAG

FTAG offers a number of advantages for writing fault-tolerant software. Programs are static and declarative in nature, making it easier to understand and incrementally create this type of software. Also, syntactic and semantic definitions are kept completely separate, contributing to program readability. Finally, programs in FTAG exhibit a high degree of locality; information is passed only between functions using attributes, and only then between functions that have a parent/child relationship (in the execution tree).

FTAG is well-suited to implementation on a loosely-coupled multi-processor system, such as a cluster of workstations. These systems are of special interest to designers of fault-tolerant software because they consist of multiple processors with independent failure modes, and are thus more prone to partial failures than are traditional systems.

Execution in FTAG depends only on the presence of certain attribute values, so a simple scheme can be used for allocating module decompositions to the processors. A node in the computation tree is assigned to a processor upon creation, with that processor being responsible for all communication between the node and its children. All nodes can be executed in parallel.

Future work on the project will focus on implementing various fault-tolerant paradigms using the FTAG framework, and investigating the features needed to realize each paradigm. According to Prof. Katayama, the next step in the project will involve the programming of practical applications using the model to test the true benefits of the FTAG approach.ç

For more information about the FTAG project, send e-mail to Prof. Schlichting (rick@cs.arizona.edu) or Prof. Katayama (katayama@cs.titech.ac.jp).

Many readers are no doubt familiar with Prof. Schlichting through his JapanCS project, which is a completely separate activity from that described here. The goal of the JapanCS project is to help make research results in Japanese computing and computer science more accessible to people outside Japan. Schlichting and his students operate a Usenet newsgroup called comp.research.japan and maintain an electronic archive at cs.arizona.edu (accessible via anonymous ftp and the Web).