Fault Tolerance with FAWS
Presenting a Client-Transparent Fault Tolerant System for SOAP-based Web services
The e-business community uses web services for its critical activites, owing to web services' ability to guarantee full availability of the service in the presence of failures. However, most of the existing fault tolerant systems for web services do not provide fault tolerance for transparent handling of requests whose processing was in progress when failure occurred. This article describes a new scheme for providing client transparent fault tolerance for SOAP-based web services, using a portable fault tolerant system called FAWS.
Fault Tolerance and its Techniques
To understand the role of fault tolerance, it is important to first understand what it means for a system to tolerate faults, and, thereby, be dependable. Dependability is a global concept that encapsulates the following attributes:
- Availability indicates that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will be working most times.
- Reliability indicates that a system can run continuously without failure. In contrast to availability, reliability is defined in terms of a time interval instead of an instant in time. A highly reliable system is one that will continue to work without interruption over a relatively long period of time. This is a subtle, yet important, difference when compared to availability.
- Safety refers to the probability that a system is either functioning properly, or is in safe failed state. If the system is in safe failed state it will not cause any major disruptions. For example, process control systems, such as those used for controlling nuclear power plants and manned space vehicles, are required to provide a high degree of safety. If such control systems temporarily fail even for a brief moment, the effects could be disastrous.
- Maintainability refers to how easily a failed system can be repaired. A high-maintenance system may also show a high degree of availability, especially if failures can be detected and repaired automatically. Having said that, systems do not find it easy to automatically overcome failures. Often, dependable systems are also required to provide a high degree of security.
When designing a dependable computing system (a fault tolerant system), it is important to understand the definitions of failure, error, and fault:
- Failure occurs when the delivered service of a system or a component deviates from its specification.
- Error is that part of the system state that is liable to lead it to failure. An error affecting the service is an indication that a failure is occurring (or has occurred).
- Fault is the adjudged or hypothesized cause of an error.
Fault tolerance requires some form of redundancy, which is the use of additional information, resources, or time beyond what is needed for normal system operation. There are two types of redundancy - time redundancy and space redundancy. Time redundancy is based on using extra execution time, whereas space redundancy is based on using extra physical resources, such as extra processors, memory, disks, or communication links. A classic example of space redundancy is process replication.
Fault tolerance requires some form of redundancy and redundancy requires replication. Therefore, to provide fault tolerance we need replication. There are two well-known process replication schemes - active replication and passive replication.
- Active Replication

Figure 1: Active Replication with "voter"
- Passive Replication - Primary Backup

Figure 2: Simple Primary Backup System with State Update
As shown in the figure, the client sends a request to the primary server. The primary server does its work and updates that work in the backup server. If the backup server updates its work successfully, an acknowledgement will be sent to the primary server. After receiving the acknowledgement, the primary server will send a reply to the client. If primary server goes down, all requests will be forwarded to the backup server. Since both maintain the same state there is no need to update the backup server's state.
In the case of passive replication, without a state update scheme, primary will not update backup frequently. The backup has to do a warm start or the process would be a stateless process. When the primary server goes down the backup server will come in to the action. Requirement of a Fault Tolerance System for Web Services
Web services often suffer from long response times or temporary non-availability. For certain web-based applications, like online banking, stock trading, reservation processing and shopping, such outages are unacceptable. Fault tolerance techniques can be used to address these problems.
SOAP has become the most popular technology to develop web services applications. But SOAP-based web services lack fault tolerance support when it is used in applications like the ones that are mentioned above. The major problems faced by clients that use SOAP to access a web service are service non-availabilities and request resending. When a client sends a request and the server fails while the request is being processed, the client will be disconnected and won't receive the desired response. The client will receive either a connection time out exception or a socket closed exception. Following this, the client has to resend the SOAP request to access the service. Such undesirable situations can be prevented by the use of a fault tolerant system for SOAP-based web services.
FAWS has been developed to address the non-availabilities of SOAP-based web services, even while keeping it transparent to the clients. FAWS comes with a fault detector that detects server and service failures to provide full availability of the web service to clients, during such situations. FAWS works by routing SOAP requests to the standby secondary server when the fault detector encounters a failure in the primary server.
FAWS is based on distributed object architecture. It consists of several components that operate independent of each other. FAWS guarantees that the "fault-tolerated" system continues to work during a single component failure. Further, distributed architecture lowers the risk of complete system failure.
FAWS' front end acts as the web server and receives SOAP requests. When a request is received, it is forwarded to the primary web server. The response from the server is sent to the respective client. The front end logs each received request before it is sent to the primary server, and this frees the clients from having to resend the request in case of a primary failure (while the request is being processed). Since clients only need be aware of FAWS' front-end address (to access the web service), they needn't bother about the underlying primary and secondary servers. FAWS Architecture
Figure 3 presents an overview of FAWS architecture. This figure shows the high-level object architecture of FAWS. The figure highlights how the four major components of the system - FT-Front, FT-Admin, FT-Detector and FT-Monitor - interact with the client and primary and backup servers. Each of these components works independently and communicates with one or more components in the system (for example, FT-Detector communicate with FT-Admin).

Figure 3: FAWS Architecture
- FT-Front - FT-Front is the front end of FAWS. It is the only component of the system that interacts with both the client and the primary web server. FT-Front listens for SOAP requests on a specific port set by the administrator. All web services under FAWS are published with the IP address and port number of the FT-Front. The client sees the FT-Front as the web server that provides the web services. This mechanism hides the primary and redundant backup servers from the client. The state management of clients is also handled by the FT-Front. FT-Front uses Remote Method Invocation (RMI) to communicate with FT-Admin. FT-Admin starts up FT-Front by providing configuration data such as the IP address of primary web server, and maximum number of resends per request. Also, since FAWS maintains a message log, FT-Front can failover to a new primary server dynamically when it is notified by the FT-Admin.
- FT-Admin - FT-Admin is considered as the core of the system. It can be used to manage the fault tolerance system as a single system and to manage applications as if they were running on a single server. FT-Admin communicates constantly with FT-Detector and FT-Front, to provide uninterrupted service. FT-Admin provides two services - Replication Management and Configuration Management. While replication management is responsible for maintaining replicated servers and failover operations, the configuration management service is responsible for system initialization and changes.
- FT-Detector - FT-Detector detects both software (process failure) and hardware (machine failure) failures and notifies FT-Admin appropriately. If the web server is up and running, FT-Detector listens in on a specific port. By checking if the web server port is available (running, that is), FT-Detector can decide whether the web server process is up or down. FT-Detector uses this technology to track web server failures. FT-Admin tracks machine failures using the ICMP protocol. ICMP echo requests are sent to each machine periodically. FT-Detector waits for a certain time period (the checking period can be set using FT-Monitor and the default time interval is 1 second) to receive a reply, after which it resends the ICMP request and wait for a reply. If it does not get a reply to the resent request, FT-Detector decides that the particular machine is failed. FT-Detector and FT-Admin communicate using RMI.
- FT-Monitor - FT-Monitor is the main graphical user interface of FAWS. It provides two major functionalities to the system administrator (or the system user). It provides the facility to set the system configuration at the initial stage as well as at run time, and provides system status and graphical representation of distributed FAWS components.
FAWS is capable of handling multiple clients. For example purposes, we shall restrict to looking at how FAWS operates with one client (see Figure 4). The numbers in the figure, correspond to each step in the process.

Figure 4: Operation of FAWS
Following is a step-by-step description of the process as it occurs:
- Client sends the SOAP request to FT-Front.
- FT-Front accepts the client, creates a separate thread to handle the client, and starts the client thread.
- Client thread running in FT-Front logs the message.
- Client thread gets the current primary IP address and forwards the request to primary server.
- Primary server processes the request and sends the response to the respective client thread in FT-Front.
- Client thread sends the received response from the server to the client.
- Client thread removes the logged SOAP request and terminates.
- FT-Detector detects a failure (either machine failure or web server process failure) in primary.
- FT-Detector notifies FT-Admin about the failure. FT-Admin selects one of the backup servers as the primary and notifies FT-Monitor to show the current status.
- FT-Admin notifies the FT-Front about the new primary server. FT-Front changes its current primary address to the new address.
- When FT-Admin changes the primary server, FT-Front discards the requests sent to the failed primary and accesses the message log to get the logged messages, which had been sent to the previous primary.
- FT-Front sends the acquired logged requests to the new primary. After that the system operation is similar to the process sequence described in steps 5, 6, and 7.
- FT-Front gets an exception while sending a received request to the primary server because it is failed.
- FT-Front requests a new primary address from FT-Admin. This failure detection system associated with FT-Front guarantees the full operation of FAWS in case of a FT-Detector failure. After that the system operation is similar to the process sequence described in steps 10,11,12 and 13.
FAWS was built with future extendibility and ultra configurability in mind. Almost all its functionality can be redefined and extended. I have successfully implemented the proposed system - FAWS for web services. Of course, there is room for future enhancements. There are several features that can be added to make this solution versatile, such as the following:
- Active Replication - the whole idea behind the fault tolerance system is to provide full availability of the system during web service failures and to prevent single point of failures in the system itself. The current system can provide system availability when there are failures in the web services. However, things could be worked out to allow for replicating all system components in the network such that it can tolerate single point of failures even in the system components.
- Load Balancing - in the current system, there is only one primary server and it does all the work in web services. The clients connect to this primary server through FT-Front, while the backup servers languish in the background. The system can be extended to allow for multiple primary servers in the FT-Group that share the workload according to a fair policy. This can avoid congestion and work overloading when there are large numbers of web service requests coming to the system, which cannot be handled by a single server.
- Failback Operation - the current FAWS system can be extended such that service automatically does a failback to the original primary server when a failed server comes back online. In the current system, the failed server must be manually removed from the FT-Group and re-included after it has recovered.
With its distributed object architecture and individual components, FAWS effectively increases the availability of web services in the presence of failures. Since FAWS ensures transparent fault tolerance, the client is not aware any failure occurring in the server. With the ability to work under any operating system environment, FAWS provides the portability and openness that cannot be seen in other fault tolerance systems. The ability to failover without loosing any SOAP requests that are being processed (during failover) is another neat feature of FAWS.
Deepal Jayasinghe is an Apache committer. His current work involves the architecting and developing of Apache Axis2, which is the next version of the highly influential Apache Axis project. His expertise is majorly in Distributed computing and Fault tolerance systems. He is currently one of the core developers for Apache Axis2 working under a fellowship from Lanka Software Foundation and specializes in the completely new deployment model for Axis2. References
- [1] N. Aghdaie and Y. Tamir, ''Client-Transparent Fault-Tolerant Web Service'', Conference Paper, 20th IEEE International performance, computing, and Communications Conference, Phoenix, AZ, pp. 209-216 (April 2001).
- [2] Navid Aghdaie and Yuval Tamir, Fast Transparent Failover for Reliable Web Services, Technical Report, UCLA Computer Science Department Los Angeles, California (November 2001)
- [3] Deron Liang, Chen-Liang Fang, Chyouhwa Chen, FT-SOAP: A Fault Tolerant Web Service, Technical Report, Department of Computer and Information Science, National Taiwan Ocean University, Keelung, Taiwan.
- [4] Vijay Dialani, Simon Miles, Luc Moreau, David De Roure, and Michael Luck, Transparent Fault Tolerance for Web Services based Architectures, Research Paper, Department of Electronics and Computer Science, University of Southampton.



