What Is High Availability | Part 2
Design of a high availability system
Ironically, adding more components to the total system can undermine efforts to achieve high availability. This is because complex systems are inherently more potential failure points and are more difficult to implement correctly. Most highly available systems draw a simple design pattern: a simple multi-physical system of high quality with comprehensive internal redundancy interdependences running all features paired with a second system on a separate physical location.
This classic design pattern is common among financial institutions for example. The computer industry and Communications has established the Service Availability Forum will host the creation of network infrastructure products, services and high availability systems. The same basic design principle applies beyond the information in such fields as nuclear power, aerospace and medical care.
High availability requires the most suitable accommodation: power supply, air conditioning on the floor, with particulate filter, maintenance service, security service and security against malicious acts and theft. Attention also to the risk of fire and water damage. Power cables and communication must be multifaceted and buried. They should not be prominent in the underground garage of the building, which is too often seen in buildings in Paris. These criteria are the first to come into account when choosing a hosting provider (if renting a local high availability).
For each level of the architecture for each component, each connection between components must be established:
- How to detect a failure?
- How secure is component, redundant, rescued, etc. Examples can be: backup server, cluster system, Websphere clustering, RAID storage, backup, SAN double attachment, degraded mode, unused material free (spare) ready to be reinstalled. .
- How do we want the trigger switches to backup / gradient. Whether it should be Manually after analysis? Or Automatically?
- How to ensure that the emergency system start over on a stable and known. Examples: one starts with a copy of the base and reapplies the archive logs, restart batches from a known state, 2-phase commit for transactions updating multiple data repositories, etc.
- How the application restarts on the backup mechanism. Examples: application restart, restart of interrupted batches, activation of a degraded mode, return the IP address of the failed server by the backup server, etc.
- How to take any transactions or sessions. Examples: Session persistence on the application server, a mechanism to ensure a response to a client for a transaction that was successfully made before failure but for which the customer does not have an answer, etc.
- How to return to the nominal situation. Examples:
~~ if a degraded mode allows for failure of a database to store transactions waiting in a file, how transactions are they re-applied when the database becomes active again.
~~ If a failed component has been deactivated, how is its reintroduction in active service (e.g., need to resynchronize data, retest the component, etc.)
Load balancing and sensitivity
Sensitivity is often managed by redundant elements with a load balancing mechanism. For this system offers a real gain in terms of reliability, check if one element fails, the remaining elements have sufficient power to service.
In other words, in the case of two active servers with load balancing, the power of a single server must ensure the entire load. With three servers, the power of a single server must ensure 50% of the load (assuming that the probability of an incident on two servers at the same time is negligible). To ensure reliability, it is useless to many servers back each other up. For example, a reliable 99% redundant once gives a reliability of 99.99% (the probability that the two elements is failing at the same time 1/100×1/100 = = 1/10.000)
Differential redundancy
The redundancy of an element is usually done by choosing redundant with several identical components. This assumes, to be effective, a failure of a component is random and independent of the failure of the other ingredients. This is for example the case of hardware failures.
This is not the case for all failures: for example, a flaw in the operating system or software malfunction of a component can occur when conditions are favorable on all components at once. For this reason, when the application is extremely sensitive, we consider redundant elements with components of different natures but the same functions. This can lead to:
- Choose different kind of servers with different OS, software products for different infrastructure,
- Develop the same component twice respecting each time the contracts that apply to the component interface.
Processes That Improve The Availability
There are two distinct roles in these processes:
Processes that reduce the number of failures
Based on the fact that prevention is better than cure, implement control processes that will reduce the number of incidents on the system improves availability. Both processes can play this role:
- The process of change management: 60% of errors are related to a recent change. By implementing a formalized process, accompanied by adequate tests (and implemented in a proper pre-production), many incidents can be eliminated.
- A process of pro-active management of errors: incidents can often be detected before they occur: response times increase, etc. A process dedicated to this task and provided with adequate tools (measuring system, reporting, etc.) may intervene even before the incident happens.
By implementing these processes, many incidents can be avoided.
The process reduces the duration of outages
Breakdowns always happen eventually. At this point, the recovery process in case of error is essential if the service is restored as quickly as possible. This process must have a goal: enabling the user to use a service as quickly as possible. Permanent repair should be avoided because it takes much longer. This process will have to implement a workaround for the problem.
High availability cluster
A high availability cluster (as opposed to a computing cluster) is a cluster of computers whose goal is to provide a service whilst avoiding downtime.
Redundancy with voting system
In this mode, various components process the same inputs and produce therefore (in principle) the same output.
The results produced by all the components are collected, then an algorithm is implemented to produce the final result. The algorithm can be simple (majority vote) or complex (mean, weighted mean, median, etc.), the aim being to eliminate erroneous results due to a malfunction on one of the components and / or a reliable result by combining several slightly different results.
This process:
- Do not allow load balancing
- Introduces the problem of reliability of the component managing the voting algorithm
This method is commonly used in the following cases
- Systems based on sensors (e.g., temperature sensors) for which the sensors are redundant
- Systems or several different components that perform the same function are used and for which a better outcome can be achieved by combining the results produced by the components (e.g., pattern recognition system using multiple algorithms for better recognition rate.
Continued…

