Fault Tolerance

Written by Charles Peacock
Bookmark and Share

Fault tolerance refers to the ability of a hardware or software system to keep performing even in the event of some type of failure. In the past, fault tolerance usually referred to systems that could keep running even in the event of a power failure. More recently, the term is used for computer hardware and software that is able to respond well in the event of a problem.

How Fault Tolerance Works

On both the hardware and software level, fault tolerance usually relies on redundant systems. What this means is that there are two or more system performing the same tasks simultaneously, so in the event that one system goes down the other one can take over without a hitch. Since this means duplicating your systems, it can be complex and expensive.

At the hardware level, redundancy means utilizing duplicate systems. This can mean dual processors, hard drives, and even networks--everything that makes the system work. While this is obviously expensive, it is an absolute necessity for large corporations or organizations that simply can not tolerate any amount of down-time.

On the software level, fault tolerance is less expensive but no less complex. Redundant systems can be built into the software application itself, so that if something goes wrong the application can shut down the offending portion of its processes. It then switches to the processes that are working, while restarting the part that had a problem.

Bookmark and Share