Fault-tolerant computer systems
Fault-tolerant computer systems are systems designed around the concepts of
fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults.
Types of fault tolerance
computer systemsare designed to be able to handle several possible failures, including hardware-related faults such as hard diskfailures, input or output devicefailures, or other temporary or permanent failures; software bugs and errors; interface errors between the hardware and software, including driver failures; operator errors, such as erroneous keystrokes, bad command sequences, or installing unexpected software; and physical damage or other flaws introduced to the system from an outside source [Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 135 - 138 1996 ISBN:0-13-057887-8] .
Hardware fault-tolerance is the most common application of these systems, designed to prevent failures due to hardware components. Typically, components have multiple backups and are separated into smaller "segments" that act to contain a fault, and extra redundancy is built into all physical connectors, power supplies, fans, etc. [Formal Techniques in Real-Time and Fault-Tolerant Systems: Second International Symposium, Nijmegen, the Netherlands, January 8-10, 1992, Proceedings
By Jan Vytopil
Contributor Jan Vytopil, Published by Springer, 1991, ISBN 3540550925, 9783540550921] . There are special software and instrumentation packages designed to detect failures, such as fault masking, which is a way to ignore faults by seamlessly preparing a backup component to execute something as soon as the instruction is sent, using a sort of voting protocol where if the main and backups don't give the same results, the flawed output is ignored.
Software fault-tolerance is based more around nullifying programming errors using real-time redundancy, or static "emergency" subprograms to fill in for programs that crash. There are many ways to conduct such fault-regulation, depending on the application and the available hardware. [Fault-tolerant computer system design book contents. Dhiraj K. Pradhan, Pages: 221 - 235 1996 ISBN:0-13-057887-8] .
The first known fault-tolerant computer was SAPO, built in 1951 in
Czechoslovakiaby Antonin Svoboda[Computer structures: principles and examples, pg 155By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] . Its basic design was magnetic drums connected via relays, with a voting method of memory error detection. Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: machines that would last a long time without any maintenance, such as the ones used on NASA space probes and satellites; computers that were very dependable but required constant monitoring, such as those used to monitor and control nuclear power plantsor supercolliderexperiments; and finally, computers with a high amount of runtime which would be under heavy use, such as many of the supercomputers used by insurance companiesfor their probabilitymonitoring.
Most of the development in the so called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960's [Computer structures: principles and examples, pg 189By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] , in preparation for
Project Apolloand other research aspects. NASA's first machine went into a space observatory, and their second attempt, the JSTAR computer, was used in Voyager. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer is still working today.
Hyper-dependable computers were pioneered mostly by
aircraftmanufacturers, [Computer structures: principles and examples, pg 210By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] nuclear powercompanies, and the railroad industry in the USA. These needed computers with massive amounts of uptime that would fail gracefully enough with a fault to allow continued operation, while relying on the fact that the computer output would be constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of Saturn Vrockets, but later on BNSF, Unisys, and General Electricbuilt their own [Computer structures: principles and examples, pg 223By Daniel P. Siewiorek, C. Gordon Bell, Allen NewellPublished by McGraw-Hill, 1982ISBN 0070573026, 9780070573024] .
In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure [Fault tolerant computing in computer designNeilforoshan, M.RJournal of Computing Sciences in Colleges archiveVolume 18 , Issue 4 (April 2003) Pages: 213 - 220 ISSN:1937-4771 ] . Later efforts showed that, to be fully effective, the system had to be self-repairing and diagnosing -- isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results, with the outcome that if, for example, four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was faulty and have it taken out of service. This is called M out of N majority voting.
Historically, motion has always been to move further from N-model and more to M out of N due to the fact that the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.
Fault tolerance verification and validation
The most important requirement of design in a fault tolerant computer system is making sure it actually meets its requirements for reliability. This is done by using various failure models to simulate various failures, and analyzing how well the system reacts. These
statistical modelsare very complex, involving probability curves and specific fault rates, latencycurves, error rates, and the like. The most commonly used models are HARP, SAVE, and SHARPE in the USA, and SURF or LASS in Europe.
Fault tolerance research
Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport,
utilitiesand the military, the field of topics that touch on research is very wide: it can include such obvious subjects as software modelingand reliability, or hardware design, to arcane elements such as stochastic models, graph theory, formal or exclusionary logic, parallel processing, remote data transmission, and more. [
Reliability Evaluation of Some Fault-Tolerant Computer Architectures
By Shunji Osaki, Toshihiko Nishio
Published by Springer, 1980
ISBN 3540102744, 9783540102748
* Fault Tolerant System
* [http://184.108.40.206/search?q=cache:uBL7iMOpV9UJ:www.cs.ucla.edu/~rennels/article98.pdf+Fault-tolerant+computer+systems&hl=en&ct=clnk&cd=13&gl=us&client=firefox-a Primer on Fault-Tolerant Computer Systems from UCLA]
* [http://www.freepatentsonline.com/5099485.html A fault-tolerant patent with a lot of basic information on specific ways to detect faults]
Wikimedia Foundation. 2010.
Look at other dictionaries:
Fault-tolerant system — This article contains specific implementations of fault tolerant systems. For general theory, see fault tolerant design. Fault tolerance or graceful degradation is the property that enables a system (often computer based) to continue operating… … Wikipedia
Fault-tolerant design — In engineering, Fault tolerant design, also known as fail safe design, is a design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of… … Wikipedia
AT&T Computer Systems — is the generic name for American Telephone Telegraph s unsuccessful attempt to compete in the computer business. In return for divesting the local Bell Operating Companies (Baby Bells), AT T was allowed to have an unregulated division to sell… … Wikipedia
Configurable Fault Tolerant Processor — The Configurable Fault Tolerant Processor (CFTP), developed by the Space Systems Academic Group at the Naval Postgraduate School, is an experimental payload on board the United States Naval Academy s (USNA) MidSTAR 1 satellite. Midstar 1 was… … Wikipedia
Computer cluster — Not to be confused with data cluster. A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other… … Wikipedia
Computer Consoles Inc. — Computer Consoles Inc. or CCI was a telephony and computer company located in Rochester, New York, USA, which did business first as a private, and then ultimately a public company from 1968 to 1990. CCI provided worldwide telephone companies with … Wikipedia
fault tolerance — A design method that ensures continued system operation in the event of individual failures by providing redundant elements. At the component level, the design includes redundant chips and circuits and the capability to bypass faults… … Dictionary of networking
Replication (computer science) — Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault tolerance, or accessibility. It could be data replication if the… … Wikipedia
Consensus (computer science) — Consensus is a problem in distributed computing that encapsulates the task of group agreement in the presence of faults. In particular, any process in the group may fail at any time. Consensus is fundamental to core techniques in fault… … Wikipedia
Computers and Information Systems — ▪ 2009 Introduction Smartphone: The New Computer. The market for the smartphone in reality a handheld computer for Web browsing, e mail, music, and video that was integrated with a cellular telephone continued to grow in 2008. According to… … Universalium