Self-Repairing Computers

By Roland Piquepaille

Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method.

A group of research collaborators at Stanford University and the University of California at Berkeley has taken a new tack, by accepting that computer failure and human operator error are facts of life. Rather than trying to eliminate computer crashes -- probably an impossible task -- our team concentrates on designing systems that recover rapidly when mishaps do occur. We call our approach recovery-oriented computing (ROC).

Here are the four basis of their approach.

Our team is exploring four principles to guide the construction of "ROC-solid" computing systems. The first is speedy recovery: problems are going to happen, so engineers should design systems that recover quickly. Second, suppliers should give operators better tools with which to pinpoint the sources of faults in multicomponent systems. Third, programmers ought to build systems that support an "undo" function (similar to those in word-processing programs), so operators can correct their mistakes. Last, computer scientists should develop the ability to inject test errors; these would permit the evaluation of system behavior and assist in operator training.

Let's concentrate on the first principle, faster reboots.

Most systems take a long time to reboot and, worse, may lose data in the process. Instead we believe that engineers should design systems so that they reboot gracefully. If one were to look inside a computer, one would see that it is running numerous different software components that work together.
Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.
George Candea and James Cutler, Stanford graduate students on our team, have focused on developing this independent-rebooting technique, which we call micro-rebooting.

After implementing the idea by manually modifying an existing application, they saw a a fivefold-faster return to service.

[This] was much more valuable than a fivefold increase in the time between failures (better reliability), even though either measure would yield the same level of improved availability. We believe that a variety of computing systems exhibit such a threshold.

Here are a few words about the "undo" principle, which is not available today, even in large data centers.

Our group is working on an undo capability for e-mail systems that is aimed at the place where messages are stored. Berkeley graduate student Aaron Brown and one of us (Patterson) have recently completed the prototype of an e-mail system featuring an operator undo utility.
Suppose a conventional e-mail storage server gets infected by a virus. The system operator must disinfect the server, a laborious job. Our system, however, would record all the server's activities automatically, including discarded messages. If the system gets infected, the operator could employ the undo command to "turn back the clock" to before the arrival of the virus. Software that attacks that virus could then be downloaded. Finally, the operator could "play forward" all the e-mail messages created after the infection, returning the system to normal operation.

I don't know about you, but I think they have a good approach. Please read the long and dense article if you want to know more. Or grab a copy of Scientific American and rread it next weekend.

Source: Armando Fox and David Patterson, for Scientific American, May 12, 2003