QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, cilt.25, sa.8, ss.1015-1028, 2009 (SCI-Expanded)
In this paper, we propose air architectural design for a dual computer system (DCS) that operates in real-time with the fault-tolerance implemented purely by hardware. We have a novel design allowing the implementation of hardware that performs the following key services: the determination of fault type (temporary or permanent) and the localization of the faulty computer without using self-testing techniques and diagnosis routines. We also propose a non-trivial sequence of services for fault-tolerance in which the determination of the fault type and the recovery of computational processes after a temporary fault are realized before fault localization. Our design has several ben(fits: the designed hardware shortens the recovery point time period; the proposed non-trivial sequence of fault-tolerant services reduces (to two) the number of logical segments that should be re-run to recover the computational processes; and the determination of the fault type allows eliminating only the computer with a permanent fault. These contributions bring both an increase in system performance and art increase in the degree of system reliability. Copyright (C) 2009 John Wiley & Sons, Ltd.