Hardware- and Software-Fault Tolerance, Design and Assessment of Dependable Computer Systems

Lecture

May 28 14:00
Hotel Beau Site

LAAS-CNRS

This lecture covers the main design and assessment issues that are to be considered when developing dependable computer systems. It is organized into four main parts. After a short introduction aimed at motivating the relevance of the topic covered, the first part briefly introduces the general concepts and related terminology for dependable computing including the notions of fault, error and failure and the main approaches towards dependability: fault tolerance, fault removal and fault forecasting. In the second part, it addresses the fault tolerance techniques (encompassing error detection, error recovery and fault masking) that can be used to cope with accidental faults (physical disturbances, software bugs, etc.) and to some extent, malicious faults (e.g., attacks, intrusions). In particular, several forms of redundancies (space, temporal, data, etc.), as well as the important notion of diversified design will be described and illustrated by means of examples. The third part covers the methods and techniques — both analytical (stochastic processes) and empirical (controlled experiments) — that can be used to objectively assess the coverage of the fault tolerance mechanisms and then infer the level of dependability achieved. The actual impact of fault-tolerant architectures on dependability, leading to the essential notion of coverage (with respect to fault tolerance) is precisely identified and exemplified. A special focus is put on controlled experiments based on fault injection techniques (hardware-, simulation-, and software-based fault injection). The fourth and last part describes most recent trends in controlled experiments aimed at developing benchmarks for robustness testing purpose and for fairly comparing the dependability features of several computer systems and components. Finally, a few concluding remarks will depict some emerging challenges and future trends in the domain of dependable computing.

Prerequisites

Suggested preliminary readings

[Arlat et al. 1990] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Martins and D. Powell, “Fault Injection for Dependability Validation — A Methodology and Some Applications”, IEEE Trans. on Software Engineering, 16 (2), pp.166-182, February 1990.
[Laprie et al. 1990] J.-C. Laprie, J. Arlat, C. Béounes and K. Kanoun, “Definition and Analysis of Hardware-and-Software Fault-Tolerant Architectures”, Computer, 23 (7), pp.39-51, July 1990.
[Carreira et al. 1999] J. V. Carreira, D. Costa and J. G. Silva, “Fault Injection Spot-checks Computer System Dependability”, IEEE Spectrum, 36, pp.50-55, August 1999.
[Koopman & DeVale 1999] P. Koopman and J. DeVale, “Comparing the Robustness of POSIX Operating Systems”, in Proc. 29th Int. Symp. on Fault-Tolerant Computing (FTCS-29), (Madison, WI, USA), pp.30-37, IEEE CS Press, 1999.
[Avižienis et al. 2004] A. Avižienis, J.-C. Laprie, B. Randell and C. Landwehr, “Basic Concepts and Taxonomy of Dependable and Secure Computing”, IEEE Transactions on Dependable and Secure Computing, 1 (1), pp.11-33, Jan.-March 2004.
[Siewiorek et al. 2004] D. P. Siewiorek, R. Chillarege and Z. Kalbarczyk, “Reflection on Industry Trends and Experimental Research in Dependability”, IEEE Transactions on Dependable and Secure Computing, 1 (2), pp.109-127, 2004.

Learning outcomes

This course will familiarize you with the challenges of the nanoscale technologies, and effective design practices to overcome these challenges, both at the chip level as well as at the system level. You will be equipped with a good understanding of how to incorporate test and reliability into the design from day one, and to look forward to the paradigm shifts of resilient design for even more effective designs.

Syllabus

Introduction: Motivation and Outline
Part 1: Basic Concepts and Terminology

The Notion of Dependability
The Dependability Attributes
Dependability Threats: Fault, Error, Failure Pathologies
Dependability Procurement
Dependability Assessment

Part 2: Fault-Tolerant Computer Architectures

Error Detection

Error Detecting Codes
Replication and Comparison
Temporal and Execution Checks
Likelihood Checks
Structured Data Checks
Wrapping
Self-Checking Component

System Recovery

Backward Error Recovery (Roll-back)
Forward Error Recovery (Roll-forward)
Error Compensation

Design Diversity

Redundancy and Common Mode Failures
Diversification Techniques

Examples of Fault-Tolerant Computer Systems

Airbus A3XX Series
Boeing 777
Ansaldo’s Computer Based Interlocking
Safe and Secure Maintenance Laptop

Part 3: Experimental Assessment of Dependability

Dependability Evaluation

Fault Tolerance Coverage
Fault Injection-based Assessment

Fault Injection Techniques

(Physical) Hardware-implemented
Simulation-based Injection
Software-implemented
Combined and Emerging Techniques

Examples of Experimental Results

The MARS Fault-Tolerant Distributed System
The Delta-4 Dependable Distributed Architecture

Part 4: Dependability Benchmarking

Requirements and Characteristics
Robustness Benchmark
Fault Tolerance Benchmark
Integration of (C)OTS Components
Selected Examples

The Ballista Environment
The MAFALDA Prototype Tool
The DBench Project

Conclusion: Wrap up, Emerging Challenges and Future Trends