Hardware- and Software-Fault Tolerance, Design and Assessment of Dependable Computer Systems

Lecture

May 28 14:00
Hotel Beau Site


Jean Arlat
LAAS-CNRS

This lecture covers the main design and assessment issues that are to be considered when developing dependable computer systems. It is organized into four main parts. After a short introduction aimed at motivating the relevance of the topic covered, the first part briefly introduces the general concepts and related terminology for dependable computing including the notions of fault, error and failure and the main approaches towards dependability: fault tolerance, fault removal and fault forecasting. In the second part, it addresses the fault tolerance techniques (encompassing error detection, error recovery and fault masking) that can be used to cope with accidental faults (physical disturbances, software bugs, etc.) and to some extent, malicious faults (e.g., attacks, intrusions). In particular, several forms of redundancies (space, temporal, data, etc.), as well as the important notion of diversified design will be described and illustrated by means of examples. The third part covers the methods and techniques — both analytical (stochastic processes) and empirical (controlled experiments) — that can be used to objectively assess the coverage of the fault tolerance mechanisms and then infer the level of dependability achieved. The actual impact of fault-tolerant architectures on dependability, leading to the essential notion of coverage (with respect to fault tolerance) is precisely identified and exemplified. A special focus is put on controlled experiments based on fault injection techniques (hardware-, simulation-, and software-based fault injection). The fourth and last part describes most recent trends in controlled experiments aimed at developing benchmarks for robustness testing purpose and for fairly comparing the dependability features of several computer systems and components. Finally, a few concluding remarks will depict some emerging challenges and future trends in the domain of dependable computing.

Prerequisites

Suggested preliminary readings

  • [Arlat et al. 1990] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Martins and D. Powell, “Fault Injection for Dependability Validation — A Methodology and Some Applications”, IEEE Trans. on Software Engineering, 16 (2), pp.166-182, February 1990.
  • [Laprie et al. 1990] J.-C. Laprie, J. Arlat, C. Béounes and K. Kanoun, “Definition and Analysis of Hardware-and-Software Fault-Tolerant Architectures”, Computer, 23 (7), pp.39-51, July 1990.
  • [Carreira et al. 1999] J. V. Carreira, D. Costa and J. G. Silva, “Fault Injection Spot-checks Computer System Dependability”, IEEE Spectrum, 36, pp.50-55, August 1999.
  • [Koopman & DeVale 1999] P. Koopman and J. DeVale, “Comparing the Robustness of POSIX Operating Systems”, in Proc. 29th Int. Symp. on Fault-Tolerant Computing (FTCS-29), (Madison, WI, USA), pp.30-37, IEEE CS Press, 1999.
  • [Avižienis et al. 2004] A. Avižienis, J.-C. Laprie, B. Randell and C. Landwehr, “Basic Concepts and Taxonomy of Dependable and Secure Computing”, IEEE Transactions on Dependable and Secure Computing, 1 (1), pp.11-33, Jan.-March 2004.
  • [Siewiorek et al. 2004] D. P. Siewiorek, R. Chillarege and Z. Kalbarczyk, “Reflection on Industry Trends and Experimental Research in Dependability”, IEEE Transactions on Dependable and Secure Computing, 1 (2), pp.109-127, 2004.

Learning outcomes

This course will familiarize you with the challenges of the nanoscale technologies, and effective design practices to overcome these challenges, both at the chip level as well as at the system level. You will be equipped with a good understanding of how to incorporate test and reliability into the design from day one, and to look forward to the paradigm shifts of resilient design for even more effective designs.

Syllabus

Introduction: Motivation and Outline
Part 1: Basic Concepts and Terminology

  1. The Notion of Dependability
  2. The Dependability Attributes
  3. Dependability Threats: Fault, Error, Failure Pathologies
  4. Dependability Procurement
  5. Dependability Assessment
Part 2: Fault-Tolerant Computer Architectures
  1. Error Detection
    1. Error Detecting Codes
    2. Replication and Comparison
    3. Temporal and Execution Checks
    4. Likelihood Checks
    5. Structured Data Checks
    6. Wrapping
    7. Self-Checking Component
  2. System Recovery
    1. Backward Error Recovery (Roll-back)
    2. Forward Error Recovery (Roll-forward)
    3. Error Compensation
    Error Detection and Compensation
    Error Masking
    Error Correcting Codes
  3. Design Diversity
    1. Redundancy and Common Mode Failures
    2. Diversification Techniques
    Recovery Blocks
    N-Version Programming
    N-Self-Checking Programming
  4. Examples of Fault-Tolerant Computer Systems
    1. Airbus A3XX Series
    2. Boeing 777
    3. Ansaldo’s Computer Based Interlocking
    4. Safe and Secure Maintenance Laptop
Part 3: Experimental Assessment of Dependability
  1. Dependability Evaluation
    1. Fault Tolerance Coverage
    2. Fault Injection-based Assessment
  2. Fault Injection Techniques
    1. (Physical) Hardware-implemented
    2. Simulation-based Injection
    3. Software-implemented
    4. Combined and Emerging Techniques
  3. Examples of Experimental Results
    1. The MARS Fault-Tolerant Distributed System
    2. The Delta-4 Dependable Distributed Architecture
Part 4: Dependability Benchmarking
  1. Requirements and Characteristics
  2. Robustness Benchmark
  3. Fault Tolerance Benchmark
  4. Integration of (C)OTS Components
  5. Selected Examples
    1. The Ballista Environment
    2. The MAFALDA Prototype Tool
    3. The DBench Project
Conclusion: Wrap up, Emerging Challenges and Future Trends