Designing systems that are Dependable and Secure: Measurement, Analysis and Design

Lecture

May 27 11:15
Hotel Beau Site


Ravishankar K. Iyer
University of Illinois at Urbana-Champaign

Quick way of learning principles and techniques to build and validate reliable computing systems and networks.

This course introduces a system (both hardware and software) view of design issues in reliable computing. The material represents a broad spectrum of hardware and software error detection and recovery techniques. The lectures discuss how the hardware and software techniques interplay; e.g., what techniques can be provided in hardware, operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself.

After introducing basic concepts and terms including reliability, availability, and hardware and software fault models, the course continues with discussions of hardware redundancy, coding techniques, signature-base error checking, processor-level error detection and recovery (e.g., duplicate execution and comparison), checkpoint and recovery (single process and distributed environment), software fault tolerance techniques (e.g., process pair, robust data structures, recovery blocks, and N-version programming), and finally, network specific issues (e.g., providing consistent data and reliable communications). The capabilities and applicability of discussed techniques are illustrated with examples of real applications and systems.

Prerequisites & suggested preliminary readings

Basic understanding of computer systems, hardware and software. BS or MS in Computer Engineering or Computer Science.

Learning outcomes

You will learn the concepts, principles and practice that jointly underlie the development of systems that are reliable and secure. You will be exposed to new and challenging application domains and computing paradigms being implemented in practice and studied in research. Overall the course will allow you to be in a position to develop or research new systems and technologies in the context of their resiliency (dependability and security).

Syllabus

  1. Introduction
    • System view of high availability design
    • Fault models
    • Example of high-availability system

  2. Hardware redundancy
    • Basic approaches to hardware redundancy
    • Static and dynamic redundancy
    • voting

  3. Error detection techniques
    • Timers, watchdogs, heartbeats
    • Audits, assertions, control flow and program invariants checks
    • Operating system exception handling
    • Example application

  4. Coding techniques
    • Error detecting and error correcting codes
    • Hamming codes
    • codes for storage and communication
    • codes for arithmetic operations

  5. Processor-level detection and recovery
    • Instruction retry, duplication, multithreading
    • Checker processor
    • RSE (reliability and security engine)

  6. Disk arrays (RAID)
    • Organization of RAIDs
    • Example design and evaluation of cache-based RAID Controller

  7. Checkpointing and recovery
    • Forward and backward error recovery
    • Checkpoint and recovery in networked systems
    • Synchronous checkpointing and recovery
    • Asynchronous checkpointing and recovery
    • Checkpointing in distributed databases
    • IRIX operating system checkpoint and restart

  8. Software fault tolerance
    • Process pairs
    • robust data structures
    • N-version programming
    • recovery blocks
    • Recovery routines - example of BM-MVS

  9. Network specific issues
    • Broadcast protocols
    • Agreement protocols (Byzantine agreement, Consensus, Interactive consistency)
    • Application of agreement algorithms

  10. High Availability Middleware
    • Replication
    • Self-checking processes
    • Application example

  11. Dependability Validation
    • Validation methods
    • Design phase- Hierarchical fault simulation
    • Prototype phase - HW or SW implemented fault injection
    • Operational phase - Measurement of field systems