Designing systems that are Dependable and Secure: Measurement, Analysis and Design

Lecture

May 27 11:15
Hotel Beau Site

University of Illinois at Urbana-Champaign

Quick way of learning principles and techniques to build and validate reliable computing systems and networks.

This course introduces a system (both hardware and software) view of design issues in reliable computing. The material represents a broad spectrum of hardware and software error detection and recovery techniques. The lectures discuss how the hardware and software techniques interplay; e.g., what techniques can be provided in hardware, operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself.

After introducing basic concepts and terms including reliability, availability, and hardware and software fault models, the course continues with discussions of hardware redundancy, coding techniques, signature-base error checking, processor-level error detection and recovery (e.g., duplicate execution and comparison), checkpoint and recovery (single process and distributed environment), software fault tolerance techniques (e.g., process pair, robust data structures, recovery blocks, and N-version programming), and finally, network specific issues (e.g., providing consistent data and reliable communications). The capabilities and applicability of discussed techniques are illustrated with examples of real applications and systems.

Prerequisites & suggested preliminary readings

Basic understanding of computer systems, hardware and software. BS or MS in Computer Engineering or Computer Science.

Learning outcomes

You will learn the concepts, principles and practice that jointly underlie the development of systems that are reliable and secure. You will be exposed to new and challenging application domains and computing paradigms being implemented in practice and studied in research. Overall the course will allow you to be in a position to develop or research new systems and technologies in the context of their resiliency (dependability and security).

Syllabus

Introduction

System view of high availability design
Fault models
Example of high-availability system

Hardware redundancy

Basic approaches to hardware redundancy
Static and dynamic redundancy
voting

Error detection techniques

Timers, watchdogs, heartbeats
Audits, assertions, control flow and program invariants checks
Operating system exception handling
Example application

Coding techniques

Error detecting and error correcting codes
Hamming codes
codes for storage and communication
codes for arithmetic operations

Processor-level detection and recovery

Instruction retry, duplication, multithreading
Checker processor
RSE (reliability and security engine)

Disk arrays (RAID)

Organization of RAIDs
Example design and evaluation of cache-based RAID Controller

Checkpointing and recovery

Forward and backward error recovery
Checkpoint and recovery in networked systems
Synchronous checkpointing and recovery
Asynchronous checkpointing and recovery
Checkpointing in distributed databases
IRIX operating system checkpoint and restart

Software fault tolerance

Process pairs
robust data structures
N-version programming
recovery blocks
Recovery routines - example of BM-MVS

Network specific issues

Broadcast protocols
Agreement protocols (Byzantine agreement, Consensus, Interactive consistency)
Application of agreement algorithms

High Availability Middleware

Replication
Self-checking processes
Application example

Dependability Validation

Validation methods
Design phase- Hierarchical fault simulation
Prototype phase - HW or SW implemented fault injection
Operational phase - Measurement of field systems