Robust System Design: Overcoming Reliability Challenges

Lecture

May 26 09:30
Hotel Beau Site


Subhasish Mitra
Stanford University

Today’s mainstream electronic systems typically assume that transistors and interconnects operate correctly over their useful lifetime. With enormous complexity and significantly increased vulnerability to failures compared to the past, future system designs cannot rely on such assumptions. At the same time, there is explosive growth in our dependency on such systems.

Robust system design is essential to ensure that future systems perform correctly despite rising complexity and increasing disturbances. For coming generations of silicon technologies, several causes of hardware failures, largely benign in the past, are becoming significant at the system-level. With extreme miniaturization of circuits, factors such as transient errors, device degradation, and variability induced by manufacturing and operating conditions are becoming important. While design margins are being squeezed to achieve high energy efficiency, expanded design margins are required to cope with variability and transistor aging. Even if error rates stay constant on a per-bit basis, total chip-level error rates grow with the scale of integration. Moreover, difficulties with traditional burn-in can leave early-life failures unscreened.

This talk will address the following major robust system design objective: cost-effective tolerance and prediction of failures in hardware during system operation. Significant recent progress in robust system design impacts almost every aspect of future systems, from ultra-large-scale networked systems, all the way to their nanoscale components.

Prerequisites & suggested preliminary readings

Basic concepts in digital circuits, systems, computer architecture, and some knowledge of VLSI testing.

Learning outcomes

The audience will learn circuit and system-level modeling and design aspects of errors (rather than just technology and physics aspects). Supporting data on designs and technologies, together with technology trends, will be covered. New techniques for analyzing circuit and system-level impact of errors will be discussed. New error resilience techniques will be presented. Finally, an extensive bibliography will be provided.

Syllabus

Basic concepts in reliability; Various reliability failure mechanisms; Basic ideas of reliability, data integrity, silent data corruption and availability; overview of circuit and system-level impact of errors, estimation strategies; derating factors, resilience techniques: Built-In Soft Error Resilience, Soft Error Correcting Combinational Logic, ECC, Concurrent Error Detection, Parity Prediction and other coding theoretic techniques, Multi-threading, Software Implemented Hardware Fault Tolerance, Application Dependent techniques, On-line self-test and diagnostics, error resilient system architectures.