Soft Errors: Sources and Mitigation

Lecture

May 26 15:30
Hotel Beau Site


Dan Alexandrescu
iRoC Technologies

Perturbations induced by Single Event Effects (SEEs or more widely known as Soft Errors) may cause system downtime, data corruption and maintenance incidents. Thus, the SEEs are a threat to the overall system reliability performance causing engineers to be increasingly concerned about the analysis and the mitigation of radiation-induced failures, even for commercial systems performing in a natural working environment.
The SEEs are physical phenomena, strongly dependent on technological process and design implementation. On the other end of the scale, any SEEs-induced faults have a potential for causing system-wide consequences. Thus, any Soft Error Rate (SER) analysis approach will require a multitude of competencies. The reliability engineer will have to interact with all the actors from the design flow from the technology/library provider to the system architect, while taking in account the reliability targets that are required by the final application.
This lecture presents an overview of the Single Event Effects in complex ASICs from process aspects to system-wide consequences. The study relies on a complete approach that integrates tightly with the design flow, enabling the reliability engineer to closely support the circuit designers in order to improve the overall Soft Error Rate of the system. Additionally, we present an introduction on the available methods and approaches that could be used to improve the Soft Error performance of the circuit through architectural and design choices with the firm goal of improving customer experience when using high availability products.
Dealing with all these subjects, this lecture hopes to improve the SER awareness in the electronic design field and to offer practical solutions when dealing with these problems, in helping both SER analysis and improvement efforts.

Prerequisites

  • Fundamental knowledge of physics in microelectronic devices

  • Basic knowledge of microelectronic VLSI design flow, cell & circuit design
  • Elements of electronic systems reliability

Suggested preliminary reading

Syllabus

The course is organized as follows:

  1. The core module presents an overview of the Single Event Effects, a justification of the academic and industrial interest in this issue and the impact of SEE on the reliability of electronic devices and systems. We will discuss the SEE production and propagation, their manifestation at each circuit abstraction level and an overall SER characterization flow aiming at providing accurate information about the circuit SEE resiliency. The first step in this characterization flow consists in evaluating the energetic particle interaction with the electronic circuit at process, transistor and cell level. The produced event (such as Single Event Transient/Upset/Bit Upset/Multiple Cell Upset/Single Event Latch-up, etc) will affect the function of the device causing temporary or permanent effects. The system-wide consequence can be very diverse: reboots, crashes, data corruption, and hardware failures with an obvious impact on the performance of the overall equipment.

  2. 2. The module presents practical approaches on how to integrate SER efforts during the various design phases (architecture, RTL or High-Level Synthesis, gate-level, cell) with an overall goal of analyzing the system performances and resiliency wrt. SEEs. The outcomes of this process are twofold:

    1. Quantitative results, such as the SER data for various cells/signals or block from the circuit, allowing the computation of SER metrics for each feature of the design.

    2. Qualitative results about the behavior of the circuit in the presence of errors that will be used during the error mitigation stage, especially when devising error protection strategies.

  3. The manufactured system can be also tested and validated using a variety of hardware testing techniques. Radiation tests are very useful in advanced manufacturing stages. These tests can help identify most frequent out-of-spec issues, validate designs and quantify field risk. The most powerful advantage is the fact that the products are tested integrally in conditions that are very close to the natural atmospheric environment. Testing helps also debugging/testing software problems, especially in the presence of hardware failures and helps isolate critical hardware/software modules. It is a very effective tool for comparing expectations versus observed results. In particular, it allows evaluating the contribution of SER to the downtime of the product. The correct and opportune estimation of the SER definitely helps avoiding later problems during deployment.

  4. Since the overall SEE management is a shared responsibility that requires inter- and intra-company collaboration, this module deals with the impact of SEE on the supply chain from the technology provider (foundry) to the system integrator and final users.

  5. This module provides the required SER knowledge in order to allow the system architect and the designers to direct implementation choices, select a design hardening methodology, establish a failure recovery/mitigation strategy and help the support engineers to accompany the final users of the design in building reliable systems. Concerning the SER improvement task, multiple approaches are possible: process improvements, hardened cells, circuit and system error mitigation techniques, etc. Most of these solutions come at some added costs, thus some compromise must be found between error handling capability and cost overheads.