Dependable Processor Design

Lecture

May 28 14:00
Hotel Beau Site

In this session, I’ll try to bring together all the theory that you have learnt about fault tolerance and show how it can be applied in a real practical example. This example will be based around an existing dependable processor.

Embedded processors are used in many applications that require a defined level of reliability, safety and/or availability. There are many approaches to providing the required level of fault tolerance – at the circuit, logic, microarchitecture, chip and system level – each of which incurs a certain cost.

In many volume applications, for example in the automotive market, you need to achieve a balance between reaching the required level of reliability, safety and/or availability and the additional cost involved. So when designing a dependable processor, you need to consider not only the kinds of faults that might occur but also the effect that these faults might have on system operation – and think carefully about how to protect against these faults without adding too much to the system cost.

In this session, I’ll start by saying what I mean by dependability in the context of an embedded processor and then take a look at safety standards and what these imply about processor design. I’ll then discuss the kinds of faults that might affect a processor’s operation – hard versus transient faults, latent faults & wearout mechanisms – and how these faults might be detected (and possibly corrected).

I’ll then look at how the requirements for reliability, safety and availability can be translated into real systems and discuss some dependable architectures. Using the example of an embedded processor that was designed to satisfy this market, I’ll look in detail at how features such as ECC on the memories, error caches, dual-core lock step and processor diversity address these requirements. I’ll also look at how external monitoring hardware can be used in conjunction with an embedded processor to achieve the required dependability.

Finally, I’ll discuss some future challenges in dependable processor design, including the effect of process scaling. I’ll briefly describe some experimental test structures that could be used to detect and mitigate for failure mechanisms, such as wear-out.

Prerequisites & suggested preliminary readings

No particular requirements. This session will build on the theory that has been presented in earlier sessions and the suggested preliminary reading for those earlier sessions will be equally applicable to this session.

Learning outcomes

    What is meant by dependability in the context of a processor
  • Requirements of safety standards and how these can be incorporated into a processor
  • How to analyse a processor for dependability
  • An understanding of possible architectures for processor dependability
  • An appreciation of some examples of existing dependable processor designs
  • An understanding of how extra hardware can be added to a processor to build a dependable system
  • An appreciation of some future challenges

Syllabus

  • Dependability as a combination of reliability, availability, maintainability and safety (RAMS)
  • Dependability vs. security
  • Principles of functional safety: functional safety standards like IEC 61508 and ISO 26262
  • Requirements of IEC 61508 and ISO 26262 for processors
    • Requirements for HW random failures (for permanent and transient faults)
    • Requirements for systematic failures (verification of a processor, avoidance or validation of unpredictable instructions)
    • Requirements for common-cause failures (clock, reset, temperature, EMC)
    • Requirements for Worst Case Execution Time
  • Safety and Dependability analysis of a processor
    • FMEA/FMEDA of a processor
    • Computing the failure rate of a processor
    • Vulnerability factors of a processor
    • Fault injection of a processor
    • The “safety manual” of a processor
  • Architectures for processor dependability
    • Dependable processor (e.g. working at cell level, Razor, Logic BIST)
    • Homogenous redundancy (e.g. Dual-core lock-step, TMR)
    • Asymmetric redundancy (e.g. TMR, YOGITECH’s faultRobust CPU, Challenge-Response architecture)
    • Achieving dependability by software (e.g. designing a SW Test Library to test a processor at run-time)
    • Monitoring a processor with watchdogs and MPU/TPU units
  • SW aspects of processor dependability
    • HW-SW interactions and configurations in a processor
    • A safe and dependable compiler
  • Lock-step architecture using an ARM processor as the example (Cortex-R4,Cortex-R5,Cortex-R7)
    • Lock-step and compare outputs to avoid common-mode failures (delays, guard rings, macro rotation)
    • Split-lock functionality
  • ECC and parity scheme to handle faults in the memory and bus interconnect of an ARM processor
    • Error caches
  • Safety and dependability of L1, L2 and L3 caches – MBIST and repairable memories
  • A safety eco-system for ARM Cortex-M3 processor
    • highlights of the FMEDA of the ARM Cortex-M3 processor
    • FRCPU_armcm3, a tightly coupled optimized and diverse supervisor for ARM Cortex-M3 processor;
    • Failure identification and fail-operational strategies for an ARM Cortex-M3 with fRCPU
    • FRSTL_armcm3, a SW Test Library to reach the SIL2/ASILB level of safety integrity level for the ARM Cortex-M3 processor
    • FRMEM and fRBUS - IPs to detect faults in the memory and bus system of a Cortex-M3 processor
  • Challenges for the future
    • Effect of process scaling on dependable processor design
    • On-chip monitors to detect and mitigate in-service degradation and/or failures – e.g. wear-out: NBTI, oxide degradation
    • Low-power systems - detecting and correcting state corruption after power-down