IRPS

Session 2B – System Reliability

Session 2B – System Reliability

Session Co-Chairs: Kingsuk Maitra, Microsoft, Werner Kanert, Infineon
Section E

10:45 a.m. – Session Introduction

10:50 a.m.

2B.1 Software-based Dynamic Reliability Management for GPU Applications

S. Li, V. Sridharan*, S. Gurumurthi*, S. Yalamanchili, Georgia Institute of Technology, *Advanced Micro Devices, Inc

In this paper we advocate a framework for dynamic reliability management (DRM) for GPU applications based on the idea of plug-n-play software-based reliability enhancement (SRE). The approach entails first assessing the vulnerability of GPU kernels to soft errors in program visible structures. This assessment is performed on a low level intermediate program representation rather than the application source. Second, this assessment guides selective injection of code implementing SRE techniques to protect the most vulnerable data. Code injection occurs transparently at runtime using a just-in-time (JIT) compiler. Thus, reliability enhancement is selective, transparent, on-demand, and customizable. We argue this flexible, automated software based DRM framework can provide an important, cost-effective approach to scaling reliability of large systems. We present the results of a proof of concept implementation on NVIDIA GPUs demonstrating the ability to traverse a range of performance reliability tradeoffs.

11:15 a.m.

2B.2 Time Ordered Events CPU Reliability Assessment

I. Sauciuc, R. Kwasnick, R. Akhter, M. Ojha, M. Tse, D. Mani, C. Beas, G. Kaur, Intel Corporation

IC product use conditions (UCs) are needed to enable accurate reliability modeling in the context of knowledge-based qualification. We describe a method of use condition development which is based on the sequence of user foreground events from field surveys. Events are converted to temperature and voltage use condition traces accounting for lab data on representative workloads and thermal modeling. The temperature and voltage data are paired for client CPU UC data and used to estimate the reliability risks for both silicon and thermo-mechanical failure mechanisms. The new approach are validated using field consumer data. We also describe the future work needed on how to account for concurrent vents and the implications on TOE methodology.

11:40 a.m.

2B.3 Power-Supply Imapct on the Reliability of mid-1X TLC NAND Flash Memories

C. Zambelli, P. King, P. Olivo, L. Crippa*, R. Micheloni*, Università degli Studi di Ferrara, *Microsemi Corporation

NAND Flash memories are complex systems that include many heterogeneous blocks that must work together to ensure a high reliability of the information storage. Many efforts in the reliability community are devoted to investigate the reliability-loss of this storage medium from a cell device physics point of view, whereas little importance is given to the other blocks that constitute such a system. In this work we present a reliability threat related to NAND Flash memories that is present on the high voltage circuitry of the memory: the dependence on the power supply. Through the experimental characterization of TLC mid-1X samples and thanks to the SPICE simulations of the high voltage blocks we have investigated the possible sources of this new reliability issue.