polito.it
Politecnico di Torino (logo)

Reconfigurable solutions to increase GPGPUs reliability

Pierpaolo Narducci

Reconfigurable solutions to increase GPGPUs reliability.

Rel. Matteo Sonza Reorda, Josie Esteban Rodriguez Condia, Luca Sterpone, Boyang Du. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2020

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
Abstract:

The aim of this thesis is to propose different implementations to increase fault mitigation for the reference FlexGrip model, which represents a simplified version of the NVIDIA GPU architecture. These solutions are not extensively included in GPGPUs, due to the limited reliability requirements of the applications they were originally intended for. Three solutions have been developed and are presented in the thesis document: -??A Dynamically configurable self-repairing (BISR) mechanism aimed at reducing the impact of permanent faults in the Scalar Processor (SP) cores in GPGPUs. This first solution mechanism is based on spare SP modules that can be used to replace a possible faulty SPs when a fault affecting it is detected. In this architecture there are some cold stand-by modules (Spare SPs, or SSPs) in parallel with the existing SPs. Two switching units, based on meta-crossbar structures, targeting the data-path input and output interconnections in the SPs are used. An instruction specifically created allows to control the faulty SP and substitute it with a SSP. This method is flexible because it does not require any change in the application software. Experimental results show that the solution introduces a moderate area overhead. This strategy seems particularly suitable for long-term missions since it allows mitigating the effects of fault accumulation in the SP cores. -??A Dynamic duplication with comparison (DDWC) mechanism intended to harden the Scalar Processor units in GPGPUs. The second solution mechanism targets the detection of permanent faults that may arise inside the SPs. The architecture has one additional SP unit to compute the same operations of a selected SP. A reconfiguration instruction is used to dynamically select the target SP to be monitored. Experimental results show that the proposed mechanism introduces a limited area overhead while it provides a significant increase in the in-field fault detection capabilities of the GPGPU. Thanks to its flexibility, low hardware overhead, and moderate performance degradation, this strategy could be effectively employed to increase the reliability of GPGPUs when they are adopted in safety-critical applications. -??The combined solution merging dynamic self-repairing configuration (BISR) and dynamic duplication with comparison (DDWC) aimes to obtain a robust fault mitigation solution. This latter solution mechanism is based on the possibility to use both the BISR and DDWC mechanism at the same time. In this architecture it is possible to define the number, from 1 up to 8, of cold stand-by modules that can support both BISR and DDWC mechanism, in parallel with the existing SPs. Two switching units, based on meta-crossbar structures, targeting the data-path for input and output interconnections in the SPs. Based on the configuration instruction it is possible to control the switching units and to implement the mechanism. By comparison with the two previous architectures this is the most complete and optimized one. Experimental results show that the solution introduces a moderate area overhead (up to 14% in the worst scenario) and a moderate performance degradation. This strategy covers the mitigation effects of fault accumulation in the SPs cores and increases the reliability of GPGPUs.

Relators: Matteo Sonza Reorda, Josie Esteban Rodriguez Condia, Luca Sterpone, Boyang Du
Academic year: 2019/20
Publication type: Electronic
Number of Pages: 56
Subjects:
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: New organization > Master science > LM-29 - ELECTRONIC ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/14482
Modify record (reserved for operators) Modify record (reserved for operators)