Politecnico di Torino (logo)

Design of a distributed control unit for reconfigurable CNN accelerators

Nicolo' Morando

Design of a distributed control unit for reconfigurable CNN accelerators.

Rel. Andrea Calimera, Valerio Tenace. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2018

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB) | Preview

Over the last few years, deep learning (DL) has evolved becoming per- vasive in many scientific and industrial fields. The effectiveness of DL techniques, aided by the widespread availability of user-friendly tools de- veloped by big ICT companies (like Google and Facebook, to name a few), is pushing the state-of-the-art in artificial intelligence, allowing Convolu- tional Neural Networks (CNNs) to represent a de facto standard for visual reasoning applications. CNNs are complex computational models inspired by the mechanisms that regulate the primary visual cortex of the brain, where images captured by the eyes are elaborated such to extrapolate a meaning, an information, from the surrounding environment (e.g., face re- cognition though feature detection). A typical CNN structure is composed of an input layer handling images for computational stages, an output layer that produces the final answer on the classification task, and several hidden layers where the feature extraction takes place. Indeed, from a functional perspective, CNNs can be divided in two main functional regions: feature extraction, and classification. The former region is where most computa- tions take place, and it is mainly composed of a specific kind of layer: the convolutional (CONV) layer, where several multidimensional matrix-vector multiplications are carried out between input images (or feature maps) and abstract filters learned by the CNN itself. Since even the simplest CNN model contains several thousands of different filters, it is not surprising that the huge computational effort required to run DL algorithms is rapidly be- coming a serious concern. Such a problem is exacerbated if we consider that most computing hardware platforms are not yet tailored to execute DL algorithms efficiently. For these reasons, a number of dedicated hard- ware accelerators for DL applications have been recently introduced. Being composed of several processing elements (PE) capable to carry out specific mathematical operations, those ad hoc solutions are capable to dramatical- ly reduce execution times and the energy per operation. However, in most cases, information sharing between each PE is partially exploited, thus lea- ving a space for substantial performance improvements. In this thesis we present a hardware-software co-design tool called INRI, which allows to de- ploy a fine-tuned dataflow for specific architectures, such that superfluous data movements and power consumption are minimized. We investiga- te different techniques including data reuse, smart activation/deactivation policies for PE in the idle state, and specific pixel-clustering algorithms. The performance of the proposed tool are supported by experimental re- sults obtained with different hardware configurations running well-known CNN models, such as AlexNet, VGG-16 and ZFNet. Results demonstrate that our approach is capable to reduce the energy of CNNs by 25%, still guaranteeing an acceptable accuracy loss of 2%.

Relators: Andrea Calimera, Valerio Tenace
Academic year: 2018/19
Publication type: Electronic
Number of Pages: 82
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Ente in cotutela: UPM - ETSIT - Universidad Politécnica de Madrid (SPAGNA)
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/9071
Modify record (reserved for operators) Modify record (reserved for operators)