polito.it
Politecnico di Torino (logo)

Implementation of a hardware accelerator for Deep Neural Networks based on Sparse Representations of Feature Maps

Yu Hao

Implementation of a hardware accelerator for Deep Neural Networks based on Sparse Representations of Feature Maps.

Rel. Maurizio Martina. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2022

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (10MB) | Preview
[img] Archive (ZIP) (Documenti_allegati) - Other
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (18MB)
Abstract:

Abstract Deep learning, as one of the most currently remarkable machine learning techniques, has achieved great success in many fields such as speech recognition, image analysis, and autonomous driving. However, the neural network requires billions of multiply-and-accumulated operations, which makes the single-frame runtime enormous and energy-hungry. To optimize these imperfections, researchers from the University of Zurich and ETH Zurich developed a hardware accelerator named NullHop, which is a flexible and efficient hardware accelerator architecture aiming at exploiting the sparsity of neuron activations. NullHop uses a novel sparse matrix compression algorithm to encode the input data into two elements: a Sparsity Map (SM) and a Non-Zero Value List (NZVL). This scheme could enhance the overall computation time and energy consumption owing to two main features: 1) its ability to skip over zero-value pixels in the input layers without any wasted clock cycles and redundant MACs. 2) The compression scheme reduces the requirements of external memory and also the huge consumption brought with every memory access. This thesis work mainly targets implementing a hardware accelerator based on NullHop in Hardware Description Language (VHDL). The simulation results from ModelSim show that the accelerator could accomplish one input layer computation with dimension 6*6, 16 input channels, kernel size 3*3, 128 output channels, ReLU enabled, 2*2 max-pooling enabled in 2900 clock cycles (around 2270 cycles for computation unit). If the input layer is 64*64*16, kernel size 3*3, 128 output channels, the total time consumption is 292255 cycles. The accelerator not only achieves a latency reduction thanks to the sparsity of input data but also reduces the workload of MACs since no zero-value pixel is forwarded to the computation unit. Furthermore, the ReLU and max-pooling are done on the fly during computation which could bring more enhancement.

Relators: Maurizio Martina
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 73
Subjects:
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: New organization > Master science > LM-29 - ELECTRONIC ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/22727
Modify record (reserved for operators) Modify record (reserved for operators)