Politecnico di Torino (logo)

Hardware Accelerator for LSTM Neural Networks using High-Level Synthesis

Chen Xie

Hardware Accelerator for LSTM Neural Networks using High-Level Synthesis.

Rel. Massimo Poncino, Daniele Jahier Pagliari. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2020

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview

Neural networks are widely used in applications such as machine translation, speech recognition, etc. Among the different types of neural networks, recurrent neural networks (RNN) based on the Long Short-Term Memory (LSTM) architecture have become popular for elaborating time series. To improve accuracy, the size of LSTM models continues to grow. Matrix-vector multiplications (MxV) are the most computation-intensive and time-consuming operations involved in LSTM inference. In order to perform these operations with high performance and low power consumption, Field-Programmable Gate Arrays (FPGAs) have become popular to accelerate LSTM inference. Based on FPGAs, finding the best accelerator architecture for a given objective and combining the algorithm-level optimizations become the hot issues. In particular, the most common optimizations for LSTMs consists in using weight pruning to reduce the number of computations and memory occupation, transforming the dense MxV into a sparse matrix-vector multiplication (SpMxV). Accelerating SpMxV requires solving new issues, such as managing unstructured sparse matrices and their corresponding irregular memory access patterns. In this thesis, a new LSTM accelerator for FPGAs is proposed, which addresses the two aforementioned problems. The design space exploration complexity is tackled using high-level synthesis (HLS), which allows the generation of a large number of different results starting from the same high-level specification changing some synthesis directives. It means different accelerator implementations have been realized, among which a system designer could select depending on his/her requirements. On the another hand, the proposed accelerator is made compatible with a popular constrained pruning methods for LSTMs, known as Bank-Balanced Sparsity (BBS), which can maintain model accuracy at a high sparsity level while still enable an efficient FPGA implementation. The proposed design has been written in C++, synthesized using Xilinx Vivado HLS and flashed onto a Xilinx Zynq System-on-Chip (SoC). These SoCs include an ARM processor besides the FPGA, which has been programmed to trigger the accelerator and collect results by means of a software driver. After implementation, the performance of accelerator with different size are evaluated. 20x and 44x speedups are achieved with low reasource occupation, and power consumption is 5x lower than CPU Core-i5 4460. In the end, a complete design space exploration has been performed by different design implementation based on a various of optimization directives.

Relators: Massimo Poncino, Daniele Jahier Pagliari
Academic year: 2019/20
Publication type: Electronic
Number of Pages: 97
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: New organization > Master science > LM-29 - ELECTRONIC ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/14465
Modify record (reserved for operators) Modify record (reserved for operators)