Andrea Redoglia
Hardware-Software Codesign of an Accelerator for Quantized Neural Networks in a Low-Power SoC.
Rel. Mario Roberto Casu, Luca Urbinati, Edward Manca. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2024
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (4MB) | Preview |
Abstract: |
The recent advancements in the artificial intelligence field, and the explosion of applications based on Neural Network (NN) deployed in the real world, pose new challenges in optimizing their execution. Edge computing is meant to answer this challenge. Characterized by embedded, low-power System-on-Chips (SoCs) devices, it is a straightforward choice for the NN deployment requirements. However, these devices usually lack specialized hardware to efficiently execute the target workloads. Moreover, the main operations required by NNs are Multiply-And-Accumulate (MAC) operations, with operands at high precision data types and this could lead to a massive number of operations not suitable for SoCs based on CPUs. In this context, two optimizations are possible. On the one hand, NNs can be trained using low precision operands, such as 8-bit integer arithmetic. This technique leads to Quantized NNs (QNNs). On the other hand, SoCs can be integrated with dedicated accelerators to handle the computational patterns of specific NN layers. Starting from QNNs obtained using the first point, this thesis focuses on the second one, and provides a path for the codesign of a custom hardware tightly related with a software implementation that employs it to enable efficient NNs computation. The accelerator presented uses multipliers based on the precision scalable principle. A precision scalable multiplier is a multiplier capable of increasing the number of operations performed in parallel when operands have a reduced precision. One approach to achieve this are the Sum-Together (ST) multipliers, which, if operands are at reduced precision, operate more than one multiplication and sum each independent result before returning them. As an example, a 16-bit multiplier based on this approach can compute one 16-bit, two 8-bit or four 4-bit multiplications in parallel. In the latter two cases the results are summed together, achieving the goal of computing more MACs at the same time at reduced precision. This leads to lower latency when decreasing the precision of the operands. I integrated a ST MAC unit in an accelerator, composed of memory and control logic necessary for the processing. To integrate the final architecture in a SoC for verification purposes I leveraged the Embedded Scalable Platform (ESP), a tool for design automation developed by Columbia University. The accelerator has been described in C++ and synthesized with Mentor Catapult HLS. This final SoC is composed of four tiles: the first is a memory interface that connects the Network-on-Chip (NoC) to the external DDR; the second is the low-power 32-bit RISC-V Ibex core, developed by ETH Zurich and University of Bologna; the third is my accelerator; and the fourth is an I/O tile. After the definition of the structure, I generated the SoC RTL description with ESP automated flow of integration. Then I developed the baremetal software baseline implementing a memory tiling algorithm to divide the memory to fit the data in the Private Local Memory (PLM) of the accelerator. The hardware, along with the software, has been tested and simulated in the QuestaSim environment. Finally, the custom software has been integrated in TensorFlow Lite for Microcontrollers (TFLM), an open-source ML inference framework. |
---|---|
Relatori: | Mario Roberto Casu, Luca Urbinati, Edward Manca |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 83 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-29 - INGEGNERIA ELETTRONICA |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/34102 |
Modifica (riservato agli operatori) |