Politecnico di Torino (logo)

Mixed-precision Quantization and Inference of MLPerf Tiny DNNs on Precision-Scalable Hardware Accelerators

Marco Alessio Terlizzi

Mixed-precision Quantization and Inference of MLPerf Tiny DNNs on Precision-Scalable Hardware Accelerators.

Rel. Mario Roberto Casu, Luca Urbinati. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (4MB) | Preview

Over the past ten years, Deep Learning has made great strides with significant advancements in a variety of Artificial Intelligence (AI) applications that range from image classification to speech recognition. Nevertheless, the unprecedented performance attained by Deep Neural Networks (DNNs) comes at the cost of high computational complexity and power consumption, making them unsuitable for deployment on resource-constrained devices such as embedded hardware. As a result, a field known as TinyML has emerged, aiming to develop efficient and accurate models for the ever-growing market of Internet-of-Things (IoT) devices. Moving both training and inference to the edge offers several advantages, including enhanced data privacy, lower latency, and improved energy efficiency. This is achieved by tackling these issues from multiple angles, such as designing networks that execute fewer operations and reducing the precision of network parameters through quantization. To this regard, this thesis analyzes how mixed-precision quantization can help improve the computational footprint and latency of deep neural networks running on hardware accelerators. First, QKeras, an open-source quantization library, is used to quantize and determine an optimal mixed-precision configuration for four neural network architectures from the MLPerfTiny Benchmark, namely MobilenetV1, Resnet, FC-AutoEncoder, and DS-CNN. Our findings show that this technique is able to reduce the number of bits by XX while keeping the test accuracy within a range of XX of their floating-point counterparts. Second, the networks are executed in software on precision-scalable hardware accelerators for DNN algorithms such as 2DConv, DWConv, and FC. In particular, they consist of reconfigurable Sum-Together multipliers placed inside the MAC units, which make them able to compute N=1,2,4 multiplications in parallel with 16/N bit operands, thus reducing latency when using low precision inputs and weights. These accelerators are designed in C to take advantage of high-level synthesis (HLS) tools. In the process, we also investigate the effects of reducing the bit-width of some accelerator internal variables, such as the quantization scaling factors, to reduce the accelerators' hardware resources (e.g. multipliers’ bitwidths) without affecting the accuracies of the four networks. Finally, following the SoC integration design flow enabled by ESP, the 2DConv accelerator was synthesized and integrated onto a SoC with a RISC-V processor, and subsequently simulated on QuestaSim, finding a speedup of XX compared to a pure software solution.

Relators: Mario Roberto Casu, Luca Urbinati
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 82
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: New organization > Master science > LM-29 - ELECTRONIC ENGINEERING
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/26664
Modify record (reserved for operators) Modify record (reserved for operators)