Design of a Novel Precision Scalable Multiplier to Improve Quantized Neural Network Computation on a Low-Power RISC-V Processor

Edward Manca

Design of a Novel Precision Scalable Multiplier to Improve Quantized Neural Network Computation on a Low-Power RISC-V Processor.

Rel. Mario Roberto Casu. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (3MB) | Preview

Abstract:	The acceleration of Neural Network (NN) algorithms became in recent years an important research topic. Among possible approaches, having different tradeoffs, this thesis focuses on implementing a Precision Scalable (PS) multiplier to be placed inside the pipeline of the Ibex low-power RISC-V processing core. PS multipliers are multipliers capable of executing more than one multiplication (MUL) in parallel in case these operations are at a lower precision than the maximum one the multiplier can support. Target for this architecture is the acceleration of Quantized Neural Networks (QNN). QNNs are an optimization of standard NNs, which uses only integer operations, to ease the deployment specifically on low-power performance-constrained devices, that is the use target of the Ibex core. Starting from the standard Baugh-Wooley (BW) multiplier architecture, I derived a novel PS multiplier capable of changing precision from 4 to 16 bit. Parallel execution at reduced precision involves two types of operations: Sum-Separate (SS), providing independent MUL results, and Sum-Together (ST), where results are also summed before returning. Moreover, I added the support for the Multiply and Accumulate (MAC) class of instructions for all MUL instructions, including SS and ST. These instructions are suitable in different contexts inside the QNN acceleration domain, considering optimal execution differences in the most common ML algorithms for the edge scenario. On the basis of the two multiplier units originally present in the Ibex core, I defined two conceptually similar structures both in purely behavioral description and in directly synthesizable gate-level description. In the first case, synthesis tools are able to derive the most appropriate multipliers capable of dealing with the PS execution requirements. In the latter case, I resorted to the previously mentioned PS BW multiplier as a basis adapting it to this context. Final configurations of multiplier structures I designed include the possibility of executing a single 32 bit MUL, two 16 bit MULs, four 8 bit MULs, and eight 4 bit MULs in parallel, all of these supporting SS and ST modes, and also MAC operations. These previously mentioned configurations define a set of custom instructions to be added to the RISC-V ISA. To support new instructions, I also extended the decode unit of the Ibex core. Added instructions are thus in the category of MUL and MAC classes in the range from 32 bit down to 4 bit for which, potentially, 8 operations are replaced with a single one executing them in parallel. In order to compile high level C/C++ code I modified the well-known GCC compiler to accept the assembly format of those new instructions, allowing to include, and easily compile, inline assembly instructions inside C/C++ source code. To evaluate the performance of the modified processor with respect to the original design, I developed a set of three benchmarks, one for each of the three most common QNN algorithms in the edge scenario, namely Fully Connected layer, Convolutional layer, and Depth-Wise Convolutional layer. For each of these, I made clear performance advantages coming from reduced precision instructions exploiting, as a metric, the execution time of each modified Ibex core running on an FPGA board. Finally, I compared all Ibex core variations, targeting synthesis on silicon on a 28 nm production process, analyzing power, performance, and area metrics to derive more comprehensive conclusions.
Relatori:	Mario Roberto Casu
Anno accademico:	2022/23
Tipo di pubblicazione:	Elettronica
Numero di pagine:	117
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/27724

Modifica (riservato agli operatori)