Integration of a Precision Scalable Multiplier in the AHA CGRA Framework

Boris Assenov Alexiev

Integration of a Precision Scalable Multiplier in the AHA CGRA Framework.

Rel. Mario Roberto Casu, Edward Manca. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB) | Preview

Abstract:	Modern neural networks continue to expand in parameter count and number of layers, pushing the limits of training and inference hardware. Since convolutional, fully connected, and attention layers are dominated by multiply–accumulate (MAC) operations, the throughput and energy cost of multiplication largely determine overall system efficiency. Two techniques are especially effective: reducing numeric precision via quantization of weights and activations, and exploiting parallelism in HW architectures to perform many multiplications concurrently with reduced precision of operands. This thesis explores the latter technique, with the integration of a reconfigurable multiplier into the Agile Hardware Approach (AHA) design flow, targeting coarse‑grained reconfigurable arrays (CGRAs). The integrated unit allows for precision-scalability, allowing a more efficient hardware utilization, as the bit‑width required by the workload varies. The goal is to accelerate multiplication‑intensive kernels, typical of neural network layers, by executing more than one operation in parallel with the same HW unit, when the bit-width is reduced. Several architectural variants are examined: accumulation performed inside the processing element (PE) versus in an external register file; splitting outputs into low‑ and high‑bit paths to improve data reuse and routing; and input reordering to enable operand decomposition and independent sub‑word parallelism. For each option, the thesis outlines microarchitectural trade‑offs, placement within the PE datapath, and expected impacts on area, frequency, and utilization. Although some variants are not presently feasible within the AHA framework, they reveal promising directions and design patterns for future exploration. The current AHA mapping methodology limits multiple micro‑operations per instruction within a PE. To mitigate this constraint, the work proposes practical compiler‑flow modifications. In particular, targeted changes to rewrite rules and mapping files allow the toolchain to recognize scalable‑precision multiply operators and emit the required control sequences without overhauling the broader infrastructure. These adjustments preserve compatibility with the existing AHA workflow while enabling precision-scalable operations. An experimental implementation validates the approach: custom Verilog for the PE, paired with updated rewrite rules and mapping files, integrates cleanly into the AHA flow. RTL simulations confirm functional correctness of reduced‑precision multiplication and indicate favorable throughput and efficiency trends, suggesting benefits for future NN accelerators based on this approach, even though several variants remain impractical under current constraints. Collectively, the results provide a path for introducing reconfigurable, precision‑scalable arithmetic into CGRAs and identify existing tradeoffs for subsequent work.
Relatori:	Mario Roberto Casu, Edward Manca
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	59
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-29 - INGEGNERIA ELETTRONICA
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/38696

Modifica (riservato agli operatori)