polito.it
Politecnico di Torino (logo)

Development of an Advanced Configurable DMA System for Edge AI Accelerators in a 16nm Low Power RISC-V Microcontroller

Tommaso Terzano

Development of an Advanced Configurable DMA System for Edge AI Accelerators in a 16nm Low Power RISC-V Microcontroller.

Rel. Luciano Lavagno, Luigi Giuffrida, Davide Schiavone. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial Share Alike.

Download (12MB) | Preview
Abstract:

Artificial Intelligence has been a key driver of technological innovation over the past decade, influencing various fields, including image recognition, natural language processing, autonomous driving, and complex system modeling. Unlike traditional cloud-based solutions, Edge AI has emerged as a promising alternative. It involves processing data directly on devices, enabling real-time processing and enhancing privacy and integrity. Each edge device can simultaneously acquire data from multiple sensors, enhancing the network's ability to extract meaningful features from the surrounding environment. Microcontrollers are a popular choice for edge devices thanks to their versatility and short time-to-market. However, they present several constraints, such as limited computational resources and low-power consumption, that must be considered during deployment. In this context, X-HEEP is a configurable, extensible, open-source RISC-V 32-bit MCU developed at the Embedded Systems Laboratory (ESL) at EPFL. The primary focus of this thesis is the enhancement of X-HEEP's Direct Memory Access (DMA) system to make it suitable for Edge AI applications. To handle the multiple data streams typical of edge computing, the DMA system has been enhanced with a configurable number of channels and connected to the system bus through a customizable number of master ports. This increases memory bandwidth in proportion to the number of channels. To reduce power consumption, each DMA channel can also be clock-gated when not in use. Like its cloud-based counterpart, Edge AI relies heavily on neural networks, which depend on matrix operations (GEMM). To boost performance and meet timing and energy requirements, specialized accelerators such as multi-core clusters, in/near-memory macros, and systolic arrays have emerged. At ESL, two innovative SRAM-based low-power architectures, Caesar and Carus, have been designed to function primarily as memory but also to perform scalar and vector computations. To leverage these accelerators for performing convolutions, the foundation of CNNs, input tensors and filters must be reshaped through a transformation called im2col. If the CPU handles this process, it becomes time-intensive, potentially reducing the accelerator's advantage and making data transfer the primary performance bottleneck. To address this challenge and complement these accelerators, this thesis enhances X-HEEP's DMA system with 2D capabilities, as well as on-the-fly data transformations like transpositions, zero-padding, and sign extension. An accelerator for the im2col transformation was developed to convert tensors into matrix form to optimize GEMM routines for convolution operations, the foundation of CNNs. The novelty of this implementation lies in its use of the DMA subsystem to optimize data movement. The Always On Peripheral Bus (AOPB) was developed for this very purpose, to allow external accelerators to directly access its peripherals, including the DMA subsystem, reducing the overall CPU load. This IP will be integrated into Heepatia, a 16nm implementation of X-HEEP, and for its verification, a software-based platform called VerifHEEP was developed. It is a Python library tailored for X-HEEP, enabling users to build a complete SBST verification environment quickly, targeting both simulation tools as well as the PYNQ-Z2 FPGA board. Successful testing of the accelerator demonstrates that it performs the im2col transformation more than five times faster than CPU-based implementations.

Relatori: Luciano Lavagno, Luigi Giuffrida, Davide Schiavone
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 118
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-29 - INGEGNERIA ELETTRONICA
Ente in cotutela: ÿ¿cole polytechnique fÿ©dÿ©rale de Lausanne - EPFL (SVIZZERA)
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/33222
Modifica (riservato agli operatori) Modifica (riservato agli operatori)