Accelerating Transformer Inference on Heterogeneous Multi-Accelerator SoCs using ESP

Jessica Marossero

Accelerating Transformer Inference on Heterogeneous Multi-Accelerator SoCs using ESP.

Rel. Daniele Jahier Pagliari, Alessio Burrello, Luca Carloni, Mohamed Amine Hamdi. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB) | Preview

Abstract:	Transformers have become essential in deep learning, excelling in tasks like natural language processing and computer vision. However, they are computationally expensive, especially in their so-called attention layers, which require large-scale matrix multiplications with quadratic complexity. Therefore, coupling general purpose processors with specialized hardware accelerators is critical to efficiently deploy Transformers in embedded systems with limited resources. The Embedded Scalable Platform (ESP) is a pioneering open-source research platform that enables the design of such heterogeneous SoCs, by integrating multiple types of tiles in a 2D mesh architecture. This modular design allows for an efficient integration of third-party accelerators, enabling rapid prototyping and exploration of novel architectures. This thesis focuses on the integration of the state-of-the-art Integer Transformer Accelerator (ITA) within ESP. ITA was developed to accelerate the execution of Transformer models by employing 8-bit quantization and custom hardware optimizations to improve the efficiency and reduce the memory footprint of attention, including an efficient implementation of the softmax function, a key component of this type of layer. Its integration in ESP required incorporating private local memories (PLMs) within ITA to store data locally during computation, minimizing external memory access. A controller was designed to manage DMA transactions, ensuring efficient data movement between the system memory and the PLMs. Additionally, a hardware socket was generated to interface ITA with the ESP platform. The latter facilitates communication between the accelerator and system components, allowing ITA to be integrated seamlessly into the SoC architecture. The final SoC architecture then consists of a memory tile, an Ariane RISC-V CPU tile, an ITA tile and an I/O tile. On the software side, a bare-metal application was written to validate the functionality of ITA within the ESP system. This application demonstrated the capability of ITA to perform attention computations taken from a real-world transformer model, working in coordination with the Ariane processor. The results showed significant improvement when using ITA to accelerate attention layers, with respect to a purely software solution running entirely on Ariane. Lastly, the flexibility of ESP was leveraged to explore performance scalability, by increasing the number of ITA accelerator tiles, with each processing a different attention head in parallel. In summary, this thesis demonstrates the successful design and integration of a specialized hardware accelerator for transformer models, exploiting the flexibility and modularity of ESP. The final SoC represents a promising solution for the deployment of resource-intensive machine learning models in embedded-systems.
Relatori:	Daniele Jahier Pagliari, Alessio Burrello, Luca Carloni, Mohamed Amine Hamdi
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	98
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-29 - INGEGNERIA ELETTRONICA
Aziende collaboratrici:	Columbia University
URI:	http://webthesis.biblio.polito.it/id/eprint/33131

Modifica (riservato agli operatori)