Politecnico di Torino (logo)

Design and analysis of VLSI architectures for Transformers.

Davide Dura

Design and analysis of VLSI architectures for Transformers.

Rel. Maurizio Martina, Guido Masera, Alberto Marchisio. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering), 2022

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview

Neural networks have been a big innovation field recently, with more and more applications addressing Machine Learning algorithms. A big part of these is made of Natural Language Processing (NLP) algorithms, which handle words, sentences and group of sentences. Machine translation, text generation, sentiment analysis and question and answering are just some examples of the NLP tasks. In this scope, the model that has gained more popularity is clearly the Transformer, with its great adaptability to different objectives. This network architecture is based on the attention mechanism and it has exceeded the performances of previously-used recurrent and convolutional neural networks. There are already several different models based on the Transformer: its encoder-decoder nature gives a lot of room to explore by changing the values of the parameters or the layer configuration. BERT (Bidirectional Encoder Representations from Transformers) and Universal Transformer network are two particular models derived from the Transformer. However, Transformer has a big structure and a lot of parameters and that's why any hardware implementation is difficult and expensive to realize. In fact these drawbacks translate into complex resources, great memory footprint and latency. This work analyzes state-of-the-art situation on hardware realizations of the Transformer and proposes some ideas to design the network as a whole. Divide-and-conquer approach is used to design single layers and sub-layers in the architecture, but considerations on reusing resources and different structure possibilities are still taken into account. Quantization is key to have an integer-only architecture and to reduce both memory requirements and resources. Starting from an entirely-quantized model, the hardware design is developed for a single Encoder layer; it is legit to assume that different configurations can be realized by replicating the architecture. Main focus is on the matrix multiplication and the non-linear functions. The former is the most important operation since it covers majority of the network computation, besides being heavy from the area point of view. To implement it, the choice is a matrix of Multiply-and-Accumulate (MAC) elements, which is simulated and synthetized for different dimensions to see the trend for estimate bigger structures. Non-linear functions on the other hand are complex due to the type of operations that they need. Linear algorithms approximating them are taken from literature and translated into hardware solutions, whose behaviour has been compared to software model to see the correctness of their results. Connecting the separate sub-layers is duty of the control part of the design, which is also described to see possible solutions. Eventually, adaptability of the design to other types of Transformer is evaluated.

Relators: Maurizio Martina, Guido Masera, Alberto Marchisio
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 87
Corso di laurea: Corso di laurea magistrale in Ingegneria Elettronica (Electronic Engineering)
Classe di laurea: New organization > Master science > LM-29 - ELECTRONIC ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/25517
Modify record (reserved for operators) Modify record (reserved for operators)