Politecnico di Torino (logo)

Efficent Transformer attentions in time series forecasting

Andrea Arcidiacono

Efficent Transformer attentions in time series forecasting.

Rel. Francesco Vaccarino, Rosalia Tatano. Politecnico di Torino, Corso di laurea magistrale in Data Science and Engineering, 2022

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (6MB) | Preview

Transformer-based architectures are neural networks architectures developed for natural language processing. These state-of-the-art architectures innovation is the use of the self-attention mechanism. These models have been deployed in several settings, not just limited to natural language, but also including videos and images. However they are hard to scale up for industrial applications due to the quadratic time and memory complexity of attention mechanism. Therefore, there has been a extensive research in proposing new variants of these architectures to solve this problem approximating the quadratic cost attention matrix, making the model more efficient and more lightweight. This thesis is focused on analyzing the recently proposed efficient attention mechanisms of Performer, BigBird and Informer and apply them to the task of time series forecasting. In particular, starting from the implementation of the Informer, the attention mechanisms of Performer and BigBird are integrated in its architecture, resulting in four models to be tested: Informer with vanilla attention mechanism, Informer with the so-called ProbSparse attention, Informer+Performer, Informer+BigBird. We compere the performance of each variation, in a easy to access scalable hardware, in two real industrial problems focusing on performance versus resources needed. The results suggest that the accuracy of the forecasting is similar for all tested models, but the computational perfomance on resources varies substantially. In fact, many efficient architecture loose the quadratic complexity problem but introduce more complex algorithms to calculate the attention approximation and this increases the overhead memory requirement. Thus, the obtained results suggest that any efficient Transformer architecture modification has to be carefully chosen according to the task and dataset characteristics, as the use of these efficient models might not result in significant benefits in terms of computational resources needed for training and testing of such models.

Relators: Francesco Vaccarino, Rosalia Tatano
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 86
Corso di laurea: Corso di laurea magistrale in Data Science and Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: ADDFOR S.p.A
URI: http://webthesis.biblio.polito.it/id/eprint/22741
Modify record (reserved for operators) Modify record (reserved for operators)