Destiny Jarymaya Okpekpe
Capabilities and Application of Deep Learning Recurrent Models.
Rel. Lia Morra. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (4MB) | Preview |
Abstract: |
Even with the major advances in Language Modelling in recent years after the introduction of transformer architecture, reasoning is still one of the unique skills of the human brain that Deep Learning models struggle to replicate the most. Since one of the main challenges is to efficiently recall information seen in the past, the Associative Recall (AR) synthetic task has gained importance for being a good proxy for language modelling and a suitable benchmark to select promising Large language models. A series of recurrent-gated models (such as H3, Mamba and Hyena), built to overcome the drawbacks of the O(L^2) computational complexity of the attention module, recently gained popularity for solving AR even with long sequences (more than 10,000 tokens). However, when scaled and trained on real language tasks, those models still can't achieve the performance of transformers. This thesis work investigates the reasons for this gap and found three main components responsible for it: (1) the fact that AR is not challenging enough to be a proxy for language, (2) the fact that recurrent models deeply relies on proper optimization to efficiently updates their hidden state and (3) the fact that while transformers benefit the most from scaling in depth, recurrent models benefit the most from scale in width. When reasoning with sequence, another difference between transformers and recurrence models is the role of positional embedding, since in the latter models the relative position of tokens is given by the order of the tokens in the sequence (implicit causality). The question is then how to reconcile data modalities that aren't sequential, such as in 3D point clouds, with the inherently directional (or bi-directional) order-dependent processing of recurrent models like Mamba. In this thesis, a new method is proposed to convert point clouds into 1D sequences that maintain 3D spatial structure with no need for data replication, allowing Mamba’s sequential processing to be applied effectively in an almost permutation-invariant manner. In contrast to other works, the proposed method method does not require positional embeddings and allows for shorter sequence lengths while still surpassing Transformer-based models in both accuracy and efficiency. |
---|---|
Relatori: | Lia Morra |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 69 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Ente in cotutela: | ETH Zurich (SVIZZERA) |
Aziende collaboratrici: | ETH Zurich |
URI: | http://webthesis.biblio.polito.it/id/eprint/34019 |
Modifica (riservato agli operatori) |