Tranformer-based architectures for long biological sequences

Vittorio Pipoli

Tranformer-based architectures for long biological sequences.

Rel. Elena Maria Baralis, Elisa Ficarra, Marta Lovino, Giuseppe Attanasio. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2022

Abstract:	Gene expression is the process by which the information encoded in the genes is transcribed and translated into a functional product, allowing cells to react to external impulses and carry out their main functions. Hence, gene expression has primary importance in life, and it is not difficult to imagine that fully understanding such a phenomenon may help cancer diagnosis and drug discovery. State-of-the-art Deep Learning techniques such as Xpresso and Expecto focus on gene expression level prediction by analyzing the raw DNA sequences (tens of thousands of base-pairs long) of each gene extracted from the reference genome, exploiting Convolutional layers. Learning from long sequences is challenging due to the intrinsic nature of the DNA. Indeed, such models must learn to extract local patterns and long-range dependencies. Current approaches employ convolution for compression and learning from the local context. However, they are not efficient in modeling long-range interactions due to the narrow local receptive field of the convolutional layers. Moreover, the cited papers exploit the One-Hot encoding to embed the biological sequences. Hence, this leads to the creation of sparse matrices, and it could be tricky to extract patterns from that simple representation. Therefore, this thesis accepts the embedding challenge and the long-range dependencies modeling issue, offering two different transformer-based approaches, namely ConvTransformer and FNetCompression, and data visualization techniques to interpret the results. The first main issue I faced in this thesis is sequence embedding. Biological sequences may be very long in the order of tens of thousands of base pairs; hence, it is crucial to handle and properly compress such information, being capable of dealing with the loss of information due to the shrinkage and the computational cost paid by modeling massive inputs. The latter holds especially for transformer-based architecture, where the Multi-Headed Attention mechanism pays a complexity that scales quadratically with the length of the input. The proposed approach uses a Word2Vec embedding layer combined with Conv1D layers, skipped connections, and AveragePooling. These steps reduce the length of the sequences to a feasible dimension for transformer-based architectures. The second main issue I focused on is modeling long-range DNA interactions. Convolutional-based solutions have a narrow local receptive field; they need extremely deep architectures to attend to the whole sequence. Conversely, transformer-based architectures attend to the whole sequence since the first layer, being capable of modeling the long-term relationships of the DNA. As anticipated, this thesis proposes two transformer-based models. The first one is intended to reach the highest possible performances. The second one can reach 98% of the former's performance by exploiting a Fourier encoder block. Moreover, it allows the compression of the sequences' length up to 95%, which leads to a dramatic complexity reduction. Experiments on the Xpresso and Expecto datasets have shown that my devised models outperform Xpresso's performances. In particular, Xpresso's mean R^2 is 0.57 while ConvTransformer reached a mean R^2 of 0.65 while FNetCompression 0.61. Nevertheless, my models can hold the comparison with Expecto's model despite different experimental conditions. Expecto's mean Spearman Correlation is 0.74, while ConvTransformer stays on 0.75 and FNetCompression on 0.74.
Relatori:	Elena Maria Baralis, Elisa Ficarra, Marta Lovino, Giuseppe Attanasio
Anno accademico:	2021/22
Tipo di pubblicazione:	Elettronica
Numero di pagine:	109
Informazioni aggiuntive:	Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/22586

Modifica (riservato agli operatori)