Flavio Spuri
Optimizing Genome Representations for Cancer Type Classification.
Rel. Alfredo Benso. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) | Preview |
Abstract: |
In recent years Large Language Models (LLMs) have been successfully adapted to the field of Genomics, as shown by models such as DNABERT, DNABERT-2, and Nucleotide Transformer. Despite this, their application in the challenging field of Cancer Genomics remains unexplored. This thesis examines whether cancer genome analysis can benefit from large, pre-trained Transformer-based models, specifically focusing on the newly introduced HyenaDNA architecture. In HyenaDNA, traditional Attention Layers are replaced by so-called Hyena Filters, which consist of recursions of an element-wise multiplicative gating and a long convolution, allowing for the processing of longer sequences while maintaining single-base resolution, and achieving a subquadratic computational cost, aligning well with the specific needs of Cancer Genomics. This study begins by assessing HyenaDNA's capabilities to represent genomic data. Using UMAP, we visualized embeddings computed by a HyenaDNA model pre-trained on the Human Reference Genome, showing that the model effectively learns to distinguish between different genomic regions. Following this, we fine-tuned pre-trained HyenaDNA models on the novel task of classifying cancer types directly from the mutated sequences. Given the lack of existing datasets tailored to our specific application, we built a custom dataset combining data from the DepMap dataset and the Human Reference Genome. Comparing performances obtained in various experimental settings, we conclude that this approach shows significant potential to be effectively employed in Cancer Genomics. |
---|---|
Relatori: | Alfredo Benso |
Anno accademico: | 2023/24 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 68 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Ente in cotutela: | TU Delft (PAESI BASSI) |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/31803 |
Modifica (riservato agli operatori) |