polito.it
Politecnico di Torino (logo)

Heuristic Algorithm for Predicting Alternatively Spliced mRNAs with Pre-Trained LLM in Cancer

Gustavo Nicoletti Rosa

Heuristic Algorithm for Predicting Alternatively Spliced mRNAs with Pre-Trained LLM in Cancer.

Rel. Stefano Di Carlo, Roberta Bardini, Alessandro Savino, Matteo Cereda, Lorenzo Martini. Politecnico di Torino, NON SPECIFICATO, 2025

[img] PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (74MB)
Abstract:

Alternative Splicing is the RNA's ability to be spliced into many different mRNA isoforms, and it is of great evolutionary importance because it allows a single gene to produce a variety of proteins. However, in cancer, the spliceosome machinery produces aberrant isoforms or changes their expression, which alters the behavior of the cell, as they interfere with biological pathways. The study of novel cancer isoforms is essential for developing therapies that can suppress their expression or exploit the new epitopes, in addition to providing a deeper understanding of the disease. The relatively new long-read sequencing technology enables a more accurate representation of the transcriptome than the older short-read. Still, not all isoforms have been sequenced, and each cell in each state will produce different outcomes. Therefore, a way of predicting possible isoforms is an interesting problem. As we see in this thesis, generating all possible isoforms solely based on the main splicing signals of the genome takes virtually infinite time and mostly inaccurate results. Hence, we propose a heuristic algorithm for the prediction of tumoral isoforms with the inclusion of a Large Language Model pre-trained on RNA long-reads of multiple tumoral cell lines. We evaluate our algorithm by analysing its perplexity, computation time, and comparing our results with a prostate cancer long-read dataset provided by the Istituto Italiano di Genomica Medica (IIGM).

Relatori: Stefano Di Carlo, Roberta Bardini, Alessandro Savino, Matteo Cereda, Lorenzo Martini
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 101
Soggetti:
Corso di laurea: NON SPECIFICATO
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Italian Institute for Genomic Medicine (IIGM)
URI: http://webthesis.biblio.polito.it/id/eprint/37899
Modifica (riservato agli operatori) Modifica (riservato agli operatori)