Mattia Cappelli
Multi-domain data fusion for colorectal cancer prognosis.
Rel. Maurizio Rebaudengo, Marta Lovino, Francesco Ponzio, Elisa Ficarra. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021
PDF (Tesi_di_laurea)
- Tesi
Accesso riservato a: Solo utenti staff fino al 17 Dicembre 2024 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (7MB) |
Abstract: |
This thesis aims to provide a methodological study in prognosis prediction for colorectal tumor, considering different types of biological data. Precisely, I will predict through survival analysis the overall survival risk, which is strictly related to the expected lifetime of cancer patients. The different types of data are histopathological images, miRNA, gene expression, methylation and clinical data. All these data are appropriately selected for the task from the TCGA-COAD project in the GDC databases. The thesis is devoted to developing an optimized framework for survival prediction in colorectal cancer patients. The focus is centred on the integration of omics data and histopathology images. The two difficulties for the data integration are the high dimensionality of the omics and the feature extraction from the images. The first problem is moderated by performing a features selection, and the integration of the unstructured data, the images, is handled in a self-supervised fashion to overcome the lack of labels and the different number of images for each patient. The self-supervised model readapts the AEGAN (AutoEncoder Generative Adversarial Network) architecture to extract the features from the images using both the discriminator and the encoder of the model. The model is trained in a complete non-informative way using two external datasets, one for the evaluation and the other one for the training. Afterwards, the survival analysis is addressed considering both the unimodal and the multimodal approaches. In the unimodal approach, a unique type of data is considered and in the multimodal setting, different types of data are included in the dataset. The models used for the survival analysis are one linear called Cox Proportional Hazard model and one non-linear based on a neural network, DeepSurv. The most relevant results are related to the feature extracted from the images; indeed, the model can outperform the results of supervised and unsupervised methods like PathologyGAN. The results also show the importance of high-level clinical features that achieve the best results among the various experiments. Furthermore, I exploited that the linear model works better with clinical features but the non-linear model is more effective with omics. In conclusion, the thesis also underlines a need to find a more suitable method for data fusion because although the results are in line with many other works, they do not reach state-of-the-art. |
---|---|
Relatori: | Maurizio Rebaudengo, Marta Lovino, Francesco Ponzio, Elisa Ficarra |
Anno accademico: | 2021/22 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 75 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/21092 |
Modifica (riservato agli operatori) |