Politecnico di Torino (logo)

Multi-domain data fusion for colorectal cancer prognosis

Mattia Cappelli

Multi-domain data fusion for colorectal cancer prognosis.

Rel. Maurizio Rebaudengo, Marta Lovino, Francesco Ponzio, Elisa Ficarra. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021

[img] PDF (Tesi_di_laurea) - Tesi
Restricted to: Repository staff only until 17 December 2024 (embargo date).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (7MB)

This thesis aims to provide a methodological study in prognosis prediction for colorectal tumor, considering different types of biological data. Precisely, I will predict through survival analysis the overall survival risk, which is strictly related to the expected lifetime of cancer patients. The different types of data are histopathological images, miRNA, gene expression, methylation and clinical data. All these data are appropriately selected for the task from the TCGA-COAD project in the GDC databases. The thesis is devoted to developing an optimized framework for survival prediction in colorectal cancer patients. The focus is centred on the integration of omics data and histopathology images. The two difficulties for the data integration are the high dimensionality of the omics and the feature extraction from the images. The first problem is moderated by performing a features selection, and the integration of the unstructured data, the images, is handled in a self-supervised fashion to overcome the lack of labels and the different number of images for each patient. The self-supervised model readapts the AEGAN (AutoEncoder Generative Adversarial Network) architecture to extract the features from the images using both the discriminator and the encoder of the model. The model is trained in a complete non-informative way using two external datasets, one for the evaluation and the other one for the training. Afterwards, the survival analysis is addressed considering both the unimodal and the multimodal approaches. In the unimodal approach, a unique type of data is considered and in the multimodal setting, different types of data are included in the dataset. The models used for the survival analysis are one linear called Cox Proportional Hazard model and one non-linear based on a neural network, DeepSurv. The most relevant results are related to the feature extracted from the images; indeed, the model can outperform the results of supervised and unsupervised methods like PathologyGAN. The results also show the importance of high-level clinical features that achieve the best results among the various experiments. Furthermore, I exploited that the linear model works better with clinical features but the non-linear model is more effective with omics. In conclusion, the thesis also underlines a need to find a more suitable method for data fusion because although the results are in line with many other works, they do not reach state-of-the-art.

Relators: Maurizio Rebaudengo, Marta Lovino, Francesco Ponzio, Elisa Ficarra
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 75
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/21092
Modify record (reserved for operators) Modify record (reserved for operators)