polito.it
Politecnico di Torino (logo)

A Deep Learning approach to integrate histological images and DNA methylation values

Margheret Casaletto

A Deep Learning approach to integrate histological images and DNA methylation values.

Rel. Elisa Ficarra, Marta Lovino, Francesco Ponzio. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (15MB) | Preview
Abstract:

This thesis aims to investigate the integration between a specific category of biomedical images, the histological ones, and DNA methylation. I consider colon cancer data derived from The Cancer Genome Atlas (TCGA) repository. Concerning images, I also exploit an additional set of Regions Of Interest (ROIs). To achieve the aim, I train an image classification model to predict the malignancy in the images. Afterward, I analyze how methylation affects the predictions by exploiting the correlation between the features extracted from the two data types. The input data consists of the methylation samples, divided between healthy and tumor, and the images, which are also globally labeled as tumor or healthy. Firstly, I perform a division into train set and test set for both data types, taking care to have both the image and methylation data for the same patient. Next, I develop two pipelines in parallel that perform the same tasks for the two data types, exploiting an ML/DL approach based on the distinct nature of the data. Regarding methylation, after a preprocessing step, I train multiple genomic classifiers and analyze the prediction scores on the test set. All the trained genomic classifiers achieve an accuracy higher than 94%. At this point, I evaluate two dimensionality reduction techniques, Principal Component Analysis (PCA) and Autoencoders (AE), to extract different feature sets from the methylation train set. I train a Support Vector Machine (SVM) classifier for each extracted feature set and choose the one that achieves the best scores on the test set. On images, after having cut them into smaller crops, I exploit a well-known Convolution Neural Network (CNN) architecture, the VGG16, to develop the image feature extractor model. After a hyper-parameters tuning procedure, I perform a complete VGG16 fine-tuning on the ROIs. I evaluate a second model by performing a further complete fine-tuning on part of the TCGA train set. The CNN is the first part of a feature extraction pipeline that eventually performs PCA. I classify the test set images with both models and obtain two baseline results. I extract the train set features with both models and train a Multi-Layer Perceptron (MLP) for each feature set. I choose the second model because it classifies the test set more likely to the respective baseline, hence making the extracted features representative. This last MLP becomes the actual images Baseline classification model. I perform the integration between the features extracted from the image and methylation data for both the train and the test sets exploiting Mutual Information (MI) and Pearson correlation. The correlation is performed between all the crops of a specific patient with his/her methylation data. The results are used to discard all those image crops with a correlation value below a certain threshold: in the case of MI, I choose a threshold value equal to 0; as for Pearson, I discard crops that have a correlation value below the first quartile (25%) of the maximum correlation value. In the final part of the work, I compare the Baseline prediction results with those obtained from the same one, but without accounting for the under-threshold crops described above. Therefore, I obtain three different sets of prediction scores on the test set (Baseline, MI and Pearson case). Assuming that an image is globally labeled as tumor if at least 10% of the crops is labeled as tumor, I conclude that the MI-based approach is the best.

Relatori: Elisa Ficarra, Marta Lovino, Francesco Ponzio
Anno accademico: 2020/21
Tipo di pubblicazione: Elettronica
Numero di pagine: 94
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/19248
Modifica (riservato agli operatori) Modifica (riservato agli operatori)