Politecnico di Torino (logo)

A data analytics integrative approach for multi-omics clustering in leukemia samples

Stefano Nardella

A data analytics integrative approach for multi-omics clustering in leukemia samples.

Rel. Elisa Ficarra, Marta Lovino. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview

In the last decades, the decrease in the cost of next-generation sequencing (NGS) technologies has allowed the widespread of many omics data (e.g., transcriptomics, genomics). This thesis focuses on a multi-omics approach to cluster patients so that similar ones are assigned to the same cluster, simultaneously considering all data sources. The proposed method has been evaluated on patients affected by myeloid and lymphoid leukemias (AML, ALL). The method considers two types of transcriptomics data, miRNA and mRNA expression. The expression measures the quantity of the molecule in the sample, which is crucial in regulating transcriptional and post-transcriptional processes. Many techniques based on multi-omics clustering of samples are presented. Among them, tools based on joint dimensionality reduction techniques -jDR- (e.g., JIVE and GCCA) should be mentioned. The main issue of jDR techniques is that they are based on a direct computation of the distances between all the samples in the original input space. The proposed technique overcomes this limit, exploiting a neural network model, Indeed it computes distances using pseudo-samples (also called centroids) generated by the neural network to identify the two classes, AML, and ALL diseases. Indeed, the network's output is a matrix of centroids generated from the data distribution of the input omics. The method is based on a Multi-Layer Perceptor (MLP) architecture which takes as independent inputs the omics matrices. The network is made up of 2 hidden layers for each input omic. The last hidden layers of each omic are concatenated and sent to the output layer. A custom loss function is implemented to minimize the error between the output value and the actual value. Different custom loss functions have been considered. In the end, the final loss is based on the Mean Squared Errors (MSE) computed on both input omics, which are combined through the sum divided by the product of the mse. In addition, it is not necessary to have a Y label given as input in the training phase. Indeed, the proposed method computes an 'artificial' Y label from the expression values of omics input matrices. This contribution is beneficial since the Y label is not always known for this kind of problem. For each patient, the artificial label is computed as the average expression values of all its features, consistent with what is mentioned in the literature. The output of the neural network is a matrix that for each omics outputs the centroids. Computing distances between patients and centroids, I assigned all the samples to the closest centroid. Unlike jDR methods, the proposed approach does not compute the distances between all the patients but between patients and centroids. This computation generates a dataset that contains the patient-centroid association. A similarity matrix is computed, this matrix is squared and binary. The value is 1 if the two samples belong to the same centroid, 0 vice-versa. Then, I applied various clustering techniques both on the similarity matrix and the original data, with and without a PCA dimensionality reduction. A custom evaluation function was designed to evaluate the performance of the clustering techniques. It verifies if these technique has correctly matched the cluster label, counts the correct matches, and returns a compatibility percentage. This metric increases about 20% in the clustering applied on the input omics and those with the PCA on the similarity matrix.

Relators: Elisa Ficarra, Marta Lovino
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 86
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/19252
Modify record (reserved for operators) Modify record (reserved for operators)