Politecnico di Torino (logo)

Data integration for the analysis of the oncogenic potential of gene fusions

Venere Sabrina Barrese

Data integration for the analysis of the oncogenic potential of gene fusions.

Rel. Elisa Ficarra, Marta Lovino. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Biomedica, 2020

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview

The cells' life cycle is strictly related to the DNA replication and transcription. Under certain conditions the DNA may break and create aberrant products known as gene fusions. A gene fusion is made up of two genes, usually coming from different chromosomes and called gene pairs. After the breaking event, a portion of both genes can be lost, and the point in which each gene breaks is known as breakpoint. Gene fusions have been proven to be related to certain types of cancers, and in this case they are defined as driver gene fusions. Gene fusion detection tools are commonly used to identify gene fusions in a biological sample. However, these tools detect a high number of putative fusions in tumor samples and sometimes do not confidently label them as oncogenic. This suggests the need to gain more insights into the role of gene fusions in cancer. This thesis examines three elements in the evaluation of gene fusions’ oncogenic potential: transcription factors (TFs), gene ontologies (GOs) and micro-RNAs (miRNAs). Under the assumption that these elements can characterize a gene fusion, two machine learning methods were used to discern the driver fusions from the passenger events (e.g. gene fusions not related to cancer): the support vector machines (SVMs) and the multilayer perceptron (MLP). The classifiers were trained on 1765 thoroughly validated gene fusions and tested on 5246 samples. The training samples and the oncogenic test samples come from an ensemble of databases that were analyzed by Lovino M. et al. in DEEPrior, while the healthy test samples were extracted from Babiceanu M. et al. paper. The developed method first exploits the information related to the gene names and the breakpoints to extract the following features for both the genes: the percentage of retained gene after the fusion event, the putative role assigned by the Cancermine database (Lever J. et al.) and whether the two genes are transcribed in the same direction or not. The training and the cross-validation were performed on the training set using these features returning a cross-validation AUC higher than 88% for both the linear SVM and the MLP. Then the complete set of 181 transcription factors (https://amp.pharm.mssm.edu) was used to train and cross-validate the training set. The combination of the previously defined features and the 181 transcription factors led to an improvement in the performance metrics of both classifiers. An analogous process was carried out to integrate the information coming from the gene ontologies. The GOs were obtained using the Biomart tool (Yates A. et al.) gathering a total of 5125 features. Finally, the association between miRNAs and genes was retrieved from Targetscan (Agarwal V. et al.) as a list of probabilities defining the strength of the relationship between miRNAs and genes. A total of 333 miRNAs were identified as features. The final developed method is a MLP with 4 layers trained using the initial features, TFs, GOs and miRNAs. The cross-validation performance metrics were 90%, 86%, 99%, 0.88 respectively for accuracy, precision, recall, AUC. The same metrics computed on the test set were: 81%, 78%, 86%, 0.81. The complete pipeline proved to be able to integrate the different sources of data and discriminate, with adequate reliability, the driver from the passenger gene fusions. The tool returned higher performances compared to the results obtained by Oncofuse (Shugay M. et al.), a similar tool found in the literature.

Relators: Elisa Ficarra, Marta Lovino
Academic year: 2019/20
Publication type: Electronic
Number of Pages: 84
Corso di laurea: Corso di laurea magistrale in Ingegneria Biomedica
Classe di laurea: New organization > Master science > LM-21 - BIOMEDICAL ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/15031
Modify record (reserved for operators) Modify record (reserved for operators)