polito.it
Politecnico di Torino (logo)

Interpretable Machine Learning for malware characterization and identification

Filippo Giovagnini

Interpretable Machine Learning for malware characterization and identification.

Rel. Antonio Lioy, Andrea Atzeni. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (16MB) | Preview
Abstract:

Malware remains a pervasive and evolving threat to cyber security. The rapid proliferation of new malware variants requires innovative solutions for timely identification and classification. This thesis presents a comprehensive study focused on the development of a machine learning model to address this challenge. The primary objective of this research is to create a machine learning model for malware identification and classification that prioritizes interpretability. The model aims to provide clear insights into the decision-making process, allowing security analysts to understand the features and characteristics that drive their classifications. This approach is essential for building confidence in automated cybersecurity systems. Firstly, I did extensive research on the state-of-the-art of interpretable machine learning models applied to malware identification and categorization. I always pay more attention to the interpretability aspects than to the performance aspects. I analyzed the first studies on this topic in detail and then moved on to the most recent and significant ones. The majority of the studies were conducted avoiding neural networks due to their computational cost, preferring traditional ML algorithms such as Random Forest, Gaussian Naive Bayes, Decision Tree, K Nearest Neighbour and Support Vector Machine. Nevertheless, I wanted to harness the power of neural networks and found a very promising project. The authors used an algorithm that had never been used before in this field, the Grad-CAM algorithm. In fact, they first transformed an Android application into images that were used to train a convolutional neural network, and then applied the Grad-CAM algorithm to these images for interpretability purposes. The real advantage would be to apply a reverse engineering phase at the end of these project steps. This phase should automate the process of converting the image back into memory and thus into code, so that the security analyst could have immediate access to the suspicious code. This is what I did in my thesis, briefly following the works done by these authors. The process began with the collection of a large and diverse dataset of Android malware samples, along with a pool of benign applications. I first tried to use this dataset with the deep neural network developed in that project, but the results were not very satisfying, so I built another model exploiting the Inception neural network developed and pre-trained by Google, adding some layers to the base model to perform classification. With this new model, the performance of the network reached very promising results. The reverse engineering phase was carried out by creating a legend file while generating the application images. A decompiler was used to generate the opcodes of the application, then each opcode was mapped to an ASCII character, and then the characters were stored in a text file, preserving the order in which they appear in the code and keeping track of where classes begin and end, storing this information in the legend text file, which will make it possible to reverse the process from an image area to the corresponding code. The results are very significant, as we now have immediate access to the precise classes to which the suspect code belongs, and can perform various studies on it, from manual revision to more automated analysis.

Relatori: Antonio Lioy, Andrea Atzeni
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 95
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/29462
Modifica (riservato agli operatori) Modifica (riservato agli operatori)