polito.it
Politecnico di Torino (logo)

Machine Learning for malware characterization and identification

Marco Saracino

Machine Learning for malware characterization and identification.

Rel. Antonio Lioy, Andrea Atzeni. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (4MB) | Preview
[img] Archive (ZIP) (Documenti_allegati) - Altro
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (135MB)
Abstract:

Nowadays, one of the most important threats that needs to be addressed is malware. Malicious programs have evolved over time, becoming more numerous and complex. Zero-day malwares are the new malware that are already widespread on the Internet but have not yet been identified. Traditional signature-based malware detection systems fail to detect these new malicious files because they have not yet been analyzed, so the systems will not have a valid signature with which to identify them and will cause false negatives when placed under examination. To identify and classify malware without the need of the malware signatures, I tried using different machine learning techniques to understand which algorithm was best suited for the task. First, datasets were sought that were suitable for my task, and then the available malware had to be analyzed to see what features could be extracted. These features were then adapted to build sensible data structures to be given as input to the selected algorithms. Then four machine learning algorithms were selected to be used for testing. In order to choose the algorithms, a study of the state of the art was made, the results obtained from the different research already done on this type of search were compared, and it was determined which four algorithms were the most promising. During the state-of-the-art study, it was noted that there were few features extracted from the datasets per search. Usually, in fact, the authors were going to extract one or two features from the malware to be used with a single machine learning algorithm. I therefore decided to use a different approach. I set as my goal to extract as many available features as possible from my dataset and try to employ them with my chosen algorithms. In this way, I was able to conclude what was the best algorithm to use in malware detection and classification for each feature. The features chosen were binary file size, n-grams of bytes, n-grams of opcodes, count of occurrences of each individual opcode, entropy, n-grams of APIs, and the check for the presence of each individual API function. The algorithms chosen were Random Forest, K-nearest neighbors, Support Vector Machine and Gradient Boosting Classifier. The results showed that Random Forest and Gradient Boosting Classifier algorithms perform better in terms of accuracy. In addition, SVMs by performing a training phase for each class in the dataset take a long time to be ready to perform for the learning phase. The n-grams of bytes was the feature with which the algorithms performed most promisingly, having said that the n-grams of opcodes also performed excellently. To confirm that everything was generalizable for any dataset, the same procedure was tried again with another dataset, obtaining very similar results. The results are very significant, leading one to think that the use of machine learning algorithms in malware identification and classification may be a solution to one of the biggest threats in the modern world.

Relatori: Antonio Lioy, Andrea Atzeni
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 100
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/26794
Modifica (riservato agli operatori) Modifica (riservato agli operatori)