Politecnico di Torino (logo)

Malware Family Classification with Semi-Supervised Learning

Maria Letizia Colangelo

Malware Family Classification with Semi-Supervised Learning.

Rel. Antonio Lioy, Andrea Atzeni. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview

In recent years, the spread of malware has increased exponentially, posing a significant challenge for cybersecurity experts. When facing with the constantly evolving world of unknown threats, including zero-day attacks, traditional signature-based approaches for malware detection have proven to be insufficient. Furthermore, adversaries are adapting by modifying their malicious code, which reduces the efficacy of signature-based detection. As a solution to these problems, machine learning models have been used to develop behaviour-based malware detection systems, because of their ability to generalise from data and detect previously unseen malware. These systems are employed to inspect the code in order to identify any malicious or potentially harmful actions performed by that code. Supervised learning shows promising results in detecting malicious code, but it is significantly limited by the considerable amount of manual effort required for labelling both malware and benign instances. Unsupervised learning methods, while reducing the labelling effort, struggle with accurate categorisation, especially in complex tasks like malware detection. A potential solution lies in the adoption of a hybrid approach, called semi-supervised learning (SsL), which combines labelled and unlabelled data. SsL has the potential to enhance algorithm performance by leveraging from a limited number of labelled samples and a large amount of unlabelled data. However, whether SsL offers advantages over supervised learning in the context of malware detection remains unclear in the current state-of-the-art, given the absence of a rigorous evaluation. This thesis focuses on semi-supervised learning for malware family classification, aiming to explore its benefits compared to the supervised approach. This research objective is to identify the circumstances under which unlabelled data can be effectively employed to enhance detection accuracy. To achieve this, I employed an evaluation framework from the current state-of-the-art that allows to assess rigorously the benefits of SsL. In particular, I implemented an image-based classification method, in line with the increasing adoption of this approach by researchers, who is driven by the promising outcomes and the continuous enhancements in the field of image processing using machine learning. This alternative technique involves the conversion of malware executables into grayscale images, which are then used to perform malware detection, and it is motivated by visual similarities among malware images from the same family, and differences among malware images of distinct families. Various combinations of machine learning algorithms and feature sets are employed, to provide a comprehensive overview on the performance of the semi-supervised approach. Based on the analysis, it is noteworthy how these algorithms behaviours are different in response to various challenges, including dataset distribution, feature input and the proportion of labelled data. The evaluation uses various performance metrics to validate the results, revealing the variations under different conditions.

Relators: Antonio Lioy, Andrea Atzeni
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 98
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/29460
Modify record (reserved for operators) Modify record (reserved for operators)