polito.it
Politecnico di Torino (logo)

Automatic Malware Signature Generation

Michele Crepaldi

Automatic Malware Signature Generation.

Rel. Antonio Lioy, Andrea Atzeni. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB) | Preview
Abstract:

In most recent years the proliferation of malicious software, namely Malware, has had a massive increase: according to AV Atlas Dashboard the new malware samples (and PUA - Potentially Unwanted Application) currently detected every day are about 440.000 (at the time of writing), and this number is predicted to only keep growing. The total number of known Microsoft Windows malicious software (and PUA) passed from about 55 million in 2011, to about 400 million in 2016, and finally to nearly 830 million now. The huge number of malware samples out there in the wild renders the detection through manually generated signatures (patterns that identify malicious code) infeasible and consequently imposes the urgent need for tools able to automatically detect malware and possibly describe it in a human-interpretable way. Several methodologies have been proposed through the years, ranging from signature-based detection (especially with Yara Rules) to various Machine Learning approaches like Decision Trees, Naive Bayes models and Neural Networks. This thesis presents a novel model built upon previous works in ML-based (Machine Learning) automatic PE (Microsoft Windows Portable Executable) malware detection and description and introduces a new evaluation procedure on the learned implicit representation/signature of malware samples that may prove the applicability of its usage in the Malicious family prediction and ranking tasks. The model is trained on an open source large scale dataset of malware and benignware samples with the aim of creating high quality implicit signatures capable of correctly detecting (and describing) unseen malware samples as well as obfuscated malware and new variants, with high True Positive Rate (TPR) and high Recall at low False Positive Rates (FPRs). The Proposed Model's results in the different tasks (both the ones it was trained on - Malicious/Benign label and descriptive tags prediction - and the additional malware family prediction and ranking tasks) were compared to the previous models' ones. In particular, the ALOHA model proposed by Rudd et al. and the Joint Embedding model described by Ducau et al. were selected as reference models. The results show that the Proposed Model generates implicit signatures (samples embeddings) which provide higher TPRs, Accuracies, Recalls, Precisions and F1 Scores at low false positive rates with respect to the ones produced by the mentioned previous methods on the corresponding tasks. When testing the Proposed Model's learned representations on the Malware Family prediction and ranking tasks, however, the results were less promising. Therefore, a new Malware Family Classifier model, built on top of the Proposed Model base topology, was created. This new model was trained and evaluated on the malware family classification task using a specially crafted dataset of 10.000 PE files, exploiting the parameters from a previous Proposed Model training run, with the aid of transfer learning. The introduction of this new Family Classifier provided more meaningful results, although not exceptional, in the family classification task while also demonstrating the potential of using Transfer learning in this context. Future works capable of overcoming some of the final model limitations may be very useful to the IT-Security field in the current scenario and could even enable the generation of explicit (and thus more interpretable) signatures derived from the learned implicit ones.

Relatori: Antonio Lioy, Andrea Atzeni
Anno accademico: 2021/22
Tipo di pubblicazione: Elettronica
Numero di pagine: 172
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/20400
Modifica (riservato agli operatori) Modifica (riservato agli operatori)