polito.it
Politecnico di Torino (logo)

DGA Detection with Big Data approaches

Luigi De Luca

DGA Detection with Big Data approaches.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
Abstract:

Domain generation algorithms (DGA) are algorithms that are present in various families of malware that are used to periodically generate a large number of domain names that can be used to communicate with their command and control servers. Domain Generation Algorithms have quickly become the main method used by the attackers to remotely communicate with the malicious tools that they have created. They no longer make use of hard-coded domain name lists and IP addresses, which are useless once they have been blocked. DGAs, compared to the previous methods, are easy to implement, difficult to block, and may be impossible to predict in advance. The main part of a Domain Generation Algorithm is the domain generator, that can be set as a random string of characters, a concatenation of random words taken from a dictionary, a constant part followed by a changing suffix, a constant part preceded by a changing prefix and so on. The purpose of this thesis project is to address and study DGA detection solutions, analyzing and studying the characteristics of the DGA domain names and trying to create a model that can distinguish between legit and DGA-based domain names. Two different approaches have been analyzed and experimented with which it has been tried to identify DGA domain names: one of supervised machine learning type, based on feature extraction, and one based on deep learning models, based on text classification. The traditional Machine Learning classifiers used are the Random Forest and the XG-Boost, while the two Deep Learning models are based on a Neural Network (NN): the first with a Long Short-Term Memory (LSTM), the second with a Bidirectional LSTM. The dataset used for the validation of the described models is made of real domain names and DGA-based ones, and is divided in training and testing set for the models evaluation. The validation is made in two ways: in the first one there is a random split of the dataset in training and testing set, in the second one a different set of DGA families is used as testing set in order to simulate the case in which the model encounters something that it has never seen, so new kind of DGA families. The models are trained with the training set and evaluated with the testing set, and the results are analyzed with different metrics: the accuracy and the values of True Positives, False Positives, True Negatives and False Negatives are the most important ones. Based on the validation made with the two simulations, the results obtained and the performances in terms of time, it turns out that the best solution is the one based on the XG-Boost Classifier with feature extraction. To be precise, in the second validation, that is the most important one because it simulates a real-word situation, the overall Accuracy of the XG-Boost Classifier is around 92%, the True Positive percentage is slightly less than 89% and the False Positive one is less than 4%. So, XG-Boost is chosen mainly because of its robustness that it has showed while it encounters algorithms it has never seen before.

Relatori: Paolo Garza
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 80
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: DATA Reply S.r.l. con Unico Socio
URI: http://webthesis.biblio.polito.it/id/eprint/25547
Modifica (riservato agli operatori) Modifica (riservato agli operatori)