Politecnico di Torino (logo)

Automatic detection of new phishing domains using machine learning

Chiara Aliberti

Automatic detection of new phishing domains using machine learning.

Rel. Marco Mellia. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020


Phishing is an online attack that tries to deceive its victims into revealing sensitive information, such as credentials or credit card details. The user is usually attracted by email (whaling, spear phishing, clone phishing) to malicious webpages that mimic the content of known legitimate websites. Over the years, phishing detection has become very relevant, mostly because of the fast-paced nature of the attack and the ever-growing scale of phishing campaigns, with the majority of phishing websites being alive for less than 24 hours and thousands of new phishing domains registered every day. This trend underlines the importance of lowering the window of vulnerability that occurs from the launch online to the detection of the malicious nature of the website. This thesis work was carried out at the IT company Ermes Cyber Security to tackle this issue, attempting to identify phishing campaigns in their early phases, by starting the detection process from newly registered public-key certificates. For each domain collected, extra data is gathered, such as DNS records, the WHOIS record, and various details about the domain webpages. Using the features extracted from this data, two different machine learning models were chosen to predict the legitimacy of the domains: a random forest classifier and an SVM classifier. Both these models have been trained using a set of features extracted from a manually selected ground truth dataset of over 500 legitimate domains and over 430 phishing domains. After various steps to improve the models, such as feature selection, hyperparameters tuning, and PCA, the results obtained showed promise. The random forest classifier manages to get an accuracy of 91,4% and a precision of 94.9%, while the support vector machine classifier obtains similar results of accuracy of 92,5% and precision of 95.4%. The models obtained have then been run in the wild, to classify over 240000 domains collected in several months, for a total of over 41.6 GB of raw data analyzed. The resulting labeled domains were then sampled and manually checked to verify the correct classification and overall trustworthiness of the classifiers produced in this thesis work.

Relators: Marco Mellia
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 81
Additional Information: Tesi secretata. Full text non presente
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: ERMES CYBER SECURITY S.R.L.
URI: http://webthesis.biblio.polito.it/id/eprint/15989
Modify record (reserved for operators) Modify record (reserved for operators)