Automatic detection of suspicious websites based on community detection algorithms in multi-graph context

Fabio Tecco

Automatic detection of suspicious websites based on community detection algorithms in multi-graph context.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

Abstract:	In cybersecurity, graph analysis has become a key methodology for identifying malicious websites and attack campaigns. This approach leverages graph structure and relationships between nodes to identify anomalous patterns or suspicious communities that could be indicative of malicious activity. In this thesis work, a cybersecurity methodology was developed to automatically identify suspicious websites not yet known as malicious, starting from others already known as malicious. To achieve this objective, it was decided to work with different community detection algorithms on some multi-graphs created in Neo4j, a famous graph database management system. These multi-graphs were filled with some preprocessed features, extracted from datasets containing only passive detection information of websites, i.e. obtained without interacting directly with them. The information contained in these datasets has the particularity of having been collected during the creation or renewal phase of the SSL/TLS certificate for a website. This methodology starts from the hypothesis that each malicious website is not isolated, but is part of a malicious community shared by other malicious websites, because for example they are part of the same attack campaign. It wants to investigate whether the community detection algorithms are able to correctly and efficiently identify the other yet unknown malicious websites. Therefore the performances of this methodology were calculated querying URLSCAN - a website scanner for suspicious and malicious URLs - to obtain the goodness or malignity of each website labeled as suspicious by the procedure. Thus considering URLSCAN as a 100% truthful oracle. However, the results obtained did not show good performances and highlighted various critical issues related to the high computational and memory resources required by this methodology, respectively in the scalability of graphs construction and in the project of these in RAM, without which the community detection algorithms cannot work. In conclusion, this procedure has shown how, starting from datasets containing too general information, to clearly distinguish a malicious website from a benevolent one, it is not possible to identify with high accuracy the malicious websites belonging to the same attack community, through the use of community detection algorithms. One of the ideas, which at this point could still be explored to have a definitive answer in this research area, would be to use a local community detection algorithm. This have the characteristic of creating only the community in the surrounding of a seed malicious website and not all the communities like in this case. After that the next steps would be aggregate the different communities found for each malicious node and see if it leads to better results.
Relatori:	Paolo Garza
Anno accademico:	2023/24
Tipo di pubblicazione:	Elettronica
Numero di pagine:	106
Informazioni aggiuntive:	Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	ERMES CYBER SECURITY SRL
URI:	http://webthesis.biblio.polito.it/id/eprint/31026

Modifica (riservato agli operatori)