polito.it
Politecnico di Torino (logo)

Retrieval-Augmented Social Media Intelligence: Detecting and Reporting of High-Risk Communication Patterns using Large Language Models

Simona Berte'

Retrieval-Augmented Social Media Intelligence: Detecting and Reporting of High-Risk Communication Patterns using Large Language Models.

Rel. Andrea Atzeni, Paolo Dal Checco. Politecnico di Torino, Corso di laurea magistrale in Cybersecurity, 2025

[img] PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB)
Abstract:

The growing prevalence of social media has generated massive amounts of digital data, becoming a primary source for OSINT and SOCMINT. The proliferation of high-risk content poses a critical challenge for intelligence teams, who require advanced tools to effectively identify and analyze these phenomena. Large Language Models (LLMs), as part of the broader integration of Artificial Intelligence into the intelligence cycle, offer significant opportunities to automate and enhance analytical processes, enabling faster and more efficient management of the vast data available on social networks. This work develops a system based on Retrieval-Augmented Generation (RAG) technology, designed to support intelligence teams in the automated analysis of Twitter profiles. The system identifies communication patterns linked to high-risk phenomena and generates preliminary reports that guide further investigation. A knowledge base framework was designed for four risk categories (terrorism and extremism, cybercrime and hacking, hate speech and cyberbullying, mental health), with initial implementation focused on terrorism, containing multiple documents that describe distinctive linguistic and communicative features such as keywords, hashtags, and emojis that correspond to specific subcategories. These documents are derived from sources including intelligence reports, academic papers, and behavioral studies. Tweets are collected, preprocessed, and classified as belonging to one of the four categories or neutral. The system then identifies the most prevalent category for the overall profile being analyzed and further associates each analyzed tweet with specific subcategories within that category by comparing each tweet to the relevant documents, measuring the semantic similarity between tweets and documents. The final structured report, generated by the LLM, justifies the classification by providing evidence based on the identified communication patterns distinctive to each subcategory. Validation through terrorism-related profiles demonstrates the system's capability in identifying relevant communication patterns and generating coherent preliminary reports. Performance depends on prompt design, the quality of the RAG knowledge base, and the underlying LLM. The system provides intelligence teams with an efficient tool for the preliminary assessment of potentially critical profiles, contributing to the automation of intelligence processes. The RAG-enhanced approach enables contextualized and transparent analysis, supporting decision-making by providing both classifications and textual evidence.

Relatori: Andrea Atzeni, Paolo Dal Checco
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 212
Soggetti:
Corso di laurea: Corso di laurea magistrale in Cybersecurity
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/37932
Modifica (riservato agli operatori) Modifica (riservato agli operatori)