Adversarial RAG-based approach to counter-narrative generation

Erfan Bayat

Adversarial RAG-based approach to counter-narrative generation.

Rel. Luca Cagliero, Aurora Gensale. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 24 Ottobre 2026 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB)

Abstract:	The widespread dissemination of hate speech across digital platforms has encouraged the development of LLM-based counter-narrative defenses. Existing approaches face critical limitations in adaptability, evidence integration, and transparency. While the use of human-generated datasets and fine-tuning approaches has paved the way in recent studies, these methods struggle with evolving hate patterns, require costly retraining, and often produce generic responses that lack the evidence-based reasoning necessary to effectively counter sophisticated, harmful arguments. Moreover, exposing human annotators to hate speech during dataset creation raises significant ethical concerns and introduces systematic biases. This thesis presents a novel adversarial debate framework in which LLM agents assume opposing roles, with one defending problematic positions and the other developing evidence-based counterarguments through structured eight-turn debates. This process enables the generation of more sophisticated hate and counter-hate speech that surpasses the current state-of-the-art datasets. By pruning and storing the relevant debates in a graph-based knowledge repository, we create a real-time content moderation system that requires no finetuning and is ready to use on the fly with minimal knowledge base storage. To further enhance the system, an adaptive assessment system was developed to dynamically assess the given input and mitigate any potential harm according to their corresponding hate score. Evaluation shows consistent improvements over zero-shot and RAG-based baselines, providing stronger evidence-based answers. Cross-model validation confirms robust transferability, while ablation and human studies validate the effectiveness and reliability of the hate assessment system.
Relatori:	Luca Cagliero, Aurora Gensale
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	92
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/37847

Modifica (riservato agli operatori)