Alessandro Buriasco
An Open-Source Multilingual RAG Pipeline for Policy Analysis.
Rel. Alessandro Aliberti. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Gestionale, 2026
|
Preview |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) | Preview |
Abstract
The analysis of environmental policies usually requires the exploration of a massive multilingual corpus of heterogeneous formats, jurisdictions and institutional portals with no unified access point. Despite recent rapid progress in Retrieval-Augmented Generation (RAG) pipelines and LLM-assisted document processing, current methods do not completely address multilingual content while ensuring legal compliance and maintaining structured domain taxonomies over an incrementally growing database. This thesis presents a two-stage, open-source system for the automated data ingestion and natural language querying of around 700 climate policy documents in twenty-one different languages, generating a vector knowledge base of more than 360,000 semantically structured chunks. The first stage integrates a traffic-light license validation model, a combination of multilingual heuristic pattern matching, trusted domains boosting and the support of LLM-based classification via Mistral-Nemo (Ollama), with recursive directory exploration, with targeted file filtering across institutional portals, automated PDF quality gating, and LLM-assisted metadata extraction into a predefined glossary of hazard types, resilience criteria and other characterizing fields.
The second stage implements a RAG architecture employing hierarchical parent-child chunking, where each child is encoded alongside the textual prefix from its parent segment, allowing the encoder to retain broader document context within local embeddings
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Corso di laurea
Classe di laurea
Aziende collaboratrici
URI
![]() |
Modifica (riservato agli operatori) |
