
Daniele Mansillo
Multimodal RAG for Slide Presentations with Synthetic Data Generation and Anonymization.
Rel. Daniele Apiletti, Simone Monaco. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025
![]() |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (5MB) |
Abstract: |
In recent years there has been a surge in the development and adoption of Retrieval-Augmented Generation (RAG) pipelines, as they constitute a cost-effective, flexible, and highly customizable way to leverage the advantages of LLMs on private and custom data. While modern RAG pipelines can work with almost any type of data, existing document processing systems focus predominantly on textual content, often ignoring visual elements. This text-centric approach may suffice when text constitutes the main information carrier, but it fails to extract all meaningful insights from documents like slide presentations, where content is equally distributed across text, charts, images, and tables that often interact with each other to convey complete information. Given the rising popularity and performance of multimodal models and the lack of substantial integration in RAG pipelines, we chose to bridge this gap by building an effective RAG pipeline capable of processing slide presentations in PDF format and accurately responding to queries requesting information available in different data modalities. The main goal of this research is to identify the optimal approach for embedding and retrieving multimodal slide content in order to provide high-quality answer generation capabilities, with a particular focus on questions that require information from multiple slides, while taking taking into account hardware constraints in order to explore and develop techniques that reduce computational cost without significantly compromising performance. To identify the optimal techniques at each stage of the pipeline we employed a multi-step approach, comparing, and if necessary developing, various techniques at each phase from the embedding through the answer generation. Due to the lack of readily available shared data, we designed two synthetic dataset generation techniques based on state-of-the-art multimodal LLMs. The first technique focuses on generating question-answer pairs from the content of multiple slides simultaneously, addressing a common limitation of existing methods that typically rely on single-image inputs. The second technique introduces a novel anonymization process that leverages recent multimodal LLMs to disguise sensitive or identifying information in slide presentations. This method is capable of anonymizing individual slides and eventually extending their context to generate coherent and complete synthetic presentations. This technique is capable of interpreting and reproducing slide content, including charts and visual layouts using LateX as a markup language, ultimately producing synthetic slides and presentations that are visually indistinguishable from authentic ones. The main contribution of the project is a robust RAG pipeline capable of embedding multimodal information extracted from slide presentations and generating accurate answers based on the extracted multimodal data. To support its implementation we introduce synthetic data generation and anonymization techniques customized for slide-based documents. This research aims to support the advancing field of enterprise document intelligence by providing a comprehensive framework for multimodal content processing. The developed pipeline also offers practical solutions for organizations seeking to easily extract information from their slide presentations. |
---|---|
Relatori: | Daniele Apiletti, Simone Monaco |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 100 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/36326 |
![]() |
Modifica (riservato agli operatori) |