Retrieval Augmented Generation for financial documents analysis and summarization

Alessandro Mosca

Retrieval Augmented Generation for financial documents analysis and summarization.

Rel. Luca Cagliero, Giuseppe Gallipoli, Lorenzo Vaiani, Simone Papicchio. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

Abstract:	In the banking sector, Trend and Risk Analysis are essential tasks. Analysts are routinely required to examine documents to extract insights, identify trends, and advise investors on suitable actions. Most of the time these documents contain not only text but also images and tables, making them challenging to analyze using traditional Natural Language Processing techniques. One tool that can facilitate the analysis of visually-rich documents is multimodal Large Language Models (LLMs). These models are trained on a vast corpus of documents and other data sources, enabling them both to generate human-like text and retain knowledge embedded within documents. To accelerate the document analysis process, banking organizations are interested in leveraging these models to integrate knowledge into the LLM without sharing the original documents. In the literature, the most common method for achieving this is through Retrieval Augmented Generation (RAG) systems. RAG systems work by splitting a document into smaller chunks, encoding these chunks as embeddings, and then retrieving the top-k similar document chunks based on a natural language query by the user. The most relevant chunks are then provided to the LLM in the prompt, effectively infusing the model with additional knowledge for analysis. To improve the accuracy of the retrieval step within the RAG system, three variations have been implemented: (1) paragraph-based retrieval, which uses paragraph text to compute embeddings that are compared to the user's query; (2) question-based retrieval, which uses an LLM to generate for each document element a set of possible questions whose embeddings are then compared with the user's question to enhance semantic alignment; and (3) tag-based retrieval where, given a set of user-defined tags with corresponding descriptions, retrieves the most relevant paragraphs by comparing document elements with the tag descriptions, allowing for a more focused retrieval based on the input tags. The aim of this thesis is to evaluate the three implemented retrieval strategies within the RAG system. Furthermore, it compares the output of the RAG system to a summary generated by feeding the retrieved document chunks into a summarizer, to assess when an explicit summarization step is beneficial for the user.
Relatori:	Luca Cagliero, Giuseppe Gallipoli, Lorenzo Vaiani, Simone Papicchio
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	82
Informazioni aggiuntive:	Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	INTESA SANPAOLO INNOVATION CENTER SPA
URI:	http://webthesis.biblio.polito.it/id/eprint/33876

Modifica (riservato agli operatori)