Empowering Enterprises with Lightweight Large Language Models: Automated "Rule Card" Extraction from Grant Documents

Hesamedin Alemifar

Empowering Enterprises with Lightweight Large Language Models: Automated "Rule Card" Extraction from Grant Documents.

Rel. Gianvito Urgese, Vittorio Fra. Politecnico di Torino, Corso di laurea magistrale in Digital Skills For Sustainable Societal Transitions, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract:	In the face of rapidly expanding unstructured data, organizations, especially small and medium-sized enterprises (SMEs), require automated solutions that not only offer accurate information extraction but also preserve data privacy. This thesis addresses such needs by introducing a lightweight, open-source Large Language Model (LLM) pipeline designed to extract structured "Rule Cards" from Italian grant and funding documents. By running locally with models like Llama 3.1, the system mitigates potential privacy risks associated with sharing data on external servers. The proposed pipeline employs a modular approach encompassing PDF parsing, Optical Character Recognition (OCR), chunk-based text segmentation, and domain-specific prompt engineering. Compared to frontier models (e.g., GPT-4, Gemini Pro), these smaller open-source models demonstrated competitive performance, as measured by BERTScore, while retaining the advantages of reduced computational overhead and on-premise deployment. On the other hand, some limitations exist: occasionally, missing or incomplete information arose from overly long or imprecise instructions, and hallucination occurred when the model attempted to generate details that were absent from the source document. Despite these issues, the focused prompts and verification steps minimized the impact of errors, underscoring the pipeline's adaptability and potential in real-world settings. By highlighting the viability of lightweight LLMs for specialized tasks, this thesis opens avenues for future research, such as fine-tuning multimodal models to enhance OCR for Italian texts and expanding the pipeline to handle additional data types. Ultimately, the findings demonstrate that domain-tuned, open-source LLMs can effectively extract structured information while maintaining privacy, offering a practical and scalable solution for SMEs and other organizations.
Relatori:	Gianvito Urgese, Vittorio Fra
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	84
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Digital Skills For Sustainable Societal Transitions
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-91 - TECNICHE E METODI PER LA SOCIETÀ DELL'INFORMAZIONE
Aziende collaboratrici:	COMPETENCE INDUSTRY MANUFACTURING 4.0 S.C.A.R.L.
URI:	http://webthesis.biblio.polito.it/id/eprint/34436

Modifica (riservato agli operatori)