Empowering Enterprises with Lightweight Large Language Models: Automated "Rule Card" Extraction from Grant Documents

Hesamedin Alemifar

Empowering Enterprises with Lightweight Large Language Models: Automated "Rule Card" Extraction from Grant Documents.

Rel. Gianvito Urgese, Vittorio Fra. Politecnico di Torino, Corso di laurea magistrale in Digital Skills For Sustainable Societal Transitions, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract

In the face of rapidly expanding unstructured data, organizations, especially small and medium-sized enterprises (SMEs), require automated solutions that not only offer accurate information extraction but also preserve data privacy. This thesis addresses such needs by introducing a lightweight, open-source Large Language Model (LLM) pipeline designed to extract structured "Rule Cards" from Italian grant and funding documents. By running locally with models like Llama 3.1, the system mitigates potential privacy risks associated with sharing data on external servers. The proposed pipeline employs a modular approach encompassing PDF parsing, Optical Character Recognition (OCR), chunk-based text segmentation, and domain-specific prompt engineering. Compared to frontier models (e.g., GPT-4, Gemini Pro), these smaller open-source models demonstrated competitive performance, as measured by BERTScore, while retaining the advantages of reduced computational overhead and on-premise deployment.

On the other hand, some limitations exist: occasionally, missing or incomplete information arose from overly long or imprecise instructions, and hallucination occurred when the model attempted to generate details that were absent from the source document