Gabriele Lorenzo
Evaluating Large Language Models in Production Workflows: Methods, Challenges, and Case Studies.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025
Abstract
This thesis presents the work carried out during an internship at Datadog within the LLM Observability team, focusing on methods for evaluating large language models (LLMs) in production environments. Unlike static academic benchmarks, industrial settings require evaluation pipelines that account for operational constraints such as latency, cost, and scalability, while also addressing evolving risks like hallucinations and prompt injection attacks. Three main contributions are presented. First, the design of a benchmarking script that standardized evaluation workflows, enabling reproducible comparisons across datasets, models, and prompt variants. This tool proved critical in diagnosing and resolving customer-facing issues, such as high false positive rates in failure- to-answer evaluations.
Specifically, the refined prompt substantially reduced false positives on customer data, improving categories such as no content response (precision rising from 9% to 45%) and refusal to answer (from 59% to 62%), while preserving high recall
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Informazioni aggiuntive
Corso di laurea
Classe di laurea
Ente in cotutela
Aziende collaboratrici
URI
![]() |
Modifica (riservato agli operatori) |
