polito.it
Politecnico di Torino (logo)

Evaluating Large Language Models in Production Workflows: Methods, Challenges, and Case Studies

Gabriele Lorenzo

Evaluating Large Language Models in Production Workflows: Methods, Challenges, and Case Studies.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Abstract:

This thesis presents the work carried out during an internship at Datadog within the LLM Observability team, focusing on methods for evaluating large language models (LLMs) in production environments. Unlike static academic benchmarks, industrial settings require evaluation pipelines that account for operational constraints such as latency, cost, and scalability, while also addressing evolving risks like hallucinations and prompt injection attacks. Three main contributions are presented. First, the design of a benchmarking script that standardized evaluation workflows, enabling reproducible comparisons across datasets, models, and prompt variants. This tool proved critical in diagnosing and resolving customer-facing issues, such as high false positive rates in failure- to-answer evaluations. Specifically, the refined prompt substantially reduced false positives on customer data, improving categories such as no content response (precision rising from 9% to 45%) and refusal to answer (from 59% to 62%), while preserving high recall. Second, the implementation of a shadow experiment framework that routed production traffic to self-hosted models, allowing direct comparison with API-based baselines. These experiments revealed the trade-offs between large, expensive models such as Llama-70B and smaller, more efficient alternatives like Qwen3-4B. Finally, a series of fine-tuning experiments on prompt injection detection demonstrated that lightweight open-weight models can be adapted through supervised and reinforcement learning approaches to achieve strong task-specific accuracy, while maintaining significantly lower inference costs. The findings underline the importance of reproducibility, continuous monitoring, and specialization. They also suggest that while small models cannot rival API- based models on general-purpose accuracy, they can become viable and cost-effective alternatives when fine-tuned for specific evaluation tasks.

Relatori: Paolo Garza
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 70
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: TELECOM ParisTech (FRANCIA)
Aziende collaboratrici: Datadog France
URI: http://webthesis.biblio.polito.it/id/eprint/38645
Modifica (riservato agli operatori) Modifica (riservato agli operatori)