Evaluating Large Language Models in Production Workflows: Methods, Challenges, and Case Studies

Gabriele Lorenzo

Evaluating Large Language Models in Production Workflows: Methods, Challenges, and Case Studies.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Abstract

This thesis presents the work carried out during an internship at Datadog within the LLM Observability team, focusing on methods for evaluating large language models (LLMs) in production environments. Unlike static academic benchmarks, industrial settings require evaluation pipelines that account for operational constraints such as latency, cost, and scalability, while also addressing evolving risks like hallucinations and prompt injection attacks. Three main contributions are presented. First, the design of a benchmarking script that standardized evaluation workflows, enabling reproducible comparisons across datasets, models, and prompt variants. This tool proved critical in diagnosing and resolving customer-facing issues, such as high false positive rates in failure- to-answer evaluations.

Specifically, the refined prompt substantially reduced false positives on customer data, improving categories such as no content response (precision rising from 9% to 45%) and refusal to answer (from 59% to 62%), while preserving high recall