Pietro Bertorelle
Stress testing chatbots: evaluating factuality, reasoning, abstraction, and other safety challenges.
Rel. Antonio Vetro'. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025
|
Preview |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution. Download (4MB) | Preview |
Abstract
The widespread use of Large Language Models (LLMs) in complex tasks has highlighted significant risks, including misinformation and biases. Moreover, the introduction of 'agentic AI' concept- autonomous systems capable of performing tasks without human intervention, encourages the use of LLMs for tasks they are not designed for due to their probabilistic nature. In response, agentic benchmarks were introduced to assess the reliability of these models on real-world tasks. This thesis proposes a benchmark with the aim of understanding the state-of-the-art in complex planning, factual accuracy and mathematical problem-solving, by analyzing the most popular chatbots from a technical point of view and the risks they entail.
The benchmark tests were performed on Gemini, Gemma, and Llama, spanning 10 different versions of these models
Tipo di pubblicazione
URI
![]() |
Modifica (riservato agli operatori) |
