Stress testing chatbots: evaluating factuality, reasoning, abstraction, and other safety challenges

Pietro Bertorelle

Stress testing chatbots: evaluating factuality, reasoning, abstraction, and other safety challenges.

Rel. Antonio Vetro'. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution.
Download (4MB)

Abstract:	The widespread use of Large Language Models (LLMs) in complex tasks has highlighted significant risks, including misinformation and biases. Moreover, the introduction of 'agentic AI' concept- autonomous systems capable of performing tasks without human intervention, encourages the use of LLMs for tasks they are not designed for due to their probabilistic nature. In response, agentic benchmarks were introduced to assess the reliability of these models on real-world tasks. This thesis proposes a benchmark with the aim of understanding the state-of-the-art in complex planning, factual accuracy and mathematical problem-solving, by analyzing the most popular chatbots from a technical point of view and the risks they entail. The benchmark tests were performed on Gemini, Gemma, and Llama, spanning 10 different versions of these models. This benchmark consists of 22 questions repeated more than 273,000 times using different prompting methodologies, such as zero-shot Chain of Thought (CoT), in which the model is asked to solve the problem step-by-step, and Programming of Thought (PoT), which requires writing a Python code that solves the problem. The benchmark results identify several limitations in the tasks analyzed and the direction of development of some models. Specifically, the resolution of mathematical tasks with Gemini has improved significantly with the progress of versions and, in general, all models analyzed were more accurate in this category than in planning. Some limitations in the use of the PoT methodology can also be identified in real-world mathematical problems, where the use of 0-shot CoT improves accuracy. Testing the factuality category revealed that LLMs struggle to recognize an incorrect statement or a trick question. Meanwhile, analysis of planning ability revealed an incapability to handle overlapping plans in the submitted questions. For instance, all LLMs achieved results close to 0% accuracy when the problems involved performing seemingly contradictory sub-actions to achieve the result, as in 'Tower of Hanoi' or 'Blocks Words'. The findings highlight some of LLMs' current challenges to achieve agency, such as their inability to handle complex planning and factual nuances. Even in the mathematical domain, where the best results were achieved, the necessity of a third-party intervention has been shown. The findings also show that a more granular study of LLMs' capabilities is needed to ensure their responsible use.
Relatori:	Antonio Vetro'
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	97
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/37632

Modifica (riservato agli operatori)