polito.it
Politecnico di Torino (logo)

Analysis of the impact of language and context in prompts on synthetic data generation with Large Language models

Gioele Giachino

Analysis of the impact of language and context in prompts on synthetic data generation with Large Language models.

Rel. Antonio Vetro', Marco Rondina. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
Abstract:

The increasing use of Large Language Models (LLMs) in various domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased decisions or patterns. With a focus on gender and professional bias, this thesis examines in which manner LLMs shape responses to ambiguous prompts, contributing to biased dynamics. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs’ ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT and Google Gemini. By automating the query phase via APIs, we ease the possibility to do multiple iterations of each prompt, collecting a wider range of responses that are useful for a far more comprehensive assessment. When analyzing the obtained results, we calculated conditional probabilities to relate the LLM response to the male/female pronoun present in the input prompt, with the goal to establish adequate evaluation metrics. Results highlight how LLM-generated synthetic content can reinforce stereotypes, raising ethical concerns about its use in every-day applications. The presence of bias in AI-generated text can have significant implications in many fields, such as working ones. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable and balanced outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger base of working professional pairs.

Relatori: Antonio Vetro', Marco Rondina
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 107
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/35266
Modifica (riservato agli operatori) Modifica (riservato agli operatori)