polito.it
Politecnico di Torino (logo)

Human-Aligned Speech Language Models with Preference Alignment Data Collection

Vincenzo Montana

Human-Aligned Speech Language Models with Preference Alignment Data Collection.

Rel. Eliana Pastor, Alkis Koudounas. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

[img] PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 12 Giugno 2027 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB)
Abstract:

Preference alignment techniques have achieved remarkable results in aligning Large Language Models (LLMs) with human values through output comparison. However, these methods critically rely on human-annotated preference data, whose collection remains a major challenge due to scalability and consistency issues. The complexity further increases in the multi-modal domain, where annotators may focus on isolated aspects of a given modality (e.g., speech tone or rhythm) rather than its overall communicative intent. Positioned within this context, the present work specifically addresses these challenges in the speech domain. The primary goal is to collect human preference data on speech-based interactions, ensuring that annotators are properly guided to provide consistent and meaningful feedback. To this end, a multi-stage speech pipeline, emulating a full conversation with a digital assistant, was designed. The pipeline includes modules for Automatic Speech Recognition (ASR), Text-to-Text reasoning (LLM), and Text-to-Speech (TTS) synthesis, allowing end-to-end evaluation of spoken dialogues. A dedicated web-based annotation platform was developed to facilitate the comparison of different model outputs under controlled and fair conditions, helping annotators focus on relevant linguistic and paralinguistic cues. The work also provides a detailed description of the database schema and its management, along with the data export formats adopted for organizing and analyzing the collected feedback. Furthermore, a systematic analysis of conversational datasets was carried out to identify suitable initial user requests, with a particular attention to the SLURP dataset and its data selection process. Both real user utterances and synthetically generated data — produced through LLM- and TTS-based pipelines — were employed to construct a diverse and realistic set of conversational scenarios. The thesis was conducted within a research project in collaboration with Amazon AGI.

Relatori: Eliana Pastor, Alkis Koudounas
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 70
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/38672
Modifica (riservato agli operatori) Modifica (riservato agli operatori)