Mohamad Samaei
Data Collection and Generation for Preference Alignment in Speech Language Models.
Rel. Eliana Pastor, Alkis Koudounas. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025
|
|
PDF (Tesi_di_laurea)
- Tesi
Accesso riservato a: Solo utenti staff fino al 12 Giugno 2027 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (1MB) |
| Abstract: |
Despite recent advances, models of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) still misrecognize words in challenging conditions, limiting their ability. Reinforcement Learning from Human Feedback (RLHF) limits hallucinations in speech models by replacing purely statistical learning with a human-aligned optimization objective that rewards factual, grounded, and faithful outputs while penalizing hallucinated content. Current speech assistants are typically trained on proprietary data and use metrics such as Word Error Rate (WER) to prove their performance. At the same time, RLHF methods have largely focused on text-only models, leaving a gap in tools and datasets for applying preference alignment training to spoken dialogue systems. This thesis addresses these gaps by presenting open-source implementation of (i) a data generation and extraction pipeline for conversational speech agents and (ii) an annotation platform for collecting human feedback, with the goal of enabling RLHF in speech models. This work is conducted within a research project in collaboration with Amazon AGI. The speech data is generated, extracted, and converted to support the task of Question-Answering (QA). This choice is made with the aim of fully simulating smart assistants. The data generation and extraction pipeline support two complementary data sources. First, it generates synthetic assistant-style dialogues by prompting different Large Language Models to simulate users speaking to a smart home assistant across everyday topics. Second, the pipeline extracts real human speech samples from existing human-spoken datasets. This thesis specifically focuses on HeySQuAD, a dataset for spoken QA consisting of audio question recordings along with source paragraph and ground-truth answer spans. For this dataset, diverse and representative utterances are selected using Agglomerative Clustering and Furthest-First algorithm. These algorithms are evaluated by comparing the resulting subset with the original set using clustering quality metrics. The platform uses an HTML-based frontend, a Flask backend, and Alembic/SQLAlchemy database. It combines a web interface and a backend framework that supports both fine-grained and multi-optional evaluation of speech and text models. The platform enables a mechanism in which all the dialogues produced in the previous pipeline are run through an elaborate pipeline of ASR and TTS models. Human annotators compare the output of parallel models to mark the superior one. They also detect discrepancies between the audio and text outputs and record structured feedback according to pre-defined evaluation criteria. Together, the pipeline and platform provide an end-to-end, auditable, and reproducible framework for generating, extracting, and human-evaluating speech data for RLHF in speech models. All components are released as open-source. |
|---|---|
| Relatori: | Eliana Pastor, Alkis Koudounas |
| Anno accademico: | 2025/26 |
| Tipo di pubblicazione: | Elettronica |
| Numero di pagine: | 73 |
| Soggetti: | |
| Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
| Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
| Aziende collaboratrici: | NON SPECIFICATO |
| URI: | http://webthesis.biblio.polito.it/id/eprint/38767 |
![]() |
Modifica (riservato agli operatori) |



Licenza Creative Commons - Attribuzione 3.0 Italia