Explainable AI for Speech Data: From words to phoneme

Raul Gatto

Explainable AI for Speech Data: From words to phoneme.

Rel. Eliana Pastor, Alkis Koudounas. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract:	As speech-based technologies such as virtual assistants become increasingly present in our daily lives, it also increases the need for transparency and interpretability in these systems due to concerns regarding transparency of their decision-making processes. Deep learning has significantly enhanced the performance of Automatic Speech Recognition (ASR), but also turned them into black-boxes, increasing their complexity and opacity. This thesis addresses these challenges by applying Explainable AI (XAI) techniques in the context of ASR systems, aiming to move to a new granularity, shifting from the current word-level explanations to phoneme-level explanations, trying to find the contributions of sub-word units, which has been unexplored until now. To achieve this, the work adapts already existing model-agnostic explanation methods such as Leave-One-Out (LOO), LIME, and SHAP, traditionally used for image and text classification explanations, to be able to perform perturbations at phoneme level. This work also introduces for the first time SHAP-based explanations in this context. A forced alignment mechanism is used to obtain accurate timestamps for each phoneme, which allows the model to perform perturbations with the most precision. A sliding window phoneme aggregation approach is also introduced to obtain insights at different levels of granularity between phonemes and words, such as syllables or graphemes. The explanation process follows a pipeline that starts with the raw audio input, followed by transcription and phonemization, forced alignment for timestamp extraction, and finally perturbation-based methods to quantify each phoneme influence on the model’s output. Through phoneme-level explanations we confirm the added value of phoneme-level explanations in revealing some expected patterns, while also revealing some counterintuitive insights, which can be helpful to further study the model’s behavior. In one example, high importance was given to apparently unimportant phonemes, showing potential errors in the training and incongruence with the expectations. The methods show similar results with different trade-offs between accuracy and computational efficiency, with LOO and LIME offering comparable results to SHAP in a fraction of time. Multiple parameter configurations were also explored, showing how even subtle components can influence the classification outcome. These results are a starting point to further extend the program with better alignment models for multiple languages, with the possibility to improve the interpretability and debugging of ASR systems.
Relatori:	Eliana Pastor, Alkis Koudounas
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	87
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/36342

Modifica (riservato agli operatori)