Can you hear what I’ve learned? Explaining audio transformer-based models through embedding sonification

Gabriele Tomatis

Can you hear what I’ve learned? Explaining audio transformer-based models through embedding sonification.

Rel. Eliana Pastor, Alkis Koudounas. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (8MB) | Preview

Abstract

Since their introduction, transformer models showed instantly their high performances in the analysis of structured data such as images, time series and audios. Their ability in solving the most different tasks brought them to become rapidly the state of the art in a wide variety of domains. How they reason, however, is still a big issue, as they translate those data into an embedding representation that only they could comprehend. Despite that, only a few works try to solve this problem by using several methods proposed in the field of Explainable AI. The aim of this discipline is to make AI models interpretable in a way that makes them trustworthy and reliable; this would be impossible to obtain if we do not understand the way the models reason.

To address these issues, we take advantage of Descript Audio VAE, a model specifically trained to compress and reconstruct an audio waveform passing through a latent space representation