polito.it
Politecnico di Torino (logo)

Speech-Text Cross-Modal Learning through Self-Attention Mechanisms

Damiano Bonaccorsi

Speech-Text Cross-Modal Learning through Self-Attention Mechanisms.

Rel. Eliana Pastor, Alkis Koudounas, Moreno La Quatra. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB) | Preview
Abstract:

Speech, with its various elements like intonation and non-verbal vocalisations, is considered to be the earliest form of human language. However, existing systems for understanding spoken language mostly focus on the textual aspect, disregarding these additional components. Recent advancements in speech language modelling have enabled the development of speech-based language models called SpeechLMs. Nevertheless, text remains the primary mode of communication on the internet. Given this pretext, the objective of the thesis is to analyse the current state-of-the-art speech models and design a novel approach to combine the speech and text modalities, obtaining an architecture that is capable of leveraging the advantages of both. To do so, we adapt VisualBERT’s approach—a previous work that introduces a simple and flexible framework to model a vast range of vision-and-text tasks—for the speech and text modalities. Visual tokens obtained from patches of an image are replaced with audio tokens obtained from patches of the audio spectrogram computed on a speech sample, effectively replacing vision with speech. The concatenation of text and speech tokens is fed into a series of transformer layers that implicitly align tokens from the two modalities via self-attention. We name our model SpectroBERT, and through experiments carried out on common speech-text multimodal tasks such as Audio Question Answering (AQA) and Speech Emotion Recognition (SER), we demonstrate that SpectroBERT is able to implicitly align text and speech features, while retaining the simple and flexible formulation of its predecessor.

Relatori: Eliana Pastor, Alkis Koudounas, Moreno La Quatra
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 93
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/29585
Modifica (riservato agli operatori) Modifica (riservato agli operatori)