polito.it
Politecnico di Torino (logo)

Transformer-based speech and text recognition models in the context of the Next-Generation Aircraft's Virtual Assistant

Ludovica Mazzucco

Transformer-based speech and text recognition models in the context of the Next-Generation Aircraft's Virtual Assistant.

Rel. Luigi De Russis. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024

Abstract:

The aim of this thesis is to present possible implementations of the two Machine Learning tasks of Speech To Text and Text To Intent paired with Named Entity Recognition, in the context of a possible deployment as core technologies exploited by a Virtual Assistant running on board of the next generation Fighter. Indeed, it is the outcome of a six-month long period of Internship in Leonardo Labs, the R&D department of Leonardo SpA, based in Turin. In what the Speech To Text module is concerned, OpenAI Whisper neural network is exploited as base structure to be fine-tuned on the down-stream task, with a dataset generated by collecting audio recordings through a Graphic User Interface implemented through the Python framework Streamlit. Comparisons have been performed between the behavior on the test set of the pre-trained model, the one fine-tuned on clean dataset and lastly the model fine-tuned on the dataset with audio effects applied, thus it has been demonstrated that the benefit produced by fine-tuning is represented by the sharp reduction of error percentage, moreover, in some versions of the pre-trained Whisper data augmentation further improves results. On the other hand, the Text to Intent part has been treated with the support of Google BERT Large Language Model, exploited as backbone of a Neural Network adapted for the additional task of Named Entity Recognition. In particular, three distinct architectures have been considered, all having BERT as fundamental component while different Named Entity head: the first is characterized by just one output linear layer, the second has one more linear layer other than the output and the last adds to the second a link from the output of the Intent head to the input of the Named Entity head. This time the dataset exploited for training and validation phases derives from a file containing a limited set of hypothetical types of input utterances that contain variable words and that are supposed to be well suited for this context. This is then further augmented substituting such variables with feasible names in order to obtain a significant number of samples. It results from the evaluation of the three different architectures for the Text To Intent task that the third one requires a little more memory expense but behaves in a more efficient way both in the test set and in the demonstration with user defined sentences. Both models have been tested with specific evaluation metrics along the lines of Word Error Rate (WER) for Speech To Text and accuracy and F1-score for the other one, whilst profiling techniques have been used to record resource usage in terms of memory occupancy and time consumption.

Relatori: Luigi De Russis
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 85
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: LEONARDO SPA
URI: http://webthesis.biblio.polito.it/id/eprint/33047
Modifica (riservato agli operatori) Modifica (riservato agli operatori)