Data Driven: AI Voice Cloning

Alessandro Emmanuel Pecora

Data Driven: AI Voice Cloning.

Rel. Luca Cagliero, Moreno La Quatra, Lorenzo Vaiani. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (2MB) | Preview

Abstract:	As humans, we transmit a significant amount of information through speech. Evolution has developed an entire organ with the function of modulating audio signals for communication purposes, and speech is the most commonly used communication channel among humans. In the field of speech processing, there are several transformations that can be used to extract values from speech data. These applications range from clinical settings, such as detecting Parkinson's disease from voice samples, to the media industry, where software for automatic dubbing in multiple languages can be developed using speech processing methods. This thesis focuses on two specific tasks within the field of speech processing: speaker recognition (SR) and text-to-speech synthesis (TTS). Speaker recognition involves determining an individual's identity through their voice, while text-to-speech synthesis entails creating natural-sounding human speech waveforms from provided input text. Overall, this work contributes to the field of Speech Processing by improving the performance of analyzed SR models. It demonstrates the effectiveness of well-constructed datasets in reducing data requirements for TTS. Moreover, the emphasis is placed on the effective utilization of speaker embeddings. These are low-dimensional vectors that capture unique characteristics of a speaker's voice, serving as a voice print, and are employed to condition the TTS models. As a result, a Voice Cloner System integrates the two tasks. The system is capable of synthesizing speech in previously unheard voices using approximately 5s of speech audio in input. It operates in a zero-shot learning setting and leverage on speaker embeddings. Additionally, a demo application was released, showcasing the capabilities of the Voice Cloner and providing an implementation for further explorations in the future. For SR, the thesis explores Speaker Identification (SI) and Speaker Verification (SV) sub-tasks, and how to use these to generate speaker embeddings. Various speaker embedding techniques, are examined, with a focus on X-vectors extracted using the SV objective. Two well-known deep learning architectures, WAVLM+ and ECAPA-TDNN, are enhanced using newer loss function, the Generalized End-to-end (GE2E) loss. Experimental results on the VoxCeleb 1 dataset demonstrate that the proposed loss outperform existing pretrained models. In the TTS domain, Tacotron 2 and FastSpeech2 architectures are investigated. Tacotron 2 utilizes LSTM encoders and decoders with attention mechanisms, while FastSpeech2 employs transformer-based models to convert phoneme into mel-spectrograms. Case studies on Italian voices demonstrate the benefits of incorporating pangram utterances (that are phrases that include all the letters of a language's alphabet) confirming how quality of data is often more important then quantity. Various integrations for Voice Cloning Systems are examined, the final architecture combines ECAPA-TDNN for generating speaker embeddings and FastSpeech2 for speech synthesis. This integration successfully enables voice cloning and allows for pitch, energy, and duration modulation of the output audio. The evaluation of the Voice Cloner is performed using Voice Clone Error Rate (VC-ER) and Word Error Rate (WER) metrics. VC-ER measures the similarity between cloned and real speaker voices using the SV task, while WER assesses the accuracy of the synthesized speech.
Relatori:	Luca Cagliero, Moreno La Quatra, Lorenzo Vaiani
Anno accademico:	2022/23
Tipo di pubblicazione:	Elettronica
Numero di pagine:	79
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/27738

Modifica (riservato agli operatori)