RobVC: An End-to-End Self-Supervised Voice Conversion

Ahmadreza Farmahini Farahani

RobVC: An End-to-End Self-Supervised Voice Conversion.

Rel. Santa Di Cataldo, Francesco Ponzio, Alessio Mascolini. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract

Current systems are mainly based on text-to-voice conversion rather than audio-to-audio voice generation, which results in an overall better final result but that lacks the speaker voice's characteristics. Moreover, the current voice generation approach struggles to preserve the emotions and the voice of the speakers, resulting in mechanical voices with a lack of intonation. Additionally, the majority of the models actually available for voice generation are too heavy to be used in a real-time system and, consequently, not usable for real-time purposes. Forbye, the few audio-to-audio systems available tend to generate mechanical, flat and emotionless voices and are not able to generalise (e.g.

a non-seen voice of the model needs a fine-tuning step before being correctly and precisely converted)