Politecnico di Torino (logo)

Automated creation of Podcasts empowered by Text-To-Speech

Simone Sasso

Automated creation of Podcasts empowered by Text-To-Speech.

Rel. Antonio Vetro', Giovanni Garifo. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB) | Preview

The goal of Text-to-Speech (TTS) is to synthesize human-like speech from texts. Over the last decade, this research field has seen incredible improvements, thanks to the significant advances in deep learning and its extensive development. TTS models based on neural networks have been able to achieve results that are almost indistinguishable from human speech. Consequently, this technology has become more and more popular, drastically improving the way people interact with machines. Despite its current progress, neural TTS is far from a solved problem and still presents several criticalities. Both training and inference require heavy computational resources, and models tend to make mistakes when dealing with corner cases or text which belongs to a different domain with respect to the training set. This thesis will examine the development of a pipeline for the generation of podcasts, by using a Text-to-Speech model to read news articles. Since there are many different neural TTS architectures, there will be a discussion on the motivations that lead to the choice of the final model. This was trained on a high performance computing cluster, using an Italian public domain dataset. In order to adapt it to the synthesis of long news text, an additional preprocessing step has been introduced in the pipeline. Care has been taken to implement a normalizer that could correctly handle technical text, which is crucial when dealing with economic or scientific articles. The thesis will also explain how the model has been finetuned on a smaller dataset of a different speaker, successfully converting the synthesized voice in a short amount of time, thanks to transfer learning. As of today, there is a lack of high quality open source TTS models, outside of commercial services offered by big tech companies. The main reason is that creating a TTS dataset is an expensive process that requires the alignment of transcripts to tens of hours of recorded speech. In order to generate the dataset used for finetuning, a different approach was followed. Leveraging the recent improvements in the Speech-to-Text field, it was possible to automate the dataset generation process without the need for transcribed text. Hopefully, the same technique can be applied to generate datasets for low resource languages, which are plagued by a scarcity of training data. In the end, it will be described how the model has been deployed as a microservice, exploring the strategies used to mitigate the long inference times.

Relators: Antonio Vetro', Giovanni Garifo
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 76
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: COLUMN SRL
URI: http://webthesis.biblio.polito.it/id/eprint/25614
Modify record (reserved for operators) Modify record (reserved for operators)