Politecnico di Torino (logo)

Deep Neural Networks for Speaker Verfication

Salvatore Sarni

Deep Neural Networks for Speaker Verfication.

Rel. Sandro Cumani. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview

Speaker identification and speaker verification are the main tasks in the field of speaker recognition. The former involves inferring the speaker of an utterance from a set of possible identities, whereas the latter aims at assessing whether a claimed identity corresponds to the speaker of a given speech segment. Thanks to the advances in the field of Deep Learning, Deep Neural Networks (DNN) have recently become the state-of-the-art technique for utterance representation in the speaker recognition field. The DNN approach consists in training a neural network to extract speaker embeddings, i.e. fixed dimensional utterance representations that contain speaker-discriminant information. DNN embeddings significantly outperform previous state-of-the-art methods such as i-vectors in terms of verification accuracy. One of the most effective architectures for speaker embedding extraction is the Time Delay Neural Network (TDNN), which is able to model long range temporal dependencies. In this work we start with an analysis of the effectiveness of TDNNs for the speaker verification task. We also investigate the combination of TDNNs with other well-known architectures, such as Residual (ResNet) and recurrent neural networks, with the aim of improving the verification accuracy and possibly lowering the computational cost of embedding extraction. Traditionally, speaker embedding networks are trained on a set of background speakers using a multi-class classification paradigm: acoustic features are propagated through the network and aggregated by a pooling layer, which is often employed as, or followed by, the embedding layer. The network output consists of a softmax layer that computes speaker posterior probabilities. The cross-entropy function is used as objective function during training. Network training requires the extraction of acoustic features for a large number of speech segments. In this work we analize how the growing size, the dataset diversity dataset and the use of augmentation techniques impact the recognition accuracy, and propose effective methods to train the models in scenarios with limited computational resources. While the cross-entropy approach works well in classification tasks, it might not be the most effective choice to produce information-rich utterance representations for speakers that have not been seen during training. We therefore also analyze different objective functions that are inspired by solutions adopted in the face recogniton field to increase the robustness of the DNN embeddings. Finally, the standard TDNN pooling layer consists of a simple temporal average of the DNN-transformed acoustic features. In this work, we consider also alternative pooling approaches.

Relators: Sandro Cumani
Academic year: 2019/20
Publication type: Electronic
Number of Pages: 67
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/15245
Modify record (reserved for operators) Modify record (reserved for operators)