polito.it
Politecnico di Torino (logo)

EXPLORING TRANSFORMER MODEL FOR ACOUSTIC SCENE CLASSIFICATION

Federico Piovesan

EXPLORING TRANSFORMER MODEL FOR ACOUSTIC SCENE CLASSIFICATION.

Rel. Marcello Chiaberge, Luis Conde Bento, Mónica Jorge Carvalho De Figueiredo. Politecnico di Torino, Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica), 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (14MB) | Preview
Abstract:

Sounds carry a large amount of information regarding the environment and events that take place in it. Deep learning architectures can be used to automatically extract and interpret these acoustic signals, an ability which is pivotal in numerous applications, such as multimedia retrieval, context-aware devices, robotics, and intelligent monitoring systems. Since its introduction, the Vision Transformer Architecture (ViT) has shown remarkable results in a diverse array of AI tasks, including those related to acoustics. The DCASE competition, with its acoustic challenges, has pushed research in the field, Moreover, over the last two years, DCASE Task1 has served as an effective benchmark for showcasing the performance of ViT models in Acoustic Scene Classification (ASC) problems. This thesis aims to explore the capabilities of the ViT model in this context, using the TAU Urban Acoustic Scenes Dataset. This work seeks to evaluate a Transformer-based solution that remains robust despite variations in recording conditions and devices, and assess the architecture’s capabilities given the stringent data constraints and challenges of the task. The Transformer’s substantial data requirements, large model size, intrinsic training difficulties, and the use of a consumer-grade GPU necessitated the adoption of efficient strategies to optimize performance and generalization. The role of the Transformer in a knowledge distillation framework was examined, where it served both as a teacher guiding a smaller model and as a student learning from a larger model. The importance of data augmentation and pre-training was also emphasized, confirming the significant challenges mentioned above. Results highlight the necessity of providing the Transformer with extensive data to fully leverage its capabilities and adopting an adequate training strategy. It also provides insights into future directions, including domain adaptation, to further enhance the model’s robustness and applicability in diverse ASC scenarios.

Relators: Marcello Chiaberge, Luis Conde Bento, Mónica Jorge Carvalho De Figueiredo
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 98
Subjects:
Corso di laurea: Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica)
Classe di laurea: New organization > Master science > LM-25 - AUTOMATION ENGINEERING
Ente in cotutela: Universidade de Coimbra (PORTOGALLO)
Aziende collaboratrici: University of Coimbra
URI: http://webthesis.biblio.polito.it/id/eprint/31917
Modify record (reserved for operators) Modify record (reserved for operators)