EXPLORING TRANSFORMER MODEL FOR ACOUSTIC SCENE CLASSIFICATION

Federico Piovesan

EXPLORING TRANSFORMER MODEL FOR ACOUSTIC SCENE CLASSIFICATION.

Rel. Marcello Chiaberge, Luis Conde Bento, Mónica Jorge Carvalho De Figueiredo. Politecnico di Torino, Corso di laurea magistrale in Mechatronic Engineering (Ingegneria Meccatronica), 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (14MB) | Preview

Abstract

Sounds carry a large amount of information regarding the environment and events that take place in it. Deep learning architectures can be used to automatically extract and interpret these acoustic signals, an ability which is pivotal in numerous applications, such as multimedia retrieval, context-aware devices, robotics, and intelligent monitoring systems. Since its introduction, the Vision Transformer Architecture (ViT) has shown remarkable results in a diverse array of AI tasks, including those related to acoustics. The DCASE competition, with its acoustic challenges, has pushed research in the field, Moreover, over the last two years, DCASE Task1 has served as an effective benchmark for showcasing the performance of ViT models in Acoustic Scene Classification (ASC) problems.

This thesis aims to explore the capabilities of the ViT model in this context, using the TAU Urban Acoustic Scenes Dataset