Politecnico di Torino (logo)

Low-complexity neural networks for robust acoustic scene classification in wearable audio devices

Michele Panariello

Low-complexity neural networks for robust acoustic scene classification in wearable audio devices.

Rel. Antonio Servetti. Politecnico di Torino, Corso di laurea magistrale in Data Science and Engineering, 2022

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (10MB) | Preview

This work concerns the design of a machine learning pipeline to perform acoustic scene classification (ASC) on a pair of headphones by means of a convolutional neural network (CNN). ASC is the task of recognizing a scenery (e.g. bus, park, office) from the sounds it produces (e.g. engine noise, birds chirping, typing sounds). In our setting, the goal is to make the headphones context-aware to enhance user experience. We capture audio from the microphone of the headphones and run the CNN on their hardware to perform classification in real time. A challenging aspect of the task is the lack of recordings coming from the microphone of the headphones, which forces us to resort to external data sources: this can be problematic since training on audio acquired from a different microphone than the one used in the final device may cause a data distribution shift and impact the classification performance (a phenomenon known as "device mismatch"). Moreover, because of the embedded environment, it is only possible to use a CNN of low complexity, which may be limiting in terms of modeling accuracy. We define the set of acoustic scenes to classify by seeking balance among the capabilities of the neural network, the possible use cases of the product, and what labeled data is publicly available. We assess the impact of device mismatch by re-recording a part of our original test set from the microphone of our headphones and testing our model on the resulting samples; we establish that no sensible performance degradation occurs. We then devise a technique to simulate the availability of further data from the target microphone. We test the generalization capabilities of the neural network to various kinds of input perturbations such as wind, unseen acoustic scenes and reverberations. We propose a number of approaches to make the model more robust to such perturbations, including different data augmentation techniques and the integration of a hidden Markov model in the system. We show that there is a fundamental trade-off between robustness and classification performance. We attempt to bypass that trade-off by increasing the complexity of the model. First, we attempt to increase the time and frequency resolution of the input to the CNN. Second, we change the neural network architecture to a deeper one, but we keep the same memory complexity. Lastly, we combine the two approaches. Our experiments suggest that merely deepening the network does not lead to sensible improvements; instead, the preprocessing is the real performance bottleneck of the system. With a more fine-grained input audio representation, we are able to enhance classification performance and robustness simultaneously, albeit at the cost of higher memory usage. Overall, our work illustrates how the deployment of a lightweight algorithm in an embedded system requires to find a sweet spot among performance, robustness, model complexity and product requirements.

Relators: Antonio Servetti
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 116
Corso di laurea: Corso di laurea magistrale in Data Science and Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/22584
Modify record (reserved for operators) Modify record (reserved for operators)