Facial Expression Recognition: Performance and Saliency Map Comparison Between Humans and CNNs

Federica Amato

Facial Expression Recognition: Performance and Saliency Map Comparison Between Humans and CNNs.

Rel. Federica Marcolin, Alessia Celeghin, Elena Carlotta Olivetti. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Abstract:	This thesis explores the adoption of different neural networks to address the task of Facial Expression Recognition (FER) on images. FER is an approach belonging to the Computer Vision and Pattern Recognition field aimed at identifying the emotion felt by a subject relying on her/his image- or video-based facial data. Facial expressions are a type of nonverbal communication, hence FER has applications in healthcare, education, criminal detection, and marketing. One of the challenges in FER is the inherent variability in how different individuals express their emotions. People may exhibit emotions differently and often blend multiple emotions simultaneously (e.g., happiness and surprise). Furthermore, several emotions share similar facial expressions, making them difficult to differentiate both by human observers and AI models. For instance, surprise and fear can appear with similar facial expressions, increasing the likelihood of misclassification. The subject of this study are Convolutional Neural Networks (CNNs). These networks excel at recognizing visual patterns and can match human-level performance. Using RGB images, it compares various CNN algorithms with human evaluations. Visualization tools including GradCAM, Bubbles, and External Perturbations were used to examine how people and models understand face emotions. Both the performance of pre-trained models and truncated pre-trained models were analyzed: the latter showed better generalization and performance. One reason may be because the last layers of models extract abstract features and are better suited to the pre-training dataset. To improve feature extraction and classification, approaches such as self-attention, patch extraction, and Global Average Pooling (GAP) were introduced. A pre-classification module with batch normalization and L2 regularization was also used to improve robustness and prevent overfitting. In addition to the previously presented methods, Focal Loss was used to further improve the performance of the models and to better manage class imbalance. Indeed, it is capable of inducing the model to pay greater attention to difficult samples to classify. Label Smoothing was also applied, which helps prevent the model from becoming too confident in its predictions, encouraging it to generalize better. In this research, the training and evaluation datasets are derived from the union of BU3DFE, DDCF, MMI, and additional labeled images from previous projects (Celeghin et al., 2023), ensuring that the train set and validation set are subject-independent. On the other hand, the Bosphorus dataset was used exclusively to create the test set,in order to ensure a more robust evaluation of the models. The results of the study suggest that, by employing these techniques, model performance is enhanced and becomes more comparable to that of humans.
Relatori:	Federica Marcolin, Alessia Celeghin, Elena Carlotta Olivetti
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	194
Informazioni aggiuntive:	Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/35492

Modifica (riservato agli operatori)