polito.it
Politecnico di Torino (logo)

A Comparison of Deep Learning and Human Visual Attention in Facial Expression Recognition: An Eye-Tracking and Explainable AI Approach

Adele Roggia

A Comparison of Deep Learning and Human Visual Attention in Facial Expression Recognition: An Eye-Tracking and Explainable AI Approach.

Rel. Federica Marcolin, Elena Carlotta Olivetti, Alessia Celeghin. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Biomedica, 2025

Abstract:

Deep learning (DL) models are becoming so effective at classifying human emotions from facial expressions that they are approaching, and in some cases surpassing, human performance. However, the decisional process behind the models’ output remains unclear. On the assumption that there are some similarities in the operativity of DL models and the human brain in the visual analysis of emotional facial expressions, this work aims to compare seven DL models with the behavioural responses of a group of 54 human subjects on a task of emotion recognition using Ekman’s basic emotions facial expressions. Human visual attention has been studied using eye-tracking analysis recorded during a facial emotion recognition task, while DL models were trained and tested on the same dataset of images. Eye tracking data were processed using Pupil Cloud tools and a custom code to obtain the gaze coordinates directly on the images. Saliency Maps were computed both from three Explainable Artificial Intelligence (XAI) techniques and from human eye-tracking data, subsequently aggregated into Canonical Saliency Maps. Accuracy results obtained by DL models and human participants did not achieve high performance. Both the inter-observer agreement among subjects and the agreement of the DL models showed congruent errors, confirming that some images were intrinsically ambiguous. Human visual analysis showed a distinctive pattern of gazing for each participant and no significant differences in classifying one emotion from another. Similarly, saliency maps obtained using DL models focus on distinctive areas of the face based on the classified emotion and attentional framework. For the same DL architecture, and in general for all the DL networks tested, XAI perturbative techniques such as Bubbles and External Perturbations seem to show similar areas of interest on the saliency maps, while gradient-based methods like Grad-CAM focus on different regions. Compared to human behaviour, Bubbles is the method with the most similarities to the canonical saliency maps of humans, but, in general, the results show that DL models and humans do not have shared patterns in observing the face.

Relatori: Federica Marcolin, Elena Carlotta Olivetti, Alessia Celeghin
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 79
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Biomedica
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-21 - INGEGNERIA BIOMEDICA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/34885
Modifica (riservato agli operatori) Modifica (riservato agli operatori)