Politecnico di Torino (logo)

Arousal and Valence Recognition in Videos: Comparing the Power of Traditional Machine Learning and Deep Learning Models.

Martino Conversano

Arousal and Valence Recognition in Videos: Comparing the Power of Traditional Machine Learning and Deep Learning Models.

Rel. Gabriella Olmo, Gianluca Amprimo. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Biomedica, 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB) | Preview

This thesis explores the field of image/video recognition of continuous emotional states, with the goal of improving our understanding of human emotions and the role of non-verbal cues in their expression. This is a critical area of research that has numerous practical applications such as mental health, human-computer interaction, and marketing. One of the most important viewpoint on emotion recognition is the affective state, which can be described by two primary dimensions: arousal and valence. Arousal refers to the intensity or the energy level of the emotion, while valence refers to its pleasantness or unpleasantness. In further details, this thesis is focused on arousal and valence automatic recognition from video frames containing human subjects' faces, by applying machine learning and deep learning techniques. The purpose of this study is to compare performance between simpler models (e.g., SVM, MLP) and deep learning architectures (e.g., Resnet, VGG, MobileNet) to appreciate whether simpler models could produce comparable performance in the task, given an effective preprocessing of the input data. As a preprocessing, the raw images were cropped and realigned. Then, face landmarks were computed using the Mediapipe library and Histogram of Gradients using the Py-feat library. To reduce the number of features obtained, a principal component analysis was performed on the HOGs. The employed data contain more then five hours of video recordings of stress-eliciting experiments in a controlled environment - e.g., a public speaking task in front of an audience. Video clips of different subjects, capturing individuals exhibiting a variety of expressions are included and annotated with arousal and valance values for each video frame. Several state-of-the-art deep learning models, including Convolutional Neural Networks (CNNs) were used to evaluate the performance in recognizing arousal and valence. Results showed that deep learning models do not necessarily outperformed traditional machine learning models in recognizing arousal and valence, therefore a powerful preprocessing, based on relevant features of the input image could produce similar effects while saving long training time typical of deep architectures. This work may contribute to the development of more accurate and reliable video recognition systems based on simpler and faster models.

Relators: Gabriella Olmo, Gianluca Amprimo
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 76
Corso di laurea: Corso di laurea magistrale in Ingegneria Biomedica
Classe di laurea: New organization > Master science > LM-21 - BIOMEDICAL ENGINEERING
Aziende collaboratrici: Politecnico di Torino
URI: http://webthesis.biblio.polito.it/id/eprint/26210
Modify record (reserved for operators) Modify record (reserved for operators)