polito.it
Politecnico di Torino (logo)

MARE-Graph: Multimodal Action Recognition in Egocentric video with Graph Neural Network

Domenico Mereu

MARE-Graph: Multimodal Action Recognition in Egocentric video with Graph Neural Network.

Rel. Giuseppe Bruno Averta, Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB) | Preview
Abstract:

In recent years, the growth of affordable wearable cameras, exemplified by devices like GoPro, has yielded a growing interest in first-person perspective, denoted as egocentric vision. The proximity of the camera to actions allows for a deep analysis of human behavior and human-environment interaction. The benefit of egocentric vision finds exploitation in numerous applications, including augmented and mixed reality, human-robot interaction and behavior understanding. Tasks related to video analysis demand a focus on the integration of diverse modalities due to their inherently multimodal nature. The inclusion of additional modalities provides complementary information, addressing limitations and enhancing the robustness and accuracy of egocentric action recognition systems. Nevertheless, the integration of diverse modalities introduces challenges arising from data heterogeneity, distinct preprocessing needs, and varying computational demands specific to each modality. Recent studies in egocentric vision have explored graph-based approaches to build hierarchical representations of human activities, or extract topological maps of physical space. Additional research has showcased the adaptability of Graph Neural Networks (GNNs) in the domain of multimodal context. This thesis extends this exploration to leverage Graph Neural Networks (GNN) for action recognition, enhancing temporal reasoning over action sequences and supporting integration and cooperation between different modalities. We combine Graph Neural Networks (GNN) with a cross-modal attention mechanism, enabling reciprocal exploration of content between different modalities and enabling robust cooperation. To further demonstrate the effectiveness of our approach in exploiting the synergies between the modes, we explore scenarios where a specific modality is not available during test time, attributed to factors like computational constraints or efficiency requirements. Our cross-modal interaction mechanism learns robust representations, showcasing robustness in the face of potential modality loss. Experiments reveal a significant boost in accuracy compared to various baselines.This underscores the efficiency of GNN in handling multimodal contexts across diverse scenarios.

Relators: Giuseppe Bruno Averta, Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 82
Subjects:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/30811
Modify record (reserved for operators) Modify record (reserved for operators)