
Maria Rosa Scoleri
Towards Egocentric Scene Graph Understanding with Graph Neural Networks.
Rel. Tatiana Tommasi, Antonio Alliegro. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024
![]() |
PDF (Tesi_di_laurea)
- Tesi
Restricted to: Repository staff only until 31 October 2025 (embargo date). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (27MB) |
Abstract: |
Egocentric vision is a domain of computer vision centered on video data captured from wearable devices such as head-mounted cameras. Videos from the user's viewpoint offer unique insights into human behavior and environmental contexts, with applications in augmented reality, activity recognition, and human-computer interaction. This thesis aims to develop a model to extract relevant features from egocentric videos exploiting labels constructed using scene graphs, which summarize the content of a given frame with verb-object-relationship triplets. Moreover, we propose a novel approach to the action anticipation task using graph-structured encoded data. We employ a Graph Neural Network (GNN) where visual features extracted from video frames serve as GNN nodes, while edges model the relationships between them. The training of the GNN employs verb-object-relationship triplets as labels, allowing the model to learn relevant frame features for egocentric tasks. To complement this framework, a Variational Autoencoder (VAE) compresses the graph-encoded data into a rich latent space. The VAE’s encoder is used to extract a dataset of videos described as a sequence of latent encoded frames. These frame sequences constitute the training data for a Diffusion Model aimed at performing the action anticipation task, which consists of predicting the next future action (verb + noun) given a set of known frames. The initial encoded frames of the sequence are kept noise-free to act as conditioning inputs, guiding the diffusion model in generating the next action. Overall, this thesis highlights the potential of scene graphs for egocentric video understanding, presents a first attempt at next-action anticipation with diffusion models, and discusses open problems and future directions. |
---|---|
Relators: | Tatiana Tommasi, Antonio Alliegro |
Academic year: | 2024/25 |
Publication type: | Electronic |
Number of Pages: | 100 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | UNSPECIFIED |
URI: | http://webthesis.biblio.polito.it/id/eprint/33133 |
![]() |
Modify record (reserved for operators) |