
Maria Rosa Scoleri
Towards Egocentric Scene Graph Understanding with Graph Neural Networks.
Rel. Tatiana Tommasi, Antonio Alliegro. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024
![]() |
PDF (Tesi_di_laurea)
- Tesi
Accesso riservato a: Solo utenti staff fino al 31 Ottobre 2025 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (27MB) |
Abstract: |
Egocentric vision is a domain of computer vision centered on video data captured from wearable devices such as head-mounted cameras. Videos from the user's viewpoint offer unique insights into human behavior and environmental contexts, with applications in augmented reality, activity recognition, and human-computer interaction. This thesis aims to develop a model to extract relevant features from egocentric videos exploiting labels constructed using scene graphs, which summarize the content of a given frame with verb-object-relationship triplets. Moreover, we propose a novel approach to the action anticipation task using graph-structured encoded data. We employ a Graph Neural Network (GNN) where visual features extracted from video frames serve as GNN nodes, while edges model the relationships between them. The training of the GNN employs verb-object-relationship triplets as labels, allowing the model to learn relevant frame features for egocentric tasks. To complement this framework, a Variational Autoencoder (VAE) compresses the graph-encoded data into a rich latent space. The VAE’s encoder is used to extract a dataset of videos described as a sequence of latent encoded frames. These frame sequences constitute the training data for a Diffusion Model aimed at performing the action anticipation task, which consists of predicting the next future action (verb + noun) given a set of known frames. The initial encoded frames of the sequence are kept noise-free to act as conditioning inputs, guiding the diffusion model in generating the next action. Overall, this thesis highlights the potential of scene graphs for egocentric video understanding, presents a first attempt at next-action anticipation with diffusion models, and discusses open problems and future directions. |
---|---|
Relatori: | Tatiana Tommasi, Antonio Alliegro |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 100 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/33133 |
![]() |
Modifica (riservato agli operatori) |