Maria Rosa Scoleri
Towards Egocentric Scene Graph Understanding with Graph Neural Networks.
Rel. Tatiana Tommasi, Antonio Alliegro. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024
PDF (Tesi_di_laurea)
- Tesi
Accesso riservato a: Solo utenti staff fino al 31 Ottobre 2025 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (27MB) |
Abstract: |
Egocentric vision is a domain of computer vision centered on video data captured from wearable devices such as head-mounted cameras. Videos from the user's viewpoint offer unique insights into human behavior and environmental contexts, with applications in augmented reality, activity recognition, and human-computer interaction. This thesis aims to develop a model to extract relevant features from egocentric videos exploiting labels constructed using scene graphs, which summarize the content of a given frame with verb-object-relationship triplets. Moreover, we propose a novel approach to the action anticipation task using graph-structured encoded data. We employ a Graph Neural Network (GNN) where visual features extracted from video frames serve as GNN nodes, while edges model the relationships between them. The training of the GNN employs verb-object-relationship triplets as labels, allowing the model to learn relevant frame features for egocentric tasks. To complement this framework, a Variational Autoencoder (VAE) compresses the graph-encoded data into a rich latent space. The VAE’s encoder is used to extract a dataset of videos described as a sequence of latent encoded frames. These frame sequences constitute the training data for a Diffusion Model aimed at performing the action anticipation task, which consists of predicting the next future action (verb + noun) given a set of known frames. The initial encoded frames of the sequence are kept noise-free to act as conditioning inputs, guiding the diffusion model in generating the next action. Overall, this thesis highlights the potential of scene graphs for egocentric video understanding, presents a first attempt at next-action anticipation with diffusion models, and discusses open problems and future directions. |
---|---|
Relatori: | Tatiana Tommasi, Antonio Alliegro |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 100 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/33133 |
Modifica (riservato agli operatori) |