Amirshayan Nasirimajd
Sequential Domain Generalisation for Egocentric Action Recognition.
Rel. Giuseppe Bruno Averta, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (30MB) | Preview |
Abstract: |
Due to the widespread popularity and accessibility of wearable devices, a substantial volume of egocentric (first-person) video data has become readily accessible. This has resulted in a growing interest of researchers in the field of egocentric vision understanding. This field of study holds significant potential in several areas, especially in robotics and the analysis of human behaviour. Additionally, gaining insights into human behaviour from an egocentric perspective can offer valuable insights to robotics experts, facilitating the development of robots with more human-like visual capabilities and a deeper comprehension of their surroundings similar to humans. One of the main applications of egocentric vision is recognising the activities carried out by the wearer. However, one limitation when deploying action recognition models to real-world scenarios is that visual appearance data such as RGB inputs vary a lot when presented with new data distributions different from the training set, which inevitably leads to a decline in model performance. This issue is commonly known as the domain shift. Extensive efforts have been performed to increase model robustness across diverse domains (domain generalization). To tackle this problem, we present an action recognition method that relies on actions' temporal context. The rational behind this is that sequences of actions do not depend on the layout or appearance of the environment. In this way, we can produce a more generalized model and mitigate the adverse impact of domain shift. In this thesis, we present Sequential Domain Generalisation (SeqDG), a reconstruction-based architecture to improve the generalization of action recognition models. This is accomplished through the utilization of a language model and a dual encoder-decoder that refines the feature representation. The model is trained with a visual-text sequence reconstruction objective (SeqRec) that utilises contextual information from both text and visual modalities to reconstruct a sequence's central action. Furthermore, we introduce SeqMix, a technique that mixes actions that share the same label but come from different domains to make the model more robust to visual changes. We evaluate our approach's effectiveness and benefits in domain generalization on the EPIC-KITCHENS dataset. Our model is trained on a set of environments and tested on new unseen environments, showing the generalization benefits of the proposed approach. Extensive experiments show that our method improves by up to +2.4% compared to the baseline. This evidence suggests the proposed method's ability to improve model robustness and generalisation during domain shifts. |
---|---|
Relators: | Giuseppe Bruno Averta, Chiara Plizzari, Simone Alberto Peirone, Marco Ciccone |
Academic year: | 2023/24 |
Publication type: | Electronic |
Number of Pages: | 69 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | Politecnico di Torino |
URI: | http://webthesis.biblio.polito.it/id/eprint/30812 |
Modify record (reserved for operators) |