Leveraging Relative-Norm-Alignment in higher norm feature space for Cross-Domain First Person Action Recognition

Riccardo Zaccone

Leveraging Relative-Norm-Alignment in higher norm feature space for Cross-Domain First Person Action Recognition.

Rel. Barbara Caputo, Mirco Planamente, Chiara Plizzari. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (6MB) | Preview

Abstract:	Recently First Person Action Recognition (FPAR) has gained great interest of the researchers' community, mainly due to the increasing spread of wearable devices, the release of large and well-annotated datasets and the huge investments in novel technologies e.g., autonomous drones and robots, self-driving systems. Although egocentric vision rapidly attracted the interest of the research community, this setup presents some important challenges, most notably the ego-motion and the domain shift in feature space. Approaches in the literature often exploit multiple modalities to help mitigating these problems. However, domain shifts affect each modality in a different way, so it is important to develop algorithms that can better leverage the complementarity among modalities to achieve model resilience across domains, allowing the model to better recognize actions under various domain shifts. The literature proposes domain adaptation techniques to address such problems: they consist in methods to mitigate the performance drop that occurs when a model trained on source data is used on target data, and these data do not follow the same probability distribution. However, such techniques require some knowledge of the target distribution, and often such assumption is too strong. Domain Generalization techniques tackle this kind of scenario but, while for related tasks like image classification several methods exist, the literature in video domain generalization is still scarce. This work focuses on Domain Generalization in first person action recognition, by proposing an approach that takes advantage of the multi-modal nature of the perceptual input. We motivate our approach in light of recent progress in understanding problems and challenges of multi-modal training: in fact, jointly training multi-modal networks is harder than training their uni-modal counterparts, because different modalities separately overfit and generalize at different rates, so the joint optimization of the related branches is sub-optimal. To this extend we study the relative norm alignment (RNA-Net) approach, and propose it as a valuable technique to leverage multi-modal correlations in input streams and as a valid regularizer, further proposing an extension that guides the norm alignment towards higher feature norm regions. Our experiments show that RNA-Net++ is able to effectively enhance the performance of the model it is applied to, by leveraging a learn to re-balance task that ensures a consensus mean feature norm among modality streams.
Relatori:	Barbara Caputo, Mirco Planamente, Chiara Plizzari
Anno accademico:	2021/22
Tipo di pubblicazione:	Elettronica
Numero di pagine:	118
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	Politecnico di Torino
URI:	http://webthesis.biblio.polito.it/id/eprint/23434

Modifica (riservato agli operatori)