EGO-T^3: Test Time Training for Egocentric videos

Simone Alberto Peirone

EGO-T^3: Test Time Training for Egocentric videos.

Rel. Barbara Caputo, Mirco Planamente, Chiara Plizzari. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2022

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (10MB) | Preview

Abstract:	In the last few years, the technological advancement of wearable cameras has led to an increasing interest in egocentric (first-person) vision. The ability to capture activities from the user's perspective has provided significant opportunities for a more in-depth study of human behavior compared to the third-person setting, as sensors are much closer to actions and embed a natural form of attention that stems from the human gaze direction. The research community highly benefited from egocentric vision for a variety of different tasks, such as human-object interaction, action prediction and anticipation, wearer pose estimation, and video anonymization. A crucial aspect for several video-related tasks is their multimodal nature. Audio, RGB, and optical flow provide complementary insights that are critical to a thorough understanding of the real world. In contrast, continuous head movement, variations in lighting conditions and differences in the way humans complete the same task represent a source of bias that strengthens the coupling between the model's predictions and the training domain, affecting its ability to generalize to unknown environments. Several Domain Adaptation (DA) techniques have been proposed to make models more robust. Among these, Unsupervised Domain Adaptation (UDA) combines labeled source data and unlabeled target data to reduce the distance of the extracted features across different domains. However, real-world applications require more flexibility, as target samples are often scarce, unrepresentative or even private, limiting the applicability of UDA. Test Time Training (TTT) appears to be a viable solution to these issues, with domain adaptation performed directly at test time under the simple assumption that input samples provide clues on the actual distribution of the target domain which could be used to improve predictions. With TTT, models undergo multiple adaptation steps at test time by minimizing an adaptation loss on target data and updating normalization statistics. This work provides, for the first time, a comparative analysis of multiple adaptation techniques on the EPIC-KITCHENS dataset. Particular attention was given to the analysis of their dependence on batch normalization layers and the impact of repeated adaptation steps, two critical concerns for real-time and power-constrained applications. Experiments indicate strong accuracy improvements, with up to 3.6% (absolute) gain over several baselines across a variety of settings, suggesting that TTT effectively improves model performance in the presence of dynamic environments.
Relatori:	Barbara Caputo, Mirco Planamente, Chiara Plizzari
Anno accademico:	2022/23
Tipo di pubblicazione:	Elettronica
Numero di pagine:	92
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/24483

Modifica (riservato agli operatori)