Giuseppe Atanasio
HIMU-MAE: Exploiting Head-mounted Inertial Measurement Unit with Masked Autoencoders for Egocentric Vision.
Rel. Giuseppe Bruno Averta, Gabriele Goletto, Simone Alberto Peirone. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2024
PDF (Tesi_di_laurea)
- Tesi
Accesso riservato a: Solo utenti staff fino al 13 Dicembre 2025 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (36MB) |
Abstract: |
Deep learning models have become central to computer vision, excelling across various tasks when trained on large labeled datasets. However, supervised training has scalability issues since gathering quality labeled data is a costly and time-intensive process. Self-Supervised Learning (SSL) methods offer a viable alternative to this paradigm, as they enable models to learn directly from input data without task-specific labels. This approach produces general representations that can be reused across diverse tasks and domains, including those with limited annotations. This study focuses on egocentric vision, a field aimed at capturing user actions and interactions within the environment from a first-person perspective. In this context, different sensors are typically adopted to capture the human activity from different perspectives. Indeed, certain modalities, such as RGB, capture environmental context but may miss fine-grained motion details outside the camera’s field of view. In contrast, motion-based sensors focus on the wearer’s movements, such as head and limb motion, complementing video data and offering low energy consumption. Thus, they can be combined in a typical multimodal setting. Nowadays, several methods have been developed to obtain robust self-supervised representations from visual data. One of them is the Masked AutoEncoder (MAE), which masks part of the input data and learns to reconstruct it from those missing parts. However, applying SSL to other modalities, specifically motion-based ones, remains underexplored. In this work, we underscore the potential of SSL on low-power, motion-focused IMU data, highlighting its benefits for downstream tasks in egocentric vision. To achieve this, we utilize the large-scale Ego-Exo4D dataset in an unlabeled manner to generate MAE representations for IMU data. Specifically, we enhance the reconstruction process by assisting it with visual data, resulting in a more effective representation. We demonstrate superior performance in two distinct settings: first, action recognition on a subset of Ego-Exo4D data, where an IMU sensor is mounted solely on the head of the camera wearer; and second, action localization on the WEAR dataset, where there are four IMU devices mounted on the limbs. |
---|---|
Relatori: | Giuseppe Bruno Averta, Gabriele Goletto, Simone Alberto Peirone |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 93 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | NON SPECIFICATO |
URI: | http://webthesis.biblio.polito.it/id/eprint/33948 |
Modifica (riservato agli operatori) |