Towards Temporal Consistency in Egocentric Object Detection for Open-Vocabulary Navigation

Antonio De Cinque

Towards Temporal Consistency in Egocentric Object Detection for Open-Vocabulary Navigation.

Rel. Giuseppe Bruno Averta, Claudia Cuttano, Gabriele Tiboni, Marco Ciccone. Politecnico di Torino, NON SPECIFICATO, 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (87MB) | Preview

Abstract:	Egocentric object detection is a critical aspect of robotic navigation and interaction within dynamic and complex home environments. The primary objective of this research is to explore the challenges and solutions associated with achieving temporal consistency in egocentric object detection, particularly in the scope of the Open-Vocabulary Mobile Manipulation (OVMM) challenge. This is contextualized within the HomeRobot 3D simulation environment, where a robot (Hello Robot Stretch) is tasked with navigating a household and bring an object from one place to another. The perception module of the robot is enabled by open-vocabulary object detection models, such as DETIC (Detecting Twenty-thousand Classes using Image-level Supervision). These models have shown to be promising in recognizing a wide range of objects, given any prompt. However, the performance is often hindered by the egocentric view and the lack of a temporal coherence, leading to “noisy” predictions and inconsistencies across consecutive frames. This problem arises as the model is processing each frame in isolation, leading to predictions which lack continuity and coherence across time. To address this, we investigate the integration of Spatio-Temporal Adapters (ST-Adapters) within the DETIC model, aiming to enhance the model's ability to maintain temporal consistency, without compromising its open-vocabulary capabilities. We highlight the limitations of DETIC in maintaining temporal consistency, particularly in the detection of small or partially obscured objects. By incorporating ST-Adapters, we investigate an approach to instill the model with a spatio-temporal dimension, allowing for more coherent and reliable object detections over time. The HomeRobot simulation environment leverages a subset of the Habitat Synthetic Scenes Dataset (HSSD), featuring high-quality 3D scenes. For our analyses, we extracted frames from video explorations, complete with objects' ground truths. To assess the performance of the models, we conduct evaluations using two distinct test sets: one in-domain, consisting of scenes similar to those encountered during the training phase, and another out-of-domain, comprising household scenes not seen in training. Therefore, the hypothesis to verify is that an improvement in the frame detections would translate into enhanced temporal consistency in the 3D simulation environment.
Relatori:	Giuseppe Bruno Averta, Claudia Cuttano, Gabriele Tiboni, Marco Ciccone
Anno accademico:	2023/24
Tipo di pubblicazione:	Elettronica
Numero di pagine:	94
Soggetti:
Corso di laurea:	NON SPECIFICATO
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/31030

Modifica (riservato agli operatori)