Temporal tracking for identifying objects in smart bins using multiple visual architectures

Matteo Gravile

Temporal tracking for identifying objects in smart bins using multiple visual architectures.

Rel. Bartolomeo Montrucchio, Antonio Costantino Marceddu. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (16MB)

Abstract:	This thesis addresses the problem of visual tracking of objects within images captured by smart bins to determine whether an object is new or has already been observed. Unlike traditional object detection methods, where each instance is recognized independently, this work focuses on the temporal dimension and the system's ability to retain the memory of previously seen objects, thus improving continuous and automated waste monitoring. The dataset used, consisting of approximately 7,000 images annotated in Common Object in Context (COCO) format, includes bounding boxes, segmentation masks, object categories (about 60 classes), and a custom "new" attribute indicating whether an object is new with respect to the temporal context. This attribute takes the value "yes" for newly appearing objects and "no" for those already seen. After an initial phase of annotation refinement and cleanup, three experimental architectures were designed and compared to correctly predict the "new" label as either "yes" or "no". The first architecture, used as a baseline, combines a ResNet50 visual feature extractor with a supervised Multilayer Perceptron (MLP) classifier that receives as input a set of geometric and similarity-based metrics (cosine similarity, Intersection over Union, centroid distance, area ratio) computed between the current object and those stored in memory. The second approach is a Memory-Augmented Network (MAN) that integrates a learnable Transformer module to retain a dynamic memory of previously observed objects. The current object is processed in relation to this memory to produce a contextualized representation, which an MLP classifies. Finally, a Siamese network was developed and trained using both pairs and triplets of objects to learn discriminative visual embeddings via supervised losses (Contrastive Loss and Triplet Loss). During inference, the similarity between the current object and those in memory is combined with geometric features and provided to an MLP for final classification. The results, evaluated using standard metrics such as accuracy, precision, recall, F1-score, and Area Under the Curve (AUC), clearly highlight the strengths and limitations of each method. The best-performing configuration was the Siamese network trained with triplet loss and followed by a supervised MLP, achieving 0.787 accuracy, 0.747 precision, 0.906 recall, 0.818 F1-score, and 0.85 AUC. This was followed by the Memory-Augmented Network with MLP (F1 = 0.781) and the baseline ResNet50 + MLP (F1 = 0.761). These results demonstrate the effectiveness of advanced architectures in accurately recognizing previously seen objects, even in visually complex and dynamic environments. Overall, this work contributes to the development of intelligent waste management systems capable of incorporating temporal reasoning into visual recognition, addressing real-world challenges with robustness and adaptability. The proposed solutions, based on supervised learning and memory-aware models, represent a promising step toward adaptive and continuous object recognition in urban and environmental scenarios.
Relatori:	Bartolomeo Montrucchio, Antonio Costantino Marceddu
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	107
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	RE LEARN S.R.L.
URI:	http://webthesis.biblio.polito.it/id/eprint/36423

Modifica (riservato agli operatori)