Adapting Vision-Language Models for Open-Vocabulary Object Detection through Prompt Learning

Claudio Macaluso

Adapting Vision-Language Models for Open-Vocabulary Object Detection through Prompt Learning.

Rel. Barbara Caputo, Fabio Cermelli, Gabriele Rosi. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (28MB) | Preview

Abstract

In recent years, foundational vision-language models have opened new opportunities for addressing open-vocabulary object detection, with applications such as automatic image annotation. However, despite their generalization ability, these models often lack the specialization required to adapt efficiently to novel datasets or domains, especially in low-data regimes. This thesis investigates the use of prompt learning techniques, originally developed in the natural language processing field, to enhance the adaptability of vision-language models for object detection. By leveraging the intrinsic fusion of text and visual modalities in these architectures, we extend current baselines with prompt-based methods and evaluate their performance in few-shot setups. The results demonstrate that the proposed approaches consistently outperform the baseline method, showing the potential of prompt learning to specialize foundational models for the task of image annotation with minimal supervision..