Mitigating the Modality Gap in Vision-Language Pre-Trained Models

Iman Morovatian

Mitigating the Modality Gap in Vision-Language Pre-Trained Models.

Rel. Giuseppe Rizzo. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB) | Preview

Abstract

One of the key challenges in vision-language models is the modality gap, which refers to the misalignment between image and text embeddings when projected into a shared latent space due to the inherent differences between the two modalities. This gap poses significant challenges for tasks that rely on seamless integration of visual and textual information, such as image-text retrieval, caption generation, and cross-modal understanding. While previous research has explored the causes of the modality gap and its effects on various downstream tasks, comprehensive studies on how model architecture influences this gap remain limited. This thesis investigates the role of model architecture in contributing to the modality gap, with a particular focus on shared-encoder architectures, where both images and text are processed by the same encoder network.

Shared-encoder models offer potential benefits in terms of efficiency and parameter sharing, but they also introduce challenges related to modality-specific representations