Mitigating the Modality Gap in Vision-Language Pre-Trained Models

Iman Morovatian

Mitigating the Modality Gap in Vision-Language Pre-Trained Models.

Rel. Giuseppe Rizzo. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB) | Preview

Abstract:	One of the key challenges in vision-language models is the modality gap, which refers to the misalignment between image and text embeddings when projected into a shared latent space due to the inherent differences between the two modalities. This gap poses significant challenges for tasks that rely on seamless integration of visual and textual information, such as image-text retrieval, caption generation, and cross-modal understanding. While previous research has explored the causes of the modality gap and its effects on various downstream tasks, comprehensive studies on how model architecture influences this gap remain limited. This thesis investigates the role of model architecture in contributing to the modality gap, with a particular focus on shared-encoder architectures, where both images and text are processed by the same encoder network. Shared-encoder models offer potential benefits in terms of efficiency and parameter sharing, but they also introduce challenges related to modality-specific representations. Building on prior work, this thesis proposes a novel method to mitigate the modality gap within the shared-encoder architecture. The proposed approach integrates specific loss functions and fine-tuning strategies designed to encourage better alignment between visual and textual embeddings. The effectiveness of this method is evaluated through extensive experiments, demonstrating its impact on reducing the modality gap and improving performance on two critical downstream tasks: image-text retrieval and vector arithmetic-based operations. Furthermore, the thesis provides a comparative analysis of the shared-encoder architecture against the more traditional dual-encoder architecture, highlighting the strengths and limitations of each in terms of modality alignment, computational efficiency, and downstream task performance. The findings contribute to a deeper understanding of the modality gap in vision-language models and offer insights into architectural choices and training strategies that can enhance cross-modal learning.
Relatori:	Giuseppe Rizzo
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	59
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	FONDAZIONE LINKS
URI:	http://webthesis.biblio.polito.it/id/eprint/36865

Modifica (riservato agli operatori)