polito.it
Politecnico di Torino (logo)

Multimodal Arithmetic for Zero-Shot Composed Image Retrieval: A Contrastive Post-Pre-Training approach of Vision-Language Models

Marco Magnanini

Multimodal Arithmetic for Zero-Shot Composed Image Retrieval: A Contrastive Post-Pre-Training approach of Vision-Language Models.

Rel. Giuseppe Rizzo, Federico D'Asaro, Luca Catalano. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (8MB) | Preview
Abstract:

Vision–Language Models (VLMs) have emerged as powerful general-purpose models, capable of transferring to a wide range of downstream tasks in a zero-shot manner. These models are typically trained with contrastive objectives on large-scale image-text datasets, aligning images and text into a shared embedding space. Although effective for many applications, tasks such as Composite Image Retrieval (CIR), which consists of retrieving a target image given a reference image and a natural language modification, pose unique challenges. Classical CIR approaches rely on curated triplet datasets (reference, query, target), which are difficult to scale and limited in diversity. This work introduces Multimodal Arithmetic Loss (MA-Loss), a training objective that learns compositional reasoning directly from readily available image-text pairs, eliminating reliance on costly triplet supervision. Unlike triplets, which require manual curation and annotation, image–text pairs can be collected at scale from the web, making them a practical foundation for large and diverse datasets. MA-Loss models semantic differences as structured transformations in a shared embedding space, aligning textual modifications with corresponding visual changes. This formulation enables CIR in a zero-shot setting while scaling naturally to heterogeneous web-sourced data. To ground the design of MA-Loss, we conduct a systematic study of multimodal arithmetic using the SIMAT benchmark, analyzing the relationship between embedding space geometry (e.g., modality gap, alignment, uniformity) and compositional reasoning ability. Experiments show that a CLIP model post-pre-trained on MSCOCO using the MA-Loss objective achieves a new state of the art on SIMAT with a 46% score, surpassing the previous best of 42%. Applying MA-Loss to CIR in a zero-shot setting, we evaluate on FashionIQ and CIRR benchmarks. Although using a relatively small dataset for post-pre-training, our method achieves results comparable to similar state-of-the-art pair-based approaches, while outperforming others on both benchmarks. These findings suggest that modeling semantic differences rather than absolute representations offers a scalable and effective alternative for compositional retrieval tasks.

Relatori: Giuseppe Rizzo, Federico D'Asaro, Luca Catalano
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 91
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: FONDAZIONE LINKS
URI: http://webthesis.biblio.polito.it/id/eprint/38773
Modifica (riservato agli operatori) Modifica (riservato agli operatori)