Multimodal Arithmetic for Zero-Shot Composed Image Retrieval: A Contrastive Post-Pre-Training approach of Vision-Language Models

Marco Magnanini

Multimodal Arithmetic for Zero-Shot Composed Image Retrieval: A Contrastive Post-Pre-Training approach of Vision-Language Models.

Rel. Giuseppe Rizzo, Federico D'Asaro, Luca Catalano. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (8MB) | Preview

Abstract

Vision–Language Models (VLMs) have emerged as powerful general-purpose models, capable of transferring to a wide range of downstream tasks in a zero-shot manner. These models are typically trained with contrastive objectives on large-scale image-text datasets, aligning images and text into a shared embedding space. Although effective for many applications, tasks such as Composite Image Retrieval (CIR), which consists of retrieving a target image given a reference image and a natural language modification, pose unique challenges. Classical CIR approaches rely on curated triplet datasets (reference, query, target), which are difficult to scale and limited in diversity. This work introduces Multimodal Arithmetic Loss (MA-Loss), a training objective that learns compositional reasoning directly from readily available image-text pairs, eliminating reliance on costly triplet supervision.

Unlike triplets, which require manual curation and annotation, image–text pairs can be collected at scale from the web, making them a practical foundation for large and diverse datasets