Visual Context Meets Translation: A CycleGAN Approach to Multimodal Neural Machine Translation

Ahmad Sidani

Visual Context Meets Translation: A CycleGAN Approach to Multimodal Neural Machine Translation.

Rel. Luca Cagliero, Giuseppe Gallipoli. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 25 Luglio 2026 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (2MB)

Abstract:	Neural Machine Translation (NMT) has made remarkable progress with the Transformer architecture, but it remains challenged by ambiguous or context-dependent language that can benefit from visual context. Multimodal Machine Translation (MMT) addresses this by incorporating information from images into the translation process to improve accuracy, especially for image-related or ambiguous content. However, most MMT approaches rely on multimodal parallel corpora, which are scarce for many language pairs. This thesis introduces a CycleGAN-based multimodal translation architecture allowing training without direct sentence-pair annotation by utilizing images as a pivot from one language to another. The model synthesizes visual-semantic representations from CLIPTrans such that the source and target language representations are aligned in a common vision-language feature space. A cycle-consistent learning task is used: the system produces a translation and translates the resulting string back into the source language to reconstruct the input sentence such that semantic consistency is enforced without access to ground-truth translations. The architecture facilitates unsupervised as well as hybrid training procedures such that monolingual image-caption datasets from two languages (supplemented with optional sparse parallel datasets) may be exploited to train high-quality translation models. Including large-scale pre-trained vision-language models (VLMs) such CLIP and its multilingual variation M-CLIP is a major novelty. By use of prefix-tuning, these models inject visual and language semantics—a strong shared embedding space—into translation models. This enables visual context to guide generation even in cases when images are not accessible during inference, hence "hallucinating" grounded semantics from text alone. The translation system gains from generalised visual-textual priors by using VLMs, hence improving robustness and lowering the need on expensive triplet data. We evaluate the proposed framework on the Multi30K benchmark, demonstrating its effectiveness across different levels of supervision. In a fully supervised setting, the model achieves competitive translation quality. Notably, even in a purely unsupervised scenario with no parallel sentences, the model produces translations that substantially outperform text-only baselines, underscoring the benefit of visual grounding. Furthermore, a hybrid training scenario that combines unpaired multimodal data with a small amount of parallel data yields additional performance gains. These results highlight that integrating visual context through VLM-based embeddings and cycle-consistent training substantially reduces the need for parallel corpora, opening new possibilities for multimodal translation in low-resource settings.
Relatori:	Luca Cagliero, Giuseppe Gallipoli
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	66
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/36337

Modifica (riservato agli operatori)