Luca Villani
Leveraging the Visual Capabilities of Transformers in Multimodal Machine Translation.
Rel. Luca Cagliero, Lorenzo Vaiani. Politecnico di Torino, Master of science program in Data Science And Engineering, 2024
|
Preview |
PDF (Tesi_di_laurea)
- Thesis
Licence: Creative Commons Attribution Non-commercial No Derivatives. Download (10MB) | Preview |
Abstract
Machine translation (MT) has come a long way since Deep Neural Networks (DNNs) arrived. The introduction of Transformer architecture, with its flexible data handling, opened the door to a new field: Multimodal Machine Translation (MMT). MMT aims to combine text with other information, like images, to improve translation accuracy. While MMT is a rapidly growing field, there are still challenges. One is the lack of data that combines different modalities with translations. Another is how to represent different data types effectively and then combine them in a way that captures the overall meaning. This thesis proposes a new architecture using three transformers: one for text, one for a general image representation, and one for detecting objects in the image.
The goal is to see if using both general and specific image features improves translation quality
Relators
Academic year
Publication type
Number of Pages
Course of studies
Classe di laurea
URI
![]() |
Modify record (reserved for operators) |
