Structural-Semantic Dynamic Graph Learning for Document Visual QA

Xiao Huan

Structural-Semantic Dynamic Graph Learning for Document Visual QA.

Rel. Luca Cagliero, Lorenzo Vaiani, Davide Napolitano. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract:	With the advancements in Natural Language Processing (NLP) and Computer Vision (CV), Document Visual Question Answering (Document VQA) has become an important research area both in industry and academia. Visual documents refer to documents containing various elements, such as images, tables, text paragraphs, and graphs. The challenge arises due to their multimodal nature and complex structure, where text and images must be processed together, often spanning multiple pages. Traditional question answering techniques are primarily designed for text-only or image-only inputs, making them ineffective when questions that require both text and visual elements. Even when these modalities are integrated, gaps can remain in how they interact and align. Some models have focused on capturing relations to handle the complex structure of documents, but these approaches are limited to intra-page relationships and rely on static weight aggregation for nodes. To address these challenges, I propose a framework that utilizes a cross-modal model to extract embeddings, integrates information using a document-level structural-semantic graphs, and employs dynamic weight learning to enhance the aggregation. Using cross-modal embeddings as node features, to enhance semantic relationships, I compute similarity of multi-modal node to construct a semantic graph. To capture document-level structural information, I use logical and spatial relations and connect elements across pages to construct structural graphs. To improve information aggregation, I employ Graph Neural Networks (GNN) with Graph Attention Networks (GAT), which dynamically learn attention scores to assign appropriate weights to neighboring nodes. Through a macro-to-micro model analysis, I selected a global GNN learning architecture that enables the model to simultaneously learn global relationships across both structural and semantic graphs. Document-level graph and cross-modal nodes preserve the original structure of paragraphs and images without splitting, allowing the model to construct a more coherent document representation. Using multiple semantic and structural graphs, the model captures global contextual relationships from different perspectives, improving relational understanding. Additionally, the dynamic GAT weight learning mechanism enhances training flexibility, allowing the model to adaptively focus on critical information. Experimental results surpass the baseline, demonstrating the effectiveness of our framework. It is a breakthrough unattainable by traditional single-modality or page-level approaches, establishes a strong foundation for future research in Document VQA.
Relatori:	Luca Cagliero, Lorenzo Vaiani, Davide Napolitano
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	81
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/35233

Modifica (riservato agli operatori)