Detection and Mitigation of Issues in Video Summarization Using Video Large Language Models

Wiktoria Woronko

Detection and Mitigation of Issues in Video Summarization Using Video Large Language Models.

Rel. Luca Cagliero, Lorenzo Vaiani. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 24 Ottobre 2026 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB)

Abstract:	Video Large Language Models (VLLMs) have recently demonstrated strong capabilities in multimodal understanding. Yet, their effectiveness in structured video-to-text summarization remains insufficiently studied, as most evaluations focus solely on plain-text outputs. This thesis investigates the zero-shot capabilities of six state-of-the-art VLLMs across five distinct summary formats: plain text, event-based, spatially contextualized, timeline, and spatio-temporal. The analysis reveals persistent limitations in structured summaries, particularly in providing valid temporal annotations and adhering to required output formats. To address these challenges, four categories of mitigation strategies are introduced. The first employs Chain-of-Thought prompting, where external video knowledge extracted from lightweight models is injected into prompts to guide temporal and spatial reasoning. The second applies LLM-based post-hoc refinement, which identifies and corrects structural errors through tailored instructions. To further exploit external knowledge, two novel pipelines are designed to integrate auxiliary cues into structured summarization. Multimodal co-Summarization strategy preprocesses external video knowledge into a provisional summary using a lightweight TextLLM, which is then combined with the video input by a VLLM to generate the final structured output. QA-assisted hierarchical summarization leverages VLLMs’ video question answering capabilities to extract event-level details from pre-segmented clips, based on timestamps obtained from external models, and merges them into coherent summaries. Experiments conducted on a new mixed-type benchmark of 100 videos sampled from five datasets demonstrate that these strategies yield substantial improvements. QA-assisted hierarchical summarization eliminates formatting issues and enhances spatio-temporal alignment, while multimodal co-summarization achieves statistically significant gains in timeline evaluation. Object-based cues contribute to improved temporal grounding in both strategies, though they frequently introduce event fragmentation. In contrast, action-based cues consistently enhance event detection and temporal coherence, while scene-based cues strengthen spatial grounding. The findings confirm that combining VLLMs with lightweight external models and TextLLMs enables more reliable, cost-effective, and contextually grounded video summarization without the need for fine-tuning. The proposed benchmark, issue taxonomy, and mitigation frameworks provide a foundation for future research for Video-to-Text summarization.
Relatori:	Luca Cagliero, Lorenzo Vaiani
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	103
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/37893

Modifica (riservato agli operatori)