Wiktoria Woronko
Detection and Mitigation of Issues in Video Summarization Using Video Large Language Models.
Rel. Luca Cagliero, Lorenzo Vaiani. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025
|
|
PDF (Tesi_di_laurea)
- Tesi
Accesso limitato a: Solo utenti staff fino al 24 Ottobre 2026 (data di embargo). Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (4MB) |
Abstract
Video Large Language Models (VLLMs) have recently demonstrated strong capabilities in multimodal understanding. Yet, their effectiveness in structured video-to-text summarization remains insufficiently studied, as most evaluations focus solely on plain-text outputs. This thesis investigates the zero-shot capabilities of six state-of-the-art VLLMs across five distinct summary formats: plain text, event-based, spatially contextualized, timeline, and spatio-temporal. The analysis reveals persistent limitations in structured summaries, particularly in providing valid temporal annotations and adhering to required output formats. To address these challenges, four categories of mitigation strategies are introduced. The first employs Chain-of-Thought prompting, where external video knowledge extracted from lightweight models is injected into prompts to guide temporal and spatial reasoning.
The second applies LLM-based post-hoc refinement, which identifies and corrects structural errors through tailored instructions
Tipo di pubblicazione
URI
![]() |
Modifica (riservato agli operatori) |
