Large-scale video scene retrieval through Transformer Encoder

Lorenzo De Nisi

Large-scale video scene retrieval through Transformer Encoder.

Rel. Andrea Calimera. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (23MB) | Preview

Abstract

Over the last few years the production of multimedia content has experienced a rapid growth. Such data constitutes a valuable source of information, but to leverage that great potential, automating human processes is crucial. A good portion of multimedia data is represented by video data. From social media and streaming services to security, videos constitute one of the most immediate mediums to convey information. Combining the great expressivity of written text with vision is the foundation of Vision-Language understanding, often employed to perform automatic supervision, moderation and anomaly detection. The Thesis goes in this direction, investigating different solutions for an application capable of performing retrieval and detection on a video, starting from a textual description of the desired scene.

Experiments have been conducted with Transformer-based architectures and particular attention is given to scale efficiency and real-time capabilities, analyzing the trade-off between latency and precision, increasing input resolution and altering the architectures