Politecnico di Torino (logo)

Large-scale video scene retrieval through Transformer Encoder

Lorenzo De Nisi

Large-scale video scene retrieval through Transformer Encoder.

Rel. Andrea Calimera. Politecnico di Torino, Corso di laurea magistrale in Data Science and Engineering, 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (23MB) | Preview

Over the last few years the production of multimedia content has experienced a rapid growth. Such data constitutes a valuable source of information, but to leverage that great potential, automating human processes is crucial. A good portion of multimedia data is represented by video data. From social media and streaming services to security, videos constitute one of the most immediate mediums to convey information. Combining the great expressivity of written text with vision is the foundation of Vision-Language understanding, often employed to perform automatic supervision, moderation and anomaly detection. The Thesis goes in this direction, investigating different solutions for an application capable of performing retrieval and detection on a video, starting from a textual description of the desired scene. Experiments have been conducted with Transformer-based architectures and particular attention is given to scale efficiency and real-time capabilities, analyzing the trade-off between latency and precision, increasing input resolution and altering the architectures. Different approaches are considered. Firstly, dealing with single frames as input data, image retrieval is performed using TERN architecture. Aiming at real-time inference, a faster single-stage object detector is proposed instead of the original two-stage model. Secondly, processing short video windows instead of frames, video retrieval is performed using CLIP4Clip architecture, with a study on the impact of different input resolutions. For both approaches, real-time capabilities are evaluated. Lastly, to test image and video retrieval models on a different domain, a common retrieval dataset is created starting from security camera recordings, annotated with a self-labelling approach by a captioning model. The results show how, switching to a single-stage detector, TERN inference time is reduced by 10 times, at the cost of a noticeable drop in metrics. For video retrieval solution, the experiments demonstrate that increasing input size is beneficial for precision up to a certain resolution, at the cost of higher inference time. On the additional dataset, original TERN architecture achieved the best results, far ahead of the modified single-stage version, which pays the price for higher speed. CLIP4Clip models performed closely to the original TERN, with the potential advantage of exploiting temporal dimension to recognize more actions. The overall experiments testify the suitability of both approaches. Switching to a single-stage object detector is an effective way to speed up inference but can also lead to performance degradation. Increasing resolution is costly, especially in terms of inference and training time, but the benefits are noticeable. Lastly, considering the additional dataset, the weights of the models demonstrate the ability to easily generalize on a new domain, without needing a specific fine-tuning, although this would still lead to better performance.

Relators: Andrea Calimera
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 108
Corso di laurea: Corso di laurea magistrale in Data Science and Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: ADDFOR S.p.A
URI: http://webthesis.biblio.polito.it/id/eprint/20468
Modify record (reserved for operators) Modify record (reserved for operators)