Politecnico di Torino (logo)

Long-Term temporal attention in Efficient Human Action Recognition Architectures

Lorenzo Atzeni

Long-Term temporal attention in Efficient Human Action Recognition Architectures.

Rel. Andrea Bottino. Politecnico di Torino, Corso di laurea magistrale in Data Science and Engineering, 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (6MB) | Preview

Human activity recognition focuses on automatically understanding the activity performed by humans. This field is of particular interest thanks to many real-world applications such as video indexing/retrieval, surveillance, Human-Machine interaction and physical activity recognition. Different data modalities have been used in order to solve this task, such as skeleton data, optical flow, accelerometer data and point clouds. These different data modalities can be chosen depending on the application, hardware and particular constraints such as latency. In particular, this work focuses on video data. In the last decade, the field of RGB Video Based Action recognition has made huge improvements, mainly due to the progress made in the field of deep learning, and the emergence of high-quality large-scale datasets. However, there are many challenges to overcome in the field of Video based Human Action Recognition. One is the different ways an action can appear that make it hard for the action recognition system to generalize to unseen videos. Another challenge is represented by the high computational requirements of current Action Recognition Systems, mainly due to the high dimensionality of the input. RGB videos are, in fact, represented by two spatial dimension and a temporal dimension, which remains a major challenge. It is, in fact, difficult for the current action recognition systems to reason about the events that happened far in the past or to grasp details dislocated in particular frames along the temporal dimension. The goal of this thesis is to tackle the problem related to the capability of the system to grasp information coming from the temporal dimension while maintaining low computational requirements for the system. This is achieved by combining efficient convolutional architectures built with Neural Architecture Search, with a classification head based on the Transformer architecture. In this work, the MoViNets architectures family are used as backbones. The efficient convolutional architecture acts like a feature extractor, while the Transformer architecture processes the feature extracted. Thanks to its ability to process sequential information, the Transformer architecture reasons about feature extracted, attending to long-term relationships between frames and particular salient information contained in one or more particular frames. The use of three Transformer architectures having different computational requirements is investigated. The first experiments are performed on two small datasets, namely HMDB51 and UCF101. The results show that the architecture with the lower computational requirements still performs consistently with both the more computationally expensive ones and the original convolutional architecture. We hypothesize that this may be due to the small amount of data and the low temporal complexity of the two datasets. To further investigate the new architectures, experiments on the Something-to-Something dataset are performed. This dataset is both more complex in the temporal dimension and is larger than the dataset used in the previous experiments. The conducted experiments have shown that the transformer classification head is a valid alternative to the original classification head, having the advantage of a lower number of parameters for the lightest implementation.

Relators: Andrea Bottino
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 96
Corso di laurea: Corso di laurea magistrale in Data Science and Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: ADDFOR S.p.A
URI: http://webthesis.biblio.polito.it/id/eprint/21216
Modify record (reserved for operators) Modify record (reserved for operators)