Human Pose Estimation aboard Nano-drones Using Tiny Vision Transformers

Ovidiu Ioan Jitaru

Human Pose Estimation aboard Nano-drones Using Tiny Vision Transformers.

Rel. Daniele Jahier Pagliari, Beatrice Alessandra Motetti, Alessio Burrello. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (6MB) | Preview

Abstract:	Nowadays drone usage is increasing due to their technological advancements, versatility, and broad applicability for a wide range of tasks. Improved capabilities in the drone sector and tech miniaturization such as the development of smaller and more powerful embedded systems allow AI to be integrated and efficiently run aboard them. Standard-size drones can be equipped with powerful Graphics Processing Units (GPUs) that allow the use of complex neural networks to solve perception tasks. However, nano-drones, with their extremely small dimensions and power envelope, have major limitations in terms of supported computational capabilities. Thus, there is a need to develop more optimized solutions to enable onboard AI, preserving the real-time response of the perception system while maximizing the performance of the considered task. At the moment the state of art (SoA) for computer vision tasks on nano-drones makes use of lightweight CNN architectures which can be effectively executed by the MCU-class processor aboard without the need for high-performance GPUs. However, many recent works in the computer vision field have proved how the Vision Transformer (ViT) architecture frequently outperforms SoA CNNs in several tasks. Thus, optimizing the architecture of ViTs by reducing their size and latency for real-time applications, making them suitable for deployment on nano-drones, represents an interesting challenge. The task considered in this thesis consists of human pose estimation, which is a computer vision task aimed at identifying the position and orientation of a person. In this particular case, the drone’s objective is to position itself in front of a person and follow them while keeping a constant distance from the subject. The work of the thesis consists of analyzing and developing through optimization techniques an efficient ViT model to be deployed on nano-drones that mount an AI deck with the GAP8 System-on-Chip. Two datasets containing images obtained from two separate laboratories are used to train and assess the performance of distinct perception modules. Multiple approaches are explored, comprising the evaluation of different ViT architecture configurations and the assessment of the benefits of a pre-training step prior to the fine-tuning of our task. Finally, ViTs are compared with MobileNet, a CNN that achieves SoA results on the considered benchmark, and in the end, structured pruning is applied to reduce the model’s size, while preserving its original performance. Comparisons between the ViT and MobileNet networks are assessed considering the Mean Absolute Error (MAE) between the predicted x, y, z, and ϕ coordinates and the ground-truth ones. Our results show that the ViT obtains a total error on all 4 predicted axes that is 12% lower than the MobileNet. In the end, by applying structured pruning techniques we reach a 30% compression of the model in terms of the number of parameters while still matching the performance of MobileNet.
Relatori:	Daniele Jahier Pagliari, Beatrice Alessandra Motetti, Alessio Burrello
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	62
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/33873

Modifica (riservato agli operatori)