polito.it
Politecnico di Torino (logo)

3DYogaSeg: A New Dataset and Benchmark for Skeleton-Based Action Recognition and Segmentation in yoga videos

Edoardo Marchetti

3DYogaSeg: A New Dataset and Benchmark for Skeleton-Based Action Recognition and Segmentation in yoga videos.

Rel. Giuseppe Bruno Averta, Chiara Plizzari. Politecnico di Torino, NON SPECIFICATO, 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (7MB) | Preview
Abstract:

In recent years, the popularity of online platforms for home exercise programs, particularly yoga, has increased significantly. Yoga involves various poses, known as "asanas", where small differences can change the name and nature of the exercise. However, many video tutorials do not have accurate names or descriptions for these poses, which presents a challenge for beginners. Automated tools for identifying yoga poses offer critical support for users approaching these sessions. Although there is a vast literature on understanding videos, there is a significant gap in the recognition of specific exercises in yoga videos, largely due to a lack of adequate datasets. To solve this problem, we introduced a new dataset to identify and segment yoga poses using skeletal data. In fact, skeletal data are much less affected by changes in viewpoint and background than traditional RGB data, providing a more stable and reliable basis for analysis. We designed a linear interpolation process to merge the videos, creating a dataset that supports both action recognition and segmentation. In total, we collected 2115 videos from YouTube covering 58 different asanas, with an average of 34 sequences per pose and an average duration of 13 seconds per video. The skeletal sequences were extracted with the BlazePose model provided by Mediapipe, an open-source tool developed by Google for computer vision tasks. In our research, using the MS-GCN network-which combines the convolution on graphs of ST-GCN and the temporal segmentation capabilities of MS-TCN-we conducted an ablation study to analyze the performance impact of dataset features constructed by interpolation. In particular, we analyzed the impact of the number of exercise per video, the possibility of having an exercise repeated multiple times in the same video, and the number of transition frames used to connect two clips. The experiments showed that a dataset built with videos with contain fewer non-repeated exercises linked by a high number of transition frames helps to increase the frame-level accuracy of MS-GCN. We believe that using linear interpolation to join videos can be a solid starting point to further explore the conversion of datasets used for activity recognition task into datasets suitable for the task of temporal segmentation of videos.

Relatori: Giuseppe Bruno Averta, Chiara Plizzari
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 97
Soggetti:
Corso di laurea: NON SPECIFICATO
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: GYMNASIO S.R.L.
URI: http://webthesis.biblio.polito.it/id/eprint/30810
Modifica (riservato agli operatori) Modifica (riservato agli operatori)