Improving performance for Multi-Label Classification on Imbalanced Problems using Data Augmentation Techniques

Marileni Sinioraki

Improving performance for Multi-Label Classification on Imbalanced Problems using Data Augmentation Techniques.

Rel. Flavio Giobergia, Simona Mazzarino, Luca Gilli. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (5MB) | Preview

Abstract

This thesis presents the development of an emotion classification pipeline designed to improve multi-label text classification, with a particular focus on addressing the underrepresentation of minority classes. In multi-label tasks, class imbalance often causes models to perform poorly on infrequent labels compared to majority ones. To mitigate this issue, the proposed approach generates synthetic sentences to enrich minority class samples and enhance overall classification performance. The study begins by establishing a baseline model based on the BERT-base-uncased architecture, trained on the original dataset. A data-driven analysis is then conducted to identify the most representative examples of underperforming labels. These examples are used as input for two data augmentation methods: a traditional synonym replacement technique and a large language model based generation approach.

For the latter, different prompting strategies are explored to improve the relevance, quality, and diversity of the generated text