polito.it
Politecnico di Torino (logo)

Improving performance for Multi-Label Classification on Imbalanced Problems using Data Augmentation Techniques

Marileni Sinioraki

Improving performance for Multi-Label Classification on Imbalanced Problems using Data Augmentation Techniques.

Rel. Flavio Giobergia, Simona Mazzarino, Luca Gilli. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (5MB) | Preview
Abstract:

This thesis presents the development of an emotion classification pipeline designed to improve multi-label text classification, with a particular focus on addressing the underrepresentation of minority classes. In multi-label tasks, class imbalance often causes models to perform poorly on infrequent labels compared to majority ones. To mitigate this issue, the proposed approach generates synthetic sentences to enrich minority class samples and enhance overall classification performance. The study begins by establishing a baseline model based on the BERT-base-uncased architecture, trained on the original dataset. A data-driven analysis is then conducted to identify the most representative examples of underperforming labels. These examples are used as input for two data augmentation methods: a traditional synonym replacement technique and a large language model based generation approach. For the latter, different prompting strategies are explored to improve the relevance, quality, and diversity of the generated text. The quality of the synthetic data is evaluated against original samples using appropriate metrics, and the augmented datasets are used to retrain the baseline model to assess performance improvements. The results demonstrate that large language model based augmentation can effectively enhance the performance of minority classes compared to traditional techniques. All code and implementations developed for this work are made publicly available in a GitHub repository to support transparency and reproducibility.

Relatori: Flavio Giobergia, Simona Mazzarino, Luca Gilli
Anno accademico: 2025/26
Tipo di pubblicazione: Elettronica
Numero di pagine: 73
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: ClearBox AI Solutions S.R.L.
URI: http://webthesis.biblio.polito.it/id/eprint/38761
Modifica (riservato agli operatori) Modifica (riservato agli operatori)