Marileni Sinioraki
Improving performance for Multi-Label Classification on Imbalanced Problems using Data Augmentation Techniques.
Rel. Flavio Giobergia, Simona Mazzarino, Luca Gilli. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (5MB) | Preview |
| Abstract: |
This thesis presents the development of an emotion classification pipeline designed to improve multi-label text classification, with a particular focus on addressing the underrepresentation of minority classes. In multi-label tasks, class imbalance often causes models to perform poorly on infrequent labels compared to majority ones. To mitigate this issue, the proposed approach generates synthetic sentences to enrich minority class samples and enhance overall classification performance. The study begins by establishing a baseline model based on the BERT-base-uncased architecture, trained on the original dataset. A data-driven analysis is then conducted to identify the most representative examples of underperforming labels. These examples are used as input for two data augmentation methods: a traditional synonym replacement technique and a large language model based generation approach. For the latter, different prompting strategies are explored to improve the relevance, quality, and diversity of the generated text. The quality of the synthetic data is evaluated against original samples using appropriate metrics, and the augmented datasets are used to retrain the baseline model to assess performance improvements. The results demonstrate that large language model based augmentation can effectively enhance the performance of minority classes compared to traditional techniques. All code and implementations developed for this work are made publicly available in a GitHub repository to support transparency and reproducibility. |
|---|---|
| Relatori: | Flavio Giobergia, Simona Mazzarino, Luca Gilli |
| Anno accademico: | 2025/26 |
| Tipo di pubblicazione: | Elettronica |
| Numero di pagine: | 73 |
| Soggetti: | |
| Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
| Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
| Aziende collaboratrici: | ClearBox AI Solutions S.R.L. |
| URI: | http://webthesis.biblio.polito.it/id/eprint/38761 |
![]() |
Modifica (riservato agli operatori) |



Licenza Creative Commons - Attribuzione 3.0 Italia