Daven Loris Speranza
Data Augmentation for Imbalanced Fund-Policy Classification in Financial NLP: A Comprehensive Evaluation.
Rel. Riccardo Coppola. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Matematica, 2025
|
Preview |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (4MB) | Preview |
Abstract
The asset-management industry produces long, jargon-dense policy texts (e.g., KIIDs) that must be classified along several taxonomies (asset class, geographic area, and others) for compliance and analytics. Manual labeling is costly and not always objective. Modern encoders can read these documents from start-to-finish, but long inputs and imbalances makes the problem challenging in practice. Can textual data augmentation (DA) mitigate class imbalance and improve fund-policy classification without harming label significance? More specifically: (i) how do lightweight local DA methods compare with online paraphrasing; (ii) when do long-context encoders (Longformer-base) materially outperform short-context ones (RoBERTa-base); and (iii) what gains, if any, derive from applying these augmentations? We build a supervised model over KIID investment policies and evaluate two encoder families.
We test selective/noising DA and controlled paraphrases, with validation checks to verify augmentations (semantic similarity, clustering indices, PCA maps)
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Corso di laurea
Classe di laurea
Aziende collaboratrici
URI
![]() |
Modifica (riservato agli operatori) |
