Daven Loris Speranza
Data Augmentation for Imbalanced Fund-Policy Classification in Financial NLP: A Comprehensive Evaluation.
Rel. Riccardo Coppola. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Matematica, 2025
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (4MB) | Preview |
| Abstract: |
The asset-management industry produces long, jargon-dense policy texts (e.g., KIIDs) that must be classified along several taxonomies (asset class, geographic area, and others) for compliance and analytics. Manual labeling is costly and not always objective. Modern encoders can read these documents from start-to-finish, but long inputs and imbalances makes the problem challenging in practice. Can textual data augmentation (DA) mitigate class imbalance and improve fund-policy classification without harming label significance? More specifically: (i) how do lightweight local DA methods compare with online paraphrasing; (ii) when do long-context encoders (Longformer-base) materially outperform short-context ones (RoBERTa-base); and (iii) what gains, if any, derive from applying these augmentations? We build a supervised model over KIID investment policies and evaluate two encoder families. We test selective/noising DA and controlled paraphrases, with validation checks to verify augmentations (semantic similarity, clustering indices, PCA maps). Models are tuned with a fixed validation protocol and assessed on macro-/weighted-F1, accuracy, and per-class recall; analyses are made over taxonomy and input length. We will show how DA consistently improves macro-averaged metrics and minority-class recall compared to baselines, with the simplest mixes delivering robust, low-variance gains. LLM paraphrases can add diversity and usually increase DA related gains but require attention to avoid label drift. Longformer-base outperforms RoBERTa-base on long policies when label features is dispersed across paragraphs; on short curated inputs, RoBERTa remains competitive at lower cost. Augmentation validation confirm semantic preservation for local method and identify risky example drift in generative DA. Textual DA is a reliable, low risk and resource requirement (with respect to increase the real world input example) remedy for imbalance in fund-policy classification, and encoder choice should follow the corpus length profile. Future work could include cost-sensitive training and calibrated decision rules; stricter filters for generative DA and implementation of new local technique; larger ad hoc pretraining and error analysis to target classes that remain poorly managed. |
|---|---|
| Relatori: | Riccardo Coppola |
| Anno accademico: | 2025/26 |
| Tipo di pubblicazione: | Elettronica |
| Numero di pagine: | 122 |
| Soggetti: | |
| Corso di laurea: | Corso di laurea magistrale in Ingegneria Matematica |
| Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-44 - MODELLISTICA MATEMATICO-FISICA PER L'INGEGNERIA |
| Aziende collaboratrici: | FIDA s.r.l. |
| URI: | http://webthesis.biblio.polito.it/id/eprint/38152 |
![]() |
Modifica (riservato agli operatori) |



Licenza Creative Commons - Attribuzione 3.0 Italia