Data Augmentation for Imbalanced Fund-Policy Classification in Financial NLP: A Comprehensive Evaluation

Daven Loris Speranza

Data Augmentation for Imbalanced Fund-Policy Classification in Financial NLP: A Comprehensive Evaluation.

Rel. Riccardo Coppola. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Matematica, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract

The asset-management industry produces long, jargon-dense policy texts (e.g., KIIDs) that must be classified along several taxonomies (asset class, geographic area, and others) for compliance and analytics. Manual labeling is costly and not always objective. Modern encoders can read these documents from start-to-finish, but long inputs and imbalances makes the problem challenging in practice. Can textual data augmentation (DA) mitigate class imbalance and improve fund-policy classification without harming label significance? More specifically: (i) how do lightweight local DA methods compare with online paraphrasing; (ii) when do long-context encoders (Longformer-base) materially outperform short-context ones (RoBERTa-base); and (iii) what gains, if any, derive from applying these augmentations? We build a supervised model over KIID investment policies and evaluate two encoder families.

We test selective/noising DA and controlled paraphrases, with validation checks to verify augmentations (semantic similarity, clustering indices, PCA maps)