Vittorio Pellegrini
Self-Supervised Fine-Tuning of sentence embedding models using a Smooth Inverse Frequency model.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023
Abstract: |
Sentence embedding models play a key role in the field of Natural Language Processing. They can be exploited for the resolution of several tasks like sentence paraphrasing, sentence similarity, and sentence clustering. Fine- tuning pre-trained models for sentence embedding extraction is a common practice that allows it to reach state-of-the-art performance on downstream tasks. Nevertheless, this practice usually requires labeled data sets. This thesis project aims to overcome this issue by introducing a novel technique for the automatic creation of a target set for fine-tuning sentence embedding models for a specific downstream task. The technique is evaluated on three distinct tasks: sentence paraphrasing, sentence similarity, and sentence clustering. The results demonstrate a significant improvement in sentence embedding models when employing the Smooth Inverse Frequency technique for automatic extraction and labeling of sentence pairs. In the paraphrasing task, the proposed technique yields a noteworthy enhancement of 2.3% in terms of F1-score compared to the baseline results. Moreover, it showcases a 0.2% improvement in F1-score when compared to the ideal scenario where real labels are utilized. For the sentence similarity task, the proposed method achieves a Pearson score of 0.71, surpassing the baseline model’s score of 0.476. However, it falls short of the ideal model trained with human annotations, which attains a Pearson score of 0.845. Regarding the clustering task, from a quantitative standpoint, the best model achieves a harmonic mean (calculated using DBCV and cophenetic score) of 0.693, outperforming the baseline score of 0.671. Nevertheless, the qualitative assessment did not demonstrate a substantial improvement for the clustering task, highlighting the need for exploring alternative techniques to enhance performance in this area. |
---|---|
Relators: | Paolo Garza |
Academic year: | 2023/24 |
Publication type: | Electronic |
Number of Pages: | 78 |
Additional Information: | Tesi secretata. Fulltext non presente |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Ente in cotutela: | KTH - Kungl. Tekniska Hogskolan (Royal Institute of Technology) (SVEZIA) |
Aziende collaboratrici: | Gavagai |
URI: | http://webthesis.biblio.polito.it/id/eprint/28611 |
Modify record (reserved for operators) |