polito.it
Politecnico di Torino (logo)

Self-Supervised Fine-Tuning of sentence embedding models using a Smooth Inverse Frequency model

Vittorio Pellegrini

Self-Supervised Fine-Tuning of sentence embedding models using a Smooth Inverse Frequency model.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Abstract:

Sentence embedding models play a key role in the field of Natural Language Processing. They can be exploited for the resolution of several tasks like sentence paraphrasing, sentence similarity, and sentence clustering. Fine- tuning pre-trained models for sentence embedding extraction is a common practice that allows it to reach state-of-the-art performance on downstream tasks. Nevertheless, this practice usually requires labeled data sets. This thesis project aims to overcome this issue by introducing a novel technique for the automatic creation of a target set for fine-tuning sentence embedding models for a specific downstream task. The technique is evaluated on three distinct tasks: sentence paraphrasing, sentence similarity, and sentence clustering. The results demonstrate a significant improvement in sentence embedding models when employing the Smooth Inverse Frequency technique for automatic extraction and labeling of sentence pairs. In the paraphrasing task, the proposed technique yields a noteworthy enhancement of 2.3% in terms of F1-score compared to the baseline results. Moreover, it showcases a 0.2% improvement in F1-score when compared to the ideal scenario where real labels are utilized. For the sentence similarity task, the proposed method achieves a Pearson score of 0.71, surpassing the baseline model’s score of 0.476. However, it falls short of the ideal model trained with human annotations, which attains a Pearson score of 0.845. Regarding the clustering task, from a quantitative standpoint, the best model achieves a harmonic mean (calculated using DBCV and cophenetic score) of 0.693, outperforming the baseline score of 0.671. Nevertheless, the qualitative assessment did not demonstrate a substantial improvement for the clustering task, highlighting the need for exploring alternative techniques to enhance performance in this area.

Relators: Paolo Garza
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 78
Additional Information: Tesi secretata. Fulltext non presente
Subjects:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Ente in cotutela: KTH - Kungl. Tekniska Hogskolan (Royal Institute of Technology) (SVEZIA)
Aziende collaboratrici: Gavagai
URI: http://webthesis.biblio.polito.it/id/eprint/28611
Modify record (reserved for operators) Modify record (reserved for operators)