
Shayan Taghinezhad Roudbaraki
Benchmarking Synonym Extraction Methods in Domain-Specific Contexts.
Rel. Luca Cagliero, Luca Gioacchini, Irene Benedetto. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2025
![]() |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) |
Abstract: |
Accurate identification of synonyms is crucial for several Natural Language Processing tasks and it presents significant challenges when done in a specialized domain. These problems arise because of unique vocabularies, domain jargon, semantic shift of words when used in non-general domains and limited domain-specific resources for synonym detection. This thesis analyzes different methods for synonym extraction in domain-specific contexts by evaluating a subset of techniques on a multi-domain dataset which includes terms, their usage contexts and ground truth synsets in different domains such as agriculture, automotive, economy, geography, legal, medical and technology. The analysis include synonym extraction using traditional lexical resources like WordNet, various available forms of distributional semantic models like fastText, domain-specific corpus training and fine-tuning, and contextual embedding models like BERT. Clustering algorithms are also investigated when applied to combined term and definition representations. For a more thorough analysis, Name Entity Recognition for term identification is explored and compared with information extraction models and LLMs for the same task. Additionally, capabilities of large language models (LLMs) for definition generation and synonym grouping is explored. Evaluation of experiments is done by using standard Precision, Recall, F1-score metrics specifically adapted for synset recovery and recall for term identification. The research concludes that currently the proposed multi-step approach is most effective in synset creation which consists of: term identification and definition generation by an LLM, unsupervised clustering, and additionally refining the clusters by an LLM. |
---|---|
Relatori: | Luca Cagliero, Luca Gioacchini, Irene Benedetto |
Anno accademico: | 2024/25 |
Tipo di pubblicazione: | Elettronica |
Numero di pagine: | 78 |
Soggetti: | |
Corso di laurea: | Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering) |
Classe di laurea: | Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA |
Aziende collaboratrici: | MAIZE S.R.L. |
URI: | http://webthesis.biblio.polito.it/id/eprint/36445 |
![]() |
Modifica (riservato agli operatori) |