polito.it
Politecnico di Torino (logo)

Towards Named Entity Disambiguation with Knowledge Graph embeddings

Felice Paolo Colliani

Towards Named Entity Disambiguation with Knowledge Graph embeddings.

Rel. Antonio Vetro', Giuseppe Futia, Giovanni Garifo. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Matematica, 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

Context: In recent years, the field of biomedicine has been experiencing a huge growth of interest, particularly in the application of knowledge mining algorithms. The extraction of knowledge from the scientific literature is valuable for assisting professionals in making well-informed decisions supported by relevant documents. This thesis discusses a novel approach for the Named Entity Disambiguation (NED) task, applied to the biomedical field. The proposed approach combines pre-trained language models and graph technologies for the NED task. It is worth noting that this methodology is not limited to the biomedical field, but it could be applied to various domains. However, the biomedical domain is employed as a case study, since it is one of the most complex due to the vast number of entities and a lack of sufficient clarity in the available literature. State of the art: When dealing with a complex domain, such as the biomedical field, only relying on entity recognition is not sufficient. Same entities (e.g. diabetes) can refer to different concepts (e.g. type 2 diabetes). Assigning a unique identity to entities mentioned in text is referred to as NED. Previous NED frameworks were mainly relying on the contextual information surrounding the entity to disambiguate, taking inspiration from human reasoning and attempting to find patterns between word embeddings. New approaches based on the integration of Knowledge Bases (KBs) allowed to add context to the word embeddings, increasing the accuracy of the disambiguation task. However, KBs often lack completeness, leading to unreliable results. Method and contribution: This thesis will deal with annotated text from biomedical papers and proposes a novel approach that leverages a Siamese Neural Network (SNN) and integrates Knowledge Graph (KG) embeddings. In fact, the input of this NN consists of the concatenation of two ingredients: (i) the text embedding of an annotated sentence produced by a pre-trained language model in the medical domain; (ii) the KG embedding computed by graph learning methods on SNOMED, a well-known medical KG. The output is a score between 0 and 1, representing the probability that the SNOMED entity corresponds exactly to the medical entity annotated in the sentence. This novel approach delves beyond the mere context of individual words in a sentence and it also exploits the topology of the KG. Furthermore, thanks to the Neo4j full-text search, an easier candidate selection can be performed to provide during the testing phase the best entities to the trained NN. Results: This novel method has only recently been tested and has rarely been used in a complex domain such as biomedicine. However, it demonstrates acceptable accuracy on two famous challenging dataset of annotated documents, MedMentions and BC5CDR. Moreover, when considering a more relaxed constraint on accuracy it reaches even a higher score. Notably, this performance is also comparable with the ScispaCy models, which have been trained on a dataset containing a number of entities 70 times higher than the one used in this study. Conclusions and Future work: The solution proposed, based on the integration of KG embeddings and text embeddings, is showing promising results comparable to the state of the art. However, comparable solutions achieved such results being trained on much bigger datasets. A future fully-integrated pipeline developed in this thesis could lead to better results and make it possible to apply this approach to other domains.

Relatori: Antonio Vetro', Giuseppe Futia, Giovanni Garifo
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 77
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Matematica
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-44 - MODELLISTICA MATEMATICO-FISICA PER L'INGEGNERIA
Aziende collaboratrici: NON SPECIFICATO
URI: http://webthesis.biblio.polito.it/id/eprint/29054
Modifica (riservato agli operatori) Modifica (riservato agli operatori)