polito.it
Politecnico di Torino (logo)

Design of a document retrieval system using Transformer-based models and a domain specific ontology

Emanuele Mottola

Design of a document retrieval system using Transformer-based models and a domain specific ontology.

Rel. Antonio Vetro', Juan Carlos De Martin, Giuseppe Futia. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2020

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview
Abstract:

The scientific literature and internal research documents every institution produces is a key source of information for the members of the institution itself. To access this material effectively and to retrieve the information needed going beyond the keyword-based approach, a Transformer-based language model tailored on the semiconductor supply chain domain is employed together with the same domain ontology -- the Digital Reference [1] -- to build a document retrieval system over the pool of documents of the Infineon Corporate Supply Chain Innovation department. The further pre-training of the Bidirectional Encoder Representations from Transformers (BERT) model [2] on a text corpus based on the semiconductor supply chain literature is used to empower SentenceBERT [3] for sentence embeddings creation. Measuring the similarity score between the embedding representation of the query and the sentence embeddings related to the documents, the system is able to retrieve relevant documents to the query posed by the user. With the same mechanism, the classes of the Digital Reference are annotated, resulting in an ontology populated with documents that are shown to the user according to the match between query keywords and class names. The first results of the system are presented, where the F-measure reaches 0.58 and the mean Average Precision 0.45.

Relators: Antonio Vetro', Juan Carlos De Martin, Giuseppe Futia
Academic year: 2020/21
Publication type: Electronic
Number of Pages: 85
Subjects:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Ente in cotutela: KARLSRUHE INSTITUTE OF TECHNOLOGY (GERMANIA)
Aziende collaboratrici: Infineon Technologies AG
URI: http://webthesis.biblio.polito.it/id/eprint/16055
Modify record (reserved for operators) Modify record (reserved for operators)