Politecnico di Torino (logo)

Analysis of semi-structured data based on Named Entity Recognition and Computer Vision techniques

Federico Lorenzo Pes

Analysis of semi-structured data based on Named Entity Recognition and Computer Vision techniques.

Rel. Luca Cagliero. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (1MB) | Preview

The task of extracting information from invoices is highly recurrent, for this reason, it is optimal to be automated. The main challenge with this task is that for each issuer the text layout of the invoice may vary. We refer to this type of data as semi-structured. Hence, while rule-based techniques may provide excellent results for a certain layout, they need to be manually adapted to a specific case. In Natural Language Processing NLP this task can be linked to the Named Entity Recognition (NER) task, which is a token classification task dedicated to detecting and classifying one or more tokens into a label corresponding to an entity in the real-world. While word embedding and transformer-based techniques rule the landscape of NLP, they suffer with this type of data, since they do not only depend on the context of each word but also on the document’s structure. Hence, the layout and the relative position of each word on the page is important to extract information. Recently Graph Neural Networks have been applied to different fields of research including NLP. The base idea of this type of Neural Network is to build a graph from the dataset, defining nodes and edges. GNNs can exploit many different types of features to create a graph. Nodes can represent words, while edges can be represented by any type of relationship. Furthermore, different tasks can be applied to these networks like node classification. These models rely on a pipeline for which each node can share information with the linked nodes and update its embedding through these pieces of information. This concept allows us to think of each document as a graph, where every word represents a node, and the edges can be formed with the closest nodes in the document. The focus of this project is to develop an end-to-end pipeline to extract entities from semi-structured data. The solution involves both Computer Vision and NLP which are linked by Graph Neural Networks. Firstly, the scanned documents are passed to DocTR to obtain the words and the bounding boxes. The bounding boxes are used to model the document as a graph. Subsequently, word embeddings such as BERT and FastText are used to obtain the representations for each word. Finally, the node embeddings and the graph structure are passed to the Graph Neural Network to classify each token. The contributions of the thesis are a hand-crafted dataset composed of 1400 invoices, and an analysis of the performances on this task over a dataset with a heterogeneous composition of invoices, by also providing an overview of the different techniques for the three main steps: graph construction, token embedding creation and graph architecture.

Relators: Luca Cagliero
Academic year: 2022/23
Publication type: Electronic
Number of Pages: 74
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: SPRINT REPLY S.R.L. CON UNICO SOCIO
URI: http://webthesis.biblio.polito.it/id/eprint/27739
Modify record (reserved for operators) Modify record (reserved for operators)