Analysis of semi-structured data based on Named Entity Recognition and Computer Vision techniques

Federico Lorenzo Pes

Analysis of semi-structured data based on Named Entity Recognition and Computer Vision techniques.

Rel. Luca Cagliero. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (1MB) | Preview

Abstract

The task of extracting information from invoices is highly recurrent, for this reason, it is optimal to be automated. The main challenge with this task is that for each issuer the text layout of the invoice may vary. We refer to this type of data as semi-structured. Hence, while rule-based techniques may provide excellent results for a certain layout, they need to be manually adapted to a specific case. In Natural Language Processing NLP this task can be linked to the Named Entity Recognition (NER) task, which is a token classification task dedicated to detecting and classifying one or more tokens into a label corresponding to an entity in the real-world.

While word embedding and transformer-based techniques rule the landscape of NLP, they suffer with this type of data, since they do not only depend on the context of each word but also on the document’s structure