Removing personal information from structured documents: a graphic and text based solution

Antonio Madaro

Removing personal information from structured documents: a graphic and text based solution.

Rel. Fabrizio Lamberti, Lia Morra, Valentina Gatteschi. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

Abstract

Preserving privacy in imageIn the last years data analytics is increasingly applied to scanned documents stored in the format of images. However,this kind of documents presents a large number of sensitive or potentially sensitive information, defined as “PII” (Personally Identifiable Information) in compliance with the GDPR, the General Data Protection Regulation. The project presented in this Master Thesis has been performed in collaboration with Reale Mutua Assicurazioni and aims to develop a complete anonymization tool for scanned document, in particular structured documents, such as Invoices. In order to deal with text contained in images OCR, Optical Character Recognition, has been exploited and its output has been elaborated and analyzed by a NER, Named Entity Recognition, module to recognize several classes information.

The NER module is based both on fuzzy regular expressions for pattern - based attributes, and on specific recognition techniques based on the semantic content of words for free - structured attributes