polito.it
Politecnico di Torino (logo)

Removing personal information from structured documents: a graphic and text based solution.

Antonio Madaro

Removing personal information from structured documents: a graphic and text based solution.

Rel. Fabrizio Lamberti, Lia Morra, Valentina Gatteschi. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

Abstract:

Preserving privacy in imageIn the last years data analytics is increasingly applied to scanned documents stored in the format of images. However,this kind of documents presents a large number of sensitive or potentially sensitive information, defined as “PII” (Personally Identifiable Information) in compliance with the GDPR, the General Data Protection Regulation. The project presented in this Master Thesis has been performed in collaboration with Reale Mutua Assicurazioni and aims to develop a complete anonymization tool for scanned document, in particular structured documents, such as Invoices. In order to deal with text contained in images OCR, Optical Character Recognition, has been exploited and its output has been elaborated and analyzed by a NER, Named Entity Recognition, module to recognize several classes information. The NER module is based both on fuzzy regular expressions for pattern - based attributes, and on specific recognition techniques based on the semantic content of words for free - structured attributes. Words detected as sensitive information by NER module are obscured and replaced by their general category by the Anonymization module. However, in this type of documents the graphical structure plays a fundamental role in deciphering the semantic content by grouping together related information, effectively replacing the role played by traditional grammar and syntax in natural language processing. For instance, contact information is interpreted differently if associated to a person or to a company. In addition, the presence of keywords can improve the accuracy of the NER especially in the case of OCR errors or ambiguities. For this reason, a module to perform document segmentation has been developed, based on a Deep Neural Network. It has been trained over a synthetic dataset of structured documents generated by custom document Generation. The combination of the anonymization tool and the document segmentation tool allows to perform a more selective anonymization, only on PII sections such as customer information and preserving information related to companies. The tool was tested on both real and synthetic documents. analytics: removing personal information from images

Relatori: Fabrizio Lamberti, Lia Morra, Valentina Gatteschi
Anno accademico: 2020/21
Tipo di pubblicazione: Elettronica
Numero di pagine: 137
Informazioni aggiuntive: Tesi secretata. Fulltext non presente
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: REALE MUTUA ASSICURAZIONI
URI: http://webthesis.biblio.polito.it/id/eprint/18166
Modifica (riservato agli operatori) Modifica (riservato agli operatori)