polito.it
Politecnico di Torino (logo)

Personal Data Detection in Free Text

Gabriele Gioetto

Personal Data Detection in Free Text.

Rel. Giuseppe Rizzo. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (4MB) | Preview
Abstract:

An HR (Human Resources) department in a large organization receives inquiries/requests from employees on multiple topics, which are quite different from one another. As an example, an employee can send requests dealing with health conditions, compensation/taxation, events of life (marriage, death of a relative. . . ). These data can be used for many different queries that can be useful for analysis purposes (Example: ‘How many people have had COVID during 2021‘). However, HR tickets typically contain personal data, that cannot be processed without the consent of the data subject according to the European privacy regulation (GDPR). To be able to process documents with personal data, we can identify the pieces of information that qualify as personal data in a communication and subsequently anonymize such information using the appropriate techniques. A significant part of this problem is represented by the complex nature of personal data according to GDPR: personal data are defined as ‘any piece of information that can be connected to an identified or identifiable natural person‘. They comprise obvious identifiers like social security numbers, email addresses, but also elements like ‘the Italian intern working for SAP in South of France‘. To the best of our knowledge, it does not exist a public dataset of HR tickets that can be used to train machine learning models, the main reason being the difficult nature of these types of data. Synthetic data, which are artificial data that are generated from original data using a model that is trained to reproduce the characteristics and structure of the original data, follow a data protection by design approach. To address the need for a large dataset of HR tickets, we created a taxonomy of tickets, we found real data that can be used as support to create synthetic tickets and developed Ticket Generator: an application that can produce as many tickets as needed belonging to different categories, we released a dataset of previously created tickets and we showcase some possible use cases of the dataset.

Relatori: Giuseppe Rizzo
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 91
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Ente in cotutela: INSTITUT EURECOM (FRANCIA)
Aziende collaboratrici: SAP Labs France
URI: http://webthesis.biblio.polito.it/id/eprint/26685
Modifica (riservato agli operatori) Modifica (riservato agli operatori)