Privacy Preserving Data Mining: a distributed approach to data anonymization

Paolo Alberto

Privacy Preserving Data Mining: a distributed approach to data anonymization.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (6MB) | Preview

Abstract

With an increasing number of real world applications of Data Science algorithms, the concept of data privacy and protection of sensible information has become an increasingly debated topic. This is especially true when we look at the direction taken by European legislation when it comes to data protection of EU citizens. While there are already some software solutions available on the market for algorithms that perform data anonymization, none of them are well suited for Big Data applications. In this project we propose a distributed computing approach to data anonymization, leveraging the Apache Spark engine in order to perform privacy preserving algorithms inside of a large-scale data processing environment.

We will also explore the topic of data classification, with the goal of predicting the appropriate level of privacy when new data gets uploaded to the system