Paolo Alberto
Privacy Preserving Data Mining: a distributed approach to data anonymization.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2021
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (6MB) | Preview |
Abstract: |
With an increasing number of real world applications of Data Science algorithms, the concept of data privacy and protection of sensible information has become an increasingly debated topic. This is especially true when we look at the direction taken by European legislation when it comes to data protection of EU citizens. While there are already some software solutions available on the market for algorithms that perform data anonymization, none of them are well suited for Big Data applications. In this project we propose a distributed computing approach to data anonymization, leveraging the Apache Spark engine in order to perform privacy preserving algorithms inside of a large-scale data processing environment. We will also explore the topic of data classification, with the goal of predicting the appropriate level of privacy when new data gets uploaded to the system. The final product will be a software library, capable of querying multiple data sources and applying the required algorithms to the result. This computations will be performed with two main goals in mind: protecting sensible data of individuals, while at the same time preserving as much information as possible for analysts and data scientists to work with. |
---|---|
Relators: | Paolo Garza |
Academic year: | 2021/22 |
Publication type: | Electronic |
Number of Pages: | 79 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | Agile Lab S.r.l. |
URI: | http://webthesis.biblio.polito.it/id/eprint/21215 |
Modify record (reserved for operators) |