Politecnico di Torino (logo)

Design and implementation of a real time data lake in cloud

Vincenzo Siciliani

Design and implementation of a real time data lake in cloud.

Rel. Tania Cerquitelli. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2021

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (4MB) | Preview

This thesis is based on my work experience in a project carried out by NTT Data Italia, for one of its major client in the media sector, to design and implement a real time Big Data Platform on a cloud environment. The goal of this project is to guide the client in the evolution of his technologies for the management of the data migrating from an architecture based on several data warehouses to a single data lake that centralized all data. This new platform allow the client's business users to perform their analyses more easily, quickly and accurately and enable data scientist to develop their prediction models joining data from different data sources and departments. To achieve these results we have analyzed the AS-IS architecture of the client's databases, the final requirements and how to implement them on a cloud platform such as Google Cloud Platform of which the client is a partner, using all the features of tools made available by the cloud provider in terms of availability, scalability, security and cost optimization. The thesis explains the architectural and data model choices that led to the creation of the different logical and abstraction levels on the data and how the distributed computing software components (in batch and streaming) belonging to the platform were implemented. In addition, the solutions implemented to manage the data anonymization (GDPR) and data lineage are detailed. Finally, are presented the CI/CD methodologies used to deploy new code or new analysis flows quickly (ensuring the backward compatibility) and monitoring solutions adopt to check in real time the status and the performances of the platform in order to ensure the correctness and freshness of the data shown to business end users.

Relators: Tania Cerquitelli
Academic year: 2021/22
Publication type: Electronic
Number of Pages: 83
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: NTT DATA Italia
URI: http://webthesis.biblio.polito.it/id/eprint/20576
Modify record (reserved for operators) Modify record (reserved for operators)