polito.it
Politecnico di Torino (logo)

A Modern Solution for Big Data Management: the Data Lake

Matteo Garbarino

A Modern Solution for Big Data Management: the Data Lake.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

Abstract:

The ever-growing diffusion and complexity of digital technologies since their introduction in the 20th century have led to an exponential increase in generated and stored data. Nowadays, every institution and company is compelled to adopt techniques for data storage and processing that align with their business needs. In data-intensive contexts, where data plays a pivotal role in operations, the challenge of big data management has emerged. Standard off-the-shelf databases often fall short in meeting the technical requirements and performance demands posed by these scenarios. Thus, the imperative to adopt specialized solutions tailored to address modern challenges has become apparent, leading to the introduction of a wide variety of new technologies. This thesis work delved into an analysis of the most prevalent big data solutions, ultimately focusing on one of the latest advancements in the field: the data lake. It provided an overview of various design paradigms along with their respective advantages, presenting the technical details of employable technologies. Furthermore, it outlined a practical use case for this technology and reported the development phases required to design and implement new data lake components. In a subsequent phase of the work, the processing infrastructure was monitored through the introduction of a dedicated interactive dashboard. The approach taken to describe the development phases was not only from the standpoint of a data specialist; rather, it also emphasized the importance of the adopted software engineering practices. This encompassed end-to-end data pipeline curation, from the initial design to subsequent implementation, testing, and maintenance. The objective to construct a cutting-edge big data storage and processing system was attained, ensuring the execution of an efficient and reliable data pipeline. Additionally, a complementary monitoring solution was successfully developed, featuring an effective design that facilitated observing the most salient aspects of the data transformations and proactively detecting potential malfunctions.

Relators: Paolo Garza
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 91
Additional Information: Tesi secretata. Fulltext non presente
Subjects:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: UNSPECIFIED
URI: http://webthesis.biblio.polito.it/id/eprint/31018
Modify record (reserved for operators) Modify record (reserved for operators)