polito.it
Politecnico di Torino (logo)

Design and implementation of data science pipelines: a new paradigm based on analytics engineers

Ferdinando Micco

Design and implementation of data science pipelines: a new paradigm based on analytics engineers.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

Data represents an increasingly critical strategic asset for companies of all sectors and sizes. Without a solid foundation of Analytics engineering, one risks having poor quality data, manual and fragmented processes, unreliable analysis, and long delivery times. Fortunately, there are tools that help implement the best Analytics engineering practices efficiently and at scale. One of these is dbt (data build tool), an open-source platform that simplifies the transformation, documentation, and testing of data models. The main focus of the thesis is to implement a modern pipeline solution that incorporates all best practice of analytics engineering. The inclusion of an analytics engineer within a data team represents a new paradigm in data-driven organizations. The study aims to show the feasibility of such a solution and the potential improvements of adopting such a solution in terms of increased efficiency, higher quality data, and faster time to insights. Moreover, this project has served as the starting point for a collaboration with a company that has specific requirements in the area of data quality. The collaboration has provided valuable insights into the practical implementation of the pipeline solution and has helped tailor the approach to address the company's data quality needs. The proposed solution will involve the use of cutting-edge tools and techniques to transform, document, and test data models, such as dbt. The whole architecture will be implemented serverless on a cloud computing system to provide the required elasticity, scalability, and cost-effectiveness. The improved reliability of data analysis, coupled with the faster time to insights, will allow organizations to make data-driven decisions more quickly and confidently.

Relatori: Paolo Garza
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 60
Soggetti:
Corso di laurea: Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: DATA Reply S.r.l. con Unico Socio
URI: http://webthesis.biblio.polito.it/id/eprint/27737
Modifica (riservato agli operatori) Modifica (riservato agli operatori)