polito.it
Politecnico di Torino (logo)

Declarative Data Pipelines: implementing a logical model through automated code generation

Matteo Donadio

Declarative Data Pipelines: implementing a logical model through automated code generation.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

The design and operation of data pipelines that deal with the extraction, transformation, and storage of large data sets are crucial in the field of data engineering. This thesis, developed in collaboration with Agile Lab S.R.L, introduces a logic model aimed at establishing a clear and standardized approach to data pipeline architecture, providing a structured framework for defining entities, their interrelationships, and the operational rules essential for building effective and reliable data pipelines. To bridge the gap between theoretical models and practical implementation, a tool that automates the generation of executable code for data pipelines, designed to work independently of specific data management tools, has also been implemented. It takes advantage of a declarative programming approach, allowing it to generate Python code for Apache Airflow, while maintaining the flexibility to adapt to other technologies as needed. Abstracting the complexities of configuration, it allows data engineers to focus on specifying goals and pipeline logic, significantly improving development efficiency and reducing the likelihood of errors. The usefulness of this model and its accompanying tool is demonstrated through a real-world use case involving building a COVID-19 data analytics pipeline. This example highlights the tool's ability to adhere to the logical model and efficiently translate high-level design specifications into operational workflows, highlighting the tool's ability to enforce model-imposed constraints such as acyclicity, non-concurrency, and idempotency, ensuring the robustness and scalability of the pipeline.

Relatori: Paolo Garza
Anno accademico: 2023/24
Tipo di pubblicazione: Elettronica
Numero di pagine: 69
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Agile Lab S.r.l.
URI: http://webthesis.biblio.polito.it/id/eprint/31795
Modifica (riservato agli operatori) Modifica (riservato agli operatori)