polito.it
Politecnico di Torino (logo)

Declarative Data Pipelines: implementing a logical model through automated code generation

Matteo Donadio

Declarative Data Pipelines: implementing a logical model through automated code generation.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

The design and operation of data pipelines that deal with the extraction, transformation, and storage of large data sets are crucial in the field of data engineering. This thesis, developed in collaboration with Agile Lab S.R.L, introduces a logic model aimed at establishing a clear and standardized approach to data pipeline architecture, providing a structured framework for defining entities, their interrelationships, and the operational rules essential for building effective and reliable data pipelines. To bridge the gap between theoretical models and practical implementation, a tool that automates the generation of executable code for data pipelines, designed to work independently of specific data management tools, has also been implemented. It takes advantage of a declarative programming approach, allowing it to generate Python code for Apache Airflow, while maintaining the flexibility to adapt to other technologies as needed. Abstracting the complexities of configuration, it allows data engineers to focus on specifying goals and pipeline logic, significantly improving development efficiency and reducing the likelihood of errors. The usefulness of this model and its accompanying tool is demonstrated through a real-world use case involving building a COVID-19 data analytics pipeline. This example highlights the tool's ability to adhere to the logical model and efficiently translate high-level design specifications into operational workflows, highlighting the tool's ability to enforce model-imposed constraints such as acyclicity, non-concurrency, and idempotency, ensuring the robustness and scalability of the pipeline.

Relators: Paolo Garza
Academic year: 2023/24
Publication type: Electronic
Number of Pages: 69
Subjects:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING
Aziende collaboratrici: Agile Lab S.r.l.
URI: http://webthesis.biblio.polito.it/id/eprint/31795
Modify record (reserved for operators) Modify record (reserved for operators)