Matteo Donadio
Declarative Data Pipelines: implementing a logical model through automated code generation.
Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2024
|
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) | Preview |
Abstract: |
The design and operation of data pipelines that deal with the extraction, transformation, and storage of large data sets are crucial in the field of data engineering. This thesis, developed in collaboration with Agile Lab S.R.L, introduces a logic model aimed at establishing a clear and standardized approach to data pipeline architecture, providing a structured framework for defining entities, their interrelationships, and the operational rules essential for building effective and reliable data pipelines. To bridge the gap between theoretical models and practical implementation, a tool that automates the generation of executable code for data pipelines, designed to work independently of specific data management tools, has also been implemented. It takes advantage of a declarative programming approach, allowing it to generate Python code for Apache Airflow, while maintaining the flexibility to adapt to other technologies as needed. Abstracting the complexities of configuration, it allows data engineers to focus on specifying goals and pipeline logic, significantly improving development efficiency and reducing the likelihood of errors. The usefulness of this model and its accompanying tool is demonstrated through a real-world use case involving building a COVID-19 data analytics pipeline. This example highlights the tool's ability to adhere to the logical model and efficiently translate high-level design specifications into operational workflows, highlighting the tool's ability to enforce model-imposed constraints such as acyclicity, non-concurrency, and idempotency, ensuring the robustness and scalability of the pipeline. |
---|---|
Relators: | Paolo Garza |
Academic year: | 2023/24 |
Publication type: | Electronic |
Number of Pages: | 69 |
Subjects: | |
Corso di laurea: | Corso di laurea magistrale in Data Science And Engineering |
Classe di laurea: | New organization > Master science > LM-32 - COMPUTER SYSTEMS ENGINEERING |
Aziende collaboratrici: | Agile Lab S.r.l. |
URI: | http://webthesis.biblio.polito.it/id/eprint/31795 |
Modify record (reserved for operators) |