polito.it
Politecnico di Torino (logo)

Dynamic Data Processing Pipelines: A Framework for Modular and Scalable Systems

Danial Soltanali Khalili

Dynamic Data Processing Pipelines: A Framework for Modular and Scalable Systems.

Rel. Alessandro Aliberti, Edoardo Patti. Politecnico di Torino, Corso di laurea magistrale in Digital Skills For Sustainable Societal Transitions, 2025

[img] PDF (Tesi_di_laurea) - Tesi
Accesso riservato a: Solo utenti staff fino al 25 Febbraio 2026 (data di embargo).
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB)
Abstract:

This thesis introduces a framework for creating and managing dynamic data processing pipelines, focusing on the generation of self-sufficient executable Python code. The framework is designed to allow for the efficient description, compatibility, and execution of data processing tasks. It addresses the need for flexible pipeline creation in modern data environments, where existing tools often fall short in adaptability. Key aspects of the framework include: Dynamic pipeline creation based on user inputs, allowing for quick adjustments to workflows as data sources evolve. Ease of creating new processing blocks, which are self-contained executable Python scripts, enabling the system to expand its capabilities over time. Handling of non-temporal data, particularly time series data, which is crucial for the GAIA platform for which the framework was initially developed. Management of dependencies and contradictions between processing blocks. Compatibility checks between data types to prevent errors and maintain data integrity. Comprehensive logging for monitoring, diagnosing, and auditing. The framework utilizes a modular architecture, comprising internal components such as a Block Generator, a DAG (Directed Acyclic Graph) Generator, and APIs, as well as external components like MongoDB and PyPIServer. The Block Generator transforms Python scripts into functional processing nodes, and the DAG Generator constructs pipeline execution flows. The framework leverages Apache Airflow for workflow orchestration. The system's design allows for a dual-user approach, with experts creating processing blocks and consumers utilizing them to build complex pipelines. The framework's implementation includes detailed processes for block generation, block description, pipeline description, and pipeline evaluation. The system also features a flexible adapter to retrieve data from various sources, including GAIA, SQL databases, and other APIs, with runtime type checking. The thesis evaluates the framework's performance in dynamic pipeline creation, ease of block creation, handling of non-temporal data, managing dependencies, and compatibility checks. It also explores the practical applications, potential impact, and areas for future improvements of the system.

Relatori: Alessandro Aliberti, Edoardo Patti
Anno accademico: 2024/25
Tipo di pubblicazione: Elettronica
Numero di pagine: 60
Soggetti:
Corso di laurea: Corso di laurea magistrale in Digital Skills For Sustainable Societal Transitions
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-91 - TECNICHE E METODI PER LA SOCIETÀ DELL'INFORMAZIONE
Aziende collaboratrici: ALPHAWAVES S.R.L.
URI: http://webthesis.biblio.polito.it/id/eprint/34439
Modifica (riservato agli operatori) Modifica (riservato agli operatori)