Adaptation and Implementation of an automated Corporate ETL Framework on Microsoft Fabric: A Technical Approach to Integrating Data Workflows into a Modern Cloud Platform

Guillermo Jose' Gallucci

Adaptation and Implementation of an automated Corporate ETL Framework on Microsoft Fabric: A Technical Approach to Integrating Data Workflows into a Modern Cloud Platform.

Rel. Paolo Garza. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2025

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (7MB) | Preview

Abstract:	The exponential growth of data generated by computational systems, users, and devices has introduced significant challenges for real-time processing, resource optimization, analysis, and the provision of training material for AI systems. On-premise infrastructures, while capable of high computational power, are constrained by hard??ware and tool configuration, limiting scalability and collaboration. To address these limitations, the research focuses on Microsoft Fabric, a SaaS cloud based platform integrating data processing and real-time analytics. The research, carried out in partnership with Mediamente Consulting, aims to adapt and implement the corporate data integration automated framework within Microsoft Fabric, and to evaluate its performance relative to the on-premise version. Furthermore, the study seeks to identify a best-practice architecture among the various solutions provided by Fabric. Data from CSV and Excel sources is ingested into Microsoft Fabric’s centralized storage, OneLake. The pipeline loads the data in Delta Tables within a Data Lake architecture. Downstream processing layers are designed for incremental loads, propagating only new or updated records, which are subsequently evaluated to verify compliance with data quality and referential integrity constraints. The final stage has three different implementations to evaluate performance: Spark, SQL based, and Dataflows. It resulted in outstanding performance from the Spark- and SQL-based implementations compared to the Dataflows. Nevertheless, the overall performance in Fabric was worse than the metrics achieved in its on-premise version, except when dealing with large datasets, where Fabric outperformed. These results are due to Fabric’s optimized cloud engines. The SQL engine benefits from query optimization, Spark excels with distributed computation for very large datasets, and Dataflows introduces overhead and abstraction due to its low-code abstraction, resulting in lower performance
Relatori:	Paolo Garza
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	118
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Data Science And Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici:	Mediamente Consulting srl
URI:	http://webthesis.biblio.polito.it/id/eprint/37851

Modifica (riservato agli operatori)