Andrea Ferretti
Distributed Arrow-based Shuffle Operation using Arrow Flight RPC.
Rel. Paolo Garza, Andrea Fonti. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023
|
Preview |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) | Preview |
Abstract
In the era of big data, the ability to efficiently process and analyze massive datasets is crucial for businesses and organizations across various domains. To meet the demands of processing such voluminous data, distributed data processing frameworks have emerged as powerful tools. These frameworks leverage the parallelism of distributed computing clusters to execute complex computations in a scalable and efficient manner. One key operation in these frameworks is the shuffle step, which involves redistributing and grouping data across nodes to enable subsequent data transformations and aggregations. Apache Arrow, an open-source in-memory data format and associated libraries, has gained significant attention in the data processing community due to its columnar representation and efficient memory utilization.
DataFusion, an open-source query engine built on Apache Arrow, has also been emerging in the field of distributed data processing due to its performance optimizations and compatibility with the Arrow data format
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Corso di laurea
Classe di laurea
Aziende collaboratrici
URI
![]() |
Modifica (riservato agli operatori) |
