Andrea Ferretti
Distributed Arrow-based Shuffle Operation using Arrow Flight RPC.
Rel. Paolo Garza, Andrea Fonti. Politecnico di Torino, Master of science program in Data Science And Engineering, 2023
|
Preview |
PDF (Tesi_di_laurea)
- Thesis
Licence: Creative Commons Attribution Non-commercial No Derivatives. Download (3MB) | Preview |
Abstract
In the era of big data, the ability to efficiently process and analyze massive datasets is crucial for businesses and organizations across various domains. To meet the demands of processing such voluminous data, distributed data processing frameworks have emerged as powerful tools. These frameworks leverage the parallelism of distributed computing clusters to execute complex computations in a scalable and efficient manner. One key operation in these frameworks is the shuffle step, which involves redistributing and grouping data across nodes to enable subsequent data transformations and aggregations. Apache Arrow, an open-source in-memory data format and associated libraries, has gained significant attention in the data processing community due to its columnar representation and efficient memory utilization.
DataFusion, an open-source query engine built on Apache Arrow, has also been emerging in the field of distributed data processing due to its performance optimizations and compatibility with the Arrow data format
Relators
Academic year
Publication type
Number of Pages
Course of studies
Classe di laurea
Aziende collaboratrici
URI
![]() |
Modify record (reserved for operators) |
