polito.it
Politecnico di Torino (logo)

Distributed Arrow-based Shuffle Operation using Arrow Flight RPC

Andrea Ferretti

Distributed Arrow-based Shuffle Operation using Arrow Flight RPC.

Rel. Paolo Garza, Andrea Fonti. Politecnico di Torino, Corso di laurea magistrale in Data Science And Engineering, 2023

[img]
Preview
PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.

Download (3MB) | Preview
Abstract:

In the era of big data, the ability to efficiently process and analyze massive datasets is crucial for businesses and organizations across various domains. To meet the demands of processing such voluminous data, distributed data processing frameworks have emerged as powerful tools. These frameworks leverage the parallelism of distributed computing clusters to execute complex computations in a scalable and efficient manner. One key operation in these frameworks is the shuffle step, which involves redistributing and grouping data across nodes to enable subsequent data transformations and aggregations. Apache Arrow, an open-source in-memory data format and associated libraries, has gained significant attention in the data processing community due to its columnar representation and efficient memory utilization. DataFusion, an open-source query engine built on Apache Arrow, has also been emerging in the field of distributed data processing due to its performance optimizations and compatibility with the Arrow data format. DataFusion offers substantial performance improvements over traditional query engines. However, despite its remarkable capabilities, DataFusion still relies on a centralized shuffle step, which can become a performance bottleneck in certain scenarios. To address this limitation, this thesis work wants to explore the feasibility of implementing a distributed shuffle step in DataFusion using Arrow Flight RPC, thereby enabling enhanced scalability. Arrow Flight RPC, also based on Apache Arrow, is a high-performance remote procedure call framework designed for efficient data transfer between different processes and machines. The primary objective of this thesis work is to develop a prototype implementation that demonstrates the potential of utilizing Arrow Flight RPC for distributed shuffling. Although achieving a fully distributed shuffle step in DataFusion within the scope of this work may be ambitious, the prototype implementation will serve as a foundation for evaluating the performance gains, limitations, and challenges associated with the distributed shuffle approach using Arrow Flight RPC. While the primary objective of distributing DataFusion's shuffle step using Arrow Flight RPC was not accomplished within the scope of this thesis work, the prototype implementation provides a solid foundation for further exploration and development. The findings from this work contribute to the understanding of leveraging Arrow Flight RPC for distributed shuffle operations and pave the way for future research and optimization in the field of distributed data processing. This thesis work laid the groundwork for a distributed shuffle step in DataFusion using Arrow Flight RPC. The prototype implementation showcased the potential benefits of leveraging Arrow Flight RPC for distributed shuffling and offered insights into its performance characteristic.

Relatori: Paolo Garza, Andrea Fonti
Anno accademico: 2022/23
Tipo di pubblicazione: Elettronica
Numero di pagine: 55
Soggetti:
Corso di laurea: Corso di laurea magistrale in Data Science And Engineering
Classe di laurea: Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA
Aziende collaboratrici: Agile Lab S.r.l.
URI: http://webthesis.biblio.polito.it/id/eprint/27735
Modifica (riservato agli operatori) Modifica (riservato agli operatori)