Design and implementation of a deployment tool for modular DNN inference using ZeroMQ-based GPU-aware communication.

Dario Antonio Ruta

Design and implementation of a deployment tool for modular DNN inference using ZeroMQ-based GPU-aware communication.

Rel. Carla Fabiana Chiasserini, Corrado Puligheddu. Politecnico di Torino, Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro), 2025

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (8MB)

Abstract:	Deep Neural Networks (DNNs) are the fundamental structure adopted to provide smart services in a wide range of AI applications. However, DNN-based tasks have high computing requirements, posing huge challenges on their deployment on small and resource constrained devices such as mobile phones or IoT devices. To address this issue, some solutions consider model compression techniques to limit the computational burden of the device as well as the model memory footprint. Other strategies involve partial or full task offloading towards more powerful computing platforms placed at the edge of new-generation mobile networks (5G-MEC), ensuring low latency and near-zero computing cost on the mobile device. In such scenario, mobile devices can consider DNN tasks as on-demand services. However, although more abundant with respect to mobile devices, edge servers computing resources are limited. Therefore, it is of paramount importance to manage and optimize them to maximize the task acceptance rate. In this context, promising results emerge from sharing segment of layers (blocks) of DNN architectures among similar offloaded tasks. However, coping with parallel model execution in a scenario with high dynamism and strict latency requirements during offloading poses some challenges to be solved. This thesis work presents BlockFlow, a high-performance deployment tool for modular and dynamic DNN inference. It manages the complete lifecycle of each DNN-block as well as the communication channels between blocks. Moreover, BlockFlow incorporates TensorMQ, a GPU-aware communication paradigm based on ZeroMQ library for inter-block communication at inference time. A detailed system design and technical motivations of the selected choices for the practical implementation are widely discussed. BlockFlow provides a high degree of flexibility and adaptability across different computing paradigms such as single-node-single-GPU, single-node-multi-GPU, multi-node, and it can be adopted in every context is desired to expose DNNs as services. TensorMQ addresses the "ping-pong" problem between CPU and GPU in single-node-single-GPU setups for modular DNN architectures avoiding redundant tensor copies between the host and the device, thus increasing the overall system performance.
Relatori:	Carla Fabiana Chiasserini, Corrado Puligheddu
Anno accademico:	2024/25
Tipo di pubblicazione:	Elettronica
Numero di pagine:	99
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Ict For Smart Societies (Ict Per La Società Del Futuro)
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-27 - INGEGNERIA DELLE TELECOMUNICAZIONI
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/36555

Modifica (riservato agli operatori)