Latency-Aware DNN Inference with Adaptive Batching for Edge Task Offloading

Sadegh Jamishi

Latency-Aware DNN Inference with Adaptive Batching for Edge Task Offloading.

Rel. Carla Fabiana Chiasserini, Corrado Puligheddu. Politecnico di Torino, Corso di laurea magistrale in Communications Engineering, 2025

	PDF (Tesi_di_laurea) - Tesi Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (5MB)
	Archive (ZIP) (Documenti_allegati) - Altro Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (57MB)

Abstract:	Edge computer-vision systems need to satisfy low-latency requirements, even under scarce computation and network resource availability. The novelty of this thesis is the investigation of how admission control, batching, and concurrency need to be jointly designed to jointly maximize the number of task completions without deadline violations. First, we perform an empirical characterization of modern inference frameworks (e.g., PyTorch, NVIDIA TensorRT, YOLO). The findings show that batching and parallelism benefit throughput, but hit diminishing returns as host-side processing saturates. Inspired by this, we present a communication–computation model which subsumes rate-dependent uploads, limited bandwidth, and asynchronous task arrivals in a single compact form. To address the scheduling problem, we introduce an algorithm Greedy-JBAS, a simple batching algorithm based on earliest-deadline-first ordering with upload and inference feasibility checks. It achieves high-completion ratios, plans in milliseconds, and almost matches the performance of more costly optimization-based formulations (e.g., Gurobi), thereby setting a new high-water mark for fixed-batch or mobile-edge-computing baselines. Overall, the contributions of this thesis include: (i) a reproducible empirical mapping of batching and concurrency behavior in modern inference stacks, (ii) a formal, yet practical, unified communication–computation model for edge inference, and (iii) a scalable scheduler that does not trade deployability for efficiency. These contributions aim to provide actionable guidance for building latency-aware edge AI pipelines, and open new doors to host-side parallelism opportunities.
Relatori:	Carla Fabiana Chiasserini, Corrado Puligheddu
Anno accademico:	2025/26
Tipo di pubblicazione:	Elettronica
Numero di pagine:	78
Soggetti:
Corso di laurea:	Corso di laurea magistrale in Communications Engineering
Classe di laurea:	Nuovo ordinamento > Laurea magistrale > LM-27 - INGEGNERIA DELLE TELECOMUNICAZIONI
Aziende collaboratrici:	NON SPECIFICATO
URI:	http://webthesis.biblio.polito.it/id/eprint/37741

Modifica (riservato agli operatori)