Adaptive Layer Placement for Pipeline‑Parallel LLM Inference at the Edge

Giandomenico Lacatena

Adaptive Layer Placement for Pipeline‑Parallel LLM Inference at the Edge.

Rel. Alessio Sacco, Guido Marchetto, Doriana Monaco. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2026

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (4MB) | Preview

Abstract

The deployment of Large Language Models (LLM) is shifting from centralized cloud environments toward edge-oriented distributed architectures located closer to data sources, driven by the requirements of real‑time applications. One possible strategy to fit these models within limited device memory is using Pipeline Parallelism, which partitions the layers across different nodes to achieve concurrency. However, this approach introduces substantial communication overhead--accounting for up to 40\% of total execution time--which can significantly degrade end‑to‑end performance. Choosing an effective deployment configuration, which must consider factors such as GPU characteristics, inter‑node communication latency, and model size, can help mitigate this overhead. Current state-of-the-art approaches focus on maximizing resource utilization but fail to adapt to frequently changing condition.

As a result, they struggle in scenarios where nodes introduce communication bottlenecks that outweigh its computational time