Giandomenico Lacatena
Adaptive Layer Placement for Pipeline‑Parallel LLM Inference at the Edge.
Rel. Alessio Sacco, Guido Marchetto, Doriana Monaco. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2026
|
Preview |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (4MB) | Preview |
Abstract
The deployment of Large Language Models (LLM) is shifting from centralized cloud environments toward edge-oriented distributed architectures located closer to data sources, driven by the requirements of real‑time applications. One possible strategy to fit these models within limited device memory is using Pipeline Parallelism, which partitions the layers across different nodes to achieve concurrency. However, this approach introduces substantial communication overhead--accounting for up to 40\% of total execution time--which can significantly degrade end‑to‑end performance. Choosing an effective deployment configuration, which must consider factors such as GPU characteristics, inter‑node communication latency, and model size, can help mitigate this overhead. Current state-of-the-art approaches focus on maximizing resource utilization but fail to adapt to frequently changing condition.
As a result, they struggle in scenarios where nodes introduce communication bottlenecks that outweigh its computational time
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Corso di laurea
Classe di laurea
Aziende collaboratrici
URI
![]() |
Modifica (riservato agli operatori) |
