Distributed AI fabrics: a network-side perspective

Francesco Camilli

Distributed AI fabrics: a network-side perspective.

Rel. Paolo Giaccone, Emilio Leonardi. Politecnico di Torino, Corso di laurea magistrale in Communications Engineering, 2026

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (3MB) | Preview

Abstract

The rapid scaling of Large Language Models (LLMs) has transformed distributed training into a network-intensive workload, where communication efficiency increasingly dominates overall performance. As model size grows, the exchange of gradients between computing devices becomes a critical bottleneck, especially in geographically distributed or resource-constrained environments. This thesis investigates the impact of network latency and bandwidth constraints on decentralized LLM training, with a specific focus on the Ring All-Reduce gradient synchronization algorithm. To analyze this phenomenon, a controlled experimental environment was implemented using Docker containerization on a single physical server, where multiple nodes are interconnected through a software-defined network to emulate a decentralized topology.

The study characterizes network traffic at packet level through tcpdump and Wireshark captures