Temporal Background Key-Value Reuse for Efficient Vision Transformer Inference

Augusto Leogrande

Temporal Background Key-Value Reuse for Efficient Vision Transformer Inference.

Rel. Alessio Sacco, Guido Marchetto, Flavio Esposito. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2026

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (2MB) | Preview

Abstract

Vision Transformers (ViTs) are compelling for edge deployment because they can operate on compact token representations instead of full images, along with impressive capabilities in video understanding. However, the combination of high computational cost and large amounts of continuous video data poses a major challenge for real-time deployment on resource-constrained edge devices within the Internet of Things (IoT) environments. Most efficiency methods for ViTs target single images or per-frame optimization. Token reduction techniques reduce intra-frame computation but leave substantial temporal redundancy untouched in videos, where large regions are static across consecutive frames. Recent video/token-reduction works highlight this redundancy by skipping redundant operations while preserving full-resolution tokens and quadratic attention complexity, which limits scalability in long-context and high-resolution settings.

While these approaches are sound, they primarily target inference-time efficiency and do not address the underlying representation complexity, which remains a major bottleneck in large-scale video modeling