Andrea Vannozzi
Post Training Low Rank Approximation for KV Cache Compression in Large Language Models.
Rel. Daniele Jahier Pagliari, Alessio Burrello, Luca Benfenati. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2026
|
Preview |
PDF (Tesi_di_laurea)
- Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives. Download (6MB) | Preview |
Abstract
Autoregressive decoding in large Transformer language models is frequently memory-bandwidth bound: each generated token reads and updates an ever-growing key--value (KV) cache. As context length and batch size increase, KV-cache storage and traffic can dominate inference cost, limiting throughput and deployment scalability. A common mitigation is to compress the KV cache by projecting per-head key and value matrices to a lower rank and storing only the projections. However, many post-training projection methods optimize proxy objectives such as variance preservation or reconstruction fidelity, which do not explicitly account for the end-to-end behavior of the decoder during generation. This thesis studies KV-cache compression from a functional perspective: can projection bases optimized to preserve decoder-layer outputs yield better memory--performance trade-offs than proxy-driven approximations? We propose a post-training framework in which lightweight predictors, trained offline on a calibration set while keeping the language model frozen, output orthonormal projection bases for keys and values.
Rather than minimizing reconstruction error on cached tensors, the training objective directly minimizes the discrepancy between full-rank and compressed decoder-layer outputs, aligning compression with the quantity that governs downstream generation
Relatori
Anno Accademico
Tipo di pubblicazione
Numero di pagine
Corso di laurea
Classe di laurea
URI
![]() |
Modifica (riservato agli operatori) |
