Post Training Low Rank Approximation for KV Cache Compression in Large Language Models

Andrea Vannozzi

Post Training Low Rank Approximation for KV Cache Compression in Large Language Models.

Rel. Daniele Jahier Pagliari, Alessio Burrello, Luca Benfenati. Politecnico di Torino, Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering), 2026

Preview

PDF (Tesi_di_laurea) - Tesi
Licenza: Creative Commons Attribution Non-commercial No Derivatives.
Download (6MB) | Preview

Abstract

Autoregressive decoding in large Transformer language models is frequently memory-bandwidth bound: each generated token reads and updates an ever-growing key--value (KV) cache. As context length and batch size increase, KV-cache storage and traffic can dominate inference cost, limiting throughput and deployment scalability. A common mitigation is to compress the KV cache by projecting per-head key and value matrices to a lower rank and storing only the projections. However, many post-training projection methods optimize proxy objectives such as variance preservation or reconstruction fidelity, which do not explicitly account for the end-to-end behavior of the decoder during generation. This thesis studies KV-cache compression from a functional perspective: can projection bases optimized to preserve decoder-layer outputs yield better memory--performance trade-offs than proxy-driven approximations? We propose a post-training framework in which lightweight predictors, trained offline on a calibration set while keeping the language model frozen, output orthonormal projection bases for keys and values.

Rather than minimizing reconstruction error on cached tensors, the training objective directly minimizes the discrepancy between full-rank and compressed decoder-layer outputs, aligning compression with the quantity that governs downstream generation

Relatori

Daniele Jahier Pagliari, Alessio Burrello, Luca Benfenati

Anno Accademico

2025/26

Tipo di pubblicazione

Elettronica

Numero di pagine

Corso di laurea

Corso di laurea magistrale in Ingegneria Informatica (Computer Engineering)

Classe di laurea

Nuovo ordinamento > Laurea magistrale > LM-32 - INGEGNERIA INFORMATICA

URI

https://webthesis.biblio.polito.it/id/eprint/39961

Modifica (riservato agli operatori)