TurboQuant: A First-Principles Walkthrough
Full article excerpt tap to expand
TurboQuant: A First-Principles Walkthrough Compressing AI vectors to 2–4 bits per numberwithout losing accuracy. Modern language models store large tables of high-dimensional vectors: KV caches, embeddings, attention keys. TurboQuant compresses each coordinate of these vectors to 2–4 bits with provably near-optimal distortion, no memory overhead for scale factors, and no training or calibration. This page explains how it works. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate Zandieh, Daliri, Hadian, Mirrokni · 2025 · arXiv:2504.19874 PolarQuant: Quantizing KV Caches with Polar Transformation Han, Kacham, Karbasi, Mirrokni, Zandieh · 2025 · arXiv:2502.02617 QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead Zandieh, Daliri, Han · 2024 · arXiv:2406.03482 The single load-bearing idea: in high dimensions, a random rotation turns every input vector into one whose coordinates follow a known fixed distribution. A codebook designed once for that distribution can then be reused for every input. Everything else on this page is the construction that puts this observation to work. §0 · Primer: jargon decoder Eight ideas the rest of the page is built on. Each mini-demo below covers one concept used later. Skip the ones you already know. §0.1 · Vector A list of numbers. An arrow in space. A vector is an ordered list: [0.3, −1.2]. Geometrically it is an arrow from the origin. A d-dimensional vector is an arrow in $d$-space, hard to picture past 3-D, but the rules are the same. ↕ drag tip coords[0.70, 0.50] length0.86 §0.2 · Length ‖x‖ & Inner Product ⟨x,y⟩ How much one vector points along another. Length = $\sqrt{x_1^2+x_2^2+\dots}$. Inner product $\langle x,y\rangle = x_1 y_1 + x_2 y_2 + \dots = \|x\|\|y\|\cos\theta$. The inner product reaches its largest positive value when the two arrows point in the same direction. It drops to zero when the two arrows are perpendicular. It becomes negative when the arrows point in opposite directions, with its most negative value when they point exactly opposite. ↕ drag either tip ‖x‖1.00 ‖y‖1.00 ⟨x,y⟩0.00 angle90° §0.3 · Mean Squared Error Why we square the mistake. Error is the distance between a guess and the truth. Scoring a guess by the signed error lets positive and negative errors cancel, which means the score does not penalise being off. Squaring forces every error to count as a positive number and gives big errors a larger penalty than small ones. The guess that minimises the mean of squared errors is the data’s average: it is the unique number that minimises the sum of squared distances to the points. The first moment of a quantity $X$ is its mean $\mathbb{E}[X]$; the second moment is the mean of its square $\mathbb{E}[X^2]$. A zero-mean variable has a vanishing first moment because positive and negative deviations cancel. Its second moment is strictly positive whenever any deviation is nonzero, because squared values are nonnegative and cannot cancel. The MSE above is itself a second moment of the residual error. This distinction returns in §7, where the per-input gap $\tilde y - y$ averages to zero in the first moment, while its square averages to a strictly positive quantity in the second. The average has a property we will use in §7. It lies between the data’s most extreme points, so its magnitude is smaller than at least one of them. When a quantizer compresses a whole bin of values down to the bin’s average, the stored value is smaller in…
This excerpt is published under fair use for community discussion. Read the full article at Github.