Breaking News

Google’s new TurboQuant cuts AI memory use without losing accuracy

Google Research has unveiled TurboQuant, a new, highly efficient compression algorithm designed to drastically reduce the memory required to run Large Language Models (LLMs). Announced in March 2026, this technology promises to shrink the "working memory" of AI models by at least 6x while delivering up to 8x faster performance, all with zero loss in accuracy.

LLMs carry a persistent scaling problem. As context windows grow, the memory required to store key-value (KV) caches expands proportionally, consuming GPU memory and slowing inference.

The Google Research team has developed three compression algorithms: TurboQuant, PolarQuant, and Quantized Johnson-Lindenstrauss (QJL). All three are designed to compress those caches aggressively without degrading model output quality.

Vector quantization has long been used to compress the high-dimensional numerical representations that AI models process. The technique reduces memory by mapping continuous values to smaller, discrete sets of numbers. The persistent limitation of conventional approaches is that they require storing quantization constants in high precision for every small block of data, adding between one and two extra bits per number. For systems already under memory pressure, that overhead offsets a significant share of the compression gains.

TurboQuant addresses this by combining two underlying methods. PolarQuant handles the primary compression step by converting standard Cartesian coordinate vectors into polar coordinates. A conventional quantizer records position along each axis independently, requiring normalization steps that vary based on the data. PolarQuant maps pairs of coordinates to a polar system, expressing them as a radius and an angle. Because the angular distribution is predictable and concentrated, the method eliminates the normalization step and the overhead costs it generates.