1-bit LLMs become practical to serve
Research2025-07-03
Source: arxiv.org
Ternary and 1-bit weight schemes cut memory enough to run large models on modest GPUs.
Quantising weights down to ternary or a single bit sounds lossy, but trained for it these models keep most of their quality while slashing the memory needed to serve them.
That's what lets bigger models fit on the cards on the prices page.