News2025July 2025

1-bit LLMs become practical to serve

Research2025-07-03 Source: arxiv.org

Ternary and 1-bit weight schemes cut memory enough to run large models on modest GPUs.

Quantising weights down to ternary or a single bit sounds lossy, but trained for it these models keep most of their quality while slashing the memory needed to serve them.

That's what lets bigger models fit on the cards on the prices page.

Read the original at arxiv.org ↗ More from July 2025