What is ternary quantization (and what it isn't)

Outlier · solo-built in Grand Rapids · published 2026-05-19 Last updated 2026-05-20

Quick answer

Ternary stores each weight as -1, 0, or +1. That's about 1.58 bits per weight.
The pitch is ~16× smaller than 16-bit. The catch is you have to train for it (quantization-aware training).
In 2026 the tooling and hardware support is still research-grade. Not production-ready.
4-bit MLX is what you actually want for local AI on a Mac right now.

Ternary quantization squeezes every model weight down to one of three values: -1, 0, or +1. On paper the payoff is huge, about 16× smaller than 16-bit floats. Compare that to the 4× you get from the 4-bit quantization most local-AI tools ship. So why isn't everyone using it? Because in 2026 it only really works on models that were trained for it, and it still hasn't pushed 4-bit off the throne as the production default.

How quantization works in one paragraph

Out of the box, model weights live as 16-bit floating point numbers (bf16 or fp16). Quantization swaps those for lower-precision integers plus a small scaling factor per group. 8-bit halves the memory. 4-bit quarters it. 2-bit cuts it 8×. Ternary needs three states, so it lands at 1.58 bits per weight, which works out to roughly 10× smaller than 16-bit. Lower precision buys you two things that matter a lot for local AI: smaller files on disk, and less memory bandwidth chewed up during inference.

What ternary specifically is

The canonical reference here is the BitNet b1.58 paper from 2024. Every weight becomes -1, 0, or +1. Three states means log₂(3) ≈ 1.58 bits per weight, at least in theory. In practice you pack 5 ternary weights into a single byte (3⁵ = 243 fits inside 8 bits), which lands you at about 1.6 bits per weight in the real world.

The math gets cheaper too. Multiplying by -1, 0, or +1 is just a sign-flip, a zero-out, or a pass-through. You don't need multiplication hardware for the weight side of the dot product at all. That's the second thing ternary promises: cheaper matmul.

What the catches are

Two big ones. First, ternary only shines on models that were trained with it baked in (quantization-aware training, or QAT). Take a normal pretrained 16-bit model and crush it down to ternary after the fact, and quality drops in a way you'll notice. The model was never optimized to be representable in 1.58 bits, so you're asking it to do something it wasn't built for. It technically works. The output just gets worse.

Second is the boring stuff: tooling and hardware. 4-bit runs everywhere already. MLX, llama.cpp, vLLM, basically every major inference framework. Ternary needs custom kernels because the standard matmul wasn't written for {-1, 0, +1}. In 2026 the production tooling is still playing catch-up, and most ternary models you'll run into are research demos, not things people actually shipped.

Where 4-bit sits in 2026

4-bit is the sweet spot, and it's not close. The quality you lose against 16-bit is small enough that you won't see it on most tasks. The tooling is mature. Apple Silicon Metal, NVIDIA Tensor Cores, AMD ROCm all do 4-bit matmul natively. Outlier ships every tier in MLX 4-bit:

Nano 4B → ~3 GB
Lite 9B → ~6 GB
Quick 26B → ~16 GB
Core 27B → ~16 GB
Vision 35B-A3B → ~20 GB
Plus 397B-A17B → ~209 GB

The quality holds up. At 4-bit, Core 27B's coding accuracy stays within rounding noise of the 16-bit base model. You paid basically nothing in quality and got the file 4× smaller.

When ternary might matter

There are two places ternary actually gets interesting:

Mobile / battery-constrained. Trying to cram a 7B model onto a phone with 6 GB RAM and a 5W power budget? Ternary's smaller bandwidth footprint saves you memory and battery at the same time. That's exactly where the research money is going.
Specialized accelerators. Build a chip designed around ternary operations and the speedup over 4-bit gets real. Nobody sells one commercially yet, but research chips exist.

On a Mac in 2026? Neither of those is your problem. Unified memory plus Metal-tuned 4-bit kernels make 4-bit the obvious pick for the hardware you have now and the hardware coming next.

The short answer

So: ternary is a legitimate research direction. It might genuinely matter in the next 2–3 years as mobile and edge accelerators grow up. What it isn't is a free lunch you can pour over your existing pretrained models to make them 4× smaller than 4-bit with no quality hit. For local AI on a Mac today, 4-bit is the right format, and it's what Outlier ships. So does pretty much every other Mac-native local-AI tool.

Frequently asked questions

What is ternary quantization?

Storing each model weight as one of three values (-1, 0, +1), about 1.58 bits per weight, roughly 10x smaller than 16-bit floats.

Is ternary better than 4-bit for local AI?

Not in 2026. Ternary needs quantization-aware training and lacks mature tooling; 4-bit is the production sweet spot with near-zero quality loss.

When will ternary quantization matter?

Likely for mobile and edge accelerators where memory bandwidth and power are tightly constrained, as hardware support matures.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac