How a 397B model runs on consumer hardware — the architecture

Outlier · solo-built in Grand Rapids · published 2026-05-19 Last updated 2026-05-20

Quick answer

Four design choices stacked: MoE routing, 4-bit quantization, paged streaming, Apple unified memory.
Each one buys roughly 10×. Stack them and 800 GB naive collapses to 11 GB peak RAM.
Apple unified memory kills the PCIe copy step a PC build is stuck with.
End result: a 397B-parameter MoE model running on a 64 GB Mac at ~2.1 tok/s.

Running 397 billion parameters on a 64 GB Mac sounds impossible. In the standard architecture it is. What makes it work is four design choices stacked on top of each other: MoE routing in the model, 4-bit quantization in the weight format, paged expert streaming in the runtime, and Apple's unified memory underneath it all. Pull any single one out and the math falls apart.

Choice 1: the model is MoE, not dense

A dense 397B model runs every token through every parameter. Qwen3.5-397B-A17B doesn't. It's a Mixture-of-Experts model, so each token only touches 10 of 512 experts at each layer. That "A17B" in the name is the giveaway: about 17 billion parameters of compute fire per token, even though the model holds 397B worth of knowledge total.

This is the whole reason paging works. A dense model wants all its weights on every token, so you can't park any of them on disk. MoE keeps most of the weights idle most of the time, and that's the gap everything else exploits.

Choice 2: 4-bit quantization

Qwen ships its weights as 16-bit floats. At that precision, 397B parameters lands at roughly 800 GB on disk, and you'd need about that much RAM just to load it. Drop the weights to 4-bit integers (with a small floating-point scaling factor per group) and the same model shrinks to around 209 GB.

People worry about quality here, and in 2026 they mostly shouldn't. Done well, 4-bit barely moves the needle. The MLX-format quantization Outlier ships keeps the model's reasoning intact to within rounding noise, and you walk away 4× smaller on disk and 4× smaller in memory.

Choice 3: paged expert streaming

209 GB still doesn't fit in 64 GB of RAM. So you don't load all of it. Paged expert streaming keeps the shared backbone resident, holds a bounded cache of the experts you've used recently, and pulls the rest off the SSD only when a token actually calls for them. The router picks the experts for each token, and the runtime fires off a queue of concurrent SSD reads so the latency hides behind the compute.

That's what Outlier's V9 paged engine does. On an M1 Ultra with 64 GB, here's what it measures out to: peak OS-level RSS around 11 GB, decode holding steady at about 2.1 tok/s.

Choice 4: Apple unified memory + fast SSD

The Mac happens to be built for exactly this pattern. Two reasons stand out.

Unified memory. The GPU and CPU share one pool of RAM. When the streaming runtime reads an expert off the SSD into a buffer, the GPU can use that buffer for the matmul right away. No PCIe copy step. No shuffling memory back and forth between CPU and GPU. A PC build with a discrete GPU pays a multi-hundred-millisecond copy cost per expert, which drags the whole thing down by roughly 10×.
Fast on-package SSD. A modern Apple Silicon Mac sustains 5–15 GB/s sequential read off its internal NVMe. That's fast enough to hand over a few hundred MB of expert weights inside the time budget you need for ~3 tok/s generation.

Same logic explains the rest. External Thunderbolt SSDs lag (usually 2–3 GB/s sustained), and Intel Macs aren't supported at all because they don't have the unified memory that makes the SSD→GPU path direct.

The combined math

Naive	With MoE	+ 4-bit	+ Streaming
Dense 397B at 16-bit: ~800 GB	Active 17B at 16-bit: ~34 GB	Active 17B at 4-bit: ~8.5 GB	Active subset cached in RAM: ~11 GB peak
Need 800 GB RAM	Still need all 397B somewhere	Total ~209 GB on disk	Total ~209 GB on disk, ~11 GB peak in RAM

Every step knocks off an order of magnitude. No single trick gets you there. Stacked together, they put the model on hardware a working developer can actually buy.

The honest tradeoff

None of this beats the cloud on raw speed. Run the same model on a server with 8 H100 GPUs and it's far quicker. Plus 397B does about 2.1 tok/s on your Mac against ~50–80 tok/s on a dedicated server, and that's a real gap, not a rounding error. So what do you get back? No API key. No rate limit. Your code never leaves the machine. No bill once the weights are on disk. For long-form coding, architecture review, anything where waiting a bit is fine, that trade tends to pay off.

Frequently asked questions

How does a 397B model fit on a 64 GB Mac?

Four stacked choices: MoE routing, 4-bit quantization, paged expert streaming, and Apple's unified memory. Each cuts the requirement by roughly an order of magnitude.

Why does Apple Silicon help specifically?

Unified memory means SSD-streamed expert weights land directly in GPU-accessible memory with no PCIe copy step, which a PC build would need.

Is it as fast as the cloud?

No. About 2.1 tok/s locally versus 50 to 80 on a dedicated server. You trade speed for privacy, no API key, and no recurring bill.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac