How to run a 397B-parameter model on a 64 GB Mac Studio

Outlier · solo-built in Grand Rapids · published 2026-05-19 Last updated 2026-05-20

Quick answer

Qwen3.5-397B-A17B is a 209 GB MoE model at MLX 4-bit. Big.
Outlier's V10 paged streaming engine runs it on a 64 GB Mac Studio at ~2.1 tok/s.
Generation peaks at ~11 GB of OS-level RSS. Not 209 GB.
You need Apple Silicon (M1+). SSD bandwidth is what actually holds you back.

A 397-billion-parameter model has no business running on a 64 GB Mac. The full Qwen3.5-397B-A17B weights eat about 209 GB on disk, and every rule of thumb says you want a server with hundreds of GB of RAM and a fat GPU to touch it. Then you actually try it on the Mac, and it runs anyway.

The naive math

Start with the obvious approach. Qwen3.5-397B-A17B at MLX 4-bit quantization. The HuggingFace repo (mlx-community/Qwen3.5-397B-A17B-4bit) weighs in at ~209 GB. Loading every parameter into unified memory at once means a Mac with at least that much RAM, plus a real margin on top for the KV cache and the activation tensors. The OS needs to breathe too. In practice you're looking at a 192 GB or 256 GB Mac Studio Ultra. That's a $7,000–$10,000 machine.

A 64 GB Mac Studio can't hold the whole thing resident. No chance. The capacity engine, which is the blunt load-everything-into-RAM path, just refuses to start. So you've got two options. Don't run the model, or change the rules.

Why MoE makes this tractable

The trick is in how the model is built. Qwen3.5-397B-A17B is a Mixture-of-Experts model. The "397B" counts every parameter across all the experts. The "A17B" is how many are actually active per token. So for each token a router picks 10 of 512 experts, and only those 10 do any work. The other 502 just sit there.

So at any given moment the model only needs about 17B parameters worth of weights to compute the next token. The rest of the 397B is dead weight, for that token at least. Which raises a question. What if you kept only the experts you need in memory and streamed the rest off the SSD when the router asks for them?

Paged expert streaming, in plain terms

That's the whole idea behind Outlier's V9 paged engine. The expert tensors live on disk as memory-mapped safetensors files and stay there. When the router calls for an expert on a given token, the engine reads just that expert's slice off the disk, a few hundred MB at a time, runs it through the activations, and lets the OS page it straight back out. Nothing lingers.

Here are the numbers from a 64 GB Mac Studio M1 Ultra running Plus 397B in V9 paged inference.

Peak OS-level RSS: about 11 GB
Generation speed: about 2.1 tokens per second
SSD read pressure: real but capped, via concurrent pread at queue depth 24

2.1 tok/s is slow. No way around it. Claude Opus runs ~80–100 tok/s in the cloud, and that's the comparison people will reach for. But this model is doing the full job. Every token routes through the real 397B logic. The only difference is that the experts arrive from disk instead of already sitting in RAM.

Hardware floor and tradeoffs

None of this works without the hardware underneath it. Apple Silicon gives you fast unified-memory bandwidth, and on M1 Pro/Max/Ultra and newer the NVMe SSDs hold several GB/s of sequential read. That's what makes paging fast enough to bother with. So what does Outlier actually ask for to run Plus?

RAM: 64 GB unified memory. 32 GB runs out of headroom once the active expert set and the KV cache both grow during a long conversation.
Storage: 250 GB free for the model download. The full 209 GB of weights comes down once, and after that everything streams from those local files. The network's out of the loop.
CPU/GPU: Apple Silicon M1/M2/M3/M4. The MLX backend won't touch an Intel Mac.

Set this against the capacity engine and the trade is clean. V9 paged inference burns ~10× less peak memory, but now SSD bandwidth is your ceiling instead of memory bandwidth. Long context with lots of cache reuse, and the gap shrinks. Fresh prompt with cold experts, and it opens right back up.

When to use it

Reach for Plus 397B on the V9 paged engine when you want the strongest local reasoning in the lineup and you're willing to wait for it. Long-form analysis. Code review across a big diff. The kind of architectural back-and-forth where a few minutes of generation buys you something actually worth reading. If you want a reply in seconds, this is the wrong tool. Drop down to Core 27B (20.7 tok/s on M1 Ultra) or Lite 9B (53.4 tok/s) instead. Outlier ships all of them, and you can swap mid-conversation without losing your place.

Frequently asked questions

Can you really run a 397B model on a 64 GB Mac?

Yes. Outlier's V9 paged engine runs Qwen3.5-397B-A17B on a 64 GB Mac Studio at about 2.1 tokens per second, using roughly 11 GB of peak memory by streaming expert weights from the SSD on demand.

How much disk space does the 397B model need?

About 209 GB for the MLX 4-bit weights. They download once, then inference streams from those local files with no further network use.

Why does a 397B MoE model fit when a 70B dense model might not?

Because a Mixture-of-Experts model activates only a fraction of its parameters per token (17B of 397B here), so the engine keeps active experts in RAM and pages the rest from disk.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac