How to run Qwen3.5-397B on a Mac without 512 GB of RAM

Outlier · solo-built in Grand Rapids · published 2026-05-19 Last updated 2026-05-20

Quick answer

The MoE design fires only 10 of 512 experts for any given token.
Paged expert streaming parks the active experts in RAM and pulls the rest off the SSD.
What you get: 11 GB peak RAM, 2.1 tok/s on a 64 GB Mac Studio Ultra.
You'll want an internal NVMe SSD (5–15 GB/s read). External Thunderbolt SSDs run ~2× slower.

Qwen3.5-397B-A17B is a Mixture-of-Experts model from Alibaba. At MLX 4-bit quantization the full weights weigh in around 209 GB. Everyone tells you to go buy a server, and for a 70B dense model they're right. Qwen's MoE design gives you an out. Keep the active experts in unified memory and stream the rest off your SSD as you need them.

The model in one paragraph

Qwen3.5-397B-A17B carries 512 experts per layer. For each token, a router picks 8 of them and only those 10 do work. The "397B" is every parameter across all the experts added up. The "A17B" is what's actually active per token. That router decision is content-dependent, so different tokens wake up different experts. Across a long conversation, though, expert usage settles into a fairly stable pattern. That stability is what makes caching pay off.

Why a 64 GB Mac is enough

If only 10 of 512 experts ever fire per token, there's no reason to keep all 512 sitting in RAM. What you actually need is small:

The shared backbone (attention layers, embedding, router weights). It's a tiny slice of the total and it stays resident.
A cache of the experts you've used recently, sized to whatever RAM you've got free.
An on-disk copy of all 512 experts, ready to read into the cache the moment one's missing.

On a 64 GB Mac Studio Ultra running Outlier's V9 paged engine, peak OS-level RSS during Plus 397B generation lands around 11 GB. The other ~50 GB of RAM goes to the OS and whatever else you've got open. Generation runs at roughly 2.1 tokens per second. Memory isn't the bottleneck. The SSD is. Specifically, how fast it can pull an expert's weights when the router asks for one that isn't cached.

What the SSD needs to be

This is where the SSD earns its keep. Apple's first-party Mac Studio SSDs sustain several GB/s of sequential read. A first-generation M1 Ultra reads at ~7 GB/s, and M2 Ultra and M3 Max go higher. Plenty to feed the expert demand for one decoded token every ~340 ms, which is the regime that lands you at ~2.1 tok/s.

Two things follow from that.

External Thunderbolt SSDs are slower, usually 2–3 GB/s sustained. Plus still runs on one, just slower. Figure roughly half the tok/s.
A nearly-full internal SSD lets macOS fragment its writes, and read speeds can regress. Keep 100+ GB free.

The honest tradeoff

Nothing here is free, so let's be straight about it. Against the capacity engine (the path that crams everything into RAM), V9 paged inference burns about 10× less peak memory, and the price is being SSD-bound rather than memory-bound. On long contexts, once expert usage settles, the cache holds the hot set and SSD reads taper off. On a fresh prompt the experts are all cold, so every token misses and triggers a read. Your first response is always the slowest one.

Against cloud inference, the math is a speed hit of about ~30× (2.1 tok/s versus the ~80–100 you'd get from Claude Opus). In return you get no API key, no rate limit, no shipping your code off to someone's server, and no monthly bill once the model's on disk. If you're doing an architecture review or grinding through a long debugging session or a real code refactor, that swap usually favors running local. For "what's the syntax for X in Python," it doesn't.

Setting it up

Easiest route is Outlier's Mac app. Download the DMG, install it, switch to the Pro tier. Plus 397B needs that Pro tier: $20/mo, $149/yr, $99 lifetime (Founding 200, first 200 seats), or $200 lifetime (Founders 500). The app takes care of the model download, the streaming engine, and the chat / agent UI for you. You can roll your own with mlx-lm and a custom streaming loader, but that's a pile of Python plumbing and you lose everything around it.

Frequently asked questions

Do you need 512 GB of RAM to run Qwen3.5-397B?

No. With paged expert streaming the model runs on a 64 GB Mac at about 2.1 tok/s and roughly 11 GB peak RAM.

Where are the model weights stored?

On your SSD as MLX 4-bit safetensors, about 209 GB. Experts stream into memory on demand as the router selects them.

Does SSD speed matter?

Yes. Internal Apple Silicon NVMe sustaining 5 to 15 GB/s is needed; external Thunderbolt SSDs are roughly half the throughput and run correspondingly slower.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac