What is a paged MoE inference engine

Outlier · solo-built in Grand Rapids · published 2026-05-19 Last updated 2026-05-20

Quick answer

MoE means only k of N experts fire per token (10 of 512 in Qwen3.5-397B-A17B).
Paged inference keeps the active experts in RAM and streams the rest off your SSD.
Same trick as OS virtual memory. Bounded working set, pages pulled on demand.
Outlier's V10 engine runs Plus 397B at ~11 GB RAM, ~2.1 tok/s on a 64 GB Mac.

A paged MoE inference engine runs Mixture-of-Experts models without holding every weight in memory at once. It leans on one fact about MoE: only a handful of experts fire per token. So it parks the active ones in RAM and pulls the rest off disk when they're needed. That's the whole reason a 397B-parameter model fits on a 64 GB Mac.

What MoE actually is

Start with a normal model. A standard ("dense") network runs every token through every parameter. A 70B dense model burns 70B parameters' worth of compute on each token, every time, no exceptions.

MoE breaks that pattern. A Mixture-of-Experts model chops each layer into a bunch of small "expert" sub-networks and adds a router. The router looks at the token, picks k experts (usually 2 or 8) from the pool, and only those k do anything. Everyone else sits idle for that token.

Take Qwen3.5-397B-A17B. 397B total parameters, but only 17B active per token. That "A17B" is your compute bill per token, roughly what a 17B dense model costs. The "397B" is how much the thing actually knows.

The opportunity for paging

So if only 17B of those 397B parameters do anything on a given token, why hold all 397B in RAM? You don't have to. What you actually need is:

The shared backbone (attention layers, embedding, the router itself).
A cache of recently-used experts, sized to whatever RAM you've got.
An on-disk store holding all the experts, ready for the cache to pull from.

Operating systems have done this forever. It's virtual memory: don't keep the whole working set resident, page things in when you need them. The pages here just happen to be entire expert tensor blocks, usually a few hundred MB apiece.

The engine, step by step

Walk through one generated token:

Embed the new token, run the attention layers (always resident).
Router picks k experts to use for this token's MLP.
For each chosen expert, check the in-RAM cache. Hit → use it. Miss → read it from SSD.
Compute the expert MLPs, combine, produce next-token logits.
Sample the next token, repeat.

Where does the time go? Cache hits are basically free, just a memory access. Misses are bound by SSD bandwidth. A modern Apple Silicon SSD sustains 5–15 GB/s, which puts a single expert read at a few tens of milliseconds. Run enough tokens and the hot experts settle into cache, so the average read cost keeps dropping.

The numbers, in practice

Numbers from Outlier's V9 paged engine running Qwen3.5-397B-A17B on a 64 GB Mac Studio M1 Ultra:

Peak OS-level RSS: about 11 GB (the capacity engine would want ~209 GB plus headroom for the same model)
Decode speed: about 2.1 tokens per second
SSD read pattern: concurrent pread with queue depth 24, sustained

2.1 tok/s is a measured number, not a brochure figure. Sure, it's slower than the in-RAM capacity engine. But that engine wants 192+ GB of RAM, which most people simply don't have. The alternative isn't a faster run. It's no run.

Where paged MoE wins and loses

Wins:

Runs models that flat-out wouldn't fit on your hardware otherwise.
The memory footprint stays bounded, so the rest of your machine is still yours to use.
Cache locality means your common usage patterns stay quick.

Loses:

Cold expert misses hit the SSD, not memory. Your first response is the slowest one.
Dense (non-MoE) models get no benefit here. They genuinely touch every weight on every token.
An external SSD over Thunderbolt runs about half the throughput of internal NVMe, so paging slows down to match.

Why this matters for local AI

This changes who gets to run these models. Before paged MoE, a flagship-class MoE model meant shelling out for a $7,000+ Mac Studio Ultra with 192 GB of unified memory. Now the same model runs on a 64 GB Mac Studio at roughly half the price. That architectural trick is the reason Plus 397B is something an individual developer can actually run, not just a well-funded lab.

Frequently asked questions

What is paged MoE inference?

Running a Mixture-of-Experts model by keeping active experts in RAM and streaming the rest from SSD on demand, similar to how an OS pages virtual memory.

Why does paging only work for MoE models?

Because MoE activates just a few experts per token, so most weights can stay on disk. Dense models need all weights every token and can't be paged this way.

How fast is paged MoE inference?

On a 64 GB Mac, Outlier's V10 engine runs Plus 397B at about 2.1 tok/s with roughly 11 GB peak RAM.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac