Concept

What is a paged Mixture-of-Experts model?

Last updated 2026-06-18 · Outlier v1.11.469

Quick answer

A 397B-parameter MoE that does not fit in unified memory has to be paged from disk. The router picks a subset of experts per token; the engine keeps the top-K resident in RAM and reads the rest on demand.

Why does what is a paged mixture-of-experts model matter for local AI on Apple Silicon?

The decision to run a model locally on a Mac comes down to three numbers: weight size on disk, peak generation memory, and the memory bandwidth feeding the decode loop. The concept above bears directly on each of those.

A Mixture-of-Experts model splits its feed-forward block into many small experts; the router picks a small subset per token. Qwen3.5-397B-A17B has 60 layers, 512 experts per layer, and a configured top-k of 10. That means the active path through a single decode step touches 10 experts out of 512 at each of the 60 layers, even though all 512 are weights on disk.

Paging is the trick that makes this fit on a 64 GB Mac. The engine keeps the K most recently used experts resident, reads the rest from the safetensors shards on demand, and serves the routed selection out of that mix.

What is the concrete number?

On Outlier’s Plus tier, K is locked at 20 because a 5-prompt sweep showed K=4 fails coherence (3/5), K=32 regresses speed by 2.5%, and K=48 only adds 1.3% which is within noise.

How does this play out in the Outlier shipping lineup?

Outlier’s Plus tier sets K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, and lazy-loads the full state dict.

What is the v1.9 implication?

Page-aligned pread() fanout with libdispatch groups is the Flash-MoE technique that closes the gap from 1.59 tok/s (Outlier today) to 4.36 tok/s (Flash-MoE on M3 Max). It is on the v1.9 backlog.

What does “what is a paged mixture-of-experts model” not mean?

This concept is sometimes invoked as a marketing word for “what is a paged mixture-of-experts model”. The number cited above — On Outlier’s Plus tier, K is locked at 20 because a 5-prompt sweep showed … — is the empirically measured one. If a cleaner number appears in someone’s pitch deck, ask for the provenance file that produced it; if there is no provenance file, treat the number as marketing.

Where can I read more about what is a paged mixture-of-experts model?

The Plus-tier paging logic and the K_override sweep that produced the locked configuration live in sprints/v18_plus_ship/artifacts/K_SWEEP_RESULTS.md and K_SWEEP_RESULTS JSON files; the engine code is in desktop_app/backend/engine_v9_loader.py.

How does the Outlier paged engine actually fetch experts?

The V9 paged loader patches the SwitchGLU forward pass so that on the routing step, the IDs of the top-K experts are computed first, then the weights for any expert not currently in the LRU cache are pulled from the safetensors shard files via standard seek+read calls. Memory-mapped reads were tried and rejected (OUTLIER_MMAP_EXPERTS=1 caused an 8× throughput regression in our 2026-04 testing).

The cache is keyed on (layer, expert_id). At cache_gb=8.0 the LRU holds roughly 240 expert tensors at the model’s 4-bit quantization. Cache hits are free; cache misses pay the disk-read latency, which is why NVMe is mandatory.

How does “what is a paged mixture-of-experts model” connect to specific tiers?

This concept is what makes the Plus tier possible at all. The other six tiers (Nano through Vision) are dense or small-MoE and load entirely into unified memory; only Plus needs the paged loader.

What is the smallest configuration that exercises this concept?

You need a 32 GB or larger Mac plus 209 GB of free disk to load the Plus tier and reproduce the paging behavior. There is no smaller-tier proxy for the routed-expert page-fault path.

One unique number

On Outlier’s Plus tier, K is locked at 20 because a 5-prompt sweep showed K=4 fails coherence (3/5), K=32 regresses speed by 2.5%, and K=48 only adds 1.3% which is within noise.

Download Outlier for Mac

Requires Apple Silicon (M1, M2, M3, or M4) — Intel Macs are not supported. macOS 12+.

Outlier runs entirely on your Mac. No prompts leave the device. macOS 12+ on Apple Silicon (arm64). Apache 2.0 model weights. Back to home.