A 397B-parameter MoE that does not fit in unified memory has to be paged from disk. The router picks a subset of experts per token; the engine keeps the top-K resident in RAM and reads the rest on demand.
The decision to run a model locally on a Mac comes down to three numbers: weight size on disk, peak generation memory, and the memory bandwidth feeding the decode loop. The concept above bears directly on each of those.
A Mixture-of-Experts model splits its feed-forward block into many small experts; the router picks a small subset per token. Qwen3.5-397B-A17B has 60 layers, 512 experts per layer, and a configured top-k of 10. That means the active path through a single decode step touches 10 experts out of 512 at each of the 60 layers, even though all 512 are weights on disk.
Paging is the trick that makes this fit on a 64 GB Mac. The engine keeps the K most recently used experts resident, reads the rest from the safetensors shards on demand, and serves the routed selection out of that mix.
On Outlier’s Plus tier, K is locked at 20 because a 5-prompt sweep showed K=4 fails coherence (3/5), K=32 regresses speed by 2.5%, and K=48 only adds 1.3% which is within noise.
Outlier’s Plus tier sets K_override=20, cache_gb=8.0, OUTLIER_MMAP_EXPERTS=0, and lazy-loads the full state dict.
Page-aligned pread() fanout with libdispatch groups is the Flash-MoE technique that closes the gap from 1.59 tok/s (Outlier today) to 4.36 tok/s (Flash-MoE on M3 Max). It is on the v1.9 backlog.
This concept is sometimes invoked as a marketing word for “what is a paged mixture-of-experts model”. The number cited above — On Outlier’s Plus tier, K is locked at 20 because a 5-prompt sweep showed … — is the empirically measured one. If a cleaner number appears in someone’s pitch deck, ask for the provenance file that produced it; if there is no provenance file, treat the number as marketing.
The Plus-tier paging logic and the K_override sweep that produced the locked configuration live in sprints/v18_plus_ship/artifacts/K_SWEEP_RESULTS.md and K_SWEEP_RESULTS JSON files; the engine code is in desktop_app/backend/engine_v9_loader.py.
The V9 paged loader patches the SwitchGLU forward pass so that on the routing step, the IDs of the top-K experts are computed first, then the weights for any expert not currently in the LRU cache are pulled from the safetensors shard files via standard seek+read calls. Memory-mapped reads were tried and rejected (OUTLIER_MMAP_EXPERTS=1 caused an 8× throughput regression in our 2026-04 testing).
The cache is keyed on (layer, expert_id). At cache_gb=8.0 the LRU holds roughly 240 expert tensors at the model’s 4-bit quantization. Cache hits are free; cache misses pay the disk-read latency, which is why NVMe is mandatory.
This concept is what makes the Plus tier possible at all. The other six tiers (Nano through Vision) are dense or small-MoE and load entirely into unified memory; only Plus needs the paged loader.
You need a 32 GB or larger Mac plus 209 GB of free disk to load the Plus tier and reproduce the paging behavior. There is no smaller-tier proxy for the routed-expert page-fault path.
On Outlier’s Plus tier, K is locked at 20 because a 5-prompt sweep showed K=4 fails coherence (3/5), K=32 regresses speed by 2.5%, and K=48 only adds 1.3% which is within noise.
Download Outlier for MacRequires Apple Silicon (M1, M2, M3, or M4) — Intel Macs are not supported. macOS 12+.
Outlier runs entirely on your Mac. No prompts leave the device. macOS 12+ on Apple Silicon (arm64). Apache 2.0 model weights. Back to home.