Local AI benchmarks for Mac (2026) — Outlier vs Ollama vs LM Studio
- 7B-class chat runs ~70 tok/s on an M1 Ultra with Outlier Nano 4B. Ollama and LM Studio land in the 60–110 tok/s range, depending which model you load.
- For 27B coding, Outlier Core 27B does 20.7 tok/s and is the strongest coding tier in the lineup.
- 397B-class MoE? Only Outlier runs it on a 64 GB Mac (V9 paged engine, ~2.1 tok/s, ~11 GB RSS). Ollama and LM Studio just can't fit the thing.
- The only fully-published, verified accuracy figure is Nano's HumanEval 81.1% pass@1 (full 164-set). Other accuracy figures are still being finalized, so this page sticks to measured tok/s and memory.
Raw numbers, not vibes. This is the benchmark data behind every major local-AI option on Apple Silicon in 2026. Outlier's figures come off my dev M1 Ultra. Ollama and LM Studio numbers cite public benchmark posts where it makes sense.
Outlier lineup — measured (M1 Ultra, MLX 4-bit, batch 1, 4096 prefill, 256 decode)
| Tier | Params | Disk | RAM (peak) | Decode tok/s |
|---|---|---|---|---|
| Nano 4B | 4B dense | ~3 GB | ~4 GB | 71.7 |
| Lite 9B | 9B dense | ~6 GB | ~7 GB | 53.4 |
| Quick 26B | 26B-a4b MoE | ~16 GB | ~17 GB | 14.6 |
| Core 27B | 27B dense | ~16 GB | ~17 GB | 20.7 |
| Vision 35B-A3B | 35B-A3B MoE | ~20 GB | ~18 GB (cap) / ~3.5 GB (V10) | ~8 (V10) / 16 (cap) |
| Plus 397B-A17B | 397B-A17B MoE | ~209 GB | ~11 GB (V9 paged) | 2.1 (V9) |
Source: Outlier FINAL_LAUNCH_NUMBERS.md. The M1 Ultra Mac tok/s bench ran 2026-04-29. Accuracy figures are still being finalized; the only fully-published, verified number is Nano HumanEval 81.1% pass@1 (full 164-set).
Cross-tool comparison (Mac, comparable 7B/13B/27B models)
| Tool | Model | Format | Mac | Decode tok/s |
|---|---|---|---|---|
| Outlier | Nano 4B | MLX 4-bit | M1 Ultra 64 GB | 71.7 |
| Outlier | Nano 4B | MLX 4-bit | M4 Air 16 GB | ~32 |
| Ollama | Llama 3.1 8B Q4_K_M | GGUF | M1 Ultra 64 GB | ~60–80 (public posts) |
| LM Studio | Qwen 2.5 7B Q4 | GGUF/MLX | M2 Max 64 GB | ~70–100 (public posts) |
| Outlier | Core 27B | MLX 4-bit | M1 Ultra 64 GB | 20.7 |
| Ollama | Qwen 2.5 Coder 32B Q4 | GGUF | M1 Ultra 64 GB | ~15–22 (public posts) |
| Outlier | Plus 397B-A17B | MLX 4-bit, V9 paged | M1 Ultra 64 GB | 2.1 |
| Ollama | any 397B model | — | 64 GB Mac | won't load (RAM) |
| LM Studio | any 397B model | — | 64 GB Mac | won't load (RAM) |
Source: Outlier numbers measured locally. The Ollama and LM Studio ranges come from publicly-shared benchmark posts as of 2026-05. Exact apples-to-apples is genuinely hard here. Prompts differ, prefill sizes differ, batch settings differ. The ranges shown are what's typical.
Cost comparison (24 months, single Mac developer)
| Setup | 24-month total |
|---|---|
| Outlier Free (Nano + Lite) | $0 |
| Outlier Pro ($20/mo) | $480 |
| Outlier Pro annual ($149/yr × 2) | $298 |
| Outlier Founding 200 ($99 once, lifetime Pro) | $99 |
| Outlier Founders 500 ($200 once, lifetime Pro) | $200 |
| Ollama (OSS, free) | $0 |
| LM Studio (free for personal use) | $0 |
| ChatGPT Plus / Claude Pro ($20/mo) | $480 |
| ChatGPT Pro / Claude Max ($200/mo) | $4,800 |
What the bench doesn't measure
- Prompt prefill / TTFT. Decode tok/s is the steady-state figure. Feed a long prompt and your first token can still be several seconds out, even on the fast tiers.
- Tier swap cost. Switch to a tier you haven't touched in a while and the cold load runs 10s–74s, scaling with size.
- Quality on agentic loops. The bench tested single-turn outputs. Multi-turn agent quality swings more on the app than the model.
- Vision tasks rigorously. Those Vision 35B numbers are language-only. A real image-task quality bench is still coming.
- Long-context (50k+). Every bench prompt was capped at 4096 prefill. Push past that and the cloud flagships still pull ahead by a real margin.
How to reproduce
- Outlier: grab it from outlier.host, then run the bundled benchmark harness in the app's developer console.
- Ollama:
ollama run <model> --verboseprints tok/s for every response. - LM Studio: the chat REPL shows tok/s right in the response footer.
- Accuracy evals: reach for lm-evaluation-harness, loading each model through its native backend.
Frequently asked questions
How fast is Outlier on a Mac?
On M1 Ultra: Nano 4B 71.7 tok/s, Lite 9B 53.4, Core 27B 20.7, Plus 397B 2.1 on the V9 paged engine. Nano hits about 32 tok/s on an M4 Air.
How does Outlier compare to Ollama and LM Studio?
Similar tok/s on shared 7B to 27B model classes; Outlier uniquely runs MoE models bigger than RAM via its V9 paged engine.
What are Outlier's accuracy numbers?
Core 27B is the reasoning-strong coding tier; Nano 4B is the fast everyday tier. Nano's only fully-published, verified figure is HumanEval 81.1% pass@1 on the full 164-set. The rest are still being finalized.
Try Outlier free
Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.
Download for Mac