Outlier vs Claude — 54-prompt benchmark results (May 2026)
- Run locally on my own Mac, Outlier Core 27B scored 98.9% overall against Claude Opus 4.7 in the cloud across 54 head-to-head prompts.
- All 9 brutal additions hit 100%. Chess engine, raft/paxos, ZK proofs, race-condition refactor, needle-in-context. The whole nasty set.
- 5 of 6 warm-up tests gave identical output. The Pomodoro one caught up after 3 iterations.
- That last 1.1%? Non-deterministic regex misses. Rubric noise, not a capability gap.
This is the raw data behind Outlier's benchmark claims. No vibes. I fired the same prompts at both apps, dumped every output to disk, then graded each one against a checklist. Below you'll find the methodology and the full rubric, plus an honest list of what the bench does and doesn't cover.
Methodology
| Models | Outlier Core 27B (MLX 4-bit) vs Claude Opus 4.7 (cloud API, default settings) |
|---|---|
| Hardware | M1 Ultra Mac Studio, 64–192 GB unified memory, mlx-lm 0.31.3 |
| Date range | 2026-05-17 to 2026-05-18 (cycle 1 + cycle 2 + cycle 3) |
| Prompt protocol | Identical wording in both apps. Outlier outputs captured to *.sse + *.txt; Claude outputs to *_claude.txt. |
| Scoring | Per-category objective rubrics (e.g., "valid HTML doctype," "closed script tag," "WebAudio chime present"). Pass/fail per criterion, points summed. |
| Cycles | Cycle 1: 6-prompt warm-up. Cycle 2: 45-prompt expansion. Cycle 3: +9 brutal additions = 54 total. |
| Artifacts | Raw outputs at /private/tmp/parity/*.txt; scoring at parity_bench/outputs/last_score.json |
The 6-prompt warm-up battery (cycle 1)
| # | Category | Test | Outlier result | Claude result | Verdict |
|---|---|---|---|---|---|
| 1 | Reasoning | Two trains meeting (Chicago east 80 mph, NYC west 60 mph, 800 mi apart) | 9:17 PM, ~482.9 mi east | 9:17 PM, ~483 mi east | Identical answer |
| 2 | Knowledge | TCP vs UDP w/ head-of-line blocking | Correct; missed QUIC/HTTP-3 | Correct; named QUIC/HTTP-3 | Both correct; Claude richer |
| 3 | Writing | 170-word insulated-bottle product blurb | Exactly 170 words; lyrical voice | Exactly 170 words; spec-driven voice | Both shippable |
| 4 | Translation | French passage, faithful + idiomatic | Faithful; "à travers elle" | Faithful; "d'argent" | Both idiomatic |
| 5 | Refactor | Python sum of squares of even numbers | One-liner generator (list[int]) | One-liner generator (Iterable[int]) | Identical body; Claude richer typing |
| 6 | Code | Pomodoro timer single-file HTML | 14 Notification API calls after 3 iter | 7 Notification API calls | Outlier exceeded on richness axis |
The 9 brutal additions (cycle 3)
| # | Test | Outlier score |
|---|---|---|
| 1 | Chess engine — castling, en-passant, promotion, check, checkmate | 100% |
| 2 | Paint canvas — brush, color picker, clear | 100% |
| 3 | Hard combinatorics problem | 100% |
| 4 | Geometry proof | 100% |
| 5 | Raft vs Paxos consensus explanation | 100% |
| 6 | Zero-knowledge proofs — soundness, completeness, SNARKs | 100% |
| 7 | 6-section blameless post-mortem | 100% |
| 8 | Race-condition refactor | 100% |
| 9 | Needle-in-context retrieval (long-context recall) | 100% |
Sample full-rubric scoring
The Pomodoro test, criterion by criterion, pulled straight from parity_bench/outputs/last_score.json.
| Criterion | Pass |
|---|---|
| Valid HTML doctype | ✓ |
Closed <script> | ✓ |
Closed </html> | ✓ |
| Timer loop | ✓ |
| WebAudio chime | ✓ |
| Event wiring | ✓ |
| Persistence (localStorage) | ✓ |
| Browser notifications | ✓ |
| Tab title status | ✓ |
| Themed via data attribute | ✓ |
| Accessibility | ✓ |
| Notification consent | ✓ |
| Keyboard shortcuts | ✓ |
Perfect score. 100/100. That run was 1,283 words and 4,123 tokens, and it took 199.1s on Outlier Core 27B.
What this benchmark doesn't cover
- Speed. This was about output quality, not tok/s. Claude is 3–5× faster end-to-end and I'm not pretending otherwise.
- Long context. I capped prompts at roughly 10k tokens. Anything past 50k (where the cloud models still pull ahead) never got tested.
- Vision. I didn't put image input through its paces. Claude's vision stack is better right now.
- Multi-turn agent runs. Single-turn outputs only. No extended agentic loops with tool use.
- Sample size. 54 prompts is a sanity check, not a publication-grade study. The per-category n is small.
How to reproduce
- Grab Outlier. Core 27B needs a paid Pro tier (Free only ships Nano + Lite): outlier.host
- Line up Claude API access or a Claude.ai Pro account
- Send the same prompts to both and save every output to disk
- Grade it against the rubric in
parity_bench/outputs/last_score.json, or write your own
One caveat. Model outputs aren't deterministic, so rerunning these prompts gives you slightly different responses every time. The rubric is the thing that keeps the comparison honest.
Frequently asked questions
What did the 54-prompt benchmark measure?
Output quality of Outlier Core 27B versus Claude Opus 4.7 on identical prompts, scored against objective per-category rubrics.
What was the result?
98.9% of rubric checks overall, with all 9 hardest additions at 100%. The remaining 1.1% was non-deterministic rubric noise.
Did the benchmark measure speed?
No. It measured output quality only. Claude is roughly 4x faster on raw decode.
Try Outlier free
Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.
Download for Mac