Outlier › data

Outlier vs Claude — 54-prompt benchmark results (May 2026)

Name: Outlier Core 27B vs Claude Opus 4.7 — 54-prompt benchmark
Creator: Outlier
Published: 2026-05-19
License: creativecommons.org/licenses/by/4.0/

Outlier · solo-built in Grand Rapids · published 2026-05-19 Last updated 2026-05-20

Quick answer

Run locally on my own Mac, Outlier Core 27B scored 98.9% overall against Claude Opus 4.7 in the cloud across 54 head-to-head prompts.
All 9 brutal additions hit 100%. Chess engine, raft/paxos, ZK proofs, race-condition refactor, needle-in-context. The whole nasty set.
5 of 6 warm-up tests gave identical output. The Pomodoro one caught up after 3 iterations.
That last 1.1%? Non-deterministic regex misses. Rubric noise, not a capability gap.

This is the raw data behind Outlier's benchmark claims. No vibes. I fired the same prompts at both apps, dumped every output to disk, then graded each one against a checklist. Below you'll find the methodology and the full rubric, plus an honest list of what the bench does and doesn't cover.

Methodology

Models	Outlier Core 27B (MLX 4-bit) vs Claude Opus 4.7 (cloud API, default settings)
Hardware	M1 Ultra Mac Studio, 64–192 GB unified memory, mlx-lm 0.31.3
Date range	2026-05-17 to 2026-05-18 (cycle 1 + cycle 2 + cycle 3)
Prompt protocol	Identical wording in both apps. Outlier outputs captured to `.sse` + `.txt`; Claude outputs to `*_claude.txt`.
Scoring	Per-category objective rubrics (e.g., "valid HTML doctype," "closed script tag," "WebAudio chime present"). Pass/fail per criterion, points summed.
Cycles	Cycle 1: 6-prompt warm-up. Cycle 2: 45-prompt expansion. Cycle 3: +9 brutal additions = 54 total.
Artifacts	Raw outputs at `/private/tmp/parity/*.txt`; scoring at `parity_bench/outputs/last_score.json`

The 6-prompt warm-up battery (cycle 1)

#	Category	Test	Outlier result	Claude result	Verdict
1	Reasoning	Two trains meeting (Chicago east 80 mph, NYC west 60 mph, 800 mi apart)	9:17 PM, ~482.9 mi east	9:17 PM, ~483 mi east	Identical answer
2	Knowledge	TCP vs UDP w/ head-of-line blocking	Correct; missed QUIC/HTTP-3	Correct; named QUIC/HTTP-3	Both correct; Claude richer
3	Writing	170-word insulated-bottle product blurb	Exactly 170 words; lyrical voice	Exactly 170 words; spec-driven voice	Both shippable
4	Translation	French passage, faithful + idiomatic	Faithful; "à travers elle"	Faithful; "d'argent"	Both idiomatic
5	Refactor	Python sum of squares of even numbers	One-liner generator (`list[int]`)	One-liner generator (`Iterable[int]`)	Identical body; Claude richer typing
6	Code	Pomodoro timer single-file HTML	14 Notification API calls after 3 iter	7 Notification API calls	Outlier exceeded on richness axis

The 9 brutal additions (cycle 3)

#	Test	Outlier score
1	Chess engine — castling, en-passant, promotion, check, checkmate	100%
2	Paint canvas — brush, color picker, clear	100%
3	Hard combinatorics problem	100%
4	Geometry proof	100%
5	Raft vs Paxos consensus explanation	100%
6	Zero-knowledge proofs — soundness, completeness, SNARKs	100%
7	6-section blameless post-mortem	100%
8	Race-condition refactor	100%
9	Needle-in-context retrieval (long-context recall)	100%

Sample full-rubric scoring

The Pomodoro test, criterion by criterion, pulled straight from parity_bench/outputs/last_score.json.

Criterion	Pass
Valid HTML doctype	✓
Closed `<script>`	✓
Closed `</html>`	✓
Timer loop	✓
WebAudio chime	✓
Event wiring	✓
Persistence (localStorage)	✓
Browser notifications	✓
Tab title status	✓
Themed via data attribute	✓
Accessibility	✓
Notification consent	✓
Keyboard shortcuts	✓

Perfect score. 100/100. That run was 1,283 words and 4,123 tokens, and it took 199.1s on Outlier Core 27B.

What this benchmark doesn't cover

Speed. This was about output quality, not tok/s. Claude is 3–5× faster end-to-end and I'm not pretending otherwise.
Long context. I capped prompts at roughly 10k tokens. Anything past 50k (where the cloud models still pull ahead) never got tested.
Vision. I didn't put image input through its paces. Claude's vision stack is better right now.
Multi-turn agent runs. Single-turn outputs only. No extended agentic loops with tool use.
Sample size. 54 prompts is a sanity check, not a publication-grade study. The per-category n is small.

How to reproduce

Grab Outlier. Core 27B needs a paid Pro tier (Free only ships Nano + Lite): outlier.host
Line up Claude API access or a Claude.ai Pro account
Send the same prompts to both and save every output to disk
Grade it against the rubric in parity_bench/outputs/last_score.json, or write your own

One caveat. Model outputs aren't deterministic, so rerunning these prompts gives you slightly different responses every time. The rubric is the thing that keeps the comparison honest.

Frequently asked questions

What did the 54-prompt benchmark measure?

Output quality of Outlier Core 27B versus Claude Opus 4.7 on identical prompts, scored against objective per-category rubrics.

What was the result?

98.9% of rubric checks overall, with all 9 hardest additions at 100%. The remaining 1.1% was non-deterministic rubric noise.

Did the benchmark measure speed?

No. It measured output quality only. Claude is roughly 4x faster on raw decode.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac