Outlier  ›  data

Outlier vs Claude — 54-prompt benchmark results (May 2026)

Quick answer
  • Run locally on my own Mac, Outlier Core 27B scored 98.9% overall against Claude Opus 4.7 in the cloud across 54 head-to-head prompts.
  • All 9 brutal additions hit 100%. Chess engine, raft/paxos, ZK proofs, race-condition refactor, needle-in-context. The whole nasty set.
  • 5 of 6 warm-up tests gave identical output. The Pomodoro one caught up after 3 iterations.
  • That last 1.1%? Non-deterministic regex misses. Rubric noise, not a capability gap.

This is the raw data behind Outlier's benchmark claims. No vibes. I fired the same prompts at both apps, dumped every output to disk, then graded each one against a checklist. Below you'll find the methodology and the full rubric, plus an honest list of what the bench does and doesn't cover.

Methodology

ModelsOutlier Core 27B (MLX 4-bit) vs Claude Opus 4.7 (cloud API, default settings)
HardwareM1 Ultra Mac Studio, 64–192 GB unified memory, mlx-lm 0.31.3
Date range2026-05-17 to 2026-05-18 (cycle 1 + cycle 2 + cycle 3)
Prompt protocolIdentical wording in both apps. Outlier outputs captured to *.sse + *.txt; Claude outputs to *_claude.txt.
ScoringPer-category objective rubrics (e.g., "valid HTML doctype," "closed script tag," "WebAudio chime present"). Pass/fail per criterion, points summed.
CyclesCycle 1: 6-prompt warm-up. Cycle 2: 45-prompt expansion. Cycle 3: +9 brutal additions = 54 total.
ArtifactsRaw outputs at /private/tmp/parity/*.txt; scoring at parity_bench/outputs/last_score.json

The 6-prompt warm-up battery (cycle 1)

#CategoryTestOutlier resultClaude resultVerdict
1ReasoningTwo trains meeting (Chicago east 80 mph, NYC west 60 mph, 800 mi apart)9:17 PM, ~482.9 mi east9:17 PM, ~483 mi eastIdentical answer
2KnowledgeTCP vs UDP w/ head-of-line blockingCorrect; missed QUIC/HTTP-3Correct; named QUIC/HTTP-3Both correct; Claude richer
3Writing170-word insulated-bottle product blurbExactly 170 words; lyrical voiceExactly 170 words; spec-driven voiceBoth shippable
4TranslationFrench passage, faithful + idiomaticFaithful; "à travers elle"Faithful; "d'argent"Both idiomatic
5RefactorPython sum of squares of even numbersOne-liner generator (list[int])One-liner generator (Iterable[int])Identical body; Claude richer typing
6CodePomodoro timer single-file HTML14 Notification API calls after 3 iter7 Notification API callsOutlier exceeded on richness axis

The 9 brutal additions (cycle 3)

#TestOutlier score
1Chess engine — castling, en-passant, promotion, check, checkmate100%
2Paint canvas — brush, color picker, clear100%
3Hard combinatorics problem100%
4Geometry proof100%
5Raft vs Paxos consensus explanation100%
6Zero-knowledge proofs — soundness, completeness, SNARKs100%
76-section blameless post-mortem100%
8Race-condition refactor100%
9Needle-in-context retrieval (long-context recall)100%

Sample full-rubric scoring

The Pomodoro test, criterion by criterion, pulled straight from parity_bench/outputs/last_score.json.

CriterionPass
Valid HTML doctype
Closed <script>
Closed </html>
Timer loop
WebAudio chime
Event wiring
Persistence (localStorage)
Browser notifications
Tab title status
Themed via data attribute
Accessibility
Notification consent
Keyboard shortcuts

Perfect score. 100/100. That run was 1,283 words and 4,123 tokens, and it took 199.1s on Outlier Core 27B.

What this benchmark doesn't cover

How to reproduce

  1. Grab Outlier. Core 27B needs a paid Pro tier (Free only ships Nano + Lite): outlier.host
  2. Line up Claude API access or a Claude.ai Pro account
  3. Send the same prompts to both and save every output to disk
  4. Grade it against the rubric in parity_bench/outputs/last_score.json, or write your own

One caveat. Model outputs aren't deterministic, so rerunning these prompts gives you slightly different responses every time. The rubric is the thing that keeps the comparison honest.

Frequently asked questions

What did the 54-prompt benchmark measure?

Output quality of Outlier Core 27B versus Claude Opus 4.7 on identical prompts, scored against objective per-category rubrics.

What was the result?

98.9% of rubric checks overall, with all 9 hardest additions at 100%. The remaining 1.1% was non-deterministic rubric noise.

Did the benchmark measure speed?

No. It measured output quality only. Claude is roughly 4x faster on raw decode.

Try Outlier free

Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.

Download for Mac