Long-form writing on the architecture, comparisons, and economics of running AI locally on Apple Silicon. Plus raw benchmark data with full methodology.
Raw side-by-side outputs. 98.9% of rubric checks overall, 100% on 9 brutal tests including chess engine, raft/paxos, ZK proofs.
DatasetDecode tok/s and memory for the current V9 paged engine vs the retired V10 and V11 streaming engines, on Plus 397B and Vision 35B. 5-run methodology.
DatasetOutlier numbers vs Ollama and LM Studio public benchmarks: tok/s, MMLU, HumanEval, memory footprint, cost.
Paged expert streaming, the actual numbers, and why MoE makes it tractable.
The stack, the tradeoffs, and where local AI still loses to the cloud.
How 209 GB of weights run with 11 GB peak RSS on a 64 GB Mac Studio.
Setup, what the agent can do, and where it still trails Cursor or Claude Code.
The data path, the telemetry surface, where the boundary really sits.
The agent loop primitives, what works locally, what's different.
Per-Mac-Studio tier-by-tier guide with measured tok/s on M1 Ultra.
Screen perception, action execution, where the local version trails.
The honest 2026 comparison: where each one actually wins.
CLI-first engine vs polished GUI. Which local-AI tool to use.
Two open-source ways to run local AI, compared honestly.
Built-in convenience vs a model you own and run fully offline.
Local Mac coding agent vs cloud terminal coding agent. Bench numbers, honest tradeoffs.
Where each tool fits: cross-platform OSS GGUF vs Mac-native MoE streaming.
Open-source cross-platform GGUF app vs Mac-native paged streaming. Honest picks.
Polished chat GUI vs Mac-native coding agent. Where each is the right pick.
Task-by-task breakdown of where local wins and where Claude Code still wins.
The bench in detail, with raw scoring and reproducibility notes.
Four serious local-AI options in 2026, feature-by-feature.
What the bar actually is, and the real options that meet it.
In-editor vs standalone setups; where the predictive-completion gap lives.
Honest side-by-side: price, caps, privacy, offline. Where each one actually wins.
A plain-English guide to AI that runs on your own machine.
What to actually check — and why it's safer for your data.
The architecture, plainly. How streaming experts from SSD breaks the RAM ceiling.
Four stacked design choices: MoE + 4-bit + streaming + unified memory.
{-1, 0, +1} weights. Where the research sits in 2026; why 4-bit is still production.
The data path walkthrough. What stays local, what doesn't, and the airplane-mode verification.
24-month comparison with breakeven analysis. Includes hardware reuse and electricity.
Unified memory, on-package SSD, Metal kernels. The technical why.
Cloud has message caps, token caps, fair-use limits. Local has wattage. What changes.
Plain-language definitions for the terms that come up when you start using local AI.
Parameters, tokens, quantization — the plain-English guide to how LLMs actually work.
Context windows explained: why your AI loses the thread and how local AI handles it differently.
Honest answer: Claude's weights are closed. But open models at 98.9% parity can run locally.
Apple Silicon has GPU cores built in. Why Mac doesn't need a separate graphics card for local AI.
Yes — Qwen2-VL and Llama Vision run locally. Images never leave your device.
The popular CLI model runner explained. What it does, who it's for, and how it compares.
Unreleased campaigns and client data stay on your Mac.
Confidential source documents never leave your machine.
Student data stays on your device; free to start.
Client confidentiality without the cloud — privileged data stays on your Mac.
Five Macs instead of five seats. Flat cost, and each person's data stays put.
Draft session notes on-device — nothing leaves the room, no vendor in the chain.
Unpublished results and embargoed data never get uploaded or trained on.
Client tax data, payroll, and pre-close financials stay on your own Mac.
Free where it counts, and donor data never leaves the staffer's machine.
Process PHI on-device — no AI vendor in your business-associate chain.
Confidential financials and MNPI never leave your Mac.
Flat cost, no per-seat API bills, pre-public code stays local.
NDA-safe by design — every client's data stays on your machine.
Your manuscript is never uploaded or used for training.
No subscription, no usage caps, works offline.
Source material that never touches a third party's servers.
The model family Outlier is built on — how to run it yourself.
Quantized GGUF, step by step, sized to your RAM.
The honest path, the sizes, and the RAM you need.
The opt-out steps — and the only guaranteed fix.
A practical five-step migration that keeps the work.
Step-by-step: download, drag, run. ~5 minutes, no account.
Download once, turn Wi-Fi off, keep working. The genuinely offline path.
The most common Mac config, and what it actually handles well.
16 GB vs 32 GB vs 64 GB — what each runs.
Model sizes (2.4–209 GB) and how to reclaim space.
RAM, disk, and download fixes for a model that won't start.
Nano runs at 32 tok/s. 7B is tight but doable. Here's the RAM math.
Whisper transcribes locally, Outlier summarizes. No audio ever uploaded.
ReferenceWhich models fit your RAM, with disk sizes and tok/s.
Outlier, Ollama, LM Studio, Jan, GPT4All — ranked and compared.
Qwen, Llama, DeepSeek, Mistral — which model family for what.
Ranked: Outlier, Qwen2.5-Coder, DeepSeek-Coder, Codestral.
Outlier free tier, Ollama, LM Studio, Jan — what 'free' means.
The most private AI runs on-device. How to verify it.
ChatGPT, Claude, Outlier, Ollama, LM Studio — tested and compared honestly.
The buildout you can't vote on — and the one personal lever you actually have.
Yes — every prompt travels to a building and back. What runs without one.
If big companies are metering AI to control cost, you get the tighter end.
Per-seat pricing only climbs. The math behind five Macs instead of five seats.
The fix isn't buying RAM. It's getting more AI out of the Mac you already own.
Add up the bill, move the 90% that doesn't need a meter, keep only what earns it.
Yes, if it lives on your device. Why the cloud can't and on-device can.
Your four real options when Claude or ChatGPT cuts you off mid-session.
The unit economics. Caps aren't a bug, they're the business model.
What's actually unmetered in 2026. Spoiler: it runs on your hardware.
ChatGPT + Claude + Cursor + Copilot + Perplexity, added up honestly.
Weights on your disk can't be deprecated, repriced, or taken away.
GPT-4 is gone from ChatGPT. Every rented model has an expiry date.
Stored, sometimes trained on, and court-preservable. With receipts.
The 2026 default scoreboard, and where every opt-out hides.
Datacenters, inference, and the on-device alternative. Sourced.
Billions of gallons for cooling. Your Mac is air-cooled.
An honest accounting, both directions. No green badge.
The real annual cost, the caps, and an honest look at who should keep it.
What ChatGPT, Claude, and Gemini store, who can read it, and how to opt out.
Free Nano + Lite. Pro $20/mo or $149/yr adds everything (Plus 397B included). Lifetime Pro from $99 (Founding 200) or $200 (Founders 500). Apple Silicon only.
Download for Mac