MCP, MoE, paged inference — the local-AI glossary
- MCP = Model Context Protocol. One standard so agents can talk to tools.
- MoE = Mixture-of-Experts. Only a few sub-networks fire per token.
- Paged inference = weights stream off the SSD on demand, like OS virtual memory.
- MLX = Apple's GPU array library. What runs local AI on a Mac.
Three years ago, none of these words existed. Local AI grew its own jargon fast, and most of it never gets explained. This page does that, in plain English, roughly in the order you'll hit each term. Keep it open and ctrl-F whatever tripped you up.
MCP — Model Context Protocol
Anthropic shipped this open standard in 2024. It's how agents reach real tools. An MCP server publishes a list of functions it can run (read a file, query a database, fire off a Slack message) plus a schema for each. The agent app is the client. It discovers those functions and calls them when the model emits a tool call. The payoff: tools stop being locked to one app. That same filesystem MCP runs in Outlier, Claude Desktop, Continue.dev, and more.
MoE — Mixture of Experts
Picture a layer stuffed with "expert" sub-networks and a little router out front. Every token, the router taps k of those experts (say 10 out of 512 in Qwen3.5-397B) and leaves the rest asleep. Sum every expert and you get total parameters. Count only the k that fired and you get active parameters per token. The gap is huge: Qwen3.5-397B-A17B carries 397B total but lights up just 17B per token.
Paged inference
Your model is bigger than your RAM. Paged inference handles that: keep only the parts in use resident, pull the rest off the SSD on demand. It pays off most with MoE, since each token touches just a sliver of the weights. That's how Outlier's V9 paged engine fits Plus 397B onto a 64 GB Mac.
MLX
Apple's answer to PyTorch and NumPy, built from scratch for their own silicon. It leans hard on Apple Silicon's unified memory and Metal GPU. A model labeled "MLX 4-bit" has been quantized to fit MLX's matmul kernels. Every Outlier tier ships in MLX 4-bit.
GGUF
The format llama.cpp uses, plus everything built on it like Ollama and LM Studio's default backend. Tuned for tight quantization, runs basically everywhere. GGUF is not MLX, though. You can convert between them, but each targets a different runtime, so it's rarely worth the round trip.
4-bit quantization
Store each weight as a 4-bit integer instead of a 16-bit float, with one tiny floating-point scaling factor per group. The model shrinks roughly 4× smaller, on disk and in memory. The quality you give up? On most real tasks you genuinely can't see it.
KV cache
Attention layers produce key and value tensors, and the KV cache holds them across tokens so each new one doesn't re-scan every word before it. Skip it and generation crawls. The catch: the cache grows with your context length. Push context far enough and the KV cache outgrows the model weights themselves.
Tok/s (tokens per second)
The number you watch for raw speed. Decode tok/s is how many output tokens stream out per second after prefill. Core 27B hits about 22 tok/s on an M1 Ultra; Plus 397B on the V9 paged engine lands around 2.1 tok/s. Bigger model, slower stream. That's the trade.
TTFT — time to first token
The wait between hitting send and the first token landing. Almost all of it is prefill, the model chewing through your prompt before it answers. Long prompt? TTFT can stretch to several seconds, even on a quick model. Outlier's prefill heartbeat exists for that moment, so the UI keeps a pulse instead of freezing.
Cold start / cold load
Call an idle model and its weights have to load into unified memory first. That's the cold start. Plus 397B takes about 74 seconds to wake up. Core 27B, a few. Once it's hot it answers right away, so the cold load is a one-time tax.
Quantization-aware training (QAT)
With QAT the model trains already knowing it'll end up at low precision, so it learns weights that survive the squeeze without losing quality. The opposite of post-training quantization, which compresses a finished 16-bit model after the fact. Ternary models usually need QAT. 4-bit models mostly get away without it.
Speculative decoding
A clever speed trick. A small "draft" model guesses the next several tokens, then the big model checks them all in one parallel pass. When the draft is right, and it often is, you bank several tokens for one big-model step. Catch is, Qwen3's hybrid attention blocks it at the architecture level. So Outlier can't ship it on the Plus, Vision, or Core tiers.
Agent loop
The engine under any coding agent, and it's just a cycle: take the user's message, run the model, pull the tool calls out of what it wrote, run them once you've approved them, feed the results back, repeat until it's done. That loop is the whole point of Outlier's Agent mode. Claude Code runs on essentially the same one.
Frequently asked questions
What does MCP stand for?
Model Context Protocol, an open standard for connecting AI agents to external tools so the same tool works across different agent apps.
What is MoE in AI?
Mixture-of-Experts: a model where a router activates only a few expert sub-networks per token, so active parameters are far fewer than total.
What is paged inference?
Streaming model weights from SSD on demand instead of holding them all in RAM, which lets MoE models larger than your memory run.
Try Outlier free
Free Nano + Lite — local, private, no account. Pro $20/mo or $149/yr adds everything (Plus 397B, Marathon mode, Computer use, Deep Research v3, long context to 128K). Lifetime Pro from $99 (Founding 200, first 200 seats) or $200 (Founders 500). Apple Silicon only.
Download for Mac