Outlier / Learn / What is a large language model?

What is a large language model (LLM)? A plain-English guide (2026)

Q: How many parameters does an LLM need to be useful?

Useful is relative to the task. A 1.5B-parameter model like Outlier Nano handles simple Q&A and summarization at 32 tokens per second on an M4 MacBook Air. A 7B or 8B model handles most everyday coding and writing tasks. You only need 70B+ parameters for highly specialized or research-grade work.

Matt Kerr · June 17, 2026 Updated June 2026

Quick answer

A large language model is a neural network trained on a large body of text. It learns to predict the next token — a word fragment — given everything that came before it. Do that one token at a time and you get a response. The "large" part refers to the number of parameters (weights) inside the network, which can range from a few billion to hundreds of billions.

Every AI assistant you've used in the last few years — whether it's a chatbot in a browser or a model running on your Mac — is built on this same idea. The terminology around LLMs gets complicated fast, so this article unpacks the pieces that actually matter: parameters, training, context windows, and what it takes to run one locally.

What does "large" mean?

There's no fixed threshold where a language model becomes "large." The term stuck because models in the 2020s were orders of magnitude bigger than what came before. Today "large" is a relative label, not a precise spec.

In practice, models are described by their parameter count: 1.5B, 7B, 70B, 671B. The number tells you roughly how much the model learned from training and how much RAM it needs to run. A 7B-parameter model is genuinely capable for everyday work. A 671B model like DeepSeek-R1 (a mixture-of-experts architecture) is far larger — but also requires hardware most people don't own.

The more useful question isn't "how large?" but "large enough for what?" A 1.5B model handles quick Q&A and summarization. An 8B model handles most coding and writing tasks. You need 70B+ for highly specialized or research-grade work where errors carry real cost.

How LLMs actually work

Training an LLM is a prediction exercise run at enormous scale. You feed the model a huge corpus of text — web pages, books, code repositories, scientific papers — and teach it one thing: given the tokens so far, what token comes next?

The model makes a prediction, the prediction is compared to the actual next token in the training data, and the weights (parameters) are adjusted slightly to reduce the error. Repeat this billions of times across trillions of tokens and the network internalizes statistical patterns about language: grammar, facts, reasoning chains, code syntax.

At inference time — when you type a message — the model runs forward through the network once per token it generates. It looks at your input plus everything it has already written, then produces a probability distribution over possible next tokens and samples from it. That's the whole mechanism. There's no database lookup, no retrieval from a knowledge store by default, and no "understanding" in the philosophical sense — just learned pattern completion, applied very quickly.

What are parameters?

Parameters are the numbers inside the network — the weights on billions of connections between artificial neurons. After training, these weights are fixed. They encode, in a distributed way, everything the model learned: facts, grammar, code patterns, reasoning heuristics.

When you load a model, you're loading those weights into RAM. A 7B-parameter model at full 32-bit float precision needs roughly 28 GB of RAM — too much for most consumer hardware. This is where quantization comes in.

Quantization compresses weights from 32-bit floats down to 4-bit integers, cutting memory by roughly 8x at modest quality cost. At 4-bit quantization, 7B parameters fit in about 4.5 GB on disk. That's why you can run a capable model on a MacBook Air with 16 GB of RAM.

Quantization isn't free — you lose some precision in the weights — but for most tasks the quality difference versus full-precision is small enough that you won't notice it in practice.

Open vs. closed models

Some LLMs are closed: the weights are never released publicly and you access the model only through an API or a chat interface. Others are open: the weights are published and anyone can download and run them.

Model	Access	Developer	Notes
GPT-4o	Closed	OpenAI	API + ChatGPT only; weights not released
Claude Sonnet	Closed	Anthropic	API + Claude.ai only; weights not released
Llama 3 (8B / 70B)	Open	Meta	Weights published; downloadable and runnable locally
Qwen2.5 (1.5B – 72B)	Open	Alibaba	Weights published; wide size range for different hardware
Mistral 7B	Open	Mistral AI	Weights published; compact, efficient for its size
DeepSeek-R1 (671B)	Open	DeepSeek	Mixture-of-experts; weights published, requires significant hardware

Closed models are often more capable at the high end, but they require an internet connection, send your prompts to a remote server, and charge per token. Open models run locally, cost nothing per query after you have the weights, and keep your data on your device.

Context window explained

A context window is how many tokens the model can see at once. Everything inside the window — your system prompt, the conversation so far, any documents you've pasted in — is available to the model when it generates the next token. Everything outside the window is invisible.

Token counts aren't word counts. English text runs roughly 750 words per 1,000 tokens. A 128k-token context window can hold a short novel.

Some reference points from published model specs:

GPT-4o: 128k tokens
Claude: 200k tokens
Most local open models: 8k – 128k tokens depending on the model and how much RAM you have

A larger context window isn't always better in practice. Models can lose track of information in very long contexts, and a bigger context window uses more RAM and slows down inference. For most tasks — writing, coding, Q&A — 8k to 32k tokens is plenty.

Running an LLM on your own Mac

Running a model locally means three things: the weight files live on your disk, computation happens on your chip, and nothing leaves the device. No API call. No server. No usage cap.

Apple Silicon makes this practical for most people. The M-series chips have a unified memory architecture — the CPU, GPU, and neural engine all share the same RAM pool — which means a model that needs 8 GB of memory doesn't require a discrete GPU. A MacBook Air with 16 GB handles 7B and 8B models comfortably. A Mac with 32 GB or 64 GB can run models in the 27B–32B range.

The Outlier app ships three models for Mac:

Outlier Nano (1.5B) — runs at 32 tokens per second on an M4 MacBook Air; free tier
Outlier Lite (3B) — more capable, still fast on any M-series Mac; free tier
Outlier Core (27B) — serious coding and writing work; requires more RAM

The size-to-memory math: at 4-bit quantization, a 7B model is about 4.5 GB on disk and in RAM. A 27B model is roughly 17 GB. That's the number you need to compare against your Mac's unified memory spec.

Sources and methodology: The numbers in this article come from published model cards and measured inference runs on my own M-series Macs. Context window figures are from official model documentation as of mid-2026. Quantization math (4-bit ≈ 8x compression vs. 32-bit float) is standard across the open-weights community.

Frequently asked questions

What is a large language model in simple terms?

A large language model is a neural network trained on a large amount of text. It learns statistical patterns from that text, then uses those patterns to predict the next token (word fragment) given whatever you've typed. Do this one token at a time and you get a response.

How many parameters does an LLM need to be useful?

It depends on the task. A 1.5B-parameter model like Outlier Nano handles simple Q&A and summarization at 32 tokens per second on an M4 MacBook Air. A 7B or 8B model handles most everyday coding and writing tasks. You only need 70B+ parameters for highly specialized or research-grade work where errors carry real cost.

Can I run an LLM on my Mac without a GPU?

Yes. Apple Silicon (M1 and later) has a unified memory architecture where the CPU and GPU share the same RAM pool. A model like Outlier Core (27B parameters) runs on a MacBook Pro without any discrete GPU. The neural engine and GPU cores built into the M-series chip handle the matrix math.

Try a local LLM on your Mac — free

Outlier Nano and Lite are free tiers. Download the app, get the weights, and run your first query in under two minutes. Nothing leaves your Mac.

Download Outlier