Why AI forgets what you said — context windows explained
AI doesn't actually "remember" anything — it only sees what's in its current context window. When a conversation goes longer than that limit, older messages get dropped or summarized. Cloud services also clear context between sessions.
You spend twenty minutes building up context with a chatbot — pasting in a document, explaining what you need, getting the model dialed in — and then it contradicts something you said near the start. Or you come back the next day and it has no idea who you are. Neither is a glitch. It's exactly how these systems work, and once you understand the mechanism, the behavior stops being mysterious.
The context window — AI's working memory
A language model has no persistent memory between calls. Every time you send a message, the model receives a single block of text — the conversation so far, stitched together — processes it all at once, and produces a reply. That block of text is the context window: the total amount of text the model can "see" in one pass. Everything outside it is invisible.
The unit of measurement is tokens. A token is roughly 0.75 words, so 1,000 tokens works out to about 750 words of ordinary prose. The context window size, measured in tokens, sets a hard ceiling on how much of a conversation a model can hold at once. A 128,000-token window — which sounds large — fits roughly 96,000 words in a single session, about the length of a full novel. That's genuinely a lot, but it is finite, and long technical sessions with pasted code or documents burn through it faster than you'd expect.
What you type, what the model replies, any documents you paste in, and system instructions from the app all count against the same limit. The model isn't selectively reading; it reads the whole window every single time it responds.
What happens when the window fills
When the context window fills up, something has to give. There are two common approaches, and neither is painless.
Truncation: the oldest messages simply fall off the front. The model never sees them again, as if they never existed. You can ask a question that directly contradicts a rule you established twenty exchanges ago, and the model has no way to notice — that part of the conversation is gone.
Summarization: some systems compress older turns into a summary and keep that in the window instead of the raw text. This preserves a rough outline of earlier context but loses the specifics. If you pasted a ten-page document early on, what the model retains after summarization is something like "the user shared a document about contract terms" — not the actual terms.
Both approaches can produce the impression that the model has forgotten you, because it has. It isn't withholding information or being difficult. The data is simply no longer in the window.
Why cloud AI forgets between sessions
The context window problem explains forgetting within a single conversation. But cloud AI also forgets between conversations, and that's a separate mechanism.
Cloud AI services — ChatGPT, Claude, Gemini — run on remote servers. Each conversation is its own isolated session. When you close the tab and come back the next day, you start a fresh context window with nothing in it. The server does not retain your previous conversation as active context by default. There's no ambient memory carrying over; it's a blank slate every time.
Some cloud services offer opt-in memory features. ChatGPT has Memory; Claude has its own. These work by having the model write short summaries or facts to a separate store, which then get injected into future sessions. They help, but they're not full recall. The raw text of every conversation you've ever had is not stored and replayed verbatim — you get curated summaries, and what gets saved is decided by the system, not by you. If a detail wasn't captured, it's gone.
Context window sizes compared
Different models have different window sizes. Here's how the major ones compare:
| Model | Context window | Approx. words |
|---|---|---|
| GPT-4o | 128k tokens |
~96,000 |
| Claude 3.5 Sonnet | 200k tokens |
~150,000 |
| Llama 3.1 8B | 128k tokens |
~96,000 |
| Most local 7B models | 8k–32k tokens (default) |
~6,000–24,000 |
The gap between a frontier cloud model and a default local model can be significant. A 7B model set to 8k tokens holds about 6,000 words — roughly a long magazine article. A cloud model at 128k–200k tokens holds an entire book. That difference matters a lot for tasks like reviewing a codebase or analyzing a long document in one shot.
That said, context window size on local models is often configurable. Many models support longer windows than their default setting, and the tradeoff is memory: a longer context uses more RAM.
How local AI handles this differently
I run local AI on my Macs using Outlier, and the context behavior is meaningfully different from cloud AI in two ways.
First, there's no server session timeout. Cloud AI sessions can expire not just when the window fills but when the service decides the session is over — a tab closing, an inactivity timeout, or a backend reset. With a local model, the conversation lives in memory on your machine for as long as your app is open. Nothing resets it from the outside.
Second, local sessions are saveable and resumable. Cloud AI sessions are ephemeral by design — the chat history is a display artifact, not active context you can reload into a fresh window with the same effect. A local app can serialize the actual context state and resume it. When you pick up a session in Outlier, the model sees the same window it had before, not a reconstruction from a chat log.
None of this changes the fundamental physics. A local model still has a context window with a token limit, and filling it still causes older content to get dropped or summarized. The difference is that you have more control over what happens to that context, and no external service is resetting it under you.
Practical tips for working with context limits
Once you understand what's happening, a few habits help you stay within the window and avoid surprises:
- Put critical information at the top. If there's a rule, persona, or fact that must survive the whole session, put it in the first message — or in a system prompt if the app supports it. Truncation drops from the front, so early content is the first to go.
- Paste in key facts when you start a new session. Don't rely on the model to remember what you told it last week. Copy the three bullet points that matter and paste them at the start. It takes ten seconds and prevents a lot of confusion.
- Use summarization intentionally. Before a long session gets close to the limit, ask the model to summarize what you've established so far. Then start a new session with that summary as the opening message. You get a clean window with the important context preserved.
- Watch for contradiction as a signal. If the model contradicts something you said earlier, that's often a sign the original turn fell out of the window. It's not confused — it literally doesn't have that information anymore.
- Keep documents separate. If you're analyzing a long file, paste only the relevant section rather than the full document. Every token you spend on context you don't need is a token you can't spend on the conversation.
Keep more context on your own machine
Outlier runs local AI on your Mac — long context sessions with no server resets, no usage caps, and no data leaving your device.
Download Outlier