Outlier  ›  Learn  ›  Why AI forgets conversations

Why AI forgets what you said — context windows explained

Quick answer

AI doesn't actually "remember" anything — it only sees what's in its current context window. When a conversation goes longer than that limit, older messages get dropped or summarized. Cloud services also clear context between sessions.

You spend twenty minutes building up context with a chatbot — pasting in a document, explaining what you need, getting the model dialed in — and then it contradicts something you said near the start. Or you come back the next day and it has no idea who you are. Neither is a glitch. It's exactly how these systems work, and once you understand the mechanism, the behavior stops being mysterious.

The context window — AI's working memory

A language model has no persistent memory between calls. Every time you send a message, the model receives a single block of text — the conversation so far, stitched together — processes it all at once, and produces a reply. That block of text is the context window: the total amount of text the model can "see" in one pass. Everything outside it is invisible.

The unit of measurement is tokens. A token is roughly 0.75 words, so 1,000 tokens works out to about 750 words of ordinary prose. The context window size, measured in tokens, sets a hard ceiling on how much of a conversation a model can hold at once. A 128,000-token window — which sounds large — fits roughly 96,000 words in a single session, about the length of a full novel. That's genuinely a lot, but it is finite, and long technical sessions with pasted code or documents burn through it faster than you'd expect.

What you type, what the model replies, any documents you paste in, and system instructions from the app all count against the same limit. The model isn't selectively reading; it reads the whole window every single time it responds.

What happens when the window fills

When the context window fills up, something has to give. There are two common approaches, and neither is painless.

Truncation: the oldest messages simply fall off the front. The model never sees them again, as if they never existed. You can ask a question that directly contradicts a rule you established twenty exchanges ago, and the model has no way to notice — that part of the conversation is gone.

Summarization: some systems compress older turns into a summary and keep that in the window instead of the raw text. This preserves a rough outline of earlier context but loses the specifics. If you pasted a ten-page document early on, what the model retains after summarization is something like "the user shared a document about contract terms" — not the actual terms.

Both approaches can produce the impression that the model has forgotten you, because it has. It isn't withholding information or being difficult. The data is simply no longer in the window.

Why cloud AI forgets between sessions

The context window problem explains forgetting within a single conversation. But cloud AI also forgets between conversations, and that's a separate mechanism.

Cloud AI services — ChatGPT, Claude, Gemini — run on remote servers. Each conversation is its own isolated session. When you close the tab and come back the next day, you start a fresh context window with nothing in it. The server does not retain your previous conversation as active context by default. There's no ambient memory carrying over; it's a blank slate every time.

Some cloud services offer opt-in memory features. ChatGPT has Memory; Claude has its own. These work by having the model write short summaries or facts to a separate store, which then get injected into future sessions. They help, but they're not full recall. The raw text of every conversation you've ever had is not stored and replayed verbatim — you get curated summaries, and what gets saved is decided by the system, not by you. If a detail wasn't captured, it's gone.

Context window sizes compared

Different models have different window sizes. Here's how the major ones compare:

Model Context window Approx. words
GPT-4o 128k tokens ~96,000
Claude 3.5 Sonnet 200k tokens ~150,000
Llama 3.1 8B 128k tokens ~96,000
Most local 7B models 8k–32k tokens (default) ~6,000–24,000

The gap between a frontier cloud model and a default local model can be significant. A 7B model set to 8k tokens holds about 6,000 words — roughly a long magazine article. A cloud model at 128k–200k tokens holds an entire book. That difference matters a lot for tasks like reviewing a codebase or analyzing a long document in one shot.

That said, context window size on local models is often configurable. Many models support longer windows than their default setting, and the tradeoff is memory: a longer context uses more RAM.

How local AI handles this differently

I run local AI on my Macs using Outlier, and the context behavior is meaningfully different from cloud AI in two ways.

First, there's no server session timeout. Cloud AI sessions can expire not just when the window fills but when the service decides the session is over — a tab closing, an inactivity timeout, or a backend reset. With a local model, the conversation lives in memory on your machine for as long as your app is open. Nothing resets it from the outside.

Second, local sessions are saveable and resumable. Cloud AI sessions are ephemeral by design — the chat history is a display artifact, not active context you can reload into a fresh window with the same effect. A local app can serialize the actual context state and resume it. When you pick up a session in Outlier, the model sees the same window it had before, not a reconstruction from a chat log.

None of this changes the fundamental physics. A local model still has a context window with a token limit, and filling it still causes older content to get dropped or summarized. The difference is that you have more control over what happens to that context, and no external service is resetting it under you.

Practical tips for working with context limits

Once you understand what's happening, a few habits help you stay within the window and avoid surprises:

The numbers behind this article. Context window sizes are published in the respective model documentation: GPT-4o at 128k tokens (OpenAI), Claude 3.5 Sonnet at 200k tokens (Anthropic), Llama 3.1 8B at 128k tokens (Meta). Token-to-word ratio of approximately 0.75 words per token is the commonly cited rule of thumb and matches outputs from OpenAI's tokenizer tool. Local model defaults of 8k–32k refer to the typical out-of-the-box configuration for quantized 7B models in tools like Ollama and llama.cpp; the configurable ceiling is higher.

Keep more context on your own machine

Outlier runs local AI on your Mac — long context sessions with no server resets, no usage caps, and no data leaving your device.

Download Outlier