Context Windows Are Not the Answer

When More Context Made Everything Worse

At CloneForce, we were running long planning and execution sessions — creating detailed plans and then executing them in the same conversation. The token count would climb as the session went on. Plans, code changes, test results, error logs — all accumulating in context.

At some point, the AI would start making mistakes. Not subtle ones. It would reference code it had already changed. It would forget decisions made earlier in the session. It would hallucinate function signatures that didn't exist. The longer the session ran, the worse the output got — confident, detailed, and wrong.

We didn't know to compact the context programmatically. We didn't even know that was a thing. We just knew that long sessions produced bad results and short sessions produced good ones. Once we figured out what was happening — that the context window was working against us, not for us — we added guardrails. Shorter sessions. Scoped context. Breaking work into pieces that fit cleanly instead of letting everything accumulate.

That was the moment I stopped trusting context windows.

My Misconception

I'll be honest — when context windows started getting bigger, I assumed more was better. More information, better answers. Just feed it everything and let the AI sort it out. That felt like the obvious conclusion, and I ran with it for a while.

Learning that the opposite was true was one of those humbling moments. The kind where you realize you've been doing it wrong and the fix is so simple it almost feels insulting. After using these tools in production, I learned that after about 50% of the context window, the AI gets noticeably worse. Not a little worse — significantly worse. There's a well-documented phenomenon called "lost in the middle" — the AI pays strong attention to the beginning and end of the context but loses track of information in the middle. The bigger the window, the bigger the dead zone in the middle.

This means a million-token context window doesn't solve the problem. It makes it worse. You now have a bigger middle section where critical information gets lost. The AI will confidently reference something from the first few thousand tokens while completely ignoring the fix you need that's buried at token 400,000. Compacting the context — making it smaller and more focused — consistently gives better results. It will get better over time, but it's still a fundamental problem and throwing more tokens at it isn't the answer.

The Bigger-Is-Better Trap

Every few months, another model drops with a headline number that's supposed to make you gasp. 100K tokens. 200K. A million. The pitch is always the same: now you can load your entire codebase, your entire knowledge base, your entire life story, and the model will just handle it.

It's seductive. I get it. Who wouldn't want an AI that can hold an entire project in its head at once?

But here's what the marketing doesn't tell you.

Cost and latency scale linearly (or worse). Every token you send gets processed. More context means slower responses and a bigger bill. I've seen developers dump entire repos into prompts for questions that needed maybe 500 tokens of actual context. They're paying ten times more for worse results.

Hallucination risk compounds. The more unrelated information you pack into context, the more surface area the model has to draw false connections. It starts pattern-matching across boundaries that shouldn't be crossed. That's not intelligence — that's a very expensive game of telephone.

What Actually Works

After building Open Notebook (the AI on this site that knows my career) and working on CloneForce's agent pipelines, I've learned the same lesson over and over: the best AI systems use less context, not more. The skill is in curation.

Here's what I reach for instead of a bigger window:

Targeted Retrieval (RAG)

RAG originally existed to solve a different problem: early models had small context windows and would truncate your input. You had to retrieve only the relevant pieces because everything didn't fit. When context windows got bigger, a lot of people thought RAG was obsolete — why bother retrieving when you can just dump everything in?

Turns out RAG was solving a deeper problem than token limits. It was solving the quality problem. It's still a great tool for getting better results — but it's not a silver bullet either. You can't use RAG for everything. Some problems need the model to hold a lot of context at once. The point is that bigger windows didn't replace RAG — they revealed that we need both, used thoughtfully.

Retrieval-Augmented Generation flips the paradigm. Instead of giving the model everything and hoping it finds what's relevant, you find what's relevant first and give the model only that.

In Open Notebook, when someone asks "What's BJT's experience with React?", the system doesn't dump my entire career history into the prompt. It retrieves the specific chunks about my React projects, my frontend work, the relevant parts of my resume — maybe 1,500 tokens total — and the model answers from that focused context. The responses are faster, cheaper, and dramatically more accurate.

Chunking Problems Into LLM-Friendly Pieces

Large language models think in tokens, and they think best in manageable amounts. The art is breaking your problem down so each prompt gets exactly what it needs. This is a design skill, not a prompting trick. It means thinking about information architecture before you ever write a system prompt.

When I'm working on a multi-file bug, I don't paste the whole project. I trace the call chain, identify the three or four files involved, and send just those — in a logical order, with clear labels. The model does dramatically better because I did the thinking up front.

Smart Context Management

Knowing what to leave out matters as much as knowing what to include. Every token that isn't directly relevant is noise. It's not just wasted space — it's actively working against you by diluting the model's attention on the tokens that matter.

This is the same principle behind good engineering in any domain. You don't hand a machinist a 10,000-page manual when they need the tolerance spec for one part. You hand them the drawing. The constraint is a feature.

Context Caching

For systems where the same prefix gets reused across requests — like a consistent system prompt or a shared knowledge base — context caching lets you pay for that prefix once and reuse it across multiple completions. It's not glamorous, but it's the kind of practical optimization that separates production systems from demo-ware. (I'll dig deeper into caching strategies in a future post.)

The Mechanic in Me

I came up through trades before I wrote code. Auto body, mechanical engineering, machining. In every one of those disciplines, the lesson was the same: precision beats volume. You don't fix a dent with a bigger hammer. You fix it with the right tool, applied to the right spot, with the right amount of force.

AI engineering is no different. The developers who build reliable systems aren't the ones with the biggest context windows. They're the ones who understand what information the model actually needs and architect their systems to deliver exactly that.

This is what I mean when I say architecture beats brute force. Anyone can stuff a million tokens into a prompt. It takes an engineer to figure out which 2,000 tokens actually matter.

Your Turn

Next time you're working with an LLM and you're tempted to expand the context window, try the opposite. Strip it down. Give the model less, but make every token count. See what happens.

I think you'll be surprised at how much better "less" performs. And if you want to see this philosophy in action, ask the AI sidebar on this site a question about my career. That's a RAG system running on curated, chunked context — not a giant context dump. The difference shows.

What's the most overcomplicated prompt you've ever written that worked better when you simplified it? I'd genuinely love to hear about it.

— Bill John Tran