The Context Window Trap
Claude has a 1M token context window. Gemini offers 2M. The instinct is obvious: just throw everything in. All the docs, all the code, all the conversation history. Let the model figure it out.
I've run Claude Code at high context for months. I've built RAG systems, long-context agents, and document analysis pipelines. The 1M window is real and useful. But treating it as an excuse to skip information architecture is a trap that costs you quality, latency, and money.
the attention degradation problem
Context windows have gotten bigger. Attention hasn't gotten proportionally better.
The "needle in a haystack" benchmarks show models can find a specific fact buried in 100K+ tokens. What they don't test: can the model reason across information spread throughout 100K tokens? That's a different capability, and it degrades faster than retrieval.
In practice, I've observed:
- 0-30K tokens: Full reasoning quality. The model tracks all the context effectively.
- 30K-100K tokens: Reasoning still works but the model starts missing connections between distant sections. It handles explicit references but misses implicit relationships.
- 100K-500K tokens: Retrieval still works (it can find things you point it to) but synthesis degrades. Ask it to "summarize the key themes across all these documents" and it'll miss some.
- 500K-1M tokens: Functional for lookup and search. Unreliable for complex reasoning that depends on integrating information from multiple locations.
These aren't hard cutoffs — they're gradients. And they depend on the task. Simple factual lookup stays accurate much deeper into the context than multi-document reasoning.
the cost math
Context window pricing is linear. Using 500K tokens of context costs the same as processing 500K tokens of input, regardless of how much of that context is actually useful.
Real example from my workflow:
Approach 1: Stuff 200K tokens of codebase into context
- Input: 200K tokens × $3/M (Claude Sonnet) = $0.60 per query
- 50 queries/day = $30/day
Approach 2: RAG retrieval of relevant 5K tokens
- Embedding: negligible
- Input: 5K tokens × $3/M = $0.015 per query
- 50 queries/day = $0.75/day
40x cost difference. And the RAG approach often produces better answers because it surfaces only the relevant code instead of making the model scan 200K tokens for the relevant section.
when large context actually helps
There are cases where big windows are the right choice:
Single-document analysis. Reading a 50-page contract or a long research paper. The document is a coherent unit and the model needs the full thing to answer questions about it. Chunking would lose cross-references and context.
Codebase understanding. When you need the model to understand how modules interact, dependency chains, or architectural patterns — having the full codebase in context beats retrieval. This is why Claude Code works: it loads relevant files and builds understanding.
Conversation continuity. Long development sessions where earlier decisions constrain later ones. "We decided to use the repository pattern in message 15, now in message 45 we need to extend it." Losing that context breaks continuity.
Few-shot learning at scale. When you need 50+ examples to teach a complex pattern, the context window is your classroom. More examples = better pattern recognition, and large windows enable it.
when to chunk instead
Multiple independent documents. If you're querying across 500 support tickets, 500 product reviews, or 500 blog posts — these are independent units. Retrieve the relevant 10, not all 500.
Knowledge bases with factual lookups. "What's our refund policy?" doesn't need 100K tokens of company documentation. It needs the one page that answers the question. RAG wins here overwhelmingly.
Repetitive or boilerplate content. Log files, data dumps, CSV exports. These are structured data masquerading as text. Parse them into a database and query structurally, don't feed them to an LLM.
the architecture I use
def decide_context_strategy(task_type: str, corpus_size: int) -> str:
if corpus_size < 30_000:
# Small enough — just use full context
return "full_context"
if task_type == "single_document_analysis":
# Coherent document — use full context up to 100K
return "full_context" if corpus_size < 100_000 else "summarize_then_query"
if task_type == "factual_lookup":
# Independent facts — always RAG
return "rag_retrieval"
if task_type == "cross_document_reasoning":
# Need synthesis — map-reduce pattern
return "map_reduce"
if task_type == "conversation_continuity":
# Rolling summary + recent turns
return "rolling_summary"
# Default: RAG with generous retrieval
return "rag_retrieval"Map-reduce for cross-document reasoning:
async def map_reduce_analysis(documents: list[str], question: str):
# Map: analyze each document independently
analyses = await asyncio.gather(*[
analyze_document(doc, question) for doc in documents
])
# Reduce: synthesize analyses (much smaller than original docs)
synthesis = await synthesize(analyses, question)
return synthesisEach document gets analyzed individually (map). The analyses — much shorter than the originals — get combined for synthesis (reduce). This works when the total corpus is 500K tokens but each document is 5K.
Rolling summary for long conversations:
def build_conversation_context(
full_history: list, summary: str, recent_turns: int = 10
):
return [
{"role": "system", "content": f"Conversation summary: {summary}"},
*full_history[-recent_turns * 2:], # last N turns
]
# Update summary every 20 turns
if len(full_history) % 40 == 0: # 20 turns = 40 messages
summary = await summarize_conversation(full_history)The summary captures key decisions and context. Recent turns provide the immediate conversation. Together, they're ~5K tokens instead of the full 50K+ token history.
my rules of thumb
- Under 30K tokens: Just use full context. Don't over-engineer.
- 30K-100K tokens: Full context for coherent documents, RAG for independent units.
- 100K-500K tokens: Almost always RAG or map-reduce. Full context only for codebase understanding.
- 500K+ tokens: Always structured retrieval. The model will lose things at this scale.
- Always ask: "Does the model need all of this, or just the relevant parts?"
The context window is a capability, not a strategy. Use it when the task demands it. For everything else, send less and get more.
More in AI & LLM
LLM Evaluation Is Hard
Building LLM-as-a-Judge for AssessAI taught me why automated evaluation needs human calibration — and where it breaks down.
Voice AI in Production
What it takes to deploy conversational Voice AI that handles real phone calls. Latency budgets, failure modes, and lessons from Avoca.
Embedding Models Compared: Cost, Quality, Latency
OpenAI ada-002 vs text-embedding-3 vs Cohere vs local models. Real benchmarks from production retrieval systems.