AI & LLMApril 4, 20265 min read

The Context Window Trap

aillmcontext-windowarchitecture

Claude has a 1M token context window. Gemini offers 2M. The instinct is obvious: just throw everything in. All the docs, all the code, all the conversation history. Let the model figure it out.

I've run Claude Code at high context for months. I've built RAG systems, long-context agents, and document analysis pipelines. The 1M window is real and useful. But treating it as an excuse to skip information architecture is a trap that costs you quality, latency, and money.

the attention degradation problem

Context windows have gotten bigger. Attention hasn't gotten proportionally better.

The "needle in a haystack" benchmarks show models can find a specific fact buried in 100K+ tokens. What they don't test: can the model reason across information spread throughout 100K tokens? That's a different capability, and it degrades faster than retrieval.

In practice, I've observed:

0-30K tokens: Full reasoning quality. The model tracks all the context effectively.
30K-100K tokens: Reasoning still works but the model starts missing connections between distant sections. It handles explicit references but misses implicit relationships.
100K-500K tokens: Retrieval still works (it can find things you point it to) but synthesis degrades. Ask it to "summarize the key themes across all these documents" and it'll miss some.
500K-1M tokens: Functional for lookup and search. Unreliable for complex reasoning that depends on integrating information from multiple locations.

These aren't hard cutoffs — they're gradients. And they depend on the task. Simple factual lookup stays accurate much deeper into the context than multi-document reasoning.

the cost math

Context window pricing is linear. Using 500K tokens of context costs the same as processing 500K tokens of input, regardless of how much of that context is actually useful.

Real example from my workflow:

Approach 1: Stuff 200K tokens of codebase into context
- Input: 200K tokens × $3/M (Claude Sonnet) = $0.60 per query
- 50 queries/day = $30/day

Approach 2: RAG retrieval of relevant 5K tokens
- Embedding: negligible
- Input: 5K tokens × $3/M = $0.015 per query
- 50 queries/day = $0.75/day

40x cost difference. And the RAG approach often produces better answers because it surfaces only the relevant code instead of making the model scan 200K tokens for the relevant section.

when large context actually helps

There are cases where big windows are the right choice:

Single-document analysis. Reading a 50-page contract or a long research paper. The document is a coherent unit and the model needs the full thing to answer questions about it. Chunking would lose cross-references and context.

Codebase understanding. When you need the model to understand how modules interact, dependency chains, or architectural patterns — having the full codebase in context beats retrieval. This is why Claude Code works: it loads relevant files and builds understanding.

Conversation continuity. Long development sessions where earlier decisions constrain later ones. "We decided to use the repository pattern in message 15, now in message 45 we need to extend it." Losing that context breaks continuity.

Few-shot learning at scale. When you need 50+ examples to teach a complex pattern, the context window is your classroom. More examples = better pattern recognition, and large windows enable it.

when to chunk instead

Multiple independent documents. If you're querying across 500 support tickets, 500 product reviews, or 500 blog posts — these are independent units. Retrieve the relevant 10, not all 500.

Knowledge bases with factual lookups. "What's our refund policy?" doesn't need 100K tokens of company documentation. It needs the one page that answers the question. RAG wins here overwhelmingly.

Repetitive or boilerplate content. Log files, data dumps, CSV exports. These are structured data masquerading as text. Parse them into a database and query structurally, don't feed them to an LLM.

the architecture I use

def decide_context_strategy(task_type: str, corpus_size: int) -> str:
    if corpus_size < 30_000:
        # Small enough — just use full context
        return "full_context"
 
    if task_type == "single_document_analysis":
        # Coherent document — use full context up to 100K
        return "full_context" if corpus_size < 100_000 else "summarize_then_query"
 
    if task_type == "factual_lookup":
        # Independent facts — always RAG
        return "rag_retrieval"
 
    if task_type == "cross_document_reasoning":
        # Need synthesis — map-reduce pattern
        return "map_reduce"
 
    if task_type == "conversation_continuity":
        # Rolling summary + recent turns
        return "rolling_summary"
 
    # Default: RAG with generous retrieval
    return "rag_retrieval"

Map-reduce for cross-document reasoning:

async def map_reduce_analysis(documents: list[str], question: str):
    # Map: analyze each document independently
    analyses = await asyncio.gather(*[
        analyze_document(doc, question) for doc in documents
    ])
 
    # Reduce: synthesize analyses (much smaller than original docs)
    synthesis = await synthesize(analyses, question)
    return synthesis

Each document gets analyzed individually (map). The analyses — much shorter than the originals — get combined for synthesis (reduce). This works when the total corpus is 500K tokens but each document is 5K.

Rolling summary for long conversations:

def build_conversation_context(
    full_history: list, summary: str, recent_turns: int = 10
):
    return [
        {"role": "system", "content": f"Conversation summary: {summary}"},
        *full_history[-recent_turns * 2:],  # last N turns
    ]
 
# Update summary every 20 turns
if len(full_history) % 40 == 0:  # 20 turns = 40 messages
    summary = await summarize_conversation(full_history)

The summary captures key decisions and context. Recent turns provide the immediate conversation. Together, they're ~5K tokens instead of the full 50K+ token history.

my rules of thumb

Under 30K tokens: Just use full context. Don't over-engineer.
30K-100K tokens: Full context for coherent documents, RAG for independent units.
100K-500K tokens: Almost always RAG or map-reduce. Full context only for codebase understanding.
500K+ tokens: Always structured retrieval. The model will lose things at this scale.
Always ask: "Does the model need all of this, or just the relevant parts?"

The context window is a capability, not a strategy. Use it when the task demands it. For everything else, send less and get more.

The Context Window Trap

the attention degradation problem

the cost math

when large context actually helps

when to chunk instead

the architecture I use

my rules of thumb

More in AI & LLM

LLM Evaluation Is Hard

Voice AI in Production

Embedding Models Compared: Cost, Quality, Latency