Governed Vibecoding vs Unmanaged AI CodingRead Now →
Skip to main content

AI Context Windows Explained

What context windows are, how they differ across models, and the strategies enterprises use to manage context for cost, accuracy, and security.

12 min read
Axiom Studio Team· Engineering

What Is a Context Window

A context window (also called a context length or context limit) is the maximum amount of text a large language model can process in a single request — the combined size of your input prompt and the model's output response, measured in tokens. Think of it as the model's working memory: everything it can "see" and reason about at once.

One token is roughly ¾ of an English word. A 128K-token context window can hold approximately 96,000 words — about the length of a full novel. A 1M-token window can hold roughly 750,000 words — enough for an entire codebase or a shelf of documents.

A one-line definition

A context window is the total number of tokens (input + output) an LLM can process in a single request — its working memory for that interaction.

Context windows have grown dramatically: GPT-3 launched with 4K tokens in 2020; by 2026, production models routinely support 128K–1M tokens, with Google's Gemini offering up to 2M in preview. But bigger is not always better — cost, latency, and accuracy all change with context size.

Context Window Limits by Model

Context limits vary widely across providers and models. Enterprise teams need to understand these limits because they directly affect which tasks each model can handle, what it will cost, and how you architect your AI workflows.

Model
Provider
Context window
Approx. words
Note
GPT-4o
OpenAI
128K tokens
~96K words
Default for ChatGPT Plus
GPT-4.1
OpenAI
1M tokens
~750K words
Long-context variant
Claude Sonnet 4
Anthropic
200K tokens
~150K words
Standard context
Claude Opus 4
Anthropic
200K tokens
~150K words
Extended thinking available
Gemini 2.5 Pro
Google
1M tokens
~750K words
Up to 2M in preview
Gemini 2.0 Flash
Google
1M tokens
~750K words
Cost-optimized
Mistral Large
Mistral
128K tokens
~96K words
European provider
Llama 3.3 70B
Meta
128K tokens
~96K words
Open-weight model

Context limits as of May 2026. Token counts vary by tokenizer; word approximations assume ~1.3 tokens per English word.

A critical nuance: the advertised context window is shared between input and output. If a model has a 128K-token window and you use 120K tokens of input, the model can only generate ~8K tokens of output. Enterprise workloads that need both large inputs (long documents, full codebases) and detailed outputs (comprehensive analyses, full implementations) must budget the window carefully.

Tokens vs Words: How Context Is Measured

Models don't process words — they process tokens, subword units determined by the model's tokenizer. The relationship between tokens and words varies by language, content type, and tokenizer.

  • English prose: ~1.3 tokens per word. A 10,000-word document ≈ 13,000 tokens.
  • Source code: ~1.5–2.5 tokens per word (variable names, operators, and syntax consume extra tokens). A 1,000-line file can be 3,000–8,000 tokens depending on the language.
  • JSON/structured data: 2–3× more tokens than the equivalent natural language because of brackets, keys, and formatting.
  • Non-English languages: Languages like Chinese, Japanese, and Korean can use 2–3× more tokens per word equivalent, reducing effective context capacity.

For enterprise cost planning, always estimate in tokens, not words. Most providers charge per token, and a codebase that "looks" small in lines of code can consume a surprisingly large share of the context window.

Enterprise Impact of Context Windows

Context window size has five direct impacts on enterprise AI deployments. Each one affects cost, performance, or risk — and most teams underestimate at least three of them.

Area
The problem
Enterprise mitigation
Cost
Larger context = more tokens = higher cost. A 200K-token prompt costs 60× more than a 3K-token prompt at the same per-token rate.
RAG to inject only relevant context; prompt caching for repeated system prompts; model routing to send simple queries to cheaper models
Latency
Time-to-first-token scales with input length. A 128K-token prompt can take 5-15 seconds before the model starts responding.
Pre-summarize long documents; cache processed context; use streaming to show partial responses while generation continues
Accuracy
Models struggle with 'lost in the middle' — information in the center of long contexts gets less attention than content at the start or end.
Place critical information at the start of the context; use RAG to surface only high-relevance chunks; test retrieval quality regularly
Security
Stuffing the context window with full documents risks exposing sensitive data (PII, credentials, IP) to the model provider.
PII redaction before context injection; data classification policies; self-hosted models for sensitive workloads; audit logging of what enters context
Governance
No visibility into what's being sent to models as context. Teams paste full databases, code repos, or customer records into prompts.
LLM gateway logging; context-size alerting; per-team policies on maximum context size and data classification

The net effect: enterprises cannot simply "use the biggest context window available" and move on. Context management is an active engineering discipline, not a model selection checkbox.

Context Management Strategies

Five strategies let enterprises work effectively within and around context window limits. Most production systems combine two or three of these.

RAG (Retrieval-Augmented Generation)

Retrieve only the relevant chunks from a vector store and inject them into the prompt. Keeps context lean and focused. The most common enterprise pattern.

Best for: Large knowledge bases, documentation QA, customer support

Summarization / compression

Condense long documents or conversation history into shorter summaries before passing to the model. Trades fidelity for capacity.

Best for: Long conversations, meeting transcripts, multi-document synthesis

Chunking & sliding windows

Break input into overlapping chunks, process each independently, then merge results. Works for extraction tasks where full coherence isn't required.

Best for: Document extraction, log analysis, code scanning

Hierarchical context

Maintain multiple layers — project-level context (always loaded), feature-level (loaded per task), ephemeral (current conversation). Agents like Claude Code use this pattern.

Best for: Agentic coding, multi-session workflows, project-scoped AI

Prompt caching

Cache the processed representation of static context (system prompts, reference docs) so it isn't re-tokenized on every request. Reduces latency and cost.

Best for: Repeated queries against same context, high-volume agent loops

The choice between strategies depends on your use case. RAG is the default for knowledge-base queries; hierarchical context is emerging as the standard for agentic workflows where agents need project awareness across sessions; prompt caching is table-stakes for any high-volume deployment.

Context Windows and AI Agents

AI agents — coding agents, research agents, customer support agents — have a unique relationship with context windows. Unlike single-turn chatbot queries, agents run multi-step workflows where context accumulates across dozens or hundreds of tool calls within a single session.

  • Context accumulation: Every tool call result, file read, and command output adds to the running context. A coding agent investigating a bug might consume 50K tokens of context before it writes its first line of code.
  • Context compression: Advanced agents (Claude Code, Devin) automatically summarize earlier conversation turns when approaching the context limit, preserving key decisions while discarding verbatim outputs.
  • Durable context: Project-level context files (CLAUDE.md, architecture docs) are loaded into every agent session, consuming a fixed portion of the context window. This is worth the cost because it prevents the agent from re-discovering conventions on every task.
  • MCP (Model Context Protocol): A standardized way for agents to access external tools and data sources. Instead of stuffing everything into the context window, MCP lets agents fetch information on demand — only what they need, when they need it.

The agent context budget rule

Reserve 30-40% of the context window for the agent's accumulated work. If using a 200K-token model, plan for ~80K of system prompt and project context, ~80K for accumulated tool results, and ~40K for the agent's reasoning and output.

The Cost of Context

Context window usage is the single largest driver of LLM costs in enterprise deployments. Understanding the math is essential for AI FinOps.

  • Linear cost scaling: Input tokens are priced per-token. Doubling your context size doubles your input cost. At $15/M input tokens (Claude Opus 4), a 200K-token prompt costs $3.00 per request.
  • Prompt caching discount: Cached input tokens cost 10-90% less than fresh tokens. For repetitive workloads (same system prompt, same reference docs), caching can cut context costs by 50-80%.
  • The hidden multiplier: Agents make many model calls per task. A coding agent that makes 30 calls with 50K tokens of context each consumes 1.5M input tokens per task — potentially $22+ at Opus pricing. Choosing the right model tier per call is critical.
  • Output vs input pricing: Output tokens typically cost 3-5× more than input tokens. Long, detailed responses are disproportionately expensive compared to large inputs.

The enterprise response is not to minimize context — it is to manage it. Route simple tasks to small-context, cheap models. Use prompt caching for repeated context. Apply RAG so only relevant information enters the window. Track cost per task, not just cost per token.

How Axiom Manages Context

Context management is not a standalone problem — it is embedded in every layer of the enterprise AI stack, from the gateway that routes requests to the agents that consume context.

Optimize context across every model call

Axiom's Unified AI Gateway provides prompt caching, model routing based on context size, token-level cost attribution, and context-size alerting — so you get the right context to the right model at the right cost. Combined with MCP Gateway for on-demand tool access, enterprises can manage context as infrastructure, not as a per-developer afterthought.

Talk to us

Manage context across every model and agent

Axiom's gateway gives you prompt caching, model routing, token attribution, and context policies — so context management becomes infrastructure, not guesswork.

Contact Us