What is the largest context window available?

As of May 2026, Google's Gemini 2.5 Pro offers the largest production context window at 1M tokens (with 2M in preview). GPT-4.1 supports 1M tokens. Claude Opus 4 and Claude Sonnet 4 support 200K tokens. Most open-weight models like Llama 3.3 support 128K tokens.

How do context windows affect enterprise AI costs?

Context window usage is the largest driver of LLM costs. Input tokens are priced per-token, so doubling context size doubles input cost. A 200K-token prompt with Claude Opus 4 costs about $3.00 per request. AI agents compound this by making 20-50 model calls per task. Enterprises manage costs through RAG (retrieve only relevant context), prompt caching (50-80% savings on repeated context), and model routing (send simple queries to cheaper models).

What is the best strategy for managing context window limits?

The most common strategies are: RAG (retrieve relevant chunks from a vector store instead of sending full documents), summarization (condense long inputs before passing to the model), hierarchical context (maintain project/feature/ephemeral context layers), and prompt caching (cache static system prompts to reduce cost and latency). Most enterprise systems combine two or three of these strategies.

On this page

Explore AI Gateway

AI Context Windows Explained

Q: What is a context window in AI?

A context window (also called context length or context limit) is the maximum number of tokens a large language model can process in a single request — the combined size of your input prompt and the model's output response. It functions as the model's working memory for that interaction. One token is roughly ¾ of an English word.

What context windows are, how they differ across models, and the strategies enterprises use to manage context for cost, accuracy, and security.

12 min read

Axiom Studio Team· Engineering

What Is a Context Window

A context window (also called a context length or context limit) is the maximum amount of text a large language model can process in a single request — the combined size of your input prompt and the model's output response, measured in tokens. Think of it as the model's working memory: everything it can "see" and reason about at once.

One token is roughly ¾ of an English word. A 128K-token context window can hold approximately 96,000 words — about the length of a full novel. A 1M-token window can hold roughly 750,000 words — enough for an entire codebase or a shelf of documents.

A one-line definition

A context window is the total number of tokens (input + output) an LLM can process in a single request — its working memory for that interaction.

Context windows have grown dramatically: GPT-3 launched with 4K tokens in 2020; by 2026, production models routinely support 128K–1M tokens, with Google's Gemini offering up to 2M in preview. But bigger is not always better — cost, latency, and accuracy all change with context size.

Context Window Limits by Model

Context limits vary widely across providers and models. Enterprise teams need to understand these limits because they directly affect which tasks each model can handle, what it will cost, and how you architect your AI workflows.

Model

Provider

Context window

Approx. words

Note

GPT-4o

OpenAI

128K tokens

~96K words

Default for ChatGPT Plus

GPT-4.1

OpenAI

1M tokens

~750K words

Long-context variant

Claude Sonnet 4

Anthropic

200K tokens

~150K words

Standard context

Claude Opus 4

Anthropic

200K tokens

~150K words

Extended thinking available

Gemini 2.5 Pro

Google

1M tokens

~750K words

Up to 2M in preview

Gemini 2.0 Flash

Google

1M tokens

~750K words

Cost-optimized

Mistral Large

Mistral

128K tokens

~96K words

European provider

Llama 3.3 70B

Tokens vs Words: How Context Is Measured

Models don't process words — they process tokens, subword units determined by the model's tokenizer. The relationship between tokens and words varies by language, content type, and tokenizer.

English prose: ~1.3 tokens per word. A 10,000-word document ≈ 13,000 tokens.
Source code: ~1.5–2.5 tokens per word (variable names, operators, and syntax consume extra tokens). A 1,000-line file can be 3,000–8,000 tokens depending on the language.
JSON/structured data: 2–3× more tokens than the equivalent natural language because of brackets, keys, and formatting.
Non-English languages: Languages like Chinese, Japanese, and Korean can use 2–3× more tokens per word equivalent, reducing effective context capacity.

For enterprise cost planning, always estimate in tokens, not words. Most providers charge per token, and a codebase that "looks" small in lines of code can consume a surprisingly large share of the context window.

Enterprise Impact of Context Windows

Context window size has five direct impacts on enterprise AI deployments. Each one affects cost, performance, or risk — and most teams underestimate at least three of them.

Area

The problem

Enterprise mitigation

Cost

Larger context = more tokens = higher cost. A 200K-token prompt costs 60× more than a 3K-token prompt at the same per-token rate.

RAG to inject only relevant context; prompt caching for repeated system prompts; model routing to send simple queries to cheaper models

Latency

Time-to-first-token scales with input length. A 128K-token prompt can take 5-15 seconds before the model starts responding.

Pre-summarize long documents; cache processed context; use streaming to show partial responses while generation continues

Accuracy

Models struggle with 'lost in the middle' — information in the center of long contexts gets less attention than content at the start or end.

Place critical information at the start of the context; use RAG to surface only high-relevance chunks; test retrieval quality regularly

Security

Stuffing the context window with full documents risks exposing sensitive data (PII, credentials, IP) to the model provider.

PII redaction before context injection; data classification policies; self-hosted models for sensitive workloads; audit logging of what enters context

Governance

No visibility into what's being sent to models as context. Teams paste full databases, code repos, or customer records into prompts.

LLM gateway logging; context-size alerting; per-team policies on maximum context size and data classification

The net effect: enterprises cannot simply "use the biggest context window available" and move on. Context management is an active engineering discipline, not a model selection checkbox.

Context Management Strategies

Five strategies let enterprises work effectively within and around context window limits. Most production systems combine two or three of these.

RAG (Retrieval-Augmented Generation)

Retrieve only the relevant chunks from a vector store and inject them into the prompt. Keeps context lean and focused. The most common enterprise pattern.

Best for: Large knowledge bases, documentation QA, customer support

Summarization / compression

Condense long documents or conversation history into shorter summaries before passing to the model. Trades fidelity for capacity.

Best for: Long conversations, meeting transcripts, multi-document synthesis

Chunking & sliding windows

Break input into overlapping chunks, process each independently, then merge results. Works for extraction tasks where full coherence isn't required.

Best for: Document extraction, log analysis, code scanning

Hierarchical context

Maintain multiple layers — project-level context (always loaded), feature-level (loaded per task), ephemeral (current conversation). Agents like Claude Code use this pattern.

Best for: Agentic coding, multi-session workflows, project-scoped AI

Prompt caching

Cache the processed representation of static context (system prompts, reference docs) so it isn't re-tokenized on every request. Reduces latency and cost.

Best for: Repeated queries against same context, high-volume agent loops

The choice between strategies depends on your use case. RAG is the default for knowledge-base queries; hierarchical context is emerging as the standard for agentic workflows where agents need project awareness across sessions; prompt caching is table-stakes for any high-volume deployment.

Context Windows and AI Agents

AI agents — coding agents, research agents, customer support agents — have a unique relationship with context windows. Unlike single-turn chatbot queries, agents run multi-step workflows where context accumulates across dozens or hundreds of tool calls within a single session.

Context accumulation: Every tool call result, file read, and command output adds to the running context. A coding agent investigating a bug might consume 50K tokens of context before it writes its first line of code.
Context compression: Advanced agents (Claude Code, Devin) automatically summarize earlier conversation turns when approaching the context limit, preserving key decisions while discarding verbatim outputs.
Durable context: Project-level context files (CLAUDE.md, architecture docs) are loaded into every agent session, consuming a fixed portion of the context window. This is worth the cost because it prevents the agent from re-discovering conventions on every task.
MCP (Model Context Protocol): A standardized way for agents to access external tools and data sources. Instead of stuffing everything into the context window, MCP lets agents fetch information on demand — only what they need, when they need it.

The agent context budget rule

Reserve 30-40% of the context window for the agent's accumulated work. If using a 200K-token model, plan for ~80K of system prompt and project context, ~80K for accumulated tool results, and ~40K for the agent's reasoning and output.

The Cost of Context

Context window usage is the single largest driver of LLM costs in enterprise deployments. Understanding the math is essential for AI FinOps.

Linear cost scaling: Input tokens are priced per-token. Doubling your context size doubles your input cost. At $15/M input tokens (Claude Opus 4), a 200K-token prompt costs $3.00 per request.
Prompt caching discount: Cached input tokens cost 10-90% less than fresh tokens. For repetitive workloads (same system prompt, same reference docs), caching can cut context costs by 50-80%.
The hidden multiplier: Agents make many model calls per task. A coding agent that makes 30 calls with 50K tokens of context each consumes 1.5M input tokens per task — potentially $22+ at Opus pricing. Choosing the right model tier per call is critical.
Output vs input pricing: Output tokens typically cost 3-5× more than input tokens. Long, detailed responses are disproportionately expensive compared to large inputs.

The enterprise response is not to minimize context — it is to manage it. Route simple tasks to small-context, cheap models. Use prompt caching for repeated context. Apply RAG so only relevant information enters the window. Track cost per task, not just cost per token.

How Axiom Manages Context

Context management is not a standalone problem — it is embedded in every layer of the enterprise AI stack, from the gateway that routes requests to the agents that consume context.

Optimize context across every model call

Axiom's Unified AI Gateway provides prompt caching, model routing based on context size, token-level cost attribution, and context-size alerting — so you get the right context to the right model at the right cost. Combined with MCP Gateway for on-demand tool access, enterprises can manage context as infrastructure, not as a per-developer afterthought.

Talk to us

Manage context across every model and agent

Axiom's gateway gives you prompt caching, model routing, token attribution, and context policies — so context management becomes infrastructure, not guesswork.

Continue Learning

What is RAG?

The most common strategy for working within context window limits at enterprise scale

What is MCP?

How Model Context Protocol lets agents fetch context on demand instead of stuffing the window

What is AI FinOps?

Managing the cost implications of context window usage across teams and models

What is an LLM Gateway?

The infrastructure layer that enables prompt caching, model routing, and context-size policies