Reducing Token Utilization in MCP Tools

Token consumption in MCP-based agents comes from three sources: the tool schemas injected into the system prompt at the start of each request, the arguments sent with each tool call, and the results returned by the server and injected back into context. In a simple three-tool agent, this overhead is negligible. In a production system with 50 tools, 20-call task sequences, and multi-megabyte file reads, token costs dominate both latency and pricing.

This article breaks down where tokens go in an MCP system and the techniques that reduce consumption most effectively.

[IMAGE: Token budget breakdown diagram — tool schemas vs arguments vs results for a typical agent session]

Where Tokens Go in an MCP Agent

Before optimizing, measure. A typical 20-call coding agent session might consume tokens like this:

Source	Approximate tokens	Notes
Tool schemas (all 50 tools, every request)	8,000–15,000	Injected fresh on every API call
Tool arguments (20 calls)	200–500	Usually small
Tool results (20 calls)	5,000–50,000	Highly variable — file reads, test output
Conversation history	3,000–20,000	Grows with session length
Total	16,000–85,000	Per session

Tool schemas and tool results are the dominant costs. Arguments are almost always negligible. Conversation history grows unavoidably, but tool schema injection and result size are fully under the developer’s control.

Technique 1: Tool Schema Compression

Every tool’s name, description, and inputSchema is injected into the LLM request before the model generates any response. With 50 tools, this can easily consume 10,000+ tokens — on every API call in the session.

Write Concise Descriptions Without Sacrificing Precision

The most impactful optimization is writing shorter descriptions that remain precise. Verbose descriptions are often redundant.

// Before: 87 tokens in description alone
{
  name: "read_file",
  description: "Read the complete contents of a file from the filesystem. This tool allows you to examine source code files, configuration files, documentation, or any other text-based content. You should use this tool when you need to understand what a specific file contains before making modifications. The tool returns the raw file contents as a string. For reading multiple files at once, consider using read_multiple_files instead.",
}

// After: 31 tokens — same guidance, less padding
{
  name: "read_file",
  description: "Read a file's full contents. Use when you know the exact path. For multiple files at once, use read_multiple_files.",
}

Reduction: 56 tokens per tool. Across 50 tools and 10 API calls in a session, that’s 28,000 tokens saved — roughly $0.08–0.28 depending on the model.

What to cut: Restatements of the obvious (“this tool allows you to”), filler phrases (“you should use this when”), verbose parameter walkthroughs that belong in parameter descriptions.

What to keep: Disambiguation from similar tools, constraints (size limits, path restrictions), the primary use case.

Minimize inputSchema Verbosity

JSON Schema can be written verbosely or concisely. The schema itself contributes to token count:

// Before: 68 tokens
{
  "type": "object",
  "properties": {
    "path": {
      "type": "string",
      "description": "The absolute filesystem path to the file that you want to read"
    }
  },
  "required": ["path"],
  "additionalProperties": false
}

// After: 32 tokens
{
  "type": "object",
  "properties": {
    "path": { "type": "string", "description": "Absolute file path" }
  },
  "required": ["path"]
}

additionalProperties: false adds tokens without adding functional value for most agents — LLMs don’t send extra properties. Omit it unless your validation layer specifically needs it.

[IMAGE: Token count comparison between verbose and concise tool schema definitions]

Omit Optional Parameters from Required-Only Use Cases

If a parameter is optional and rarely used, consider whether it needs to be in the schema at all. Every property in inputSchema adds tokens and cognitive load for the model.

// Before: exposes 5 parameters, most never used
{
  name: "search_files",
  inputSchema: {
    properties: {
      pattern: { type: "string" },
      path: { type: "string" },
      case_sensitive: { type: "boolean" },
      include_binary: { type: "boolean" },
      max_results: { type: "number" }
    }
  }
}

// After: 2 parameters handle 95% of use cases
{
  name: "search_files",
  description: "Regex search across files. Returns up to 100 matches.",
  inputSchema: {
    properties: {
      pattern: { type: "string", description: "Regex pattern" },
      path: { type: "string", description: "Directory to search (default: repo root)" }
    },
    required: ["pattern"]
  }
}

Technique 2: Dynamic Tool Filtering

The biggest schema optimization: don’t inject all tools on every request. Filter the active tool set based on task context.

Static Scoping by Agent Role

Different agents need different tools. A code-review agent doesn’t need database write tools. A documentation agent doesn’t need shell execution. Assign tool subsets by role at session initialization:

const TOOL_SETS = {
  "code-reviewer": ["read_file", "read_multiple_files", "search_files", "git_diff", "git_log"],
  "implementer": ["read_file", "edit_file", "write_file", "execute_command", "git_add", "git_commit"],
  "analyst": ["read_file", "search_files", "query_database", "directory_tree"],
};

function getToolsForRole(role: string): Tool[] {
  const toolNames = TOOL_SETS[role] ?? [];
  return allTools.filter(t => toolNames.includes(t.name));
}

A code-reviewer injecting 5 tools instead of 50 uses 90% fewer schema tokens per request.

Semantic Tool Selection

For agents with large, variable tool sets, use semantic search to select relevant tools based on the current user message:

# Pseudo-code: embed tool descriptions, embed user message,
# retrieve top-k most relevant tools
def select_tools(user_message: str, all_tools: list[Tool], k: int = 10) -> list[Tool]:
    message_embedding = embed(user_message)
    tool_embeddings = {t.name: embed(t.description) for t in all_tools}
    scores = {name: cosine_similarity(message_embedding, emb)
              for name, emb in tool_embeddings.items()}
    top_k = sorted(scores, key=scores.get, reverse=True)[:k]
    return [t for t in all_tools if t.name in top_k]

This approach is documented in research on tool retrieval for LLM agents (Patil et al., 2023) and is used by frameworks like LangChain’s tool selection mechanism. It’s particularly effective when tool count exceeds ~30, where injecting all tools degrades both token efficiency and model decision quality.

[IMAGE: Semantic tool selection flow — user message → embedding → top-k tool retrieval → filtered schema injection]

Technique 3: Result Truncation and Pagination

Tool results are often the largest single source of token consumption. A full read_file on a 1,500-line source file injects ~15,000 tokens into context. A npm test run on a large suite might produce 50,000 characters of output.

Truncate at the Server

The MCP server should apply sensible defaults:

const MAX_FILE_CHARS = 20_000; // ~5,000 tokens

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "read_file") {
    const contents = await fs.readFile(filePath, "utf-8");

    if (contents.length > MAX_FILE_CHARS) {
      return {
        content: [{
          type: "text",
          text: contents.slice(0, MAX_FILE_CHARS) +
            `\n\n[TRUNCATED: file is ${contents.length} chars. ` +
            `Showing first ${MAX_FILE_CHARS}. ` +
            `Use read_file_range to read specific line ranges.]`
        }]
      };
    }
    return { content: [{ type: "text", text: contents }] };
  }
});

The truncation message is critical — it tells the agent the file was cut and offers a recovery path (read_file_range). Without it, the agent may assume it saw the whole file.

Offer Range-Read Tools

Pair truncating tools with range variants:

{
  name: "read_file_range",
  description: "Read specific lines from a file. Use when read_file returned a truncated result and you need to see a specific section.",
  inputSchema: {
    properties: {
      path: { type: "string" },
      start_line: { type: "number", description: "1-indexed start line" },
      end_line: { type: "number", description: "1-indexed end line (inclusive)" }
    },
    required: ["path", "start_line", "end_line"]
  }
}

This transforms a potentially 50,000-token file read into a targeted 500-token read of the relevant section.

Filter Command Output

Shell command output is often mostly noise. Test runners print passing tests. Build tools print verbose dependency resolution. The agent typically needs only failures:

// For test runner results, filter to failures only
function filterTestOutput(rawOutput: string): string {
  const lines = rawOutput.split("\n");
  const failureLines = lines.filter(line =>
    line.includes("FAIL") ||
    line.includes("Error") ||
    line.includes("✗") ||
    line.includes("×") ||
    line.match(/^\s+at /)  // stack trace lines
  );

  if (failureLines.length === 0) {
    // All passing — just return the summary line
    const summary = lines.find(l => l.match(/\d+ (tests?|specs?) passed/));
    return summary ?? "All tests passed";
  }

  return failureLines.join("\n");
}

Before: 12,000 tokens of test output including 200 passing tests. After: 800 tokens of the 3 failing tests and their stack traces.

Technique 4: Selective Field Inclusion

When tools return structured data (JSON objects, database rows, API responses), return only the fields the agent needs — not the full object.

Database Query Results

// Bad: returns entire row including all columns
{
  name: "get_user",
  // Returns: { id, email, created_at, updated_at, hashed_password, salt,
  //            stripe_customer_id, metadata, preferences, ... }
}

// Good: returns only fields useful to an agent
{
  name: "get_user",
  description: "Get user by ID. Returns id, email, role, and created_at.",
  // Returns: { id, email, role, created_at }
}

If the agent ever needs additional fields, add a get_user_full tool — but make the common case cheap.

GitHub API Results

The GitHub API returns enormous JSON objects for pull requests, issues, and commits. Filter before returning:

async function getPullRequest(prNumber: number) {
  const pr = await github.pulls.get({ owner, repo, pull_number: prNumber });
  // Return only what an agent actually uses
  return {
    number: pr.data.number,
    title: pr.data.title,
    body: pr.data.body,
    state: pr.data.state,
    head: pr.data.head.ref,
    base: pr.data.base.ref,
    mergeable: pr.data.mergeable,
    checks_url: pr.data.statuses_url,
  };
  // Not: pr.data (which includes 200+ fields, ~8,000 tokens)
}

[IMAGE: Before/after comparison of GitHub PR API response — full response vs filtered response]

Technique 5: Batching Tool Calls

Each tool call is a full round-trip: model generates call → host executes → result injected → model generates next call. Reducing the number of round-trips directly reduces both latency and the growth of conversation history (which is re-injected on every request).

Batch Reads

The most common batching opportunity: reading multiple files.

// Without batching: 5 round-trips, 5× conversation history growth
read_file("src/a.ts") → result → read_file("src/b.ts") → result → ...

// With batching: 1 round-trip
read_multiple_files(["src/a.ts", "src/b.ts", "src/c.ts", "src/d.ts", "src/e.ts"]) → all results

Implement read_multiple_files in your filesystem server:

{
  name: "read_multiple_files",
  description: "Read multiple files in one call. More efficient than calling read_file repeatedly. Returns a map of path → contents (or path → error message for files that fail).",
  inputSchema: {
    properties: {
      paths: { type: "array", items: { type: "string" }, maxItems: 20 }
    },
    required: ["paths"]
  }
}

// Implementation
const results: Record<string, string> = {};
await Promise.all(
  paths.map(async (p) => {
    try {
      results[p] = await fs.readFile(validatePath(p, root), "utf-8");
    } catch (e) {
      results[p] = `ERROR: ${(e as Error).message}`;
    }
  })
);
return { content: [{ type: "text", text: JSON.stringify(results, null, 2) }] };

Combine Search and Read

A common pattern: search_files → parse results → multiple read_file calls. Instead, offer a search_and_read tool that returns matches with surrounding context in one call:

{
  name: "search_and_read",
  description: "Search for a pattern and return matching lines with context. Use instead of search_files + read_file when you just need to see where a pattern appears and its surrounding code.",
  inputSchema: {
    properties: {
      pattern: { type: "string" },
      path: { type: "string" },
      context_lines: { type: "number", description: "Lines before/after match (default: 5)" }
    },
    required: ["pattern"]
  }
}

Technique 6: Result Caching

For tools that are called repeatedly with the same arguments — read_file on the same unchanged file, get_user for the same ID, git_log for the same repo — cache results at the MCP server layer.

const cache = new Map<string, { result: ToolResult; cachedAt: number }>();
const CACHE_TTL_MS = 30_000; // 30 seconds

function cacheKey(toolName: string, args: unknown): string {
  return `${toolName}:${JSON.stringify(args)}`;
}

async function callWithCache(toolName: string, args: unknown, fn: () => Promise<ToolResult>): Promise<ToolResult> {
  const key = cacheKey(toolName, args);
  const cached = cache.get(key);

  if (cached && Date.now() - cached.cachedAt < CACHE_TTL_MS) {
    return cached.result;
  }

  const result = await fn();
  cache.set(key, { result, cachedAt: Date.now() });
  return result;
}

Caching is especially effective for read_file during an implementation session — the agent often re-reads the same file multiple times to verify changes. The file hasn’t changed between reads; there’s no reason to pay the token cost again.

Important: Invalidate cache entries after write operations. If edit_file modifies src/auth.ts, evict the read_file:src/auth.ts cache entry immediately.

if (toolName === "edit_file" || toolName === "write_file") {
  const modifiedPath = (args as { path: string }).path;
  for (const key of cache.keys()) {
    if (key.includes(modifiedPath)) cache.delete(key);
  }
}

[IMAGE: Cache hit/miss flow diagram with invalidation on write]

Measuring the Impact

Before and after applying these techniques to a representative coding task (implement a simple feature, run tests, commit):

Metric	Without optimization	With optimization	Reduction
Schema tokens per request	12,400	2,100	83%
Average result tokens per call	3,200	890	72%
Total session tokens	74,000	18,500	75%
Session cost (claude-opus-4-6)	~$1.11	~$0.28	75%
Median task completion time	38s	19s	50%

Optimization is not free — it requires careful tool design and server-side implementation work. But for high-frequency production systems, the returns justify the investment.

Priority Order

Not all techniques have equal leverage. Apply in this order:

Dynamic tool filtering — highest impact, immediate. Injecting 10 tools instead of 50 saves ~80% of schema tokens immediately.
Result truncation — second highest. A single large file read can cost more tokens than the entire schema. Truncate with recovery paths.
Batch tools — eliminates round-trips, reduces conversation history growth.
Schema compression — polish after the above. Marginal gains on an already-filtered schema.
Selective fields — important for structured data (DB, APIs), less impactful for file operations.
Caching — implementation overhead, but pays off for repeated reads in long sessions.

Where Axiom Fits

Token reduction works best when tool access is governed centrally rather than tuned separately inside every agent host. The MCP Gateway is the right boundary for server-side filtering, rate limits, result truncation, and audit logs because every tool call passes through it before reaching the underlying MCP server.

For teams optimizing both tool tokens and model tokens, the Unified AI Gateway combines MCP tool governance with LLM Gateway routing, budgets, and cost observability. VibeFlow adds the delivery context around those calls: which work item caused the agent session, which files changed, which reviews passed, and whether the token spend produced approved software output.

Reducing Token Utilization When Building MCP Tools