On this page
Agentic Coding: How AI Agents Write, Test & Ship Code
Autonomous AI agents that plan, write, test, and ship code on their own — Claude Code, Devin, Cursor agent mode — plus benchmarks, team structures, and the governance they need to be safe.
14 min readWhat Is Agentic Coding
Agentic coding is a software development practice where autonomous AI agents — not human developers — drive the inner loop of writing code. The agent reads a task, plans an approach, edits files, runs commands, observes the results, and iterates until the task is complete. A human reviews the output rather than the keystrokes.
Tools like Claude Code, Devin, Cursor's agent mode, GitHub Copilot Workspace, and OpenAI Codex CLI are all expressions of this pattern. They share three core capabilities: they read your codebase as context, they take actions through real tools (shells, file editors, browsers, test runners, git), and they loop on their own observations until the work is done or a guardrail stops them.
The shift matters because it changes the unit of delegation. With code completion, you delegate a line. With chat-assisted coding, you delegate a function. With agentic coding, you delegate an entire task — "fix bug #142," "add the export endpoint," "migrate this module to TypeScript" — and review the result rather than co-writing it.
Agentic Coding vs Vibecoding
Agentic coding and vibecoding are often conflated, but they describe different practices. Vibecoding is the broader cultural shift to natural-language-driven software creation — including everything from prompt-driven prototyping to autonomous agents. Agentic coding is the specific subset where the work is delegated to an AI agent that loops on its own, with the developer reviewing rather than co-writing.
A useful framing: vibecoding describes how the work is initiated (intent in natural language). Agentic coding describes who runs the loop (the agent, not the developer). You can vibecode without an agent — typing into Cursor's compose mode is vibecoding without delegation. You can also run agents on traditional, fully-specified tickets — that's agentic coding without much vibe.
Why the distinction matters
How the Agent Loop Works
Every agentic coding tool implements some variation of the same four-step loop: plan, act, observe, reflect. The agent reads the task, decides what to do, takes an action through a tool, reads the result, and updates its plan. The loop runs until the task succeeds, the agent decides it cannot proceed, or a guardrail terminates the run.
1. Plan
Read task, choose approach
2. Act
Edit files, run commands
3. Observe
Read tool output, errors
4. Reflect
Update plan, retry, or ship
The agent loop runs until the task succeeds, fails, or hits a guardrail.
Plan means the agent reads the task description, scans relevant files, and proposes an approach. Modern agents typically write the plan as text — a list of file edits, commands to run, and acceptance criteria — before touching anything.
Act is where the agent uses tools: it edits files through a file API, runs commands in a sandboxed shell, makes HTTP calls, or executes git operations. This is the riskiest step because the agent is now mutating real state.
Observe is the agent reading tool output: stdout, stderr, file contents after an edit, test results, type errors. The quality of observation determines whether the agent catches its own mistakes.
Reflect closes the loop. The agent decides whether the action moved it closer to the goal, whether to retry with a different approach, or whether the task is done and a PR can be opened.
What Agents Do Well — and Where They Fail
Agentic coding shines on a specific shape of work: tasks with clear acceptance criteria, scoped to a small number of files, where the codebase has good tests or types to ground the agent's observations. Bug fixes with reproducible failures, small CRUD endpoints, refactors that are mechanical but tedious, dependency upgrades, and test backfills are all well-suited.
Agents struggle when the task requires cross-system context they cannot see (a deployment quirk, a tribal-knowledge constraint), when acceptance is ambiguous ("make this faster" without a target), or when the only feedback signal is human judgment ("does this UX feel right?"). Agents also tend to fail silently when test coverage is sparse — they will report success because nothing failed, even when nothing was actually verified.
Common failure modes
- Confident hallucination. The agent invents an API that doesn't exist, runs the build (which doesn't catch it), and ships.
- Reward hacking. The test suite has gaps; the agent edits the test rather than the code to make it pass.
- Scope creep. The agent rewrites adjacent code "while it's there" and the diff balloons.
- Loop-without-progress. The agent burns tokens and tool calls retrying the same broken approach.
Each of these is a governance problem more than a model problem. Better models reduce frequency; only governance catches what slips through.
The Tooling Landscape
The agentic coding tool market has matured rapidly. Most tools fall into one of three shapes: an editor-embedded agent (Cursor, Windsurf), a CLI agent that runs in your repo (Claude Code, Codex CLI), or a hosted autonomous engineer (Devin, Copilot Workspace). The differences come down to where the agent runs, how it gets work, and what governance surface it offers.
Claude Code
CLI agent in your repo
Long-horizon refactors, multi-file edits
Cursor Agent Mode
Editor-embedded agent
Tight feedback with editor state
Devin
Hosted autonomous SWE
Self-running on Linear/Jira tickets
GitHub Copilot Workspace
Issue-to-PR pipeline
Native to GitHub flow
OpenAI Codex CLI
Local CLI agent
Sandboxed shell + file ops
VibeFlow
Governance + orchestration
Tracked work, audit trails, multi-agent coord
Choosing an agentic coding tool
- Editor-embedded agents (Cursor, Windsurf) — best for interactive coding where you want to stay in the loop. Lower governance burden, lower autonomy.
- CLI agents (Claude Code, Codex CLI) — best when you want the agent to run long-horizon tasks against your full repo. Higher autonomy, less editor coupling.
- Hosted autonomous engineers (Devin, Copilot Workspace) — best when you want the agent to pick up tickets and ship PRs without human babysitting. Highest autonomy, highest governance requirement.
How Axiom differs
Most agentic coding tools optimize for solo developer velocity. Axiom's VibeFlow sits one layer up — it gives any agent (Claude Code, Cursor, Codex CLI, Devin) a tracked work item, persistent context, audit logs, branch isolation, and human review gates. You don't replace your agent; you make it auditable.
Governing Agentic Coding
The reason agentic coding needs more governance than other AI-assisted development is autonomy. An agent that runs for thirty minutes against your repo can edit dozens of files, make dozens of tool calls, spend real money on inference, and commit code that nobody watched being written. Without guardrails, you have no audit trail and no ability to catch a misbehaving agent before it ships.
Tracked work items
Every agent run is bound to a task ID — no orphan commits
Branch + worktree isolation
Agents work on dedicated branches so collisions are impossible
Tool allowlists
Agents only call gateway-approved tools (LLM, MCP, A2A)
Execution logs
Every plan, edit, command, and observation is captured
Human review gates
Agents propose PRs; humans approve before merge
Persistent context
Architecture decisions carry forward between agent sessions
The non-negotiables are tracked work, branch isolation, tool allowlists, and execution logs. With those four in place, an agent's run becomes reproducible: you can see what task it claimed, which branch it touched, which tools it called, and what it observed at each step. Without them, debugging an agent failure means asking "what happened?" and getting no answer.
The two governance layers worth investing in early are an LLM gateway (so every model call is logged, costed, and policy-checked) and an MCP gateway (so tool calls are scoped, allowlisted, and observable). These give you visibility without changing how the agent itself works.
VibeFlow turns any agent into a governed agent
VibeFlow assigns every agent run a tracked work item, isolates work in a worktree, captures the full execution log, and gates merges on human review. Bring your own agent — Claude Code, Cursor, Codex CLI — and inherit the governance layer you need to ship autonomously without losing the audit trail.
How Teams Roll It Out
Adopting agentic coding works best in stages. Jumping straight from manual development to autonomous agents almost always produces a governance crisis within a quarter — too many commits, no audit trail, surprising AI bills, and a few notable production incidents. The teams that get value compound it by going through three stages.
Stage 1 — Supervised single-agent
One developer runs one agent (Claude Code, Cursor agent mode) on small, well-scoped tasks they would have done themselves. The developer watches every step, intervenes when the agent goes off-track, and learns where the agent's competence ends. Goal: build intuition for what agents can and can't do in your codebase.
Stage 2 — Tracked single-agent
Agents pick up tasks from a queue (Linear, Jira, VibeFlow). Each run is bound to a tracked work item, runs on its own branch, and opens a PR for human review. The developer's role shifts from co-writing to reviewing. Goal: prove that an agent can reliably ship a class of tasks (bug fixes, small features) without supervision.
Stage 3 — Multi-agent with persona specialization
Multiple agents with distinct personas (developer, QA, security reviewer, architect) coordinate through a shared work tracker. Different agents claim different work item types. Worktree isolation prevents collisions. Goal: scale agent capacity beyond what a single developer can supervise.
The honest version
Agentic Coding Tools in Depth
Each major agentic coding tool takes a different approach to autonomy, context, and integration. Understanding the trade-offs helps you match the right tool to the right class of work.
Agentic Coding with Claude Code
Claude Code is Anthropic's CLI agent for software development. It runs inside your terminal with full access to your repo, shell, and git. You give it a task in natural language; it plans an approach, edits files, runs tests, and iterates until the work is done. Key strengths: long-horizon multi-file refactors, deep codebase understanding via its large context window, and native support for hooks, MCP servers, and custom tool integrations. Claude Code can also run in "headless" mode for CI/CD pipelines, making it suitable for automated code review and batch operations.
Agentic Coding with Cursor
Cursor's agent mode embeds an autonomous loop inside the editor. You describe a task, and the agent edits files, runs terminal commands, and iterates — but within the context of your open workspace. The advantage is tight feedback: you see every edit as it happens, can steer the agent mid-run, and benefit from editor-specific context (open tabs, linter results, type errors). The trade-off is lower autonomy — Cursor agents are designed for interactive sessions, not background execution on a queue of tickets.
Agentic Coding with Devin
Devin by Cognition is a hosted autonomous software engineer. It connects to your issue tracker (Linear, Jira), picks up tickets, spins up a development environment, writes and tests code, and opens PRs — all without a human in the loop during execution. Devin's strength is fully autonomous ticket-to-PR execution with its own sandboxed compute. The trade-off is less control over the execution environment and higher governance requirements since the agent runs remotely with broad access.
Tool choice is less important than governance
Building an AI Development Team
The end state of agentic coding is not one agent replacing one developer — it's a team of specialized agents, each with a distinct persona and set of responsibilities, coordinating through a shared work tracker. This is what an AI development team looks like in practice.
Developer Agent
Picks up tickets, writes code, opens PRs
Tools: Claude Code, Devin, Codex CLI
QA Agent
Reviews PRs, runs tests, flags regressions
Tools: Claude Code + test harness
Security Reviewer
Scans diffs for vulnerabilities, checks dependencies
Tools: Claude Code + SAST tools
Architect Agent
Reviews for patterns, consistency, tech debt
Tools: Claude Code + context docs
The coordination model mirrors how human teams work: a developer agent writes the code, a QA agent reviews and tests it, a security agent checks for vulnerabilities, and an architect agent ensures consistency with existing patterns. Each agent has different tool access, different review authority, and different escalation paths.
What makes this work is not the agents themselves — it's the orchestration layer. Without tracked handoffs between agents, you get collisions, duplicated work, and no audit trail for who did what. The work tracker becomes the single source of truth: which agent claimed which task, what branch it's working on, what status it's in, and what the execution log shows.
Teams running multi-agent development today typically use 2–4 agent personas. The developer persona handles the bulk of implementation. QA and security personas run as reviewers on every PR. The architect persona is invoked selectively for larger changes that affect shared patterns or interfaces.
Orchestrate your AI development team
VibeFlow coordinates multi-agent teams with persona-based work assignment, worktree isolation, execution logging, and status-driven handoffs — so your developer, QA, and security agents work together without collisions.
Benchmarks and Adoption Data
The most widely cited benchmark for agentic coding capability is SWE-Bench Verified — a curated set of real GitHub issues from popular Python repositories. Agents are given the issue description and must produce a patch that passes the project's test suite. It tests end-to-end capability: reading context, planning a fix, editing code, and verifying correctness.
SWE-Bench Verified scores as of May 2025. Dash indicates no published result. Scores change rapidly — check leaderboards for the latest.
Beyond benchmarks, adoption data from 2025 developer surveys tells a consistent story: roughly 75% of professional developers use AI coding tools regularly, up from 44% in 2023. Agentic usage (delegating entire tasks rather than accepting line-by-line completions) is growing fastest among teams with existing CI/CD maturity, since test suites give agents a feedback signal to loop on.
What the benchmarks don't tell you
SWE-Bench measures whether an agent can fix a single issue in a well-tested open-source project. Real enterprise work is harder: proprietary codebases with thin test coverage, multi-service architectures, tribal knowledge not captured in docs, and acceptance criteria that require human judgment. Benchmark scores are a useful signal for model capability, but they don't predict how well an agent will perform in your repo.
The practical metrics that matter more: task completion rate (what percentage of assigned tasks does the agent ship without human rework?), review rejection rate (how often does a human reviewer send the PR back?), and cost per task (inference spend per successfully completed work item). These are specific to your codebase and only discoverable through piloting.
Getting Started
A pragmatic first month: pick one agent (Claude Code is a reasonable default for repo-grounded work), pick one developer to be the agent owner, pick five tasks of the right shape (bug fix, small feature, refactor, test backfill, dependency bump), and run them under supervision. Track which tasks succeeded, which failed, and what kind of failure each was. This becomes your team's empirical model of "what is this agent good for."
Pick the tool
Start with one. Claude Code if you want a CLI agent in your repo. Cursor agent mode if your team already lives in Cursor. Devin if you want hosted autonomous runs against tickets. You can always add more — but operating two agents in parallel before you've governed one well is the fastest path to a mess.
Wrap it in governance early
Even on day one, route inference through a logged gateway, run the agent on a dedicated branch per task, and require a human-approved PR before merge. The cost of adding governance later — after dozens of unattributed commits — is much higher than building it in from the first run.
Decide on your scope ceiling
Until you've built confidence, cap what an agent is allowed to do unsupervised: maximum diff size, no migrations, no deployment commands, no production credentials. Lift the ceiling as track record warrants it.
Run agentic coding with a real audit trail
VibeFlow gives you the control plane for agentic coding — tracked work items, worktree isolation, full execution logs, persistent context, and human review gates. Bring Claude Code, Cursor, or Devin; inherit the governance you need to ship faster without losing accountability.
Run coding agents with a full audit trail
VibeFlow gives every agent run a tracked work item, isolated worktree, full execution log, and human review gate — so you can ship autonomously without losing accountability.
Contact UsContinue Learning
What is Vibecoding?
The broader cultural shift to natural-language-driven software development
AI Software Engineering
The discipline surrounding agentic coding — practices, roles, and metrics
AI Software Developer
The agent itself — Claude Code, Devin, Cursor agent mode
Best AI Coding Tools
13 tools compared for complex, multi-file development work
What is an LLM Gateway?
The first governance layer to add when running coding agents
AI Governance Framework
Policy, compliance, and cost control for enterprise AI systems
What Are Agent Skills?
Reusable SKILL.md packages that give coding agents domain expertise and repeatable workflows