The Agent Control Plane | Colby Callahan

The next major infrastructure category is not another chatbot wrapper. It is the agent control plane.

AI is moving from a system that answers questions to a system that takes actions. Agents read repositories, call tools, use terminals, query databases, operate browsers, write code, open pull requests, create documents, interact with SaaS systems, and coordinate with other agents. That changes the central engineering question.

The old question was: can the model produce a useful answer?

The new question is: can we safely operate a fleet of semi-autonomous software actors?

An agent that only chats is an application feature. An agent that can call tools, spend money, mutate state, touch source code, access internal data, or trigger production-adjacent workflows is a distributed-system participant. Treating it as anything less creates unobservable automation, unclear accountability, runaway spend, and security controls added after the blast radius is already real.

The right abstraction is the agent control plane: the infrastructure layer that makes agents observable, governable, attributable, bounded, measurable, and stoppable.

Why this is happening now

Three shifts are converging.

First, agent frameworks are becoming real runtimes. The latest agent SDKs are no longer just prompt wrappers. They include tools, handoffs, sessions, memory, guardrails, human-in-the-loop mechanisms, tracing, and sandbox execution. The runtime is becoming a first-class software layer.

Second, tool protocols are standardizing. MCP gives agents a common way to connect with external data, tools, and workflows. A2A points toward a world where agents built by different vendors and frameworks can communicate and coordinate across enterprise systems. This is useful, but it also means agent activity will cross more trust boundaries.

Third, the security and observability ecosystems are catching up. OWASP now explicitly calls out risks such as prompt injection, excessive agency, sensitive information disclosure, supply-chain exposure, and unbounded consumption. OpenTelemetry has started defining GenAI semantic conventions for model, agent, workflow, retrieval, and tool spans. These are not academic details. They are signals that agent systems are becoming production systems.

When a technology gets runtimes, protocols, security taxonomies, and telemetry conventions, the next missing piece is the operating layer.

That operating layer is the control plane.

Agents are not just another app surface

Traditional developer tools are mostly deterministic. A CI job runs a known script. A deployment pipeline follows a known graph. A service emits logs with relatively stable schemas. Humans may make mistakes, but the system usually does not invent new plans during execution.

Agents are different.

They are dynamic planners operating over partially trusted context. They choose which files to inspect, which tools to call, how long to persist, when to retry, when to ask for help, and what changes to make. Their behavior depends on model output, prompt context, repository state, tool results, memory, policy, and harness design. The same high-level task can produce different execution paths across runs.

That flexibility is the value. It is also the risk.

The more useful agents become, the more they need the seriousness we already apply to production compute: identity, permissions, observability, audit trails, rate limits, policy enforcement, runtime isolation, cost controls, and emergency stop mechanisms.

The control plane versus the data plane

A useful way to reason about agent infrastructure is to separate the data plane from the control plane.

The data plane is where agent work happens: model calls, tool calls, file reads, shell commands, browser actions, API requests, code generation, test execution, and pull request creation.

The control plane is where agent work is governed: identity, policy, routing, budgets, telemetry, auditability, evaluations, approvals, incident response, and kill switches.

If the data plane is where agents act, the control plane is where the organization decides what agents are allowed to do, observes what they actually did, and intervenes when they should stop.

This matters because most early agent systems overbuild the data plane and underbuild the control plane. They make agents more capable before making them more governable.

That order is backwards for enterprise deployment.

Design principles

Server-side telemetry beats client-side trust.
Identity is the root of control.
Policy belongs outside the model.
Tool calls are the risk boundary.
Cost must be tied to outcomes.
Memory must have provenance.
Human review should be risk-based.
Every agent needs a targeted stop path.

These principles are intentionally simple because they need to survive organizational scale. A platform team should not need every agent harness, product team, and workflow owner to independently rediscover them.

The minimum viable agent control plane

A credible agent control plane needs at least twelve primitives.

1. Agent identity

Every agent instance needs durable identity.

Not just the human who launched it. Not just the API key it used. Not just the team that owns the workflow. The platform should know: this is agent instance X, launched by user Y, for workflow Z, against repo or resource A, using model B, with policy profile C, budget D, and runtime E.

Without instance identity, you cannot answer basic operational questions:

who started this?
what is it allowed to do?
what tools can it call?
what did it touch?
how much did it spend?
which policy decisions affected it?
can I stop this one agent without breaking everything else?

Agent identity is the root of every other control.

2. Session reconstruction

Agent activity is scattered across model calls, streaming events, tool invocations, logs, job runners, source-control actions, browser sessions, CI systems, and human review.

The control plane has to stitch those fragments into a coherent timeline.

A useful session timeline should show the original task, user, agent identity, model calls, tool calls, file reads, file writes, command execution, external actions, policy decisions, cost, retries, errors, handoffs, human approvals, and final outcome.

If the only way to understand what happened is to grep logs across five systems, the platform is not ready for serious autonomy.

3. Tool-call ledger

Tool calls are where agent risk becomes concrete.

Reading a README is low-risk. Reading secrets, writing files, querying customer data, calling cloud APIs, modifying payment settings, posting messages, or opening pull requests requires stronger controls.

The control plane should maintain a queryable ledger of tool invocations: tool name, tool version, arguments, result class, latency, caller identity, policy decision, downstream side effects, and sensitivity level.

This is the difference between “the agent did something weird” and “this exact agent called this exact tool with these arguments at this time under this policy.”

4. Policy enforcement outside the model

Prompt-level governance is not enough.

A prompt can instruct an agent not to do something. A control plane can prevent it.

Policy should answer concrete questions:

Can this agent access this repository?
Can it call this MCP server?
Can it invoke this tool with these arguments?
Can it write files or only read them?
Can it open network connections?
Can it use this model?
Can it spend more than this budget?
Can it access production-adjacent resources?
Can it act without human approval?

The policy engine should be centralized, auditable, and independent of the agent’s self-reported intentions.

5. Capability manifests

Agents need explicit capability manifests.

A manifest should define what the agent can use: models, tools, MCP servers, data sources, filesystem scope, network scope, credential scope, memory scope, and approval requirements.

This is the agent equivalent of a Kubernetes pod spec, IAM policy, and runtime profile combined into one operational contract.

The default should not be ambient authority. The default should be least privilege.

6. Runtime isolation

Agents should run in constrained environments with explicit boundaries around filesystem access, network access, credentials, secrets, and tool availability.

The sandbox does not need to be perfect to be valuable. It needs to make the safe path normal and the dangerous path explicit.

For coding agents, isolation should answer:

what workspace is mounted?
what files can be read?
what files can be written?
what commands can run?
what network destinations are allowed?
what credentials are available?
what artifacts leave the sandbox?

An agent with a terminal is not just generating text. It is operating a computer.

7. Cost attribution

AI spend without attribution becomes organizational fog. Everyone knows the bill is growing. Nobody knows which workflows are creating value.

A good control plane attributes cost by user, team, repo, workflow, task class, model, prompt template, context source, tool pattern, and outcome.

The important unit is not dollars per request. It is dollars per successful outcome.

A $20 agent run that fixes a production bug is cheap. A $2 run that loops over irrelevant files and accomplishes nothing is expensive.

8. Model routing and context economics

Model choice is a control-plane decision.

Some tasks need the strongest model. Many do not. Some requests should use a cheaper model. Some should use a cached prefix. Some should be compacted. Some should be rejected before they burn tokens. Some should require approval because the context contains sensitive data.

The control plane should reason about model routing, prompt caching, context compaction, budget limits, latency targets, and fallback strategies.

This is where AI cost control becomes engineering, not finance.

9. Evaluations and outcome feedback

An agent platform without evals is an opinion machine.

The control plane should measure success rates by task class, model, tool set, memory strategy, prompt strategy, repo type, cost band, and risk tier. It should track regressions over time, retries per success, cost per success, human intervention rate, review rejection rate, failure modes, and downstream incidents.

Outcome feedback matters because agent activity is not the same as agent value.

For coding agents, useful outcome signals include:

did the patch compile?
did CI pass?
was the pull request accepted?
did reviewers request changes?
was the issue closed?
did the change later revert?
did the user abandon the session?

Without outcome feedback, agents optimize for motion instead of results.

10. Human-in-the-loop routing

Human review should be risk-based, not universal.

The goal is not to force humans to manually approve every low-risk action forever. The goal is to define risk tiers, automate safe paths, escalate ambiguous cases, and preserve accountability.

A good control plane routes work differently depending on action type, confidence, historical success rate, data sensitivity, blast radius, and reversibility.

Low-risk documentation edits may only need post-hoc audit. Medium-risk code changes may need normal review. High-risk production actions may need explicit approval before execution.

11. Memory governance

Agent memory is useful only if it is governed.

Memory should have provenance, freshness, scope, trust level, expiration, and deletion semantics. A memory created by a successful merged pull request is different from a memory inferred during a failed run. A human-authored convention is different from a model-generated guess. A repo-specific memory is different from a global instruction.

Ungoverned memory becomes a persistence layer for stale assumptions and prompt injection.

12. Kill switches

There must be a way to stop bad behavior at multiple layers: one agent instance, one session, one workflow, one team, one tool, one model, one identity, one MCP server, or the entire platform.

Revoking a shared credential is not enough. Killing all agent traffic is too blunt. The control plane needs targeted intervention.

A real kill switch should prevent new model calls, cancel pending tool calls where possible, terminate the runtime, block automatic restart, preserve forensic state, and emit an audit event.

If you cannot stop a specific agent, you are not operating it. You are hoping it behaves.

The Staff Engineer lens

A Staff+ engineer’s job is not just to build the agent runtime. It is to define the operating model that lets many teams safely adopt agents without every team reinventing identity, policy, telemetry, evaluation, cost attribution, and incident response.

That is the real leverage point.

The problem is not one agent. The problem is the second hundred agent workflows: different teams, different repos, different data sources, different tools, different risk levels, different cost profiles, and different definitions of success.

The Staff+ move is to turn that chaos into a platform primitive.

What I would build first

I would not start by trying to build the perfect agent platform.

I would start by building the control-plane spine.

In the first 30 days:

assign every agent run a durable instance identity
emit normalized model-call events
emit normalized tool-call events
attach workflow, repo, team, user, model, and policy metadata
produce a basic session timeline

In the first 60 days:

attribute cost by team, repo, workflow, model, and session
track outcomes such as CI pass, PR opened, PR merged, task abandoned, review rejected, or human intervention required
identify retry loops and high-cost failed sessions
add workflow-level dashboards
define risk tiers for human approval

In the first 90 days:

add scoped stop controls for agent instance, workflow, tool, and model
expose cost per successful outcome
detect stuck sessions while they are still running
log policy decisions centrally
produce an executive view of adoption, value, cost, and risk

That is enough to change the conversation.

The platform moves from “we have agents doing things” to “we can operate agents as a governed production system.”

The architecture pattern

The architecture I would expect to become common looks like this:

Identity layer

Issues agent instance IDs, binds them to users, services, workflows, teams, resources, and policy profiles.

Model gateway

Routes model calls, records token usage, applies model policy, manages caching, captures request metadata, and emits GenAI telemetry.

Tool gateway

Brokers tool and MCP access, validates arguments, records tool calls, applies allow/deny/approval policy, and tracks downstream side effects.

Runtime manager

Creates isolated workspaces, applies capability manifests, controls filesystem/network/credential access, manages lifecycle, and exposes termination hooks.

Event stream

Publishes normalized model, tool, runtime, policy, cost, and outcome events.

Session store

Builds the agent timeline: what happened, in what order, under which identity and policy.

Policy engine

Evaluates access, risk, budget, approval, data sensitivity, model choice, and action authorization.

Evaluation layer

Measures success, failure, regression, cost per outcome, and drift across models, workflows, tools, prompts, and memory strategies.

Operations console

Lets operators inspect active sessions, investigate failures, pause workflows, disable tools, enforce budgets, and kill targeted agent runs.

That is the control plane. It is not a dashboard. It is an operating system for agent authority.

The executive metric

The control plane should eventually answer one executive question:

How much useful autonomous work did agents complete, at what cost, with what risk, under what controls?

That is the metric that matters. Not total prompts. Not total tokens. Not total agent sessions. Those are activity metrics.

The control plane exists to turn agent activity into accountable business value.

What to measure first

The fastest way to make this real is to measure the right things.

Start with these metrics:

active agent sessions
model calls per session
tool calls per session
cost per session
cost per successful outcome
session reconstruction coverage
tool-call attribution coverage
policy decision coverage
cache hit rate
retry-loop detection rate
human intervention rate
review acceptance rate
kill-switch time to effect
failed-session spend
top workflows by spend
top workflows by value

The most important metric is not total AI spend.

The most important metric is value-normalized autonomy: how much useful work the agent completed, at what cost, with what risk, under what controls.

What good looks like in 3–6 months

A serious enterprise agent platform should not need a year to become observable and governable. Within 3–6 months, the platform should be able to answer the operational questions that determine whether agents are ready for broader autonomy.

At minimum, operators should be able to answer these questions in seconds:

Which agents are running right now?
Who launched each agent?
What workflow, repo, team, or resource is each agent operating against?
What model is each agent using?
What tools has each agent called?
What did each session cost?
Which sessions are stuck, looping, or failing to make progress?
Which workflows are creating useful outcomes?
Which workflows are mostly burning tokens?
Which model, prompt, tool, or memory change caused a regression?
Which actions required human approval?
Which policy decisions blocked or modified agent behavior?
Can we stop one agent without stopping the whole platform?

The 3-month version does not need to be perfect. It needs the control-plane spine:

durable agent instance identity
model-call telemetry
tool-call telemetry
session reconstruction
cost attribution
outcome tracking
policy decision logging
targeted stop controls

The 6-month version should add operational maturity:

workflow-level dashboards
cost-per-success reporting
retry-loop and stuck-session detection
human-approval routing by risk tier
regression tracking across models, prompts, tools, and memory
budget controls by team, workflow, model, and task class
incident workflows for unsafe or wasteful agent behavior
kill switches for agent instance, session, workflow, tool, model, identity, and platform

This is the difference between agent experimentation and agent operations.

If the platform cannot answer these questions after 3–6 months, the organization is probably accumulating agent activity rather than building agent infrastructure.

The anti-patterns

Anti-pattern 1: Client logs as the source of truth

Client logs are useful, but they are not authoritative. They can be missing, inconsistent, disabled, or bypassed. A production control plane needs server-side telemetry at enforceable choke points.

Anti-pattern 2: Prompt-only safety

Prompts can express policy. They cannot reliably enforce it. Serious systems enforce policy in gateways, tools, runtimes, credentials, and approval paths.

Anti-pattern 3: Shared credentials

If every agent uses the same token, you cannot attribute action, revoke precisely, or apply least privilege. Shared credentials erase accountability.

Anti-pattern 4: No cost-to-outcome loop

A spend dashboard without outcomes encourages blunt cost cutting. A cost-to-outcome dashboard encourages better routing, caching, compaction, and workflow design.

Anti-pattern 5: Global shutdown as the only control

A giant red button is necessary but insufficient. Operators need scoped controls: instance, workflow, tool, model, identity, team, and platform.

Anti-pattern 6: Memory without provenance

Memory that cannot explain where it came from, when it was last validated, and whether it came from a successful outcome should not be trusted.

Why this category matters

Every large software organization is going to face the same adoption curve.

At first, a few engineers use agents locally. Then teams use agents for repetitive code changes, test fixes, documentation, release notes, migrations, dependency updates, support workflows, and operational tasks. Then agents become embedded in CI, code review, incident response, data analysis, customer operations, and internal platforms.

At small scale, trust is informal. At enterprise scale, trust needs infrastructure.

The companies that win will not simply be the ones that buy the smartest model. They will be the ones that build the best operating layer around models: context management, tool governance, gateway observability, memory governance, evaluation, cost control, and runtime safety.

Models will keep improving. That is not the bottleneck.

The bottleneck is whether organizations can safely turn model capability into production authority.

The mistake to avoid

The common mistake is treating agent governance as a legal document, a dashboard, a prompt template, or a committee process.

Those are useful, but they are not enough.

The right design is architectural. The control plane should sit where identity, model calls, tool calls, runtime execution, and policy decisions can be observed and enforced. It should not depend on every client doing the right thing. It should not require every team to invent its own audit trail. It should not rely on asking the agent to behave.

Agents are becoming a new class of production actor. They need production-grade infrastructure.

The goal is not to spend a year designing the perfect platform. The goal is to make the first 90 days create enough visibility, control, and economic accountability that every future agent workflow becomes easier to trust.

Final thought

The most important enterprise AI question is not “which model is smartest?”

It is: “Can we safely give AI systems more authority?”

The answer will depend less on chat UX and more on control planes.

The future belongs to organizations that can run agents the way they run serious distributed systems: observable, governable, attributable, bounded, measurable, and stoppable.

That is the agent control plane.

Research notes used for this revision

OpenAI Agents SDK documentation: agent primitives include agents, handoffs, guardrails, sandbox agents, sessions, MCP integration, human-in-the-loop, and tracing.
Model Context Protocol documentation: MCP is an open-source standard for connecting AI applications to external data sources, tools, and workflows.
Google Agent2Agent announcement: A2A is an open protocol for agent-to-agent communication and coordination across enterprise platforms, launched April 9, 2025 with support from more than 50 partners.
OWASP Top 10 for LLM Applications 2025: relevant risks include prompt injection, sensitive information disclosure, supply chain, excessive agency, and unbounded consumption.
OpenTelemetry GenAI semantic conventions: current development conventions include GenAI model, agent, workflow, retrieval, and tool spans; tool-call arguments/results are explicitly treated as potentially sensitive.
NIST AI Risk Management Framework and Generative AI Profile: risk management for AI systems should be incorporated into design, development, use, and evaluation, with GenAI-specific risks called out in the 2024 profile.

Source links

OpenAI Agents SDK: https://openai.github.io/openai-agents-python/
Model Context Protocol: https://modelcontextprotocol.io/docs/getting-started/intro
Google Agent2Agent announcement: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
OWASP Top 10 for LLM Applications 2025: https://genai.owasp.org/llm-top-10/
OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework