ColbyCallahan
Agent Architecture

AI Cost Control Is an Engineering Discipline

14 min read Updated May 24, 2026

AI cost control is not a finance problem. It is an engineering discipline.

Companies will not control model spend by telling engineers to be careful. That may work for a few weeks. It will not survive real adoption. The moment AI becomes embedded in developer workflows, CI systems, support operations, internal tools, eval pipelines, and autonomous agents, cost becomes a platform problem.

The right question is not “how do we make people use AI less?”

The right question is “how do we make every token economically accountable?”

AI spend is different from cloud spend

Cloud infrastructure cost is already hard, but the primitives are familiar: compute, storage, network, databases, queues, and logs.

AI spend has a stranger shape.

A developer can create a large cost event with one workflow. An agent can repeatedly send the same repository context. A tool can retry a failing task dozens of times. A prompt can quietly grow until every request carries a huge context tax. A high-end model can be used for work a smaller model could handle. A background automation can burn tokens while producing no durable value.

The cost unit is not just infrastructure usage. It is reasoning usage.

That means traditional cost dashboards are insufficient. You need to know what the model was asked to do, why it needed that context, whether the output succeeded, and whether a cheaper path would have produced the same result.

Tokens are not all the same

The first mistake is treating all tokens as one bucket.

A modern model platform may expose several cost-relevant token classes:

  • uncached input tokens
  • cached input tokens
  • cache-creation tokens
  • output tokens
  • reasoning tokens
  • tool-result tokens reintroduced into context
  • retrieval tokens
  • image, audio, or multimodal tokens
  • tokens used in evals and retries

These categories behave differently. They have different cost, latency, and optimization levers.

For example, OpenAI prompt caching can activate automatically on long prompts and reports cache hits through cached_tokens. Anthropic exposes cache-read and cache-creation token fields, supports explicit cache breakpoints, and charges cache reads at a different multiplier than cache writes.

The engineering implication is simple: cost control requires token accounting, not just request accounting.

A platform that only records “this request cost $0.14” cannot optimize much. A platform that records where the tokens came from, whether they were cached, which workflow used them, and whether the run succeeded can improve.

Do not optimize tokens in isolation

The wrong cost strategy is to minimize tokens.

The right strategy is to minimize cost per successful outcome.

A stronger model that solves a task in one pass may be cheaper than a weaker model that loops, retries, and fails. A large prompt prefix may be cheap if it is cached and reused across hundreds of sessions. A high-cost workflow may be worthwhile if it eliminates repeated human toil.

AI cost control is not token starvation. It is economic routing.

The first principle: attribute cost to outcomes

The worst AI cost dashboard shows spend by model and user.

That is not useless, but it is not enough.

A useful cost system attributes spend by:

  • user
  • team
  • repository or resource
  • workflow
  • task type
  • agent session
  • model
  • prompt template
  • cache key or prefix
  • tool chain
  • routing decision
  • retry group
  • outcome

The outcome part is the most important.

A $50 agent run that resolves an expensive operational issue may be a bargain. A $1 run that does nothing useful is waste. The unit of analysis should be cost per successful outcome, not just cost per request.

For coding agents, outcomes might include:

  • pull request opened
  • CI passed
  • review accepted
  • issue closed
  • migration completed
  • build fixed
  • release notes generated
  • human intervention avoided
  • change merged without revert

Without outcome attribution, organizations optimize for lower bills instead of higher leverage.

The AI cost equation

At a high level, agent cost is a function of:

cost = model_price
     × tokens
     × retries
     × workflow_frequency
     × failure_rate
     × context_duplication
     × routing_quality

Most teams focus on model price. That is only one dimension.

A cheap model that fails five times may be more expensive than a stronger model that succeeds once. A prompt that carries 80,000 irrelevant tokens may waste more money than a bad model choice. A retry loop that never changes strategy is a cost incident. A cached prefix can be worth more than a procurement negotiation.

Cost control is systems design.

The repeatable waste patterns

Most AI waste comes from a few patterns.

1. Repeated context ingestion

Agents often begin by exploring the same repository: README, build files, package manifests, architecture docs, test setup, conventions, recent failures, and dependency structure.

If developers launch multiple sessions per day in the same repo, the same initial context is rediscovered repeatedly.

The fix is not “use fewer agents.” The fix is cached repo cognition: stable repo maps, reusable summaries, file-hash invalidation, deterministic prompt prefixes, and task-specific retrieval.

2. Unstable prompt prefixes

Prompt caching depends on repeatable prefixes. If the platform shuffles tool definitions, injects timestamps near the top, serializes maps nondeterministically, or mixes stable and volatile context, it destroys cache hits.

Caching is not just a provider feature. It is a prompt-architecture discipline.

Static material belongs early. Variable material belongs late. Tool definitions need deterministic ordering. Repo summaries need stable formatting. Dynamic tool results should not poison reusable prefixes.

3. Overpowered model selection

Not every task needs the strongest model.

Log classification, release-note drafting, small deterministic edits, summarization, and formatting may not need frontier-level reasoning. Complex debugging, multi-file refactors, ambiguous architecture decisions, and high-risk agent actions often do.

The router should choose based on task class, risk, context size, historical success rate, latency target, budget, and fallback behavior.

The router should be measured by cost per successful outcome, not average cost per request.

4. Context bloat

Long context windows are useful, but they can become a tax.

If every request carries irrelevant history, verbose logs, duplicated files, old tool results, stale memory, and uncompressed error output, the system pays repeatedly for context that no longer changes the answer.

Context should be managed like memory hierarchy:

  • hot context
  • summarized context
  • retrievable context
  • cached context
  • discarded context

The goal is not maximum context. The goal is sufficient context at minimum cost.

5. Verbose tool outputs

Tools are a hidden cost amplifier.

A test command that emits 20,000 lines of logs may be cheap to run but expensive to feed back into the model. A search tool that returns entire documents instead of snippets creates context bloat. A shell tool that dumps irrelevant warnings into every turn quietly increases spend.

Tool outputs need budgets, truncation, summarization, structure, and relevance scoring.

6. Retry loops

Retries are valuable when they converge. They are waste when they repeat the same failed strategy.

A cost-aware platform should detect low-progress loops:

  • same command failures
  • same tool calls
  • same files reread
  • same error class
  • same model-generated plan
  • same patch failure
  • same CI failure
  • same reviewer rejection pattern

At some point, the platform should change strategy, downgrade, escalate, or stop.

7. Evals on the wrong path

Evals are essential, but they can become expensive if every experiment runs synchronously on premium models.

Many eval workloads are ideal for batch processing, flex or lower-priority processing, sampling, smaller judges, cached fixtures, and staged escalation.

Production traffic and offline eval traffic should not share the same cost path.

8. Unbounded background automation

Interactive AI use has natural human backpressure. Background agents do not.

A scheduled workflow can run every hour, across every repo, with expensive models, retry loops, and no owner watching. This is where budget incidents happen.

Background automations need explicit budgets, owner identity, rate limits, kill switches, and outcome thresholds.

The Staff+ cost-control agenda

The Staff+ version of AI cost control is not asking engineers to use AI less.

It is building the platform mechanisms that make efficient usage the default:

  • route tasks to the cheapest model that reliably succeeds
  • cache stable context
  • compact irrelevant history
  • cap runaway loops
  • attribute spend to teams and workflows
  • connect spend to outcomes
  • expose showback before enforcing chargeback
  • make expensive behavior visible while the session is still running

The goal is not less AI. The goal is higher return on every model dollar.

What I would build first

I would start with cost visibility that changes behavior inside one quarter.

In the first 30 days:

  1. attribute every model request to user, team, workflow, repo, model, and agent session
  2. record input tokens, output tokens, cached tokens, latency, and estimated cost
  3. separate interactive usage from automated agent workflows
  4. identify the top repeated prompt prefixes and context blocks
  5. build a basic cost-by-workflow dashboard

In the first 60 days:

  1. calculate cost per completed task, not just cost per request
  2. identify high-cost failed sessions
  3. detect retry loops and repeated context ingestion
  4. add prompt-cache hit/miss reporting
  5. define model-routing tiers by task class and risk level

In the first 90 days:

  1. expose cost per successful outcome
  2. add budgets by workflow, model, team, and session
  3. add live alerts for runaway spend
  4. route low-risk tasks to cheaper models where success rates hold
  5. publish showback reports that connect AI spend to actual outcomes

That is the point where the cost conversation changes from “AI is expensive” to “we know which usage is valuable, which usage is waste, and which platform mechanisms reduce waste without slowing adoption.”

The engineering controls

1. Gateway-level accounting

The model gateway should record:

  • organization
  • user or service
  • team
  • workflow
  • agent session
  • request class
  • prompt template
  • model requested
  • model actually used
  • input tokens
  • output tokens
  • reasoning tokens
  • cached tokens
  • cache-creation tokens
  • latency
  • error class
  • retry group
  • policy decision
  • estimated cost
  • outcome link

If cost data lives only in vendor invoices or local logs, it arrives too late and with too little context.

2. Prompt-cache engineering

Prompt caching is most valuable when large stable prefixes repeat. Agent systems naturally create these prefixes: system instructions, tool definitions, repository summaries, coding conventions, dependency graphs, and previously discovered project structure.

Good caching design requires:

  • stable prompt prefixes
  • deterministic ordering
  • separation of stable and volatile context
  • cache-key strategy
  • cache-hit metrics
  • cache-write metrics
  • TTL awareness
  • invalidation when repo state changes
  • comparison of cached versus uncached cost
  • privacy review of cache behavior

The key metric is not “is prompt caching enabled?” It is “what share of eligible input tokens are served from cache?“

3. Model routing

A model router should not be a static dropdown. It should be policy-driven.

Inputs should include:

  • task type
  • user intent
  • context size
  • risk tier
  • historical model success rate
  • latency target
  • budget
  • data sensitivity
  • fallback strategy
  • need for tool use
  • expected reversibility

The router should produce an explicit decision: model selected, reason, fallback, expected cost, and confidence.

A routing system without explanations will be hard to debug and hard to trust.

4. Context budgets

Every workflow should have a context budget.

For example:

  • maximum initial repo context
  • maximum tool-output size
  • maximum rolling transcript size
  • maximum retrieved chunks
  • maximum log lines
  • maximum memory insertions
  • maximum retry context

Budgeting context forces prioritization. It also makes failures diagnosable.

When a task fails, you can ask: did it fail because the model was weak, because the context was wrong, because the budget was too tight, or because the workflow was ill-posed?

5. Tool-output shaping

Tools should return model-friendly outputs.

That means:

  • structured summaries instead of raw dumps
  • top-K relevant snippets instead of whole files
  • diff-aware file views
  • test failure extraction
  • error normalization
  • log compaction
  • source links for expansion
  • hard output-size limits

A good agent tool should optimize for decision quality per token.

6. Loop detection and stop rules

Cost control needs runtime stop rules.

Examples:

  • stop after N failed attempts with no new evidence
  • stop when cost exceeds expected value
  • stop when the same command fails repeatedly
  • stop when the model rereads the same files
  • stop when CI failure class does not change
  • stop when patch attempts increase diff size without increasing pass rate
  • escalate when model confidence drops below threshold

A retry is a strategy only if the next attempt is different.

7. Offline and batch routing

Not every workload needs immediate synchronous processing.

Good candidates for cheaper offline paths include:

  • evals
  • data enrichment
  • summarization backfills
  • repo indexing
  • documentation generation
  • periodic report generation
  • non-urgent codebase analysis
  • bulk classification

OpenAI’s cost optimization guidance explicitly highlights reducing requests, minimizing tokens, selecting smaller models, Batch API, and flex processing as cost levers. The general principle applies regardless of provider: match urgency and value to the processing tier.

8. Showback before chargeback

Start with showback: make spend visible by team, workflow, repo, model, and outcome.

Chargeback can come later. Early chargeback often creates bad incentives: hidden usage, local workarounds, and underinvestment in shared infrastructure.

The first goal is not blame. It is shared reality.

9. Cost budgets as policy

Budgets should be policy objects, not spreadsheet reminders.

They should exist at multiple layers:

  • user
  • team
  • workflow
  • repo
  • model
  • agent session
  • tool class
  • environment

Budgets should trigger actions: warn, downgrade model, compact context, pause workflow, require approval, or stop.

10. Eval-backed optimization

Every cost optimization should be evaluated.

A smaller model is not cheaper if it fails more. A summary is not cheaper if it drops the key fact. A cache is not better if it breaks privacy expectations. A context budget is not better if it increases human intervention.

The right metric is cost per successful outcome with quality held constant.

What good looks like in 3–6 months

Within 3–6 months, an AI platform should be able to explain where model spend goes and which usage creates value.

The platform should answer these questions in seconds:

  • Which teams, workflows, repos, and agents are spending the most?
  • Which sessions are burning tokens without making progress?
  • Which workflows have the best cost per successful outcome?
  • Which workflows have high failed-session spend?
  • Which prompts and context blocks repeat across sessions?
  • Which requests are hitting prompt cache?
  • Which requests should be routed to cheaper models?
  • Which tasks become more expensive after model, prompt, tool, or memory changes?
  • Which budget limits are being approached?
  • Which active sessions should be stopped or escalated?

The 3-month version needs attribution: request metadata, token accounting, cache accounting, session cost, workflow cost, and basic outcome linkage.

The 6-month version should add optimization: routing policies, context compaction, prompt-cache engineering, budget controls, live anomaly detection, and cost-per-success reporting.

The point is not to slow down AI adoption. The point is to make adoption economically accountable.

The executive metric

The executive metric for AI cost control is not total spend.

It is return on model spend:

How many successful outcomes did the organization buy with each model dollar?

That framing avoids the two bad extremes: unlimited spend with no accountability, and blunt restrictions that suppress useful adoption.

A mature platform should make the efficient path the default and the wasteful path visible.

Design principles

  1. Optimize cost per successful outcome, not tokens in isolation.
  2. Treat model routing as a platform decision, not a user preference.
  3. Cache stable context aggressively, but measure cache effectiveness.
  4. Separate valuable high-cost work from wasteful high-cost work.
  5. Detect runaway spend while the session is still running.
  6. Showback should come before chargeback.
  7. Budgets should be scoped by team, workflow, model, and session.
  8. Cost controls should increase trust in AI adoption, not suppress it.

What to measure

A useful AI cost dashboard should show:

Spend

  • total spend by model
  • spend by team
  • spend by workflow
  • spend by repository
  • spend by agent session
  • spend by request class
  • spend by retry group
  • failed-session spend

Token economics

  • input tokens
  • output tokens
  • reasoning tokens
  • cached input tokens
  • cache-creation tokens
  • cache hit rate
  • cache write rate
  • prompt prefix reuse
  • average context size by workflow
  • tool-output token share

Efficiency

  • cost per successful outcome
  • cost per merged PR
  • cost per CI fix
  • cost per accepted review
  • cost per human intervention avoided
  • cost per eval datapoint
  • retries per success
  • time to useful action

Optimization impact

  • model routing savings
  • prompt caching savings
  • context compaction savings
  • batch/offline processing savings
  • smaller-model success rate
  • failed-run termination savings
  • top waste patterns

The key dashboard is not “who spent the most?”

The key dashboard is “which workflows create the most value per dollar, and which ones silently waste money?”

The cultural mistake

The worst cultural response is to shame engineers for using AI.

That creates hidden usage, local workarounds, and underinvestment in platform solutions. Engineers will still use the tools because the productivity gain is real.

A better stance is: use AI aggressively, but make the platform smart enough to optimize the spend.

That means central routing, caching, policy, measurement, evals, and feedback loops.

The organizational operating model

A mature AI cost program needs three owners.

Platform engineering

Builds the model gateway, routing, attribution, caching, budgets, and telemetry.

Product and workflow owners

Define outcomes, value, quality thresholds, and acceptable latency.

Finance and leadership

Set macro budgets, approve investment, and measure ROI.

The platform team should not be asked to “cut AI spend” in isolation. It should be asked to improve cost per successful outcome.

That distinction matters.

Final thought

AI cost is not a bill to be explained after the fact. It is a runtime signal.

Every model call should carry enough metadata to answer: who requested this, what workflow is it part of, what did it cost, did it succeed, could it have been cheaper, and should we allow this pattern again?

The companies that answer those questions will scale AI usage without losing financial control.

The companies that cannot answer them will eventually choose between uncontrolled spend and blunt restrictions.

AI cost control should be built into the platform from the beginning.

Research notes used for this revision

  • OpenAI prompt caching documentation: prompt caching is intended to reduce latency and cost, works on recent models, is based on exact prefix matches, reports cache hits through cached_tokens, and is more effective when static prompt content appears before variable content.
  • Anthropic prompt caching documentation: caching supports automatic and explicit cache controls, default 5-minute TTL, optional 1-hour TTL, cache-read/cache-creation token fields, and different pricing multipliers for cache writes and reads.
  • OpenTelemetry GenAI semantic conventions: GenAI spans and metrics include token usage, cache-read input tokens, cache-creation input tokens, streaming timing, and model metadata.
  • OpenAI cost optimization documentation: recommended levers include reducing requests, minimizing tokens, selecting smaller models, Batch API, and flex processing.
  • OWASP LLM10:2025 Unbounded Consumption: LLM applications can face denial-of-wallet, excessive inference, resource exhaustion, and service degradation risks; mitigations include rate limiting, quotas, monitoring, throttling, access controls, and graceful degradation.