Empirica Agent Economy Series — Course Lesson


Executive Summary

LLM API costs are not fixed infrastructure expenses — they are variable, per-unit charges that scale directly with agent activity. For a single developer querying an API occasionally, the economics are trivial. For an agent fleet processing millions of requests daily, cost structure becomes a primary architectural constraint.

This lesson covers the mechanics of per-token pricing, the levers available to reduce cost through caching and routing, and how to build decision logic that keeps agent fleets economically viable. The content connects directly to the broader agent economy: agents that cannot manage their own compute costs cannot operate autonomously at scale.

Who this is for: Developers building agent systems, technical product managers, and anyone designing multi-agent pipelines where API spend is a real budget line.


1. Per-Token Economics Fundamentals

What a Token Is

A token is the atomic unit of LLM computation. Roughly speaking, one token corresponds to about four characters of English text, or approximately three-quarters of a word. The exact mapping varies by tokenizer — code, non-Latin scripts, and structured data (JSON, XML) often tokenize less efficiently than plain prose.

Practical implications: - A 1,000-word document is approximately 1,300–1,500 tokens - A dense JSON payload of the same character count may be 1,600–2,000 tokens - Code with verbose variable names tokenizes worse than abbreviated code - Whitespace and formatting characters consume tokens

Input vs. Output Tokens

All major LLM APIs price input and output tokens separately, and output tokens are consistently more expensive — typically 3× to 5× the input rate. This asymmetry exists because generating tokens requires sequential autoregressive computation, while processing input tokens can be parallelised more efficiently.

Cost asymmetry consequences for agent design: - Verbose system prompts are cheaper than verbose model responses - Asking a model to "think step by step" in its output increases cost substantially - Structured output formats (JSON with many fields) cost more than terse responses - Chain-of-thought reasoning, while often more accurate, carries a direct cost premium

The Token Budget Mental Model

Every agent interaction has a token budget with two components:

Total Cost = (Input Tokens × Input Rate) + (Output Tokens × Output Rate)

For a typical agent turn: - System prompt: 200–2,000 tokens (fixed per call) - Conversation history / context: 0–100,000+ tokens (grows with session length) - User message / task: 10–500 tokens - Tool call results injected into context: 100–10,000 tokens - Model response: 50–2,000 tokens

The context window — the total tokens the model can process at once — sets a hard ceiling. Costs scale linearly within that ceiling.


2. Cost Drivers and Pricing Models

Pricing Tiers Across Model Classes

LLM providers offer models across a wide capability-cost spectrum. As of current market structure, the rough tiers are:

Tier Typical Use Case Relative Cost (per 1M tokens)
Frontier (largest) Complex reasoning, novel tasks High (baseline)
Mid-tier General instruction following 10×–20× cheaper than frontier
Small/fast Classification, extraction, routing 50×–100× cheaper than frontier
Specialised fine-tuned Domain-specific narrow tasks Varies; often mid-tier pricing

These ratios shift as providers compete and new models release, but the structural spread between tiers has remained wide. The implication: using a frontier model for every task in an agent fleet is equivalent to shipping all packages by overnight courier when most could go standard post.

Context Length as a Cost Multiplier

Long-context models charge per token regardless of whether the model "attends" to all of it equally. Filling a 128K context window costs 128× more in input tokens than a 1K context call, even if only a small fraction of that context is relevant to the query.

Key cost drivers in practice: - Retrieval-augmented generation (RAG): Each retrieved chunk adds tokens. Retrieving 10 chunks of 500 tokens each adds 5,000 input tokens per call. - Tool results: API responses, database query results, and web content injected into context can be large and poorly compressed. - Multi-turn conversation history: Without truncation or summarisation, history grows unboundedly. - Few-shot examples: Including 5 examples of 200 tokens each adds 1,000 tokens to every call.

Batch vs. Real-Time Pricing

Several providers offer batch inference at significant discounts (often 50% off real-time rates) in exchange for higher latency — responses delivered within hours rather than seconds. For agent tasks that are not latency-sensitive (overnight data processing, bulk document analysis, asynchronous research tasks), batch pricing is a direct cost halving with no quality tradeoff.


3. Caching Strategies for Cost Reduction

Prompt Caching: The Mechanics

Prompt caching allows providers to store the KV (key-value) cache of a processed prompt prefix, so that subsequent requests sharing that prefix do not re-process those tokens. The provider charges a reduced rate (typically 50%–90% discount) for cache hits on input tokens.

How it works structurally: 1. A request arrives with a long system prompt + short user message 2. The provider processes and caches the system prompt's KV state 3. The next request with the identical system prompt prefix hits the cache 4. Only the new user message tokens are processed at full rate 5. The cached prefix tokens are charged at the discounted cache-hit rate

Requirements for cache hits: - The cached prefix must be byte-identical (exact match, including whitespace) - The prefix must meet a minimum length threshold (typically 1,024 tokens) - Cache entries expire after a provider-defined window (minutes to hours)

Designing Prompts for Cache Efficiency

Cache efficiency requires deliberate prompt architecture:

[STATIC — cache this]
System prompt (role, instructions, constraints)
Few-shot examples
Tool definitions
Domain knowledge / reference material

[DYNAMIC — not cached]
Conversation history
Current user message
Tool call results

The static prefix should be maximised; dynamic content should be appended at the end. Any modification to the static section — even a single character — breaks the cache and forces full reprocessing.

Anti-patterns that destroy cache efficiency: - Injecting timestamps or session IDs into the system prompt - Randomising example order on each call - Including user-specific data in the static section - Dynamic few-shot selection that changes the prefix

Application-Level Caching

Beyond provider-side prompt caching, application-level caching stores complete request-response pairs:

  • Exact match cache: Hash the full prompt; return stored response if hash matches. Effective for repeated identical queries (e.g., FAQ-style agent interactions).
  • Semantic cache: Embed the query; return stored response if cosine similarity exceeds a threshold. Catches paraphrased versions of the same question. Requires an embedding model and a vector store.
  • Deterministic task cache: For tasks with deterministic outputs (format conversion, fixed calculations), cache indefinitely. For tasks with time-sensitive outputs (news summaries, market data), set TTL accordingly.

Cache hit rate economics: If an agent fleet has a 30% cache hit rate on calls averaging 2,000 input tokens, and cached calls cost 80% less, the effective input token cost drops by 24% fleet-wide — before any other optimisation.

Context Compression

When full caching is not possible, compressing what enters the context window reduces cost:

  • Summarise conversation history rather than passing raw turns. A 10-turn conversation of 3,000 tokens can often be summarised to 300 tokens with minimal information loss for most tasks.
  • Chunk and filter retrieved documents — pass only the most relevant sentences, not full documents.
  • Truncate tool outputs — API responses often contain metadata, headers, and fields irrelevant to the agent's task. Strip them before injection.
  • Use structured formats efficiently — a flat JSON object is cheaper than a nested one with repeated keys.

4. Model Routing and Selection Logic

The Core Routing Principle

Not every task requires the same model. Routing is the practice of directing each agent request to the cheapest model capable of handling it adequately. The economic leverage is large: routing 70% of calls to a model that costs 20× less than the frontier model reduces that portion's cost by 95%.

Routing Dimensions

By task complexity: - Simple extraction, classification, yes/no decisions → small/fast model - Multi-step reasoning, code generation, novel synthesis → frontier model - Structured data transformation → mid-tier or fine-tuned model

By latency requirement: - Sub-second user-facing response → real-time, fast model - Background processing, no user waiting → batch API, any model

By context length: - Short context (< 4K tokens) → any model, optimise for cost - Long context (> 32K tokens) → only models with sufficient windows; compare per-token rates carefully as long-context pricing varies

By output format: - JSON with schema → models with reliable structured output support - Free-form prose → broader model selection - Code → models with strong code benchmarks

Routing Implementation Patterns

Static routing (rule-based): Define task types in advance and map each to a model tier. Simple, predictable, zero overhead. Works well when task types are known and stable.

IF task_type == "classification" → small_model
IF task_type == "summarisation" AND length < 2000 → mid_model  
IF task_type == "reasoning" OR task_type == "code_gen" → frontier_model

Dynamic routing (classifier-based): A lightweight model (or embedding-based classifier) evaluates each incoming request and assigns it to a tier. Adds one small model call per request but can route more accurately than static rules. The routing call itself should cost < 1% of the routed call's cost to be economical.

Cascade routing (try-cheap-first): Send the request to a cheaper model first. If the response meets a quality threshold (checked by a validator or a second model), return it. If not, escalate to a more capable model. Effective when most requests are simple but the distribution has a long tail of hard cases.

Cost-quality frontier tracking: Maintain a live model registry with current pricing and benchmark scores. Route based on the Pareto frontier — for a given quality requirement, always use the cheapest model that meets it. Update the registry as providers change pricing.

Fallback and Redundancy

Agent fleets operating at scale need routing logic that handles provider outages, rate limits, and latency spikes:

  • Primary/fallback pairs: Each model slot has a designated fallback at the same or adjacent tier
  • Rate limit awareness: Track token consumption per minute/day against provider limits; pre-emptively route to secondary providers before hitting limits
  • Latency circuit breakers: If a provider's p95 latency exceeds a threshold, route away until it recovers

5. Agent Fleet Optimization

Fleet-Level vs. Single-Agent Economics

A single agent making 100 calls/day at $0.01/call costs $1/day — negligible. A fleet of 1,000 agents each making 1,000 calls/day at the same rate costs $10,000/day — $3.65M/year. Fleet economics are not a linear extrapolation of single-agent economics; they require architectural decisions that single-agent deployments can ignore.

Token Budget Allocation Per Agent Role

In a multi-agent system, different agent roles have different cost profiles:

Agent Role Dominant Cost Driver Optimisation Priority
Orchestrator Long context (managing subagent outputs) Context compression, summarisation
Researcher/retriever RAG chunk injection Chunk filtering, semantic deduplication
Executor/tool-caller Tool output injection Output truncation, structured parsing
Validator/critic Short, focused prompts Small model routing
Synthesiser Long output generation Output length control, structured formats

Shared Context and Context Reuse

When multiple agents in a fleet process the same document, knowledge base, or system context, sharing that context rather than re-injecting it per agent reduces redundant token spend:

  • Shared prompt prefix caching: All agents in a fleet use the same system prompt, maximising cache hit rates across the fleet
  • Centralised context store: Rather than each agent maintaining its own conversation history, a shared store allows agents to read only the delta since their last call
  • Pre-computed embeddings: Embed documents once; all agents query the same vector store rather than re-embedding per agent

Cost Monitoring and Alerting

Fleet cost management requires instrumentation:

  • Per-agent cost tracking: Log tokens in/out per agent per call; aggregate by agent role, task type, and time period
  • Cost anomaly detection: Flag agents whose per-call token consumption exceeds 2× their historical average — often indicates prompt injection, runaway loops, or context accumulation bugs
  • Budget caps with graceful degradation: When an agent approaches its token budget, switch to a cheaper model or reduce context rather than failing hard
  • Cost attribution: In multi-tenant or multi-project deployments, attribute costs to the originating task or user for accurate unit economics

6. Real-World Cost Scenarios

Scenario A: Document Processing Pipeline

Setup: 10,000 documents/day, each ~2,000 words. Task: extract structured data (entities, dates, amounts) and generate a 100-word summary.

Naive approach: Send each document to a frontier model with a 500-token system prompt. - Input: (500 system + 2,600 doc) × 10,000 = 31M tokens/day - Output: 130 tokens × 10,000 = 1.3M tokens/day - At frontier pricing, this is expensive and unnecessary for extraction tasks.

Optimised approach: 1. Route extraction to a fine-tuned small model (50× cheaper) 2. Route summarisation to mid-tier model (15× cheaper) 3. Enable prompt caching on the static system prompt (saves ~15% of input cost) 4. Use batch API for both tasks (50% discount) 5. Strip document metadata before injection (reduces average doc tokens by ~20%)

Result: Combined optimisations can reduce cost by 80%–95% versus the naive approach, with comparable or better accuracy on well-defined extraction tasks.

Scenario B: Interactive Customer-Facing Agent

Setup: 50,000 conversations/day, average 8 turns, mix of simple FAQ and complex troubleshooting.

Cost drivers: - Conversation history grows each turn — by turn 8, history may be 3,000–5,000 tokens - Real-time latency required — no batch pricing available - Quality must be consistent — cannot route all calls to small models

Optimisation levers: 1. Cascade routing: Classify each user message; route ~60% (simple FAQ, greetings, status checks) to small model 2. History summarisation: After turn 4, summarise earlier turns rather than passing raw history 3. Prompt caching: Static system prompt + FAQ knowledge base cached; only conversation delta processed at full rate 4. Semantic response cache: Cache responses to the 200 most common questions; ~20% hit rate on FAQ queries

Result: Effective cost per conversation reduced by 40%–60% versus unoptimised deployment.

Scenario C: Autonomous Research Agent Fleet

Setup: 200 agents running continuously, each performing web research, synthesis, and report generation.

Cost profile: Dominated by long-context synthesis calls — agents accumulate 20,000–80,000 tokens of research material before generating reports.

Optimisations: 1. Aggressive chunk filtering: Before synthesis, run a small model to score relevance of each retrieved chunk; discard bottom 50% by relevance score 2. Hierarchical summarisation: Summarise each source before adding to the synthesis context 3. Model routing for sub-tasks: Use small models for relevance scoring, mid-tier for per-source summarisation, frontier only for final synthesis 4. Shared knowledge cache: When multiple agents research overlapping topics, share retrieved and summarised content via a central store


7. Decision Framework for Practitioners

The Cost Optimisation Decision Tree

1. Is this task latency-sensitive?
   NO → Use batch API (50% discount, stop here for this lever)
   YES → Continue

2. Is this an exact or near-exact repeated query?
   YES → Check application cache first; serve cached response
   NO → Continue

3. What is the task complexity?
   LOW (classify, extract, format) → Route to small model
   MEDIUM (summarise, transform, moderate reasoning) → Route to mid-tier
   HIGH (novel reasoning, complex code, multi-step planning) → Route to frontier

4. Is the context > 8K tokens?
   YES → Apply compression (summarise history, filter chunks) before sending
   NO → Continue

5. Does the prompt have a static prefix > 1K tokens?
   YES → Ensure prompt caching is enabled and prefix is byte-stable
   NO → Continue

6. Is output length controllable?
   YES → Set explicit max_tokens; use structured output formats
   NO → Investigate why and constrain if possible

Cost vs. Quality Tradeoff Calibration

Not all tasks have the same quality requirements. A useful calibration exercise:

  • Run a sample of 100 representative tasks through both your current model and a cheaper alternative
  • Score outputs on task-specific criteria (accuracy, format compliance, completeness)
  • Calculate the quality delta — if the cheaper model scores 94% vs. 97% on your rubric, ask whether that 3% gap justifies a 10× cost difference for this specific task
  • Segment by task type — the quality gap between model tiers varies enormously by task; extraction and classification gaps are often negligible, while complex reasoning gaps can be large

When Not to Optimise

Cost optimisation has diminishing returns and real risks:

  • Do not route safety-critical decisions to small models to save cost — the failure mode cost (incorrect medical, legal, or financial output) exceeds any token savings
  • Do not over-compress context for tasks where completeness matters — a summarised legal document may omit a clause that changes the answer
  • Do not cache responses for tasks where freshness is required — a cached answer to "what is the current price of X" is worse than no answer
  • Do not optimise prematurely — instrument first, identify the actual cost drivers, then optimise. The largest cost driver is rarely where intuition points.

8. Connection to Agent Economy

LLM API cost structure is not merely an engineering concern — it is a foundational economic constraint on the agent economy as a whole.

Agents as Economic Actors with Compute Budgets

In autonomous agent systems, each agent effectively has a compute budget denominated in tokens. An agent that cannot complete its task within its token budget either fails, produces degraded output, or requires human intervention — all of which have costs. Token budget management is therefore a core agent capability, not an afterthought.

This connects directly to the broader agent economy framework: agents that acquire, process, and sell information (as covered in the knowledge markets context) must price their services above their compute costs to remain viable. An agent performing research synthesis at $0.50/report cannot sell that report for $0.30 and remain operational. The per-token cost floor sets the minimum viable price for any agent-produced output.

Cost Structure as a Competitive Moat

Agents (or the systems deploying them) that achieve lower per-unit compute costs can undercut competitors on price while maintaining margins, or reinvest savings into higher-quality models for the same budget. Caching strategies, routing efficiency, and context compression are therefore sources of competitive advantage in agent marketplaces — not just internal engineering hygiene.

The Commoditisation Trajectory

LLM API prices have fallen substantially and continue to fall as model efficiency improves and competition increases. This has two implications:

  1. Absolute cost optimisation matters less over time — a 10× price reduction makes today's optimisation effort worth less tomorrow
  2. Relative cost optimisation matters more — as prices fall, the agents with the best cost structures can operate at scales that were previously uneconomical, opening new market segments

The practical implication: invest in cost architecture (routing logic, caching infrastructure, monitoring) that remains valuable as prices change, rather than optimising for today's specific price points.

Agent-to-Agent Cost Delegation

In multi-agent systems where agents delegate tasks to subagents (as covered in the agent-to-agent payment protocols context), cost attribution becomes a settlement question. When Agent A delegates a research task to Agent B, who pays for Agent B's token consumption? The answer must be encoded in the delegation protocol — either as a pre-negotiated token budget, a cost-plus pricing model, or a fixed-fee arrangement. Undefined cost delegation is a common source of runaway spend in agent fleet deployments.


Key Takeaways

Economics: - Output tokens cost 3×–5× more than input tokens; minimise unnecessary output verbosity - Model tiers span 50×–100× cost differences; routing is the highest-leverage single optimisation - Batch APIs offer ~50% discounts for latency-tolerant tasks — use them

Caching: - Prompt caching requires byte-identical static prefixes; architect prompts with a stable prefix and dynamic suffix - Application-level semantic caching can achieve 20%–40% hit rates on conversational workloads - Context compression (history summarisation, chunk filtering) reduces input tokens without provider-side changes

Routing: - Static rule-based routing is sufficient for well-defined task taxonomies - Cascade routing (try cheap first, escalate on failure) handles mixed-complexity workloads efficiently - Maintain a live model registry; pricing and capability rankings shift frequently

Fleet operations: - Instrument cost per agent role before optimising — the largest cost driver is rarely obvious - Shared prompt prefix caching across a fleet multiplies cache efficiency - Undefined cost delegation in agent-to-agent systems causes runaway spend

Strategic: - Token cost floor sets the minimum viable price for any agent-produced service - Cost architecture (routing, caching, monitoring) is a durable competitive advantage even as absolute prices fall - Do not optimise safety-critical paths for cost; the failure mode cost dominates


Further Reading

The following topic areas extend the concepts in this lesson:

  • Tokenisation mechanics: How different tokenizers (BPE, SentencePiece, tiktoken) handle different content types — relevant for estimating costs before calling APIs
  • KV cache internals: The transformer architecture's key-value cache and why prefix matching enables provider-side caching — understanding the mechanism helps design cache-friendly prompts
  • Vector databases and semantic caching: Infrastructure for application-level semantic response caching at scale
  • LLM benchmarking methodology: How to run your own quality-cost calibration across model tiers for your specific task distribution
  • Agent payment protocols and cost settlement: How multi-agent systems encode cost attribution in delegation contracts — covered in the Empirica Agent Economy Series on agent-to-agent payment protocols
  • Knowledge market pricing: How per-token compute costs propagate into the pricing of agent-produced information goods — covered in the Empirica Agent Economy Series on agent memory and knowledge markets

Empirica Agent Economy Series. This lesson assumes familiarity with basic LLM API concepts. No prior economics background required.