LLM API Cost Structure for Agent Fleets: Per-Token Economics, Caching, and Model Routing
1. Overview
Autonomous agent fleets are now the dominant per-token consumer of frontier LLM APIs, with workloads characterised by long context windows, repeated prompt prefixes, and chained reasoning steps that amplify token consumption by 10–100× over chatbot baselines. The economic structure of these workloads is shifting from a flat "tokens × rate" pricing model toward a multi-tier regime that combines prompt caching discounts, batch-mode rebates, and aggressive small-model routing — collectively reshaping the unit economics of agent reasoning. This synthesis quantifies the cost stack agents face in 2025–2026, models caching ROI across workload archetypes, and frames model routing as an explicit economic optimisation problem with direct implications for where Empirica's structured research API sits in an agent's cost graph.
2. Key Findings
- Frontier token prices have compressed roughly 80–90% in 24 months on a quality-adjusted basis. OpenAI GPT-4o is priced at $2.50/M input and $10/M output tokens (https://openai.com/api/pricing/); Anthropic Claude Sonnet 4 at $3/M input and $15/M output (https://www.anthropic.com/pricing); Google Gemini 2.5 Pro at $1.25/M input under 200k context (https://ai.google.dev/pricing). The "small/cheap" tier — GPT-4o-mini at $0.15/$0.60 per M, Claude Haiku at $0.80/$4, Gemini Flash at $0.075/$0.30 — is now 15–40× cheaper than flagship models for routine classification, extraction, and routing subtasks.
- Output tokens are 4–5× more expensive than input tokens across all major vendors. For agent loops with long planning preambles and short tool-call outputs, ~70–85% of spend is input-side; for content-generation agents the ratio inverts. This asymmetry is the single most important variable in caching ROI calculations.
- Prompt caching now offers 50–90% discounts on cached input tokens. Anthropic caches at 10% of base input rate after a 25% write premium with 5-minute TTL (https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching); OpenAI caches automatically at 50% off for prompts ≥1024 tokens (https://platform.openai.com/docs/guides/prompt-caching); DeepSeek offers context caching at roughly 10% of input price. Cache hits on a 20k-token system prompt across 100 agent turns reduce input spend by ~85% versus uncached repetition.
- Agent workloads exhibit extreme prefix repetition. Empirical observation from multi-agent frameworks like AutoGen [P6] shows that conversational agents reuse system prompts, tool schemas, and few-shot exemplars across the entire session — typically 80–95% of input tokens are constant prefix material. This is the ideal pattern for prefix caching and KV-cache reuse via PagedAttention-style serving.
- Batch APIs deliver an additional 50% discount at the cost of up to 24-hour latency (OpenAI Batch API, Anthropic Message Batches). For non-interactive agent workloads — overnight enrichment, backfills, evaluation runs — batch routing can halve marginal cost again, stacking with caching for ~95% total discount versus naive synchronous flagship calls.
- Model routing techniques like RouteLLM and FrugalGPT report 50–85% cost reductions at <2% quality degradation on benchmark mixes by cascading queries from cheap to expensive models, escalating only on low-confidence outputs. The economic question for agent fleets is no longer "which model" but "what routing policy."
- Reasoning models (o1, o3, DeepSeek R1, Claude with extended thinking) invert the cost structure. o1-preview at $15/$60 per M, with hidden reasoning tokens billed as output, produces 5–20× the output token count of standard completions. A single complex agent step can cost $0.10–$2.00, making reasoning-model gating one of the highest-leverage cost decisions. DeepSeek R1's open weights and ~$0.55/$2.19 hosted pricing [P1] partially relieve this but introduce vendor diversification questions.
- The frontier training cost barrier remains $100M+ per flagship model [P7], which concentrates inference supply among ~5 vendors and means agent fleets cannot meaningfully negotiate token rates; the only cost levers are routing, caching, and replacing tokens with structured knowledge.
3. Agent Service Patterns — What Agents Buy and Why
Agent fleets purchase three economically distinct categories of LLM service, each with different optimisation surfaces.