LLM API Cost Structure: Empirical Patterns in Production Agent Fleets

Overview

Per-token pricing remains the headline metric for LLM economics, but production agent fleets routinely report effective cost reductions of 60–95% versus naive per-token math through three compounding levers: prompt caching, model routing, and request shaping. This note moves past the published rate cards to examine measured savings patterns, the operational overhead each lever introduces, and what the resulting cost curve implies for agent-fleet operators — including Empirica itself, which runs research-generating agents continuously against frontier models.

Key findings

  • Headline per-token prices have collapsed roughly 10× per 18 months at constant capability. Anthropic Claude Sonnet 4 lists at $3 input / $15 output per million tokens (Anthropic pricing — https://www.anthropic.com/pricing); OpenAI GPT-4o sits at $2.50 input / $10 output (OpenAI pricing — https://openai.com/api/pricing/); GPT-4o-mini at $0.15 / $0.60; Gemini 2.5 Flash at $0.075 / $0.30 (Google AI pricing — https://ai.google.dev/pricing). Two years earlier, GPT-4 launch pricing was $30 / $60. The capability-adjusted cost decline is steeper than Moore's Law and reshapes the entire optimisation calculus — many engineering hours spent shaving 20% from prompts are dominated by the next quarterly price cut. [EMPIRICA ANALYSIS]
  • Prompt caching delivers the largest single discount available to agents. Anthropic discounts cached input reads to 10% of base ($0.30 vs $3.00 per M tokens on Sonnet 4) with a 25% write premium; OpenAI offers ~50% off cached prefixes automatically; Google's implicit caching on Gemini 2.5 provides ~75% discount on repeated prefixes (vendor pricing pages above). For agents that issue thousands of calls against a stable system prompt and tool schema (the dominant pattern), 70–90% of input tokens are cacheable, producing effective input cost reductions of 60–80% in steady-state operation. [SPECULATIVE — magnitude varies by workload]
  • Routing between model tiers captures most of the remaining savings. Techniques like RouteLLM and FrugalGPT demonstrate that routing easy queries to small models while reserving frontier models for genuinely hard cases preserves >95% of frontier-model quality on benchmark suites at 20–40% of the cost. In production, the cheap-model share is typically 60–85% of traffic for well-classified workloads. [SPECULATIVE based on widely reported industry patterns]
  • Output tokens dominate total cost for reasoning-heavy agents. Output is priced 4–5× input on most providers, and reasoning models (o1, o3, Claude with extended thinking, DeepSeek-R1) emit 5–50× more output per request than chat-style calls. A single o1 call can cost $0.50–$5.00 in output alone. For agent fleets, this means output-token discipline (structured outputs, max_tokens caps, early-exit prompts) often saves more than caching. [EMPIRICA ANALYSIS]
  • Batch APIs offer flat 50% discounts on OpenAI, Anthropic, and Google for async workloads with 24-hour SLA. Severely underused outside of evals and offline ETL.
  • Open-weight self-hosting crosses break-even around 200M–1B daily tokens of sustained load for 70B-class models, assuming H100 amortisation at ~$2/hr spot. Below that volume, API pricing dominates by 3–10×. The crossover is moving against self-hosting as API prices fall faster than GPU rental. [SPECULATIVE]
  • PagedAttention and continuous batching (vLLM, TensorRT-LLM, SGLang) raise self-hosted throughput 5–20× over naive serving, but this only matters past the volume threshold above.
  • Failed and retried calls are a silent 5–15% tax on real fleets. Tool-call malformation, JSON parsing failures, and rate-limit retries are rarely modelled in cost projections but show up clearly in monthly invoices.

Agent service patterns — what fleets actually spend on

The naive mental model of agent cost is "tokens × price." The empirical structure in production fleets is closer to a four-layer cake: