LLM API Cost Structure for Agent Fleets: Per-Token Economics, Caching, and Model Routing
A structured course lesson for all audiences — from first-time builders to fleet operators
Executive Summary
Running a fleet of AI agents at production scale is fundamentally a cost-engineering problem. The dominant cost driver is LLM inference, priced per token — a unit so small it appears trivial until multiplied across thousands of concurrent agents making hundreds of calls per hour. Three levers control the economics: how tokens are priced, how many tokens are consumed, and which model handles each request. Mastering all three — per-token economics, caching strategy, and model routing — is the difference between a profitable agent fleet and one that burns budget faster than it creates value.
This lesson builds a complete mental model for each lever, then assembles them into an operational playbook.
1. Per-Token Economics Fundamentals
What a Token Is
A token is the atomic unit of LLM computation — roughly 0.75 words in English, though this varies by language, code, and special characters. Every API call converts input text into tokens (the prompt) and generates output tokens (the completion). Both are billed, but at different rates.
The Asymmetry That Matters Most
Output tokens cost more than input tokens — typically 3× to 5× more, depending on the provider and model. This asymmetry has direct architectural consequences:
- Verbose system prompts are cheaper than verbose model responses
- Asking a model to "think step by step" in its output is expensive; pre-computing reasoning and injecting it as input context is cheaper
- Streaming long completions is a cost signal, not just a latency signal
Price Tiers Across Model Classes
| Model Class | Typical Input Price | Typical Output Price | Best Use Case |
|---|---|---|---|
| Frontier (e.g., GPT-4-class) | $10–$30 / 1M tokens | $30–$60 / 1M tokens | Complex reasoning, final synthesis |
| Mid-tier (e.g., GPT-3.5-class) | $0.50–$3 / 1M tokens | $1.50–$6 / 1M tokens | Classification, extraction, summarization |
| Small/fast (e.g., Haiku-class) | $0.25–$1 / 1M tokens | $1–$3 / 1M tokens | Routing, triage, simple Q&A |
| Open-weight self-hosted | Infrastructure cost only | Infrastructure cost only | High-volume, latency-sensitive, private data |
Note: Prices shift frequently. Treat these as order-of-magnitude anchors, not billing guarantees.
The Compounding Effect in Agent Fleets
A single agent call is cheap. A fleet is not. Consider:
- 100 agents × 50 calls/hour × 2,000 tokens/call = 10 million tokens/hour
- At frontier pricing ($30/1M output tokens), that is $300/hour in output costs alone
- Over a 720-hour month: $216,000 — from one cost component of one model tier
This arithmetic explains why token efficiency is a first-order engineering concern, not an optimization afterthought.
2. Token Counting and Cost Prediction Models
Why Prediction Is Hard
Token counts are not word counts. The same semantic content can tokenize differently depending on:
- Language: Non-Latin scripts tokenize less efficiently (more tokens per word)
- Code: Structured syntax often tokenizes compactly; variable names vary
- Formatting: Markdown headers, JSON brackets, and whitespace all consume tokens
- Model family: Each provider uses a different tokenizer (BPE variants, SentencePiece, etc.)
Practical Counting Tools
- Tiktoken (OpenAI): Open-source, accurate for GPT-family models
- Tokenizer endpoints: Most providers expose a count-tokens API call before committing to inference
- Heuristic budgeting: For English prose, 1,000 tokens ≈ 750 words ≈ 1.5 pages — useful for back-of-envelope planning
Building a Cost Prediction Model
A reliable fleet cost model requires four inputs per agent task type:
- Mean prompt token count (system prompt + context + user message)
- Mean completion token count (sampled from production or staging runs)
- Call frequency (calls per agent per hour, by task type)
- Model assignment (which tier handles this task type)
Multiply these through the pricing table, sum across task types, and you have a cost-per-agent-hour figure that can be tracked against actual billing.
Variance Is the Hidden Risk
Mean token counts mislead. A task with a mean of 800 tokens but a 95th-percentile of 8,000 tokens (e.g., when agents retrieve long documents) can blow cost budgets on tail events. Always model the distribution, not just the mean.
3. Caching Strategies: Context Window Optimization
The Core Insight
The most expensive token is one you've already paid for and are paying for again. Caching eliminates redundant computation. There are two distinct caching problems in agent fleets:
- Prompt caching: Reusing the KV-cache of a repeated prefix across multiple API calls
- Semantic caching: Storing complete responses and returning them for semantically equivalent queries without hitting the model at all
Context Window as a Cost Surface
Every model has a maximum context window (e.g., 128K tokens for some frontier models). Filling that window on every call is expensive. Strategies to reduce context size:
- Summarization compression: Replace raw conversation history with a running summary
- Retrieval-augmented generation (RAG): Inject only the retrieved chunks relevant to the current query, not the entire knowledge base
- Sliding window: Retain only the N most recent turns plus a compressed summary of earlier turns
- Structured state: Store agent state as compact JSON rather than natural language narration
The Retrieval-vs-Context Trade-off
Putting more context in the window increases accuracy on tasks requiring broad recall but increases cost linearly. RAG reduces cost but introduces retrieval latency and retrieval error risk. The optimal balance depends on task type:
- High-precision tasks (legal, medical, financial): Larger context windows often justified
- High-throughput tasks (classification, routing, extraction): Aggressive compression and RAG preferred
4. Prompt Caching Implementation
How Provider-Side Prompt Caching Works
Several major providers (Anthropic, OpenAI, Google) now offer prompt caching — a mechanism where the KV-cache of a long, repeated prompt prefix is stored server-side and reused across calls. The economics are significant:
- Cache hit: Input tokens in the cached prefix are billed at a steep discount (typically 50–90% reduction)
- Cache write: The first call that populates the cache is billed at standard or slightly elevated rates
- Cache lifetime: Caches expire (typically minutes to hours depending on provider); high-frequency calls are needed to amortize the write cost
Structural Requirements for Cache Efficiency
Prompt caching only works if the prefix is identical across calls. This imposes a structural discipline on prompt design:
- System prompt first, always: Place the static system prompt at the very beginning of every call
- Static before dynamic: All fixed content (instructions, persona, tool definitions, few-shot examples) must precede any dynamic content (user query, retrieved context, conversation history)
- No timestamps or session IDs in the prefix: Any variable content in the cached region breaks the cache hit
Calculating Cache ROI
Cache savings per call = (cached_prefix_tokens × standard_price) - (cached_prefix_tokens × cache_hit_price)
Break-even calls = cache_write_cost / savings_per_call
For a 10,000-token system prompt at $15/1M tokens standard price and $1.50/1M cache-hit price: - Savings per cache hit: 10,000 × ($15 - $1.50) / 1,000,000 = $0.135 per call - If cache write costs $0.1875 (at 1.25× standard rate): break-even at 2 calls
Any agent making more than 2 calls with the same system prompt benefits from caching. Most production agents make thousands.
Semantic Caching Layer
For queries that are not identical but are semantically equivalent, a semantic cache sits in front of the LLM:
- Embed the incoming query
- Search a vector store of previous (query, response) pairs
- If cosine similarity exceeds a threshold, return the cached response
- Otherwise, call the LLM and store the new pair
Trade-off: Semantic caches introduce staleness risk (cached answers may be outdated) and require careful threshold tuning (too aggressive → wrong answers returned; too conservative → low hit rate).
5. Model Routing Architectures
The Routing Premise
Not every task requires a frontier model. Routing — directing each request to the cheapest model capable of handling it adequately — is the highest-leverage cost reduction available to fleet operators. A 10× price difference between model tiers means routing even 50% of traffic to a cheaper model cuts total inference cost by roughly 45%.
Routing Architectures
1. Rule-Based Routing - Classify tasks by type at the application layer (e.g., "summarization" → mid-tier, "multi-step reasoning" → frontier) - Fast, predictable, zero additional LLM cost - Brittle: requires manual maintenance as task distributions shift
2. Classifier-Based Routing - A small, cheap classifier model (or fine-tuned embedding model) scores each incoming request for complexity - Routes to the appropriate tier based on score thresholds - Adds latency (one extra inference call) but that call is cheap if the classifier is small - Requires labeled training data: human-annotated examples of "this task needed frontier" vs "this task was fine on mid-tier"
3. Cascade Routing (Try-Cheap-First) - Send every request to the cheapest model first - Evaluate the response quality with a lightweight judge (another small model or rule-based check) - If quality is insufficient, escalate to the next tier - Risk: Latency doubles on escalated calls; not suitable for latency-sensitive applications - Benefit: No training data required; quality threshold is the only tunable parameter
4. LLM-as-Router - A small LLM reads the task and outputs a routing decision - More flexible than rule-based, cheaper than using a frontier model for routing - Introduces a meta-cost: the router itself consumes tokens
Routing Decision Matrix
| Architecture | Latency Impact | Training Data Required | Maintenance Burden | Best For |
|---|---|---|---|---|
| Rule-based | None | None | High | Stable, well-defined task types |
| Classifier | Low (+50–100ms) | Yes | Medium | High-volume fleets with labeled data |
| Cascade | High on escalation | None | Low | Quality-critical, latency-tolerant |
| LLM-as-router | Low–Medium | None | Low | Dynamic, diverse task distributions |
6. Cost-Performance Trade-offs by Model Class
The Pareto Frontier of Model Selection
Model selection is not a binary frontier-vs-cheap decision. It is a continuous trade-off across three dimensions:
- Cost (tokens priced per call)
- Latency (time to first token, total generation time)
- Capability (accuracy on the target task)
The Pareto-optimal choice depends on which dimension is the binding constraint for a given task.
Task-to-Model Mapping Framework
| Task Type | Capability Requirement | Recommended Tier | Rationale |
|---|---|---|---|
| Intent classification | Low | Small/fast | Binary or multi-class output; small models competitive |
| Named entity extraction | Low–Medium | Mid-tier | Structured output; few-shot prompting sufficient |
| Summarization | Medium | Mid-tier | Quality degrades gracefully; cost savings large |
| Code generation (simple) | Medium | Mid-tier | Well-defined output; mid-tier models strong |
| Multi-step reasoning | High | Frontier | Chain-of-thought quality matters; errors compound |
| Novel synthesis / strategy | High | Frontier | Requires broad world knowledge and reasoning depth |
| Tool call parsing | Low | Small/fast | Structured JSON output; deterministic enough for small models |
| Final user-facing response | Medium–High | Mid-tier or Frontier | Depends on quality bar and user expectations |
Capability Degradation Is Non-Linear
Moving from frontier to mid-tier on a simple task may cost 10× less with 2% accuracy loss. Moving from frontier to mid-tier on a complex reasoning task may cost 10× less with 40% accuracy loss. Benchmark your specific tasks before committing to a routing policy.
7. Real-World Fleet Optimization Case Studies
Case Study Pattern 1: The Long System Prompt Problem
Scenario: An agent fleet uses a 15,000-token system prompt (detailed instructions, tool definitions, few-shot examples). Each agent makes 200 calls/day. Fleet size: 500 agents.
Before optimization: 15,000 tokens × 200 calls × 500 agents = 1.5 billion input tokens/day. At $15/1M: $22,500/day.
After prompt caching: Cache hit rate of 95% (5% cache misses due to cache expiry). Effective input token cost for cached portion: $1.50/1M. - Cache hits: 1.425B tokens × $1.50/1M = $2,137 - Cache misses: 75M tokens × $15/1M = $1,125 - Total: $3,262/day — an 85% reduction
Case Study Pattern 2: Cascade Routing for a Classification-Heavy Fleet
Scenario: A document processing fleet where 70% of tasks are classification/extraction (mid-tier sufficient) and 30% require synthesis (frontier required).
Naive approach: All calls to frontier. 10M tokens/day × $30/1M output = $300/day.
With cascade routing: 70% handled by mid-tier at $3/1M output = $63. 30% escalated to frontier = $90. Total: $153/day — 49% reduction.
Additional cascade cost: 10M tokens × $0.50/1M (small model first pass) = $5/day. Net saving still ~47%.
Case Study Pattern 3: Context Compression via RAG
Scenario: Agents previously injected 50,000 tokens of background context per call. RAG implementation retrieves only the 3 most relevant chunks (~2,000 tokens).
Token reduction: 48,000 tokens/call × call volume. At 1M calls/month: 48 billion tokens saved. At $15/1M: $720,000/month saved — offset by vector search infrastructure costs (typically orders of magnitude cheaper).
8. Practical Cost Reduction Playbook
Priority-Ordered Actions
Tier 1: Immediate, low-risk (implement first)
- Enable prompt caching on all providers that support it. Restructure prompts so static content leads. Break-even is typically 2–3 calls.
- Audit system prompt length. Remove redundant instructions. Every 1,000 tokens removed from a 500-agent fleet saves proportionally across all calls.
- Switch tool-call parsing and intent classification to small models. These tasks do not require frontier capability.
Tier 2: Medium effort, high return
- Implement RAG for any agent that currently injects large static knowledge bases into context.
- Build a task classifier to route requests by complexity. Even a simple keyword-based classifier captures significant savings.
- Set output token limits per task type. Unconstrained
max_tokensparameters allow models to generate far more than needed.
Tier 3: Infrastructure investment, large-scale fleets
- Deploy semantic caching for high-volume, repetitive query patterns (e.g., FAQ-style agent interactions).
- Evaluate self-hosted open-weight models for tasks where data privacy, latency, or volume make API costs prohibitive.
- Implement cascade routing with automated quality evaluation to continuously optimize the routing threshold.
Anti-Patterns to Avoid
- Putting dynamic content before static content in prompts: Destroys cache hit rates
- Using frontier models for structured output tasks: JSON extraction does not require GPT-4-class reasoning
- Ignoring the 95th-percentile token distribution: Budget based on mean, blow budget on tail
- Caching without staleness controls: Semantic caches return outdated answers if not invalidated on knowledge updates
- Routing without benchmarking: Assuming mid-tier is "good enough" without measuring task-specific accuracy degradation
9. Monitoring and Attribution Framework
What to Instrument
Cost without attribution is noise. Every token consumed should be tagged with:
- Agent ID: Which agent instance generated the call
- Task type: What category of work was being done
- Model tier: Which model handled the request
- Cache status: Hit, miss, or write
- Call outcome: Success, error, escalation
Key Metrics Dashboard
| Metric | Formula | Alert Threshold |
|---|---|---|
| Cost per agent per hour | Total spend / (agents × hours) | >20% above baseline |
| Cache hit rate | Cache hits / total calls | <80% for high-frequency agents |
| Escalation rate | Tier-2+ calls / total calls | >40% (suggests routing misconfiguration) |
| Token efficiency ratio | Useful output tokens / total tokens consumed | <30% (suggests context bloat) |
| Tail cost ratio | P95 call cost / P50 call cost | >10× (suggests unbounded context growth) |
Attribution for Chargeback and Optimization
In multi-tenant or multi-product fleets, cost attribution enables:
- Per-product P&L: Understanding which agent workflows are cost-efficient vs cost-draining
- Routing policy feedback loops: If a task type consistently escalates, the routing classifier needs retraining
- Anomaly detection: Sudden cost spikes often indicate prompt injection, runaway loops, or retrieval failures returning massive documents
Logging Architecture
Call record schema:
{
"timestamp": "ISO8601",
"agent_id": "string",
"task_type": "enum",
"model": "string",
"input_tokens": "int",
"output_tokens": "int",
"cached_tokens": "int",
"cache_status": "hit|miss|write",
"cost_usd": "float",
"latency_ms": "int",
"escalated": "bool"
}
Aggregate this into a time-series store. Alert on rolling 1-hour cost anomalies. Review weekly by task type to identify drift in token distributions.
10. Age-Grouped Learning Paths
🟢 Ages 10–14: The Token Vending Machine
Core concept: Imagine an AI like a vending machine that charges you per word — but it charges more for words it makes up than words you put in. If you ask it a short question and it gives a long answer, the long answer costs more.
Key ideas at this level: - Tokens are like puzzle pieces that make up words - You pay for pieces going in AND pieces coming out - Coming-out pieces cost more - If you ask the same question 1,000 times, you're paying 1,000 times — unless you save the answer
Activity: Count the tokens in your name. Now count them in a full sentence. Notice how longer text = more tokens = more cost.
🔵 Ages 15–18: Building Your First Cost Mental Model
Core concept: Every AI API call has a price tag determined by token count × price per token. Output costs more than input. Smarter models cost more than simpler ones. Your job as a builder is to match the right model to the right task.
Key ideas at this level: - The input/output price asymmetry and why it matters for prompt design - Why you wouldn't use a Ferrari to drive to the corner shop (frontier models for simple tasks) - Caching as "don't pay twice for the same thing" - Routing as "match the tool to the job"
Practical exercise: Take a simple chatbot project. Estimate its monthly token cost at three model tiers. Calculate the break-even point for adding a routing layer.
🟡 Ages 19–25: The Developer's Cost Engineering Toolkit
Core concept: You are building systems that make thousands of API calls. The difference between a profitable product and an unprofitable one often lives in token efficiency, caching architecture, and routing logic — not in the core AI capability.
Key ideas at this level: - Implement tiktoken or equivalent to count tokens before calling the API - Structure prompts: static system prompt first, dynamic content last - Enable prompt caching on Anthropic/OpenAI — it's a configuration flag, not a rebuild - Build a simple task classifier to route cheap tasks to cheap models - Log every call with cost metadata from day one
Project: Build a two-tier routing system for a document Q&A agent. Measure cost-per-query before and after routing. Target: 40%+ cost reduction without measurable accuracy loss.
🟠 Ages 26–40: Fleet Operators and Product Builders
Core concept: At fleet scale, token economics compound into significant P&L line items. Your optimization stack should include prompt caching, RAG-based context compression, classifier-based routing, and semantic caching — layered in priority order by implementation cost vs return.
Key ideas at this level: - Build cost attribution into your observability stack from the start — retrofitting is expensive - Model the full token distribution (mean + P95 + P99), not just averages - Treat routing policy as a continuously trained system, not a one-time configuration - Evaluate open-weight self-hosted models when monthly API spend exceeds the infrastructure break-even point - Semantic caching requires staleness management — build invalidation logic before deploying
Decision framework: For each agent task type, answer: (1) What is the minimum model tier that achieves acceptable accuracy? (2) What is the cache-hit potential? (3) What is the context compression opportunity? Optimize in that order.
🔴 Ages 40+: Executive and Strategic Lens
Core concept: LLM inference cost is a variable cost that scales with agent activity — unlike most software infrastructure. Understanding the three control levers (token volume, model tier, caching) allows you to model unit economics, set pricing for AI-powered products, and make informed build-vs-buy decisions.
Key ideas at this level: - Inference cost is not fixed — it scales with usage, task complexity, and architectural choices - A 10× model price difference does not mean 10× capability difference for most tasks - Caching and routing are engineering investments with calculable ROI — treat them as capital allocation decisions - Cost monitoring is a competitive intelligence function: cost per outcome tells you where your AI workflows are efficient and where they are not - The build-vs-buy decision for self-hosted models has a clear break-even formula: compare monthly API spend against annualized infrastructure + engineering cost
Key Takeaways and Decision Framework
The Three Levers, Summarized
| Lever | Primary Mechanism | Typical Savings Potential | Implementation Complexity |
|---|---|---|---|
| Token volume reduction | RAG, context compression, output limits | 30–80% | Medium |
| Prompt caching | Provider-side KV-cache reuse | 50–90% on cached prefix | Low |
| Model routing | Task-to-tier matching | 30–60% on total spend | Medium–High |
The Decision Sequence
When optimizing a new agent fleet, apply levers in this order:
- Enable caching first — lowest effort, immediate return, no accuracy risk
- Compress context — audit what's actually needed in the context window; remove the rest
- Route by task type — start with rule-based, graduate to classifier-based as volume grows
- Add semantic caching — only after the above are in place and you have data on query repetition rates
- Evaluate self-hosting — only when monthly API spend justifies the infrastructure investment
The Non-Negotiable Principle
Measure before you optimize. Every cost reduction strategy requires a baseline. Instrument your fleet on day one. Log tokens, costs, cache status, and task types. Without this data, optimization is guesswork. With it, every lever has a calculable ROI.
This lesson is part of Empirica's Agent Economics curriculum. The frameworks presented are architecture-agnostic and apply across major LLM API providers. Pricing figures are illustrative order-of-magnitude anchors; verify current rates directly with providers before financial modeling.