A structured course lesson for all audiences — from first-time builders to fleet operators

Executive Summary

Running a fleet of AI agents at production scale is fundamentally a cost-engineering problem. The dominant cost driver is LLM inference, priced per token — a unit so small it appears trivial until multiplied across thousands of concurrent agents making hundreds of calls per hour. Three levers control the economics: how tokens are priced, how many tokens are consumed, and which model handles each request. Mastering all three — per-token economics, caching strategy, and model routing — is the difference between a profitable agent fleet and one that burns budget faster than it creates value.

This lesson builds a complete mental model for each lever, then assembles them into an operational playbook.

1. Per-Token Economics Fundamentals

What a Token Is

A token is the atomic unit of LLM computation — roughly 0.75 words in English, though this varies by language, code, and special characters. Every API call converts input text into tokens (the prompt) and generates output tokens (the completion). Both are billed, but at different rates.

The Asymmetry That Matters Most

Output tokens cost more than input tokens — typically 3× to 5× more, depending on the provider and model. This asymmetry has direct architectural consequences:

Verbose system prompts are cheaper than verbose model responses
Asking a model to "think step by step" in its output is expensive; pre-computing reasoning and injecting it as input context is cheaper
Streaming long completions is a cost signal, not just a latency signal

Price Tiers Across Model Classes

Model Class	Typical Input Price	Typical Output Price	Best Use Case
Frontier (e.g., GPT-4-class)	$10–$30 / 1M tokens	$30–$60 / 1M tokens	Complex reasoning, final synthesis
Mid-tier (e.g., GPT-3.5-class)	$0.50–$3 / 1M tokens	$1.50–$6 / 1M tokens	Classification, extraction, summarization
Small/fast (e.g., Haiku-class)	$0.25–$1 / 1M tokens	$1–$3 / 1M tokens	Routing, triage, simple Q&A
Open-weight self-hosted	Infrastructure cost only	Infrastructure cost only	High-volume, latency-sensitive, private data

Note: Prices shift frequently. Treat these as order-of-magnitude anchors, not billing guarantees.

The Compounding Effect in Agent Fleets

A single agent call is cheap. A fleet is not. Consider:

100 agents × 50 calls/hour × 2,000 tokens/call = 10 million tokens/hour
At frontier pricing ($30/1M output tokens), that is $300/hour in output costs alone
Over a 720-hour month: $216,000 — from one cost component of one model tier

This arithmetic explains why token efficiency is a first-order engineering concern, not an optimization afterthought.

2. Token Counting and Cost Prediction Models

Why Prediction Is Hard

Token counts are not word counts. The same semantic content can tokenize differently depending on:

Language: Non-Latin scripts tokenize less efficiently (more tokens per word)
Code: Structured syntax often tokenizes compactly; variable names vary
Formatting: Markdown headers, JSON brackets, and whitespace all consume tokens
Model family: Each provider uses a different tokenizer (BPE variants, SentencePiece, etc.)

Practical Counting Tools

Tiktoken (OpenAI): Open-source, accurate for GPT-family models
Tokenizer endpoints: Most providers expose a count-tokens API call before committing to inference
Heuristic budgeting: For English prose, 1,000 tokens ≈ 750 words ≈ 1.5 pages — useful for back-of-envelope planning

Building a Cost Prediction Model

A reliable fleet cost model requires four inputs per agent task type:

Mean prompt token count (system prompt + context + user message)
Mean completion token count (sampled from production or staging runs)
Call frequency (calls per agent per hour, by task type)
Model assignment (which tier handles this task type)

Multiply these through the pricing table, sum across task types, and you have a cost-per-agent-hour figure that can be tracked against actual billing.

Variance Is the Hidden Risk

Mean token counts mislead. A task with a mean of 800 tokens but a 95th-percentile of 8,000 tokens (e.g., when agents retrieve long documents) can blow cost budgets on tail events. Always model the distribution, not just the mean.

3. Caching Strategies: Context Window Optimization

The Core Insight

The most expensive token is one you've already paid for and are paying for again. Caching eliminates redundant computation. There are two distinct caching problems in agent fleets:

Prompt caching: Reusing the KV-cache of a repeated prefix across multiple API calls
Semantic caching: Storing complete responses and returning them for semantically equivalent queries without hitting the model at all

Context Window as a Cost Surface

Every model has a maximum context window (e.g., 128K tokens for some frontier models). Filling that window on every call is expensive. Strategies to reduce context size:

Summarization compression: Replace raw conversation history with a running summary
Retrieval-augmented generation (RAG): Inject only the retrieved chunks relevant to the current query, not the entire knowledge base
Sliding window: Retain only the N most recent turns plus a compressed summary of earlier turns
Structured state: Store agent state as compact JSON rather than natural language narration

The Retrieval-vs-Context Trade-off

Putting more context in the window increases accuracy on tasks requiring broad recall but increases cost linearly. RAG reduces cost but introduces retrieval latency and retrieval error risk. The optimal balance depends on task type:

High-precision tasks (legal, medical, financial): Larger context windows often justified
High-throughput tasks (classification, routing, extraction): Aggressive compression and RAG preferred

4. Prompt Caching Implementation

How Provider-Side Prompt Caching Works

Several major providers (Anthropic, OpenAI, Google) now offer prompt caching — a mechanism where the KV-cache of a long, repeated prompt prefix is stored server-side and reused across calls. The economics are significant:

Cache hit: Input tokens in the cached prefix are billed at a steep discount (typically 50–90% reduction)
Cache write: The first call that populates the cache is billed at standard or slightly elevated rates
Cache lifetime: Caches expire (typically minutes to hours depending on provider); high-frequency calls are needed to amortize the write cost

Structural Requirements for Cache Efficiency

Prompt caching only works if the prefix is identical across calls. This imposes a structural discipline on prompt design:

System prompt first, always: Place the static system prompt at the very beginning of every call
Static before dynamic: All fixed content (instructions, persona, tool definitions, few-shot examples) must precede any dynamic content (user query, retrieved context, conversation history)
No timestamps or session IDs in the prefix: Any variable content in the cached region breaks the cache hit

Calculating Cache ROI

Cache savings per call = (cached_prefix_tokens × standard_price) - (cached_prefix_tokens × cache_hit_price)
Break-even calls = cache_write_cost / savings_per_call

For a 10,000-token system prompt at $15/1M tokens standard price and $1.50/1M cache-hit price: - Savings per cache hit: 10,000 × ($15 - $1.50) / 1,000,000 = $0.135 per call - If cache write costs $0.1875 (at 1.25× standard rate): break-even at 2 calls

Any agent making more than 2 calls with the same system prompt benefits from caching. Most production agents make thousands.

Semantic Caching Layer

For queries that are not identical but are semantically equivalent, a semantic cache sits in front of the LLM:

Embed the incoming query
Search a vector store of previous (query, response) pairs
If cosine similarity exceeds a threshold, return the cached response
Otherwise, call the LLM and store the new pair

Trade-off: Semantic caches introduce staleness risk (cached answers may be outdated) and require careful threshold tuning (too aggressive → wrong answers returned; too conservative → low hit rate).

5. Model Routing Architectures

The Routing Premise

Not every task requires a frontier model. Routing — directing each request to the cheapest model capable of handling it adequately — is the highest-leverage cost reduction available to fleet operators. A 10× price difference between model tiers means routing even 50% of traffic to a cheaper model cuts total inference cost by roughly 45%.

Routing Architectures

1. Rule-Based Routing - Classify tasks by type at the application layer (e.g., "summarization" → mid-tier, "multi-step reasoning" → frontier) - Fast, predictable, zero additional LLM cost - Brittle: requires manual maintenance as task distributions shift

2. Classifier-Based Routing - A small, cheap classifier model (or fine-tuned embedding model) scores each incoming request for complexity - Routes to the appropriate tier based on score thresholds - Adds latency (one extra inference call) but that call is cheap if the classifier is small - Requires labeled training data: human-annotated examples of "this task needed frontier" vs "this task was fine on mid-tier"

3. Cascade Routing (Try-Cheap-First) - Send every request to the cheapest model first - Evaluate the response quality with a lightweight judge (another small model or rule-based check) - If quality is insufficient, escalate to the next tier - Risk: Latency doubles on escalated calls; not suitable for latency-sensitive applications - Benefit: No training data required; quality threshold is the only tunable parameter

4. LLM-as-Router - A small LLM reads the task and outputs a routing decision - More flexible than rule-based, cheaper than using a frontier model for routing - Introduces a meta-cost: the router itself consumes tokens

Routing Decision Matrix

Architecture	Latency Impact	Training Data Required	Maintenance Burden	Best For
Rule-based	None	None	High	Stable, well-defined task types
Classifier	Low (+50–100ms)	Yes	Medium	High-volume fleets with labeled data
Cascade	High on escalation	None	Low	Quality-critical, latency-tolerant
LLM-as-router	Low–Medium	None	Low	Dynamic, diverse task distributions

6. Cost-Performance Trade-offs by Model Class

The Pareto Frontier of Model Selection

Model selection is not a binary frontier-vs-cheap decision. It is a continuous trade-off across three dimensions:

Cost (tokens priced per call)
Latency (time to first token, total generation time)
Capability (accuracy on the target task)

The Pareto-optimal choice depends on which dimension is the binding constraint for a given task.

Task-to-Model Mapping Framework

Task Type	Capability Requirement	Recommended Tier	Rationale
Intent classification	Low	Small/fast	Binary or multi-class output; small models competitive
Named entity extraction	Low–Medium	Mid-tier	Structured output; few-shot prompting sufficient
Summarization	Medium	Mid-tier	Quality degrades gracefully; cost savings large
Code generation (simple)	Medium	Mid-tier	Well-defined output; mid-tier models strong
Multi-step reasoning	High	Frontier	Chain-of-thought quality matters; errors compound
Novel synthesis / strategy	High	Frontier	Requires broad world knowledge and reasoning depth
Tool call parsing	Low	Small/fast	Structured JSON output; deterministic enough for small models
Final user-facing response	Medium–High	Mid-tier or Frontier	Depends on quality bar and user expectations

Capability Degradation Is Non-Linear

Moving from frontier to mid-tier on a simple task may cost 10× less with 2% accuracy loss. Moving from frontier to mid-tier on a complex reasoning task may cost 10× less with 40% accuracy loss. Benchmark your specific tasks before committing to a routing policy.

7. Real-World Fleet Optimization Case Studies

Case Study Pattern 1: The Long System Prompt Problem

Scenario: An agent fleet uses a 15,000-token system prompt (detailed instructions, tool definitions, few-shot examples). Each agent makes 200 calls/day. Fleet size: 500 agents.

Before optimization: 15,000 tokens × 200 calls × 500 agents = 1.5 billion input tokens/day. At $15/1M: $22,500/day.

After prompt caching: Cache hit rate of 95% (5% cache misses due to cache expiry). Effective input token cost for cached portion: $1.50/1M. - Cache hits: 1.425B tokens × $1.50/1M = $2,137 - Cache misses: 75M tokens × $15/1M = $1,125 - Total: $3,262/day — an 85% reduction

Case Study Pattern 2: Cascade Routing for a Classification-Heavy Fleet

Scenario: A document processing fleet where 70% of tasks are classification/extraction (mid-tier sufficient) and 30% require synthesis (frontier required).

Naive approach: All calls to frontier. 10M tokens/day × $30/1M output = $300/day.

With cascade routing: 70% handled by mid-tier at $3/1M output = $63. 30% escalated to frontier = $90. Total: $153/day — 49% reduction.

Additional cascade cost: 10M tokens × $0.50/1M (small model first pass) = $5/day. Net saving still ~47%.

Case Study Pattern 3: Context Compression via RAG

Scenario: Agents previously injected 50,000 tokens of background context per call. RAG implementation retrieves only the 3 most relevant chunks (~2,000 tokens).

Token reduction: 48,000 tokens/call × call volume. At 1M calls/month: 48 billion tokens saved. At $15/1M: $720,000/month saved — offset by vector search infrastructure costs (typically orders of magnitude cheaper).

8. Practical Cost Reduction Playbook

Priority-Ordered Actions

Tier 1: Immediate, low-risk (implement first)

Enable prompt caching on all providers that support it. Restructure prompts so static content leads. Break-even is typically 2–3 calls.
Audit system prompt length. Remove redundant instructions. Every 1,000 tokens removed from a 500-agent fleet saves proportionally across all calls.
Switch tool-call parsing and intent classification to small models. These tasks do not require frontier capability.

Tier 2: Medium effort, high return

Implement RAG for any agent that currently injects large static knowledge bases into context.
Build a task classifier to route requests by complexity. Even a simple keyword-based classifier captures significant savings.
Set output token limits per task type. Unconstrained max_tokens parameters allow models to generate far more than needed.

Tier 3: Infrastructure investment, large-scale fleets

Deploy semantic caching for high-volume, repetitive query patterns (e.g., FAQ-style agent interactions).
Evaluate self-hosted open-weight models for tasks where data privacy, latency, or volume make API costs prohibitive.
Implement cascade routing with automated quality evaluation to continuously optimize the routing threshold.

Anti-Patterns to Avoid

Putting dynamic content before static content in prompts: Destroys cache hit rates
Using frontier models for structured output tasks: JSON extraction does not require GPT-4-class reasoning
Ignoring the 95th-percentile token distribution: Budget based on mean, blow budget on tail
Caching without staleness controls: Semantic caches return outdated answers if not invalidated on knowledge updates
Routing without benchmarking: Assuming mid-tier is "good enough" without measuring task-specific accuracy degradation

9. Monitoring and Attribution Framework

What to Instrument

Cost without attribution is noise. Every token consumed should be tagged with:

Agent ID: Which agent instance generated the call
Task type: What category of work was being done
Model tier: Which model handled the request
Cache status: Hit, miss, or write
Call outcome: Success, error, escalation

Key Metrics Dashboard

Metric	Formula	Alert Threshold
Cost per agent per hour	Total spend / (agents × hours)	>20% above baseline
Cache hit rate	Cache hits / total calls	<80% for high-frequency agents
Escalation rate	Tier-2+ calls / total calls	>40% (suggests routing misconfiguration)
Token efficiency ratio	Useful output tokens / total tokens consumed	<30% (suggests context bloat)
Tail cost ratio	P95 call cost / P50 call cost	>10× (suggests unbounded context growth)

Attribution for Chargeback and Optimization

In multi-tenant or multi-product fleets, cost attribution enables:

Per-product P&L: Understanding which agent workflows are cost-efficient vs cost-draining
Routing policy feedback loops: If a task type consistently escalates, the routing classifier needs retraining
Anomaly detection: Sudden cost spikes often indicate prompt injection, runaway loops, or retrieval failures returning massive documents

Logging Architecture

Call record schema:
{
  "timestamp": "ISO8601",
  "agent_id": "string",
  "task_type": "enum",
  "model": "string",
  "input_tokens": "int",
  "output_tokens": "int",
  "cached_tokens": "int",
  "cache_status": "hit|miss|write",
  "cost_usd": "float",
  "latency_ms": "int",
  "escalated": "bool"
}

Aggregate this into a time-series store. Alert on rolling 1-hour cost anomalies. Review weekly by task type to identify drift in token distributions.

10. Age-Grouped Learning Paths

🟢 Ages 10–14: The Token Vending Machine

Core concept: Imagine an AI like a vending machine that charges you per word — but it charges more for words it makes up than words you put in. If you ask it a short question and it gives a long answer, the long answer costs more.

Key ideas at this level: - Tokens are like puzzle pieces that make up words - You pay for pieces going in AND pieces coming out - Coming-out pieces cost more - If you ask the same question 1,000 times, you're paying 1,000 times — unless you save the answer

Activity: Count the tokens in your name. Now count them in a full sentence. Notice how longer text = more tokens = more cost.

🔵 Ages 15–18: Building Your First Cost Mental Model

Core concept: Every AI API call has a price tag determined by token count × price per token. Output costs more than input. Smarter models cost more than simpler ones. Your job as a builder is to match the right model to the right task.

Key ideas at this level: - The input/output price asymmetry and why it matters for prompt design - Why you wouldn't use a Ferrari to drive to the corner shop (frontier models for simple tasks) - Caching as "don't pay twice for the same thing" - Routing as "match the tool to the job"

Practical exercise: Take a simple chatbot project. Estimate its monthly token cost at three model tiers. Calculate the break-even point for adding a routing layer.

🟡 Ages 19–25: The Developer's Cost Engineering Toolkit

Core concept: You are building systems that make thousands of API calls. The difference between a profitable product and an unprofitable one often lives in token efficiency, caching architecture, and routing logic — not in the core AI capability.

Key ideas at this level: - Implement tiktoken or equivalent to count tokens before calling the API - Structure prompts: static system prompt first, dynamic content last - Enable prompt caching on Anthropic/OpenAI — it's a configuration flag, not a rebuild - Build a simple task classifier to route cheap tasks to cheap models - Log every call with cost metadata from day one

Project: Build a two-tier routing system for a document Q&A agent. Measure cost-per-query before and after routing. Target: 40%+ cost reduction without measurable accuracy loss.

🟠 Ages 26–40: Fleet Operators and Product Builders

Core concept: At fleet scale, token economics compound into significant P&L line items. Your optimization stack should include prompt caching, RAG-based context compression, classifier-based routing, and semantic caching — layered in priority order by implementation cost vs return.

Key ideas at this level: - Build cost attribution into your observability stack from the start — retrofitting is expensive - Model the full token distribution (mean + P95 + P99), not just averages - Treat routing policy as a continuously trained system, not a one-time configuration - Evaluate open-weight self-hosted models when monthly API spend exceeds the infrastructure break-even point - Semantic caching requires staleness management — build invalidation logic before deploying

Decision framework: For each agent task type, answer: (1) What is the minimum model tier that achieves acceptable accuracy? (2) What is the cache-hit potential? (3) What is the context compression opportunity? Optimize in that order.

🔴 Ages 40+: Executive and Strategic Lens

Core concept: LLM inference cost is a variable cost that scales with agent activity — unlike most software infrastructure. Understanding the three control levers (token volume, model tier, caching) allows you to model unit economics, set pricing for AI-powered products, and make informed build-vs-buy decisions.

Key ideas at this level: - Inference cost is not fixed — it scales with usage, task complexity, and architectural choices - A 10× model price difference does not mean 10× capability difference for most tasks - Caching and routing are engineering investments with calculable ROI — treat them as capital allocation decisions - Cost monitoring is a competitive intelligence function: cost per outcome tells you where your AI workflows are efficient and where they are not - The build-vs-buy decision for self-hosted models has a clear break-even formula: compare monthly API spend against annualized infrastructure + engineering cost

Key Takeaways and Decision Framework

The Three Levers, Summarized

Lever	Primary Mechanism	Typical Savings Potential	Implementation Complexity
Token volume reduction	RAG, context compression, output limits	30–80%	Medium
Prompt caching	Provider-side KV-cache reuse	50–90% on cached prefix	Low
Model routing	Task-to-tier matching	30–60% on total spend	Medium–High

The Decision Sequence

When optimizing a new agent fleet, apply levers in this order:

Enable caching first — lowest effort, immediate return, no accuracy risk
Compress context — audit what's actually needed in the context window; remove the rest
Route by task type — start with rule-based, graduate to classifier-based as volume grows
Add semantic caching — only after the above are in place and you have data on query repetition rates
Evaluate self-hosting — only when monthly API spend justifies the infrastructure investment

The Non-Negotiable Principle

Measure before you optimize. Every cost reduction strategy requires a baseline. Instrument your fleet on day one. Log tokens, costs, cache status, and task types. Without this data, optimization is guesswork. With it, every lever has a calculable ROI.

This lesson is part of Empirica's Agent Economics curriculum. The frameworks presented are architecture-agnostic and apply across major LLM API providers. Pricing figures are illustrative order-of-magnitude anchors; verify current rates directly with providers before financial modeling.

LLM API Cost Structure for Agent Fleets: Per-Token Economics, Caching, and Model Routing