Executive Summary

LLM API costs are the dominant operational expense for autonomous agent fleets. Unlike traditional software infrastructure where compute costs scale with users, LLM costs scale with tokens — the atomic units of text processed by language models. A fleet of 100 agents making 1,000 calls per day can generate millions of tokens per hour; without deliberate cost architecture, expenses compound faster than capability gains.

This lesson covers three interlocking cost levers:

Per-token economics — how pricing is structured, what drives token counts, and where waste hides
Caching strategies — how to avoid paying for the same tokens twice
Model routing — how to match task complexity to model cost, at fleet scale

Who this is for: This lesson is structured for multiple audiences. Each section includes a learning path indicator: 🟢 Beginner | 🟡 Intermediate | 🔴 Advanced. Read at the level appropriate to your role, or traverse all three to build full-stack understanding.

Core Concepts: Per-Token Economics Explained

🟢 Beginner: What Is a Token?

A token is not a word. It is a chunk of text — typically 3–4 characters in English — that a language model processes as a single unit. The sentence "The agent retrieved the document" contains approximately 7 tokens.

Why this matters for cost: - Every API call to a hosted LLM is billed by tokens consumed - Billing is split: input tokens (what you send) and output tokens (what the model returns) - Output tokens are almost always priced higher than input tokens — typically 3–5× more expensive per token

Simple mental model:

Sending a 500-word system prompt = ~650 input tokens. Getting a 200-word response = ~260 output tokens. Multiply by your per-token rate. Multiply by call volume. That is your cost.

🟡 Intermediate: The Anatomy of an API Call's Token Budget

Every agent API call has four token-cost components:

Component	Description	Typical Cost Weight
System prompt	Instructions, persona, constraints	High — repeated every call
Context window	Prior conversation, retrieved documents	Variable — grows with task depth
User/task input	The actual query or instruction	Low–Medium
Model output	The generated response	Highest per-token rate

Key insight: In most agent architectures, the system prompt and injected context dominate input token costs — not the task itself. A 2,000-token system prompt sent 10,000 times per day costs more than the actual work being done.

Token cost drivers to audit: - Verbose system prompts with redundant instructions - Full document injection when only excerpts are needed - Conversation history that grows unbounded across turns - JSON schemas or tool definitions repeated in every call - Whitespace, markdown formatting, and XML tags that add tokens without adding information

🔴 Advanced: Pricing Tiers, Context Windows, and the Quadratic Cost Problem

Pricing tier structure across major providers follows a consistent pattern:

Frontier models (e.g., GPT-4-class, Claude Opus-class): $10–$75 per million tokens input; $30–$150 per million tokens output
Mid-tier models (e.g., GPT-4o-mini-class, Claude Haiku-class): $0.10–$1.50 per million tokens input; $0.30–$6 per million tokens output
Open-weight self-hosted (e.g., Llama 3, Mistral): Compute cost only — no per-token fee, but infrastructure overhead applies

The quadratic cost problem in long-context agents:

Attention mechanisms in transformers scale quadratically with context length in compute terms, but providers typically charge linearly per token. However, the effective cost problem is still superlinear: longer contexts increase latency, increase error rates (lost-in-the-middle failures), and increase the probability of needing a retry — each of which multiplies cost. A 128K-token context window is not 4× more expensive than a 32K window; it is often 6–10× more expensive in total operational cost when retries and latency penalties are included.

Batch API pricing: Most major providers offer asynchronous batch endpoints at 50% discount for non-real-time workloads. For agent tasks that do not require immediate response — data enrichment, document classification, background research — batch mode is the highest-leverage single cost reduction available.

Caching Strategies for Cost Reduction

🟢 Beginner: The Core Idea

If an agent sends the same system prompt 10,000 times, you are paying for those tokens 10,000 times. Caching means storing the processed result of those tokens so you only pay once — or pay a reduced rate on subsequent uses.

Two types of caching: 1. Prompt caching — the API provider caches the processed representation of repeated prompt prefixes 2. Semantic caching — your infrastructure caches full responses to semantically similar queries

🟡 Intermediate: Prompt Caching Mechanics

Provider-level prompt caching (available on Anthropic, OpenAI, and others) works as follows:

You mark a portion of your prompt as cacheable (typically the system prompt and any static context)
On the first call, the provider processes and stores the KV (key-value) cache for that prefix
On subsequent calls with the same prefix, the provider reads from cache — charging a reduced rate (typically 10–25% of normal input token cost) rather than full reprocessing cost

Requirements for effective prompt caching: - The cached prefix must be identical — even a single character change invalidates the cache - Cache entries have TTLs (time-to-live) — typically 5 minutes to 1 hour depending on provider - The cached portion must appear at the beginning of the prompt, before any dynamic content

Practical design rule: Structure prompts as [static system prompt] + [static context] + [dynamic task input]. Never interleave dynamic content into the static prefix.

Savings estimate: For a fleet where system prompts average 2,000 tokens and each agent makes 500 calls per day, prompt caching reduces input token costs on the system prompt by 75–90%. At frontier model pricing, this can represent $50,000–$200,000 in annual savings per 100-agent fleet.

🔴 Advanced: Semantic Caching and Cache Invalidation Architecture

Semantic caching operates at the application layer, not the provider layer:

Incoming query is embedded using a lightweight embedding model
Embedding is compared against a vector store of previously answered queries
If cosine similarity exceeds a threshold (typically 0.92–0.97), the cached response is returned without an API call
Cache miss triggers a normal API call; the response is stored with its embedding

Architecture components: - Embedding model: Use the cheapest available (text-embedding-3-small, or a self-hosted model) — embedding costs should be <1% of the savings generated - Vector store: Redis with vector search, Qdrant, or Pinecone for low-latency retrieval - Similarity threshold tuning: Lower thresholds increase hit rate but risk returning stale or imprecise answers; higher thresholds are safer but reduce savings

Cache invalidation triggers: - Time-based TTL (mandatory for any factual or time-sensitive content) - Source document version change (for RAG-based agents) - Explicit invalidation on tool output change - Confidence scoring: if the cached response contains hedged language ("as of my last update"), flag for re-evaluation

Hybrid caching architecture for agent fleets:

Incoming agent query
        ↓
[Semantic cache lookup] → HIT → Return cached response (cost: ~$0.00002 embedding only)
        ↓ MISS
[Prompt cache check] → HIT → Reduced-rate API call (cost: 10-25% of normal input)
        ↓ MISS
[Full API call] → Store in semantic cache → Return response

This three-layer architecture reduces effective per-query cost by 60–85% in high-repetition agent workloads.

Model Routing and Fleet Optimization

🟢 Beginner: Not Every Task Needs the Most Expensive Model

A frontier model costs 50–500× more per token than a small, fast model. Most tasks in an agent fleet do not require frontier capability. Routing means automatically sending each task to the cheapest model that can handle it reliably.

Simple routing heuristic: - Classification, extraction, summarization → small/fast model - Multi-step reasoning, code generation, novel synthesis → frontier model - Structured output with known schema → small model with validation

🟡 Intermediate: Routing Architectures

Static routing assigns task types to model tiers at design time:

Task type registry:
  "classify_intent"     → model: haiku-class    | max_tokens: 50
  "extract_entities"    → model: haiku-class    | max_tokens: 200
  "generate_report"     → model: sonnet-class   | max_tokens: 2000
  "complex_reasoning"   → model: opus-class     | max_tokens: 4000

Dynamic routing evaluates task complexity at runtime before committing to a model:

A lightweight classifier (or a cheap model) scores the incoming task on complexity dimensions: ambiguity, required reasoning depth, domain specificity, output length
The score maps to a model tier
The task is dispatched to the selected model

Cascade routing (try-cheap-first): 1. Send task to cheapest capable model 2. Evaluate output quality using a validation function (schema check, confidence score, or a cheap judge model) 3. If quality threshold not met, escalate to next tier 4. Log escalation rate per task type to refine routing rules over time

Cost impact of routing: In a mixed-workload fleet, 60–80% of tasks typically qualify for small-model handling. Routing correctly reduces average cost-per-task by 70–90% compared to sending all tasks to a frontier model.

🔴 Advanced: Fleet-Level Optimization and Cost-Quality Pareto Frontiers

Fleet-level routing treats the agent fleet as a portfolio, not a collection of independent agents:

Concurrency budgeting: Assign token-per-minute (TPM) budgets per agent tier. High-priority agents get guaranteed TPM allocation; background agents consume remaining capacity.
Rate limit arbitrage: Distribute calls across multiple provider accounts or regions to avoid rate limit throttling, which causes retries and inflates cost.
Provider routing: Maintain live cost and latency data for equivalent models across providers (e.g., GPT-4o vs. Claude Sonnet vs. Gemini Pro). Route to lowest-cost provider meeting latency SLA at call time.

Cost-quality Pareto frontier analysis:

For each task type, empirically measure: - Output quality score (task-specific rubric, 0–1) - Cost per successful completion (including retries)

Plot quality vs. cost across model options. The Pareto frontier identifies models where no cheaper option achieves the same quality. Tasks below the frontier are overspending; tasks above it are under-serving quality requirements.

Predictive token budgeting:

Train a lightweight regression model on historical call data: - Input features: task type, input token count, agent state, time of day - Output: predicted output token count

Use predictions to pre-allocate max_tokens tightly. Oversized max_tokens parameters do not increase cost directly (you pay for tokens generated, not reserved), but they increase latency and context window pressure in multi-turn agents.

Age-Grouped Learning Paths

These paths are designed for different entry points — not age literally, but experience depth with AI systems and cost engineering.

Path A: New to AI APIs (0–6 months experience)

Goal: Understand what you are paying for and why it matters.

Recommended sequence: 1. Read "What Is a Token?" (Beginner sections above) 2. Run a token counter on your current prompts (use tiktoken for OpenAI-compatible models) 3. Calculate your current monthly token spend using your API dashboard 4. Identify your three largest token consumers 5. Apply one caching strategy (start with provider-level prompt caching)

Key metric to track: Cost per agent task completion

Path B: Building Agent Systems (6 months–2 years experience)

Goal: Implement cost-aware architecture from the start.

Recommended sequence: 1. Audit all system prompts for token waste (redundancy, verbose formatting) 2. Implement prompt caching for all static prefixes 3. Build a task-type registry and implement static routing 4. Add batch API calls for all non-real-time workloads 5. Instrument cost per task type in your observability stack

Key metric to track: Cost per task type, escalation rate in cascade routing

Path C: Operating Agent Fleets at Scale (2+ years, production systems)

Goal: Optimize cost-quality tradeoffs across a heterogeneous fleet.

Recommended sequence: 1. Build semantic caching layer with tuned similarity thresholds 2. Implement dynamic routing with runtime complexity scoring 3. Construct cost-quality Pareto frontiers per task type 4. Deploy provider routing with live cost/latency arbitrage 5. Build predictive token budgeting using historical call data 6. Implement TPM budgeting and concurrency management per agent tier

Key metric to track: Fleet-level cost efficiency ratio (value delivered per dollar spent)

Practical Implementation Guide

Step 1: Instrument Before You Optimize

You cannot optimize what you do not measure. Add the following to every API call log:

{
  "call_id": "uuid",
  "agent_id": "string",
  "task_type": "string",
  "model": "string",
  "input_tokens": int,
  "output_tokens": int,
  "cached_tokens": int,
  "cost_usd": float,
  "latency_ms": int,
  "success": bool,
  "retry_count": int
}

Aggregate by task_type and agent_id daily. Cost anomalies surface within 48 hours of instrumentation.

Step 2: Prompt Compression Audit

For each system prompt in your fleet:

Count tokens (use provider tokenizer, not word count)
Remove: duplicate instructions, redundant examples, decorative formatting
Replace verbose descriptions with structured schemas where possible
Target: reduce system prompt token count by 30–50% without capability loss
A/B test compressed vs. original on a sample of tasks before fleet-wide deployment

Step 3: Implement Caching in Priority Order

Priority	Strategy	Implementation Effort	Expected Savings
1	Provider prompt caching	Low (1–2 days)	40–70% of input costs
2	Batch API for async tasks	Low (1–3 days)	50% on eligible calls
3	Static model routing	Medium (1–2 weeks)	60–80% cost reduction
4	Semantic caching	High (2–4 weeks)	20–60% additional reduction
5	Dynamic routing + Pareto optimization	High (1–2 months)	10–30% additional reduction

Step 4: Establish Cost Budgets Per Agent

Assign each agent type a daily token budget. Implement hard stops at 90% of budget with alerting. This prevents runaway costs from prompt injection attacks, infinite loops, or unexpected task complexity spikes.

Cost Calculation Frameworks

Framework 1: Baseline Cost Estimation

Daily cost = 
  (avg_input_tokens × input_price_per_token × daily_calls)
  + (avg_output_tokens × output_price_per_token × daily_calls)

Example:
  avg_input_tokens = 3,000
  avg_output_tokens = 500
  input_price = $3.00 / 1M tokens = $0.000003
  output_price = $15.00 / 1M tokens = $0.000015
  daily_calls = 10,000

  Daily cost = (3,000 × 0.000003 × 10,000) + (500 × 0.000015 × 10,000)
             = $90 + $75
             = $165/day = ~$4,950/month

Framework 2: Caching ROI Calculation

Cache hit rate = cached_calls / total_calls
Cached call cost = full_call_cost × cache_discount_rate (typically 0.10–0.25)

Monthly savings = 
  total_calls × cache_hit_rate × full_call_cost × (1 - cache_discount_rate)

Break-even: Implementation cost / monthly_savings = months to ROI

Framework 3: Routing Savings Calculation

Baseline cost = all_calls × frontier_model_cost_per_call

Routed cost = 
  (calls_to_small_model × small_model_cost_per_call)
  + (calls_to_mid_model × mid_model_cost_per_call)
  + (calls_to_frontier × frontier_model_cost_per_call)

Routing savings % = (baseline_cost - routed_cost) / baseline_cost × 100

Framework 4: Total Cost of Ownership (TCO) for Self-Hosted vs. API

API TCO = token_costs + (engineering_hours × hourly_rate)

Self-hosted TCO = 
  GPU_compute_cost
  + model_serving_infrastructure
  + engineering_hours (higher — ops burden)
  + model_update_and_fine-tuning_costs

Break-even volume: API_TCO = Self-hosted_TCO
Solve for token_volume where self-hosting becomes cheaper
(Typically: >500M tokens/month for mid-tier models, >2B tokens/month for frontier-equivalent)

Advanced Topics: Dynamic Routing and Predictive Scaling

Dynamic Routing Implementation

A production dynamic router requires:

Complexity classifier: - Input: raw task text + metadata (task type, agent state, history length) - Output: complexity score (0–1) or discrete tier (low/medium/high) - Model: a fine-tuned small classifier, or a rule-based system for well-defined task taxonomies - Latency budget: <50ms — the router must not add meaningful latency to the call

Routing decision table:

complexity < 0.3  → small model (haiku-class)
complexity 0.3–0.7 → mid model (sonnet-class)
complexity > 0.7  → frontier model (opus-class)
complexity = UNCERTAIN → cascade: try mid, escalate if quality < threshold

Feedback loop: - Log routing decisions and outcomes - Weekly: compute quality scores per task type per model tier - Update routing thresholds based on observed quality-cost data - Alert when escalation rate for a task type exceeds 20% (indicates routing miscalibration)

Predictive Scaling for Agent Fleets

Agent workloads have predictable temporal patterns (business hours, batch job schedules, user behavior cycles). Predictive scaling reduces cost by:

Pre-warming caches before anticipated demand spikes
Shifting batch workloads to off-peak hours where provider pricing is lower or rate limits are less contested
Adjusting concurrency limits to avoid rate limit retries during peak periods

Scaling signal sources: - Historical call volume by hour/day - Upstream event triggers (user login patterns, scheduled jobs, external API webhooks) - Queue depth in agent task queues

Implementation: A time-series forecasting model (even simple ARIMA or exponential smoothing) on historical TPM data provides sufficient accuracy for pre-warming and batch scheduling decisions.

Case Studies and Real-World Scenarios

Scenario 1: The Unbounded Context Problem

Situation: A customer support agent fleet accumulates full conversation history in every API call. After 10 turns, each call contains 8,000+ tokens of history.

Cost impact: Input token costs grow linearly with conversation length. A 20-turn conversation costs 10× more per call than a 2-turn conversation.

Solution applied: - Implement conversation summarization: after every 5 turns, compress history into a 200-token summary - Retain only the last 2 full turns verbatim for immediate context - Result: input token count stabilized at ~1,500 regardless of conversation length - Cost reduction: 75% on input tokens for long conversations

Scenario 2: The Repeated Research Problem

Situation: A research agent fleet answers similar questions repeatedly. Analysis shows 40% of incoming queries are semantically equivalent to a query answered in the past 24 hours.

Solution applied: - Deployed semantic caching with 0.94 cosine similarity threshold - Cache TTL: 4 hours for factual queries, 24 hours for stable reference queries - Result: 38% of calls served from cache at near-zero cost - Monthly savings: 35% reduction in total API spend

Scenario 3: The One-Model-Fits-All Fleet

Situation: An agent fleet uses a frontier model for all tasks, including simple classification and entity extraction tasks that represent 65% of call volume.

Solution applied: - Audited task type distribution: 65% simple extraction, 25% structured generation, 10% complex reasoning - Implemented static routing: small model for extraction, mid model for generation, frontier for reasoning - Validated quality on 1,000-task sample before fleet-wide deployment - Result: 72% reduction in per-task cost; quality metrics unchanged on extraction and generation tasks

Scenario 4: The Prompt Bloat Audit

Situation: System prompts across a 50-agent fleet averaged 4,200 tokens. Engineering team had added instructions incrementally over 18 months without removal.

Solution applied: - Token audit revealed: 800 tokens of duplicate safety instructions, 600 tokens of outdated tool descriptions, 400 tokens of verbose examples replaceable with schemas - Compressed prompts to average 1,800 tokens - Enabled prompt caching on the compressed prefix - Combined effect: 70% reduction in system prompt token costs

Key Takeaways and Decision Trees

The Five Laws of LLM Fleet Cost Management

Measure first. Instrument every call before optimizing anything. Intuition about where costs live is usually wrong.
Static content is the biggest waste. System prompts and repeated context dominate input costs. Cache them.
Output tokens are the most expensive. Constrain output length with explicit max_tokens and structured output formats.
Task complexity is not uniform. 60–80% of agent tasks in most fleets can be handled by models costing 10–50× less than frontier.
Caching compounds. Prompt caching + semantic caching + routing together achieve savings no single strategy can.

Decision Tree: Which Optimization to Implement First

START: What is your primary cost driver?
│
├─ Input tokens too high?
│   ├─ System prompt > 1,500 tokens? → Compress prompt first
│   ├─ Context window growing unbounded? → Implement conversation summarization
│   └─ Same prompt sent repeatedly? → Enable provider prompt caching
│
├─ Output tokens too high?
│   ├─ Responses longer than needed? → Add explicit length constraints + structured output
│   └─ High retry rate? → Fix prompt clarity; retries double output costs
│
├─ Model tier too expensive?
│   ├─ All tasks on frontier model? → Implement static routing immediately
│   └─ Routing exists but escalation rate > 30%? → Recalibrate routing thresholds
│
└─ Overall volume too high?
    ├─ Repeated queries? → Implement semantic caching
    └─ Non-real-time tasks? → Move to batch API (50% discount)

Quick Reference: Cost Reduction by Strategy

Strategy	Implementation Time	Typical Cost Reduction	Risk Level
Prompt compression	1–3 days	20–40% of input costs	Low
Provider prompt caching	1–2 days	40–70% of input costs	Low
Batch API for async tasks	1–3 days	50% on eligible calls	Low
Static model routing	1–2 weeks	60–80% overall	Medium
Semantic caching	2–4 weeks	20–60% additional	Medium
Dynamic routing	1–2 months	10–30% additional	Medium–High
Self-hosted models	2–6 months	60–90% at scale	High

Further Resources and Next Steps

Immediate Actions (This Week)

Run a token audit on your top 5 most-called agents. Use your provider's token counter or tiktoken.
Enable prompt caching on any agent with a static system prompt longer than 500 tokens.
Check your batch eligibility — identify which agent tasks do not require real-time response.
Add cost logging to every API call if not already present.

Skills to Build Next

Prompt engineering for compression: Learn to express the same constraints in fewer tokens using structured formats (JSON schema, XML tags, numbered lists) rather than prose.
Embedding-based retrieval: Semantic caching and RAG (retrieval-augmented generation) share infrastructure — building one enables the other.
Observability for LLM systems: Tools like LangSmith, Helicone, or custom dashboards on top of your call logs enable the continuous measurement that cost optimization requires.
Model evaluation methodology: Routing decisions require quality benchmarks. Learn to build task-specific evaluation rubrics before deploying routing logic.

Concepts That Connect to This Lesson

Agent memory architecture — how agents store and retrieve context affects token consumption directly; shorter effective memory means lower per-call token counts
Discovery infrastructure — how agents find and select tools affects the size of tool-definition payloads included in prompts
Knowledge market economics — the cost of acquiring structured knowledge via API must be weighed against the token cost of generating equivalent knowledge via LLM inference

The Core Principle to Carry Forward

Every token has a price. Every agent call is a purchasing decision. The teams that build cost-efficient agent fleets treat LLM APIs the way finance teams treat cloud infrastructure: with budgets, monitoring, optimization cycles, and a clear understanding that waste at scale is not a rounding error — it is a strategic liability.

Cost efficiency and capability are not in tension. A well-routed, well-cached fleet of agents using the right model for each task will outperform — in both quality and cost — a fleet that applies maximum compute to every problem indiscriminately.

This lesson is part of Empirica's agent infrastructure curriculum. It is designed to be traversed non-linearly — return to advanced sections as your implementation matures.

LLM API Cost Structure for Agent Fleets: A Multi-Audience Course Lesson on Per-Token Economics, Caching, and Model Routing