LLM API Cost Optimization for Agent Fleets: Beyond Per-Token Economics

Course Lesson | Empirica Agent Economy Series


Executive Summary: Cost Levers Beyond Token Count

Running a fleet of AI agents at scale is fundamentally an economics problem. Most teams start by watching token counts — and stop there. That's a mistake.

Token price is one variable in a multi-dimensional cost function. The teams that achieve 60–80% cost reductions on real workloads do so by attacking three levers simultaneously: per-token economics (choosing the right model for each task), caching (eliminating redundant computation), and model routing (dynamically matching task complexity to model capability). This lesson covers all three, plus the fleet-scale patterns that compound their effects.

What you'll be able to do after this lesson: - Calculate true cost-per-task, not just cost-per-token - Design a caching strategy that pays off within weeks - Build a routing layer that cuts costs without degrading output quality - Identify the five highest-leverage optimization patterns for agent fleets


Part 1: Per-Token Economics Fundamentals

The Unit Economics of LLM Inference

LLM APIs price on two dimensions: input tokens and output tokens. Output tokens are consistently more expensive than input tokens — often 3–5× more — because generation is computationally heavier than prefill. This asymmetry has direct design implications.

Key pricing variables to track: - Input token price (per million tokens) - Output token price (per million tokens) - Context window size (affects how much you can cache vs. re-send) - Batch vs. real-time pricing (batch is typically 50% cheaper with latency tradeoff) - Tier discounts (volume commitments can reduce effective rates significantly)

The Hidden Multipliers

Raw token price understates true cost. The actual cost-per-task depends on:

  • Context bloat: System prompts, tool definitions, and conversation history that get re-sent on every call. A 2,000-token system prompt sent 10,000 times per day costs more than the actual task tokens in many workloads.
  • Retry overhead: Failed or low-quality outputs that require re-runs. A model that's 10% cheaper but requires 25% more retries is net more expensive.
  • Output verbosity: Models differ significantly in how many tokens they use to express the same information. Verbose models inflate output costs.
  • Tool call loops: Agentic workflows that call tools iteratively accumulate tokens across multiple turns. A five-turn tool loop at GPT-4 prices can cost 10× a single-shot call.

Calculating True Cost-Per-Task

Cost per task = 
  (input_tokens × input_price) 
  + (output_tokens × output_price) 
  + (retry_rate × average_retry_cost) 
  + (tool_calls × average_tool_call_cost)

Build this calculation before optimizing. Teams that skip it often optimize the wrong variable.


Part 2: Caching Strategies — When Repetition Pays

Why Caching Matters More for Agents Than Chatbots

Single-turn chatbots have low repetition. Agent fleets have high repetition — the same system prompts, tool schemas, retrieved documents, and reasoning scaffolds appear across thousands of calls. This makes caching disproportionately valuable in agentic contexts.

Three Caching Layers

1. Prompt Caching (Provider-Level)

Several major LLM providers now offer prompt caching: if the prefix of your input matches a recently cached prefix, you pay a reduced rate (typically 50–90% less) for those cached tokens. The key design requirement is prefix stability — the cacheable portion must appear at the start of the prompt and must not change between calls.

Design implication: Structure prompts so that static content (system instructions, tool definitions, background context) comes first, and dynamic content (user query, current task state) comes last. This maximizes the cacheable prefix length.

2. Semantic Caching (Application-Level)

Store previous (query, response) pairs and retrieve them when a new query is semantically similar above a threshold. Unlike exact-match caching, semantic caching handles paraphrases and near-duplicates.

When it pays off: High query repetition rates (FAQ-style agents, classification tasks, document processing pipelines). Low payoff for highly variable creative or analytical tasks.

Implementation cost: Requires a vector store, embedding calls, and a similarity threshold decision. The threshold is critical — too low and you return stale or mismatched answers; too high and hit rates drop.

3. Intermediate Result Caching

In multi-step agent pipelines, cache the outputs of expensive intermediate steps (web searches, document parsing, sub-agent calls) rather than re-running them when the same input recurs. This is especially valuable in workflows where multiple downstream agents consume the same upstream result.

Cache Hit Rate Economics

A cache that costs $X/month to operate breaks even when:

Monthly savings = hit_rate × daily_calls × avg_cost_per_call × 30

For most production agent fleets processing more than a few thousand calls per day, even a 20% hit rate on a well-structured semantic cache generates positive ROI within the first month.

Caching Anti-Patterns to Avoid

  • Caching non-deterministic outputs: If your agent's response needs to reflect current state (prices, availability, live data), caching introduces staleness risk.
  • Over-aggressive TTLs: Long cache lifetimes save money but can serve outdated information. Match TTL to the rate of change of the underlying data.
  • Ignoring cache invalidation cost: Invalidating large caches on model updates or prompt changes has operational overhead. Plan for it.

Part 3: Model Routing ROI — Matching Task Complexity to Cost

The Core Insight

Not all tasks in an agent fleet require the same model. A frontier model like GPT-4o or Claude 3.5 Sonnet is appropriate for complex reasoning, nuanced writing, and multi-step planning. It is wasteful for classification, extraction, summarization of short texts, and routing decisions.

The price gap between frontier and capable mid-tier models is typically 10–20×. Routing even 50% of tasks to a cheaper model can cut total inference spend by 40–60%.

Building a Routing Layer

A routing layer intercepts each task before it reaches the LLM and assigns it to a model tier based on predicted complexity.

Routing signals to use: - Task type (classification vs. generation vs. reasoning) - Input length and structure - Required output format (structured JSON is often easier for smaller models) - Confidence threshold requirements - Latency constraints (smaller models are faster)

Routing approaches:

Approach How it works Best for
Rule-based Explicit task-type → model mapping Stable, well-defined task taxonomies
Classifier-based Small model predicts required capability tier Mixed or ambiguous task streams
Cascading Try cheap model first; escalate if confidence is low Tasks where quality is verifiable
LLM-as-router Use a cheap model to classify the task High-volume, heterogeneous fleets

Cascading: The Highest-ROI Pattern

Cascading routes every task to the cheapest capable model first. If the output passes a quality gate (confidence score, format validation, human-in-the-loop spot check), it's accepted. If not, the task escalates to a more capable model.

For workloads where 60–70% of tasks are routine, cascading can achieve frontier-model quality at mid-tier cost on the majority of volume.

Quality gate options: - Output format validation (did the model return valid JSON?) - Confidence scores from the model's logprobs - A cheap verifier model that checks the output - Deterministic rule checks (does the answer contain required fields?)

Model Routing Pitfalls

  • Routing overhead: The routing decision itself costs tokens and latency. Keep routing prompts short and use the cheapest possible model for the routing call.
  • Quality regression at the tail: Cascading works well on average but can fail on edge cases. Monitor escalation rates — a sudden spike signals distribution shift.
  • Vendor lock-in risk: Multi-model routing requires abstraction layers. Build against a unified interface (LiteLLM, a custom gateway) rather than vendor SDKs directly.

Part 4: Fleet-Scale Optimization Patterns

Pattern 1: Tiered Agent Architecture

Assign different model tiers to different agent roles:

  • Orchestrator agents (planning, task decomposition): frontier models
  • Worker agents (execution, extraction, formatting): mid-tier models
  • Validator agents (output checking, format verification): small/cheap models or rule-based systems

This mirrors how human organizations allocate senior vs. junior resources — expensive judgment at the top, efficient execution in the middle, automated checking at the bottom.

Pattern 2: Batch Processing Windows

Many agent tasks are not latency-sensitive. Queuing them for batch processing (where providers offer 50% discounts) can halve inference costs for background workloads. Separate your fleet's real-time path (user-facing, latency-sensitive) from its batch path (background enrichment, analysis, indexing) and price them differently.

Pattern 3: Context Window Management

Longer contexts cost more. In long-running agent sessions, implement:

  • Sliding window summarization: Compress older conversation turns into a summary rather than retaining full history.
  • Selective retrieval: Instead of injecting all available context, retrieve only the most relevant chunks for each call.
  • State externalization: Move agent state into structured storage (a database or key-value store) rather than keeping it in the context window.

Pattern 4: Output Length Control

Instruct models explicitly to be concise. Phrases like "respond in under 100 words" or "return only the JSON object, no explanation" can reduce output token counts by 30–50% on tasks where verbosity adds no value. This is one of the cheapest optimizations available — it costs one line of prompt engineering.

Pattern 5: Observability-Driven Optimization

You cannot optimize what you don't measure. Instrument every LLM call with:

  • Model used
  • Input and output token counts
  • Latency
  • Task type
  • Retry count
  • Cache hit/miss

Aggregate this into a cost dashboard broken down by task type and agent role. The highest-cost task types are your optimization targets. Teams that skip this step spend engineering time optimizing low-volume paths while high-volume expensive paths go unaddressed.


Part 5: Empirica's Role in Cost-Aware Agent Design

Empirica's research infrastructure is designed with agent fleet economics in mind. Several structural choices directly reduce the cost of agents that consume Empirica's outputs:

  • Structured, token-efficient outputs: Empirica produces content in formats (structured Markdown, JSON-compatible schemas) that minimize the parsing and re-processing work agents need to do, reducing downstream token consumption.
  • Agent-readable discovery infrastructure: Empirica's use of llms.txt, agents.json, and OpenAPI patterns means agents can locate and consume relevant content without expensive search-and-summarize loops.
  • Modular note architecture: Content is chunked at the concept level, so agents can retrieve precisely the context they need rather than ingesting large documents and filtering in-context.

These design choices compound with the caching and routing strategies described above. An agent that retrieves a compact, well-structured Empirica note as its context prefix — and caches that prefix — pays far less per task than one that retrieves unstructured web content and processes it in-context.


Practical Decision Framework: Cost vs. Quality Tradeoffs

Use this framework when deciding how aggressively to optimize:

Step 1: Classify Your Tasks

Task type Quality sensitivity Recommended tier
Classification, extraction Low–Medium Small/mid-tier model
Summarization Medium Mid-tier model
Structured data generation Medium Mid-tier with format validation
Complex reasoning, planning High Frontier model
Creative generation High Frontier model
Routing, validation Low Smallest capable model

Step 2: Estimate Repetition Rate

  • High repetition (>30% similar queries) → invest in semantic caching
  • Stable system prompts → enable provider-level prompt caching immediately
  • Variable, unique queries → caching ROI is low; focus on routing

Step 3: Set Quality Gates

Define what "good enough" means for each task type before routing. Without a quality gate, cascading is just hoping the cheap model works.

Step 4: Measure, Then Optimize

Run one week of instrumented baseline before making changes. Optimization without baseline data produces anecdotes, not improvements.

Step 5: Sequence Your Interventions

  1. Prompt caching (lowest effort, immediate payoff for any fleet with stable system prompts)
  2. Output length control (one line of prompt engineering, 30–50% output token reduction)
  3. Model routing (higher engineering effort, highest total savings)
  4. Semantic caching (requires infrastructure, high payoff for repetitive workloads)
  5. Batch processing (requires workflow restructuring, high payoff for non-latency-sensitive tasks)

Case Studies: Real Fleet Optimization Wins

Case A: Document Processing Pipeline

Setup: An agent fleet processing 50,000 documents per day for extraction and classification. Each call used a frontier model with a 1,500-token system prompt.

Interventions applied: - Enabled prompt caching on the static system prompt → ~60% reduction in input token cost for the system prompt portion - Routed classification subtasks to a mid-tier model → 80% of calls now use the cheaper model - Added output length instruction ("return only the JSON schema") → output tokens reduced by 40%

Result: Total inference cost reduced by approximately 65% with no measurable quality degradation on the extraction task.

Case B: Customer-Facing Agent with Mixed Task Types

Setup: A customer service agent handling 10,000 conversations per day. Tasks ranged from simple FAQ retrieval to complex complaint resolution.

Interventions applied: - Implemented cascading: simple queries routed to mid-tier model first, escalated to frontier on low-confidence outputs - Semantic cache on FAQ-type queries (high repetition rate identified via observability) - Sliding window summarization for long conversations

Result: 55% of conversations handled entirely by mid-tier model. Semantic cache hit rate of 38% on FAQ queries. Total cost reduction of approximately 50%.

Case C: Background Research Fleet

Setup: Agents running nightly research and enrichment tasks, not user-facing.

Interventions applied: - Moved entire fleet to batch processing mode - No other changes

Result: 50% cost reduction with no quality impact. Latency increased from seconds to hours — acceptable for a background task.


Key Takeaways & Implementation Checklist

Core Concepts

  • Token price is not cost: True cost-per-task includes context bloat, retries, tool call loops, and output verbosity.
  • Output tokens cost more than input tokens: Design for concise outputs explicitly.
  • Caching ROI is highest for agent fleets: High repetition rates make caching far more valuable than in single-turn applications.
  • Model routing is the highest-leverage optimization: A 10–20× price gap between tiers means even imperfect routing generates large savings.
  • Observability is a prerequisite: You cannot optimize without per-call cost data broken down by task type.

Implementation Checklist

Immediate (this week): - [ ] Enable provider-level prompt caching if your provider supports it - [ ] Add output length instructions to all prompts where verbosity is unnecessary - [ ] Instrument all LLM calls with token counts, model, task type, and latency

Short-term (this month): - [ ] Build a cost dashboard broken down by task type and agent role - [ ] Identify your top 3 highest-cost task types - [ ] Implement rule-based routing for clearly separable task types

Medium-term (this quarter): - [ ] Evaluate semantic caching for your highest-repetition task types - [ ] Build or adopt a cascading routing layer with quality gates - [ ] Separate real-time and batch workloads; move batch to discounted processing

Ongoing: - [ ] Monitor escalation rates in cascading systems for distribution shift - [ ] Review cache TTLs against data freshness requirements quarterly - [ ] Re-benchmark routing thresholds when you upgrade or change models


This lesson is part of Empirica's Agent Economy Series. Related lessons cover API service categories for agent consumption, discovery infrastructure patterns, and structured output design for agent-readable content.