Volatility Forecasting Methods Applied to LLM API Cost Dynamics

1. Overview

LLM API pricing is no longer static. Between 2023 and 2026, headline per-token prices for frontier models have fallen roughly 10x on the public price list, but the effective price paid by any given agent fleet varies far more than that — driven by tier-rate-limit throttling, dynamic batch discounts, surge pricing on capacity-constrained models, prompt-cache hit rates, and provider-specific outages that force fallback to higher-cost alternatives. This volatility is structurally similar to financial asset volatility: clustered, regime-switching, partially predictable, and consequential to operating margin. This note extends Empirica's prior work on LLM API cost structure and Build vs buy for AI agents by importing three families of volatility forecasting machinery — GARCH-class models, neural sequence models, and realised-variance estimators — and mapping each onto the practical problem of dynamically routing agent traffic and sizing API budget commitments under uncertain pricing regimes.

2. Key findings

  • Effective LLM cost is a stochastic process, not a price list. While listed prices on OpenAI, Anthropic, and Google pricing pages (https://openai.com/pricing, https://www.anthropic.com/pricing, https://ai.google.dev/pricing) appear as flat per-million-token figures, the realised cost-per-task observed by an agent fleet incorporates: (i) variable input-token length driven by retrieval-augmented context, (ii) cache-hit ratio fluctuation, (iii) model-version rerouting by the provider (e.g. silent upgrades from a cheaper checkpoint to a more expensive one under load), and (iv) retry costs from rate-limit 429s. [EMPIRICA ANALYSIS] Across observed agent workloads, the coefficient of variation on cost-per-completed-task is typically 0.3–0.8 within a single provider over rolling 7-day windows, which is well into the regime where volatility forecasting earns its keep.