Build vs Buy for AI Agents: Quantifying the Fine-Tuning vs API Decision Boundary

1. Overview

Autonomous agents face a recurring procurement decision: invoke a frontier model API for every reasoning step, or amortise capability into a fine-tuned (or distilled) internal model. The decision boundary is governed by three measurable variables — workload volume (tokens/month), task specificity (distribution narrowness vs frontier-model coverage), and latency tolerance (p50/p99 budgets). Current vendor pricing and open-weight inference economics place the breakeven for most narrow tasks somewhere between 5M and 200M tokens/month, but the boundary shifts sharply when retrieval-augmented external knowledge (such as a research subscription) can substitute for parametric memorisation. This note quantifies that boundary and maps it onto Empirica's research infrastructure positioning.

2. Key Findings

  • Frontier API pricing has compressed roughly 10× in 24 months but remains 3–30× more expensive per token than self-hosted open-weight inference at scale. OpenAI's GPT-4o is listed at $2.50 / $10.00 per 1M input/output tokens (https://openai.com/api/pricing/); Anthropic Claude Sonnet 4 at $3 / $15 per 1M (https://www.anthropic.com/pricing); Google Gemini 1.5 Flash at $0.075 / $0.30 per 1M for short context (https://ai.google.dev/pricing). The Gemini 1.5 family explicitly targets compute-efficient inference with long context [P8], compressing the cost-per-task envelope further for retrieval-style workloads.
  • Self-hosted Llama-3.1-70B inference on a single H100 node runs roughly $0.20–$0.60 per 1M tokens at >60% utilisation (Together AI lists Llama-3.1-70B at $0.88/1M tokens — https://www.together.ai/pricing; Fireworks at $0.90/1M — https://fireworks.ai/pricing). Owning the hardware and amortising capex drops marginal cost further, but only above ~30% sustained utilisation. [EMPIRICA ANALYSIS]
  • Fine-tuning (LoRA / QLoRA) capex is now in the $50–$5,000 range for task-specialised 7B–70B adapters. OpenAI fine-tuning of GPT-4o-mini is $3.00 per 1M training tokens (https://openai.com/api/pricing/); a 100M-token corpus therefore costs ~$300 to fine-tune, plus a ~2× inference surcharge ($0.30 / $1.20 per 1M vs $0.15 / $0.60 base). Quantisation (INT8/INT4) lowers inference cost a further 2–4× with minor accuracy loss for narrow tasks [P2].
  • Task-specificity drives the decision more than volume. A comprehensive LLM survey notes that fine-tuned smaller models match or exceed frontier general models on narrow domains while costing 10–100× less to operate [P7][P9]. For tasks within a frontier model's "core competence", marginal accuracy gains from fine-tuning are small; for niche taxonomies, jargon-heavy domains, or proprietary schemas, fine-tuning closes a measurable gap.
  • Latency budgets independently force the decision. API round-trip floors are roughly 300–800 ms (TLS + queue + first-token). Self-hosted quantised 7B models can deliver <100 ms first-token on a co-located GPU [P2]. Any agent loop requiring >5 sequential LLM calls per user-perceived action (common in Tree-of-Thoughts [P4] and Graph-of-Thoughts [P5] reasoning) compounds API latency to user-visible levels.
  • Reasoning frameworks multiply token consumption 5–100×. Tree of Thoughts [P4] explores multiple branches per decision; Graph of Thoughts [P5] reports >31% cost reduction over ToT but still operates well above single-shot CoT cost. This token-amplification shifts breakeven volumes downward by a similar factor — an agent doing 1M user-actions/month at 50 LLM calls per action consumes 50M+ calls, easily clearing fine-tuning breakeven.
  • External research APIs short-circuit the build-vs-buy question for knowledge-bound tasks. If the bottleneck is what the model knows rather than how it reasons, retrieval from a curated source dominates fine-tuning on both freshness and cost dimensions. Long-context models like Gemini 1.5 [P8] make this even cheaper by allowing large retrieved corpora to be stuffed into a single call.
  • Personalisation and domain adaptation are increasingly handled at the prompt/retrieval layer, not via weight updates [P10]. This trend shifts the economic centre of gravity away from fine-tuning toward retrieval infrastructure.

3. Agent Service Patterns — Cost/Value Analysis

3.1 The three-axis decision boundary

Let: