Build vs Buy for AI Agents: Quantifying the Fine-Tune vs External API Decision Boundary
1. Overview
Agent fleet operators face a recurring capital allocation question: when does it become cheaper to fine-tune (or distill) an internal model for a recurring capability versus continuing to call an external API? The decision is not purely about per-token cost — it is a multi-variable optimisation across query volume, latency SLA, accuracy tolerance, training data availability, model staleness risk, and operational overhead. This note quantifies the crossover point using public vendor pricing (OpenAI, Anthropic, Together, Fireworks, AWS, Modal) and a first-principles TCO model, and maps the decision to Empirica's own positioning as a "buy" option for the research-knowledge slice of agent workloads.
2. Key findings
- Pure inference economics favor "buy" until ~10–50M queries/month for general reasoning. At OpenAI's GPT-4o-mini pricing of $0.15/M input + $0.60/M output tokens (https://openai.com/api/pricing/) and Anthropic Claude Haiku at $0.25/M + $1.25/M (https://www.anthropic.com/pricing), a typical 2K-input / 500-output agent call costs ~$0.0006. Self-hosting a fine-tuned Llama-3.1-8B on a single H100 at ~$2/hr (Modal, Together, Lambda public rates: https://www.together.ai/pricing, https://modal.com/pricing) achieves ~50–80 tok/s sustained — roughly 200–400 such calls/hr, or amortised ~$0.005–$0.01 per call before utilisation gaps. The external API wins until utilisation approaches 60–80%.
- Fine-tuning capex is now a rounding error; opex dominates. LoRA fine-tunes of 7–13B models cost $50–$500 on Together/Fireworks (https://docs.fireworks.ai/fine-tuning/fine-tuning-models). The relevant fixed cost is not training — it is MLOps headcount, eval harness construction, and drift monitoring, typically $15–40K/month for a small team. (speculative) based on observed startup org structures.
- Latency SLAs below ~300ms p50 push agents toward self-hosting or speculative-decoded small models — frontier API p50 is 400–900ms cold and 200–600ms warm (artificialanalysis.ai public benchmarks). A 7B model on dedicated GPU regularly hits sub-150ms TTFT.
- Accuracy floors above ~92% on narrow tasks are reachable via distillation. Public results from the Orca, Phi, and Gemma distillation lines, plus vendor case studies from OpenAI's fine-tuning docs (https://platform.openai.com/docs/guides/fine-tuning), show 7–13B fine-tunes routinely match GPT-4-class accuracy on bounded domains (classification, extraction, structured generation). For open-ended reasoning and long-horizon planning, the gap persists.
- Training data availability is the binding constraint more often than cost. Empirica's prior note on "Multi-agent systems with specialised subagents" identified that most agent fleets generate <10K labelled trajectories/month — below the ~50K threshold where supervised fine-tuning reliably beats prompt-engineered frontier models for complex tasks.
- Routing layers compress the decision. Techniques like RouteLLM and FrugalGPT demonstrate that hybrid stacks — cheap local model for 70–80% of queries, frontier API for the hard residual — typically beat either pure strategy by 40–70% on blended cost while preserving quality. The build-vs-buy question becomes "what fraction of the distribution do we internalise?" not a binary.
3. Agent service patterns: what the TCO model actually looks like
Define the monthly TCO of a capability as: