Empirica Technologies

1. Overview

The multi-agent capability market has moved from theoretical framework to visible commercial offerings — published protocols, shipping products, public pricing — though usage telemetry remains private to the vendors. This synthesis focuses on observable delegation patterns: which tasks vendors visibly sell and document orchestrators outsourcing to specialised subagents, what pricing structures have emerged, and where the cost-benefit threshold sits between calling an external capability versus internalising it through fine-tuning, prompt engineering, or a self-hosted small model. The principal-agent framing from [P7] and [P9] gives the theoretical scaffolding; public vendor pricing pages give the observable market structure. Citation labels are keyed in the References section below.

2. Key Findings

Delegation appears concentrated, not uniform. Across widely used orchestrator frameworks (LangGraph, CrewAI, AutoGen, OpenAI's Agents SDK, Anthropic's MCP-based clients), the outsourced capabilities most visible in public framework documentation and integration catalogues fall into five categories: (i) web search and retrieval, (ii) structured research/knowledge lookup, (iii) code execution sandboxes, (iv) browser/automation actions, and (v) document parsing/OCR. This is a qualitative reading of what the frameworks document and integrate, not a measurement of call volumes — usage telemetry is private (see Open Questions). Inference itself — the LLM call — is the meta-capability that is always outsourced unless the operator runs its own weights.
Principal-agent liability creates a moat for branded capability providers. [P7] argues that LLM-based agentic systems inherit classical principal-agent problems: information asymmetry, misaligned incentives, monitoring cost, and ambiguous liability allocation. The practical consequence is that orchestrators have an incentive to prefer subagent providers with auditable outputs, SLAs, and contractual indemnity over cheaper but opaque alternatives — a preference consistent with the market positioning of contracted search APIs (Exa, Tavily, Perplexity Sonar, Brave Search) relative to uncontracted scraping.
Outsourcing can prevail when monitoring is costly. In [P9]'s principal/prime-agent/subagent model, outsourcing can be the prevailing mode when the principal's information processing is limited and when collusion between agents is a risk under insourcing. Translated to agent stacks: when the orchestrator cannot cheaply verify a subagent's intermediate reasoning, the model's logic favours paying a single bundled fee to a vertically integrated capability vendor (e.g., a research API that returns cited, structured output) over assembling raw web-scrape + summariser internally.
Pricing has bifurcated into per-call and subscription tiers. Public pricing observed across the capability layer (vendor pricing pages, cited by URL):
- Exa API — https://exa.ai/pricing — per-search and per-content-retrieval, with volume discounts.
- Tavily — https://tavily.com/#pricing — free tier + per-1k-search pricing, subscription bundles.
- Perplexity Sonar API — https://docs.perplexity.ai/guides/pricing — per-million-token input/output plus per-request search surcharge.
- Browserbase — https://www.browserbase.com/pricing — per-browser-minute.
- Firecrawl — https://www.firecrawl.dev/pricing — per-page-crawled with subscription bundles.
- E2B code sandboxes — https://e2b.dev/pricing — per-second of compute.
- OpenAI pricing — https://openai.com/api/pricing — per-token tiers plus tool surcharges (web search, file search, code interpreter).
- Anthropic pricing — https://www.anthropic.com/pricing — per-token with prompt-caching discounts up to ~90%. The pattern: specialised subagents charge per-unit-of-work that maps to their marginal cost (page crawled, browser minute, search executed), while LLM platforms charge per token. Several of the pricing pages above combine usage-based and subscription components; the hybrid structure dampens budget variance, which matters for autonomous fleets whose call volume is hard to forecast.
Cost-benefit threshold for outsource vs fine-tune. As an order-of-magnitude heuristic — an illustrative estimate rather than a measured figure: a small-model fine-tune (LoRA on a 7–13B base) reaches breakeven against API consumption only when (a) the task is repetitive, (b) call volume exceeds roughly 10⁶–10⁷ calls/month at current per-call rates, and (c) the capability does not depend on fresh external data. Anything requiring real-time information (search, news, prices) is structurally outsource-only — fine-tuning cannot internalise freshness.
Management practice variance predicts capability-stack quality. [P4]'s finding that productivity differences across firms reflect management practice variance — monitoring, targets, incentives — suggests a close parallel for agent fleet operators. Fleets that systematically measure subagent reliability (latency p99, factuality, schema-conformance) are positioned to extract more value per dollar of capability spend than those that don't. The same instrumentation gap that explains firm-level productivity dispersion may be reproducing itself in agent operations.
Delegation is a primary consumer experience category. [P2] identifies delegation explicitly as one of four core consumer-AI interaction modes (alongside data capture, classification, and social). A plausible implication: consumer-facing agents that visibly delegate to specialised providers (e.g., "checked with Wolfram", "verified via Perplexity") may gain trust faster than monolithic agents — creating a marketing incentive for orchestrators to advertise their subagent stack, which in turn drives demand for branded capability layers.

3. Agent Service Patterns: What Is Bought, Why, and at What Price

Pattern A — Research and retrieval. In research-oriented agent fleets, search, retrieval, and structured-knowledge APIs plausibly represent the largest single category of non-LLM API spend, though no public benchmark exists to quantify the share. The economic logic: an LLM call that needs grounded facts costs roughly the same whether grounding succeeds or not, but the value of the call collapses without grounding. The trade-off favours the per-search fee whenever the cost of hallucinated output — verification effort, or discarded work — exceeds it; no public cost model pins down where that line sits for a given fleet.

Pattern B — Code execution and computational tools. Sandboxes (E2B, Modal, Daytona, Replit) charge per compute-second. Adoption is driven by safety (isolating arbitrary generated code from the agent host) and capability (Python with libraries that the LLM cannot run internally). Several providers compete in this category; none publish margins.

Pattern C — Browser and action execution. Browser-as-a-service (Browserbase, Anchor, Steel) is priced per browser-minute (see vendor pricing above). The engineering surface — stealth, captcha handling, session management, proxy infrastructure — is substantial and hard to replicate, but no public margin data is available to establish how this layer's economics compare with other capability categories.

Pattern D — Document and unstructured-data parsing. OCR, PDF parsing, and table extraction (Reducto, LlamaParse, Unstructured.io) priced per page. Volume-driven adoption among agents doing financial, legal, and scientific document analysis.

Pattern E — Structured research and curated knowledge. A long-standing category — financial data terminals and research databases predate agents by decades — now being consumed programmatically: pre-synthesised, structured output sold by subscription rather than purely per call. The buyer trades per-call flexibility for output needing less downstream summarisation and verification. Of the five patterns this one has the least public usage data.

Cost-benefit thresholds (illustrative heuristics, not procurement guidance).

Structurally external: capabilities that depend on fresh external data, regulatory-grade citation, or specialised infrastructure (browser fleets, sandboxes) cannot be internalised by fine-tuning.
External at moderate volumes: for standard inference, embeddings, and generic summarisation, the illustrative order-of-magnitude breakeven above (roughly 10⁶–10⁷ calls/month) marks where internalisation begins to be arguable.
Internalisation candidates: narrow, stable, latency-critical, privacy-sensitive tasks at call volumes that amortise the up-front fine-tuning and ongoing inference-hosting cost.
Within a multi-agent hierarchy, [P9]'s model points to outsourcing when monitoring intermediate reports is costly and collusion risk between subagents is non-trivial, and to insourcing when the principal can cheaply verify reports.

4. Open Questions

What fraction of total agent-cycle cost is non-LLM capability spend? Vendor telemetry is private; no industry benchmark exists.
Does delegation depth (orchestrator → subagent → sub-subagent) compound reliability decay multiplicatively? [P7]'s principal-agent framing suggests yes, but empirical measurement is absent.
Will MCP (Model Context Protocol) and similar standards commoditise the capability layer? If standardised tool interfaces drive substitutability, pricing power shifts to whoever has the strongest brand, SLA, or unique data — not the integration layer.
Cross-fleet collusion risk. [P9] notes that insourcing is prone to agent collusion; the analogue in LLM agent systems — multiple subagents from the same provider producing correlated errors — is unstudied.
Fine-tune economics post-distillation. As distilled small models improve, the volume threshold for internalisation falls. A speculative reading: the order-of-magnitude breakeven heuristic above could compress materially over the next few years.
Regulatory liability allocation. [P7] flags this explicitly; no jurisdiction has yet clarified whether the orchestrator, the subagent provider, or the LLM platform bears liability for downstream harms from delegated tool use.

5. Implications

Instrumentation of the capability layer matters. Per-call measurement of subagent latency, factuality, and schema-conformance is the operational lever the management-practice evidence [P4] points at — and the cheapest one to adopt early.
Outsourcing is the structural default for fresh-data capabilities. Fine-tuning cannot internalise freshness, so the internalisation argument structurally fails for time-sensitive capabilities such as search, news, and prices.
Pricing structure allocates volume risk. Per-call pricing leaves volume risk with the buyer; flat subscriptions shift it to the provider. For autonomous fleets with bursty call patterns, the two structures present a trade-off between budget variance and expected cost rather than a strict ranking.
The breakeven calculation is a moving target. As distilled open-weight models improve and fine-tune costs fall, the volume threshold for internalisation shifts over time rather than holding as a one-time constant.
Monitoring cost is part of the effective price of delegation. In the principal-agent framing of [P7] and [P9], comparing a subagent that returns structured, cited, schema-conformant output with a cheaper but opaque alternative is not a pure price comparison: the orchestrator's downstream verification cost differs between the two, and the models treat that cost as a real component of the delegation decision.
Visible delegation may carry marketing value. Per [P2], delegation is a core consumer-AI interaction mode; a plausible extension is that visibly delegating to named providers affects consumer trust, which would link consumer-facing transparency to demand for branded capability layers.

Across these patterns, delegation tends to dominate internalisation for capabilities that are data-fresh, infrastructure-heavy, or costly to verify in-house — and the principal-agent models in [P7] and [P9] supply theoretical grounding for that observation beyond what is visible in public pricing alone.

6. References

Citation labels are retained from the underlying screened source pool and are therefore non-consecutive; every label used in the text is keyed below.

[P2] Puntoni, S., Reczek, R. W., & Giesler, M. (2020). Consumers and Artificial Intelligence: An Experiential Perspective. https://doi.org/10.1177/0022242920953847
[P4] Bloom, N., & Van Reenen, J. (2010). Why Do Management Practices Differ across Firms and Countries? https://doi.org/10.1257/jep.24.1.203
[P7] Gabison, G. A., & Xian, R. P. (2025). Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective. https://doi.org/10.18653/v1/2025.realm-1.9
[P9] Shin, D., & Strausz, R. (2024). Insourcing versus outsourcing in a vertical structure. https://doi.org/10.1111/jems.12585

Vendor pricing pages cited inline as primary sources for pricing structure: Exa (https://exa.ai/pricing), Tavily (https://tavily.com/#pricing), Perplexity Sonar (https://docs.perplexity.ai/guides/pricing), Browserbase (https://www.browserbase.com/pricing), Firecrawl (https://www.firecrawl.dev/pricing), E2B (https://e2b.dev/pricing), OpenAI (https://openai.com/api/pricing), Anthropic (https://www.anthropic.com/pricing).

Multi-agent systems with specialised subagents — capability markets and delegation economics