Track Atlas · OPC ATLAS

AI Eval & Observability: The First Real Enterprise Wedge of the Agent Era

Braintrust at $36M ARR. LangSmith free inside LangChain. Where can a small team still win?

Updated 2026-05-10

Eval and observability is the closest thing AI has to a guaranteed enterprise check. The 2024 wave of agent rollouts gave every Fortune 500 the same problem: their LLM apps work in demo, fail in prod, and nobody knows why. In 2026 four horizontal winners have emerged — Braintrust, LangSmith, Arize Phoenix, Langfuse — each in the $10-50M ARR band with very different distribution. Below them, the open white space is vertical: legal eval (citation correctness), medical eval (clinical safety), agent eval (multi-step trajectory grading). The horizontal layer is now too crowded for a 19th general-purpose tracing tool, but a focused vertical eval business with 50 paying customers at $30K each is a clean $1.5M ARR business that nobody at Sequoia is chasing.

The category split is clear. (1) Trace + dashboard layer: LangSmith ships inside LangChain (~$10M+ ARR, default for LangChain users), Langfuse is the open-source winner with self-hosting and SOC2 — ~10K GitHub stars and ~$5M+ ARR through cloud. Arize built Phoenix as the OSS adoption funnel into its enterprise ML observability suite (raised $70M+, ~$30M ARR). (2) Eval-first layer: Braintrust crossed ~$36M ARR in 2025 per industry reporting, with OpenAI, Notion, Vercel, Stripe, Brex, and Airtable as named users; their wedge is offline eval workflows that look like a unit-test runner for prompts. (3) Specialty: Patronus AI ($17M Series A, Lit Tests for hallucinations), Comet Opik (open source, eval + tracing combo), Weights & Biases Weave (the WandB rebundle, leverages MLOps incumbents). 2026 dynamics: (a) The Apex partnership model is dead — every horizontal tool is racing to add agent-trajectory eval and tool-use evaluation, and there's no clear winner yet on the agent side. (b) Pricing is migrating from per-trace to per-eval-run + per-seat, which favors evals as the wedge. (c) The vertical TAM is enormous and unclaimed: a hospital system buying clinical-safety eval is not buying LangSmith, they need a HIPAA-compliant clinical eval bench with domain raters.
Braintrust 2023 · Series A · ~$36M ARR (2025)
OpenAI, Notion, Vercel, Stripe, Brex, Airtable

Started as 'unit-test runner for prompts', now ships eval + observability + prompt playground. Founder Ankur Goyal sold Impira to Figma — pedigree distribution into AI-native teams.

LangSmith (LangChain) 2023 · default tool inside LangChain
~$10M+ ARR / 100K+ developers

Lives inside the most-imported LLM framework. Distribution is the entire moat. The 'good enough' default for every team that already imports langchain.

Arize AI / Phoenix 2020 · Series C · $70M+ raised
~$30M ARR (industry est.)

Was MLOps observability for tabular ML, pivoted hard to LLM in 2023. Open-source Phoenix is the adoption hook into enterprise tier. Strong with traditional Fortune 500.

Langfuse 2023 · YC W23 · Series A 2025
~10K GitHub stars / ~$5M+ ARR

Self-hostable open-source LLM observability. Where engineering teams who don't trust cloud go. SOC2 compliant, ISO 27001, the EU/regulated industries default.

Patronus AI 2023 · Series A · $17M raised
Lit Tests / hallucination detection

Founded by ex-Meta researchers (Anand Kannappan, Rebecca Qian). Wedge: pre-built test suites for hallucination + RAG correctness. Sells into enterprise compliance, not engineering.

Comet Opik 2024 · open source
Eval + tracing combo

Comet (MLOps incumbent) launched Opik in 2024 to defend its MLOps perimeter from the LLMops wave. Free + open, distributed via Comet's existing 50K+ users.

Weights & Biases Weave 2024 · CoreWeave-owned
Free for WandB users

WandB's LLM observability bundle, now part of CoreWeave (acquired ~$1.7B 2025). Bundled free with WandB, which is the killer move against standalone tools.

Helicone 2022 · YC W23 · open source
~$2M ARR / 90K+ developers

Proxy-based observability — drop-in one-line change, no SDK. Indie-friendly pricing, the developer-favorite scrappy alternative. Profitable, small team.

🟢 Green light · Consider entering
You have a vertical with non-trivial domain experts on call

Legal: ex-attorneys who can grade citation accuracy. Medical: clinicians who can grade triage safety. Sales: SDR managers who can grade outreach quality. The horizontal layer's weakness is that none of them have domain rater pools — that's your wedge.

You ship engineering products and grok agent traces

If you've ever debugged a multi-step LangGraph agent at 2am you know the pain isn't 'log my prompts.' It's 'why did step 4 hallucinate a tool I don't have?' Multi-step trajectory eval is the unsolved problem and the next $50M+ ARR slot is wide open.

You can sell DevTools or have a strong engineering brand

Eval is technical-buyer. Engineering leaders evaluate, security signs, finance pays. PLG works if your bottom-up motion is real. If you're naming yourself the 'XYZ for AI evals,' you're already losing — your wedge must be a missing feature, not a category.

🔴 Red flag · Hold off
You're building the 19th horizontal LLM tracing tool

LangSmith is bundled into LangChain for free. Braintrust has the AI-native logos. Arize has Fortune 500. Langfuse owns OSS. There is no wedge left at horizontal trace-and-dashboard. Stop.

Your differentiation is 'we're cheaper'

Helicone already owns the indie price floor. CoreWeave just bundled Weave for free into WandB. There is no margin in price-only competition in eval; you'll get squeezed from above by enterprise bundles and below by OSS.

You can't get one Fortune 500 design partner in 90 days

Enterprise eval is sold via design partnerships, not cold email. If your network can't get you one named Fortune 500 logo signing a $50K pilot in 90 days, the math on a 2-year runway doesn't work.

Vertical eval bench (legal / medical / agent)

Domain expert + engineer co-founder pair, $1M+ pre-seed

Capital
$1M-3M pre-seed
Time commitment
18-24 months to first $1M ARR
First move
Pick one regulated vertical (radiology triage, clinical scribing, contract redlining, plaintiff demand letters). Spend the first 90 days shadowing 3-5 domain teams. Build a rater workforce of 20+ named domain experts before writing tests. Your moat is the rater network, not the runner.
Agent trajectory evaluation tool

Engineering-led team with prior agent prod experience

Capital
$500K-2M seed
Time commitment
12-18 months to $500K ARR
First move
Start with one open-source framework (LangGraph, CrewAI, Mastra). Build the eval harness specifically for multi-step agent traces — branch comparison, tool-call grading, retry-tree analysis. Ship as OSS with a hosted SaaS for the heavy compute. Goal: 1K weekly active engineers in 6 months.
Compliance + audit-trail wedge

Ex-security/compliance lead + technical co-founder

Capital
$300K-1M bootstrap or seed
Time commitment
9-12 months to first contract
First move
Target regulated industries (banking, healthcare, government). Ship a thin layer that turns existing LangSmith/Braintrust/Langfuse logs into SOC2/HIPAA/EU AI Act audit-ready reports. Sell via Big-4 compliance partnerships, $50K-200K/yr contracts. You compete with consulting hours, not other tools.

Worth reading

Communities

People to follow

Adjacent tracks

  • AI Coding & DevToolsBuyers overlap heavily — engineering teams that buy Cursor/Codeium also buy LangSmith/Braintrust. Shared distribution motion.
  • AI Security & Red TeamSame buyer (engineering/security), different angle. Eval = correctness, red team = adversarial. The combined wedge is the strongest play.
  • Legal AIThe single biggest vertical eval opportunity — citation correctness is the #1 unsolved problem in legal AI.

Which kind of founder are you?

5 min · 12 questions · Free · Get your archetype + top 3 matching tracks

Take the quiz →
← Home AI / Agent atlas →