AI Eval & Observability Landscape 2026 | LangSmith, Braintrust, Arize, Langfuse

01 · 2026 Market Reality

The category split is clear. (1) Trace + dashboard layer: LangSmith ships inside LangChain (~$10M+ ARR, default for LangChain users), Langfuse is the open-source winner with self-hosting and SOC2 — ~10K GitHub stars and ~$5M+ ARR through cloud. Arize built Phoenix as the OSS adoption funnel into its enterprise ML observability suite (raised $70M+, ~$30M ARR). (2) Eval-first layer: Braintrust crossed ~$36M ARR in 2025 per industry reporting, with OpenAI, Notion, Vercel, Stripe, Brex, and Airtable as named users; their wedge is offline eval workflows that look like a unit-test runner for prompts. (3) Specialty: Patronus AI ($17M Series A, Lit Tests for hallucinations), Comet Opik (open source, eval + tracing combo), Weights & Biases Weave (the WandB rebundle, leverages MLOps incumbents). 2026 dynamics: (a) The Apex partnership model is dead — every horizontal tool is racing to add agent-trajectory eval and tool-use evaluation, and there's no clear winner yet on the agent side. (b) Pricing is migrating from per-trace to per-eval-run + per-seat, which favors evals as the wedge. (c) The vertical TAM is enormous and unclaimed: a hospital system buying clinical-safety eval is not buying LangSmith, they need a HIPAA-compliant clinical eval bench with domain raters.

02 · Companies to Watch

Braintrust 2023 · Series A · ~$36M ARR (2025)

OpenAI, Notion, Vercel, Stripe, Brex, Airtable

Started as 'unit-test runner for prompts', now ships eval + observability + prompt playground. Founder Ankur Goyal sold Impira to Figma — pedigree distribution into AI-native teams.

LangSmith (LangChain) 2023 · default tool inside LangChain

~$10M+ ARR / 100K+ developers

Lives inside the most-imported LLM framework. Distribution is the entire moat. The 'good enough' default for every team that already imports langchain.

Arize AI / Phoenix 2020 · Series C · $70M+ raised

~$30M ARR (industry est.)

Was MLOps observability for tabular ML, pivoted hard to LLM in 2023. Open-source Phoenix is the adoption hook into enterprise tier. Strong with traditional Fortune 500.

Langfuse 2023 · YC W23 · Series A 2025

~10K GitHub stars / ~$5M+ ARR

Self-hostable open-source LLM observability. Where engineering teams who don't trust cloud go. SOC2 compliant, ISO 27001, the EU/regulated industries default.

Patronus AI 2023 · Series A · $17M raised

Lit Tests / hallucination detection

Founded by ex-Meta researchers (Anand Kannappan, Rebecca Qian). Wedge: pre-built test suites for hallucination + RAG correctness. Sells into enterprise compliance, not engineering.

Comet Opik 2024 · open source

Eval + tracing combo

Comet (MLOps incumbent) launched Opik in 2024 to defend its MLOps perimeter from the LLMops wave. Free + open, distributed via Comet's existing 50K+ users.

Weights & Biases Weave 2024 · CoreWeave-owned

Free for WandB users

WandB's LLM observability bundle, now part of CoreWeave (acquired ~$1.7B 2025). Bundled free with WandB, which is the killer move against standalone tools.

Helicone 2022 · YC W23 · open source

~$2M ARR / 90K+ developers

Proxy-based observability — drop-in one-line change, no SDK. Indie-friendly pricing, the developer-favorite scrappy alternative. Profitable, small team.

03 · Green Lights & Red Flags

🟢 Green light · Consider entering

You have a vertical with non-trivial domain experts on call

Legal: ex-attorneys who can grade citation accuracy. Medical: clinicians who can grade triage safety. Sales: SDR managers who can grade outreach quality. The horizontal layer's weakness is that none of them have domain rater pools — that's your wedge.

You ship engineering products and grok agent traces

If you've ever debugged a multi-step LangGraph agent at 2am you know the pain isn't 'log my prompts.' It's 'why did step 4 hallucinate a tool I don't have?' Multi-step trajectory eval is the unsolved problem and the next $50M+ ARR slot is wide open.

You can sell DevTools or have a strong engineering brand

Eval is technical-buyer. Engineering leaders evaluate, security signs, finance pays. PLG works if your bottom-up motion is real. If you're naming yourself the 'XYZ for AI evals,' you're already losing — your wedge must be a missing feature, not a category.

🔴 Red flag · Hold off

You're building the 19th horizontal LLM tracing tool

LangSmith is bundled into LangChain for free. Braintrust has the AI-native logos. Arize has Fortune 500. Langfuse owns OSS. There is no wedge left at horizontal trace-and-dashboard. Stop.

Your differentiation is 'we're cheaper'

Helicone already owns the indie price floor. CoreWeave just bundled Weave for free into WandB. There is no margin in price-only competition in eval; you'll get squeezed from above by enterprise bundles and below by OSS.

You can't get one Fortune 500 design partner in 90 days

Enterprise eval is sold via design partnerships, not cold email. If your network can't get you one named Fortune 500 logo signing a $50K pilot in 90 days, the math on a 2-year runway doesn't work.

04 · Three Ways In

Vertical eval bench (legal / medical / agent)

Domain expert + engineer co-founder pair, $1M+ pre-seed

Capital: $1M-3M pre-seed
Time commitment: 18-24 months to first $1M ARR
First move: Pick one regulated vertical (radiology triage, clinical scribing, contract redlining, plaintiff demand letters). Spend the first 90 days shadowing 3-5 domain teams. Build a rater workforce of 20+ named domain experts before writing tests. Your moat is the rater network, not the runner.

Agent trajectory evaluation tool

Engineering-led team with prior agent prod experience

Capital: $500K-2M seed
Time commitment: 12-18 months to $500K ARR
First move: Start with one open-source framework (LangGraph, CrewAI, Mastra). Build the eval harness specifically for multi-step agent traces — branch comparison, tool-call grading, retry-tree analysis. Ship as OSS with a hosted SaaS for the heavy compute. Goal: 1K weekly active engineers in 6 months.

Compliance + audit-trail wedge

Ex-security/compliance lead + technical co-founder

Capital: $300K-1M bootstrap or seed
Time commitment: 9-12 months to first contract
First move: Target regulated industries (banking, healthcare, government). Ship a thin layer that turns existing LangSmith/Braintrust/Langfuse logs into SOC2/HIPAA/EU AI Act audit-ready reports. Sell via Big-4 compliance partnerships, $50K-200K/yr contracts. You compete with consulting hours, not other tools.

05 · Founders Who Fit Best

High match · Engineer-led · DevTools muscle

The Lone Engineer

Eval is a developer-tooling product at its core. You ship code, you instrument code, you know what bad traces look like at 2am. If you also have a network into one regulated vertical, this is your highest-leverage track in the entire atlas.

High match · Industry deep · Vertical wedge

Industry Veteran

Without you, the vertical eval wedge doesn't exist. Engineering-only teams cannot build a clinical safety bench or a legal citation eval. Your rolodex of 20 domain raters is the moat. Pair with one strong engineer.

Medium · Mid-career pivot · Industry → AI

35 and Still Building

If you spent 10 years in regulated ops (compliance, clinical, legal review) and now want to ride AI, vertical eval is the cleanest wedge. The horizontal tools are too crowded but the regulated TAM is unclaimed.

06 · What's Next

Worth reading

AI Engineering: Building Applications with Foundation Models Chip Huyen / O'Reilly 2025
Your AI Product Needs Evals (canonical post) Hamel Husain
The State of LLM Evaluation 2025 Braintrust

Communities

People to follow

Ankur Goyal (Braintrust CEO) @ankrgyl
Hamel Husain (eval canonical) @HamelHusain
Eugene Yan (eval essays) @eugeneyan

Adjacent tracks

AI Coding & DevToolsBuyers overlap heavily — engineering teams that buy Cursor/Codeium also buy LangSmith/Braintrust. Shared distribution motion.
AI Security & Red TeamSame buyer (engineering/security), different angle. Eval = correctness, red team = adversarial. The combined wedge is the strongest play.
Legal AIThe single biggest vertical eval opportunity — citation correctness is the #1 unsolved problem in legal AI.

AI Eval & Observability: The First Real Enterprise Wedge of the Agent Era

Worth reading

Communities

People to follow

Adjacent tracks

Which kind of founder are you?