Started as 'unit-test runner for prompts', now ships eval + observability + prompt playground. Founder Ankur Goyal sold Impira to Figma — pedigree distribution into AI-native teams.
Braintrust at $36M ARR. LangSmith free inside LangChain. Where can a small team still win?
Eval and observability is the closest thing AI has to a guaranteed enterprise check. The 2024 wave of agent rollouts gave every Fortune 500 the same problem: their LLM apps work in demo, fail in prod, and nobody knows why. In 2026 four horizontal winners have emerged — Braintrust, LangSmith, Arize Phoenix, Langfuse — each in the $10-50M ARR band with very different distribution. Below them, the open white space is vertical: legal eval (citation correctness), medical eval (clinical safety), agent eval (multi-step trajectory grading). The horizontal layer is now too crowded for a 19th general-purpose tracing tool, but a focused vertical eval business with 50 paying customers at $30K each is a clean $1.5M ARR business that nobody at Sequoia is chasing.
Started as 'unit-test runner for prompts', now ships eval + observability + prompt playground. Founder Ankur Goyal sold Impira to Figma — pedigree distribution into AI-native teams.
Lives inside the most-imported LLM framework. Distribution is the entire moat. The 'good enough' default for every team that already imports langchain.
Was MLOps observability for tabular ML, pivoted hard to LLM in 2023. Open-source Phoenix is the adoption hook into enterprise tier. Strong with traditional Fortune 500.
Self-hostable open-source LLM observability. Where engineering teams who don't trust cloud go. SOC2 compliant, ISO 27001, the EU/regulated industries default.
Founded by ex-Meta researchers (Anand Kannappan, Rebecca Qian). Wedge: pre-built test suites for hallucination + RAG correctness. Sells into enterprise compliance, not engineering.
Comet (MLOps incumbent) launched Opik in 2024 to defend its MLOps perimeter from the LLMops wave. Free + open, distributed via Comet's existing 50K+ users.
WandB's LLM observability bundle, now part of CoreWeave (acquired ~$1.7B 2025). Bundled free with WandB, which is the killer move against standalone tools.
Proxy-based observability — drop-in one-line change, no SDK. Indie-friendly pricing, the developer-favorite scrappy alternative. Profitable, small team.
Legal: ex-attorneys who can grade citation accuracy. Medical: clinicians who can grade triage safety. Sales: SDR managers who can grade outreach quality. The horizontal layer's weakness is that none of them have domain rater pools — that's your wedge.
If you've ever debugged a multi-step LangGraph agent at 2am you know the pain isn't 'log my prompts.' It's 'why did step 4 hallucinate a tool I don't have?' Multi-step trajectory eval is the unsolved problem and the next $50M+ ARR slot is wide open.
Eval is technical-buyer. Engineering leaders evaluate, security signs, finance pays. PLG works if your bottom-up motion is real. If you're naming yourself the 'XYZ for AI evals,' you're already losing — your wedge must be a missing feature, not a category.
LangSmith is bundled into LangChain for free. Braintrust has the AI-native logos. Arize has Fortune 500. Langfuse owns OSS. There is no wedge left at horizontal trace-and-dashboard. Stop.
Helicone already owns the indie price floor. CoreWeave just bundled Weave for free into WandB. There is no margin in price-only competition in eval; you'll get squeezed from above by enterprise bundles and below by OSS.
Enterprise eval is sold via design partnerships, not cold email. If your network can't get you one named Fortune 500 logo signing a $50K pilot in 90 days, the math on a 2-year runway doesn't work.
Domain expert + engineer co-founder pair, $1M+ pre-seed
Engineering-led team with prior agent prod experience
Ex-security/compliance lead + technical co-founder
Eval is a developer-tooling product at its core. You ship code, you instrument code, you know what bad traces look like at 2am. If you also have a network into one regulated vertical, this is your highest-leverage track in the entire atlas.
Without you, the vertical eval wedge doesn't exist. Engineering-only teams cannot build a clinical safety bench or a legal citation eval. Your rolodex of 20 domain raters is the moat. Pair with one strong engineer.
If you spent 10 years in regulated ops (compliance, clinical, legal review) and now want to ride AI, vertical eval is the cleanest wedge. The horizontal tools are too crowded but the regulated TAM is unclaimed.
5 min · 12 questions · Free · Get your archetype + top 3 matching tracks
Take the quiz →