---
title: Devtool Arena Q1 2026 Report
date: 2026-03-24
author: Sapient
description: We evaluated 65 developer APIs across 10 categories for AI-agent readiness. Only one achieved a Grade A.
---

Stripe costs an AI agent $0.13 to integrate — eight tool calls, one minor error, working code. Circle costs $1.65 for fifty-five tool calls, seventeen errors, and code that barely functions.

Both are payment APIs with typed SDKs and MCP servers. Stripe scores 85 on our benchmark, Circle scores 43. That 42-point gap has nothing to do with API quality; Circle handles billions in stablecoin transactions. It comes down to whether an agent can figure out how to use the API without help.

We benchmarked 65 developer APIs across 10 categories. One scored above 90, sixteen below 60, average 64.5. Most APIs are harder for agents to use than their makers think.

The biggest predictor of success turned out to be discoverability — whether the agent can find docs, specs, and SDKs before it writes any code.

---

# How We Tested

We gave Claude Sonnet 4.5 a sandboxed environment and a single instruction: integrate this API and complete a task. No human help, no pre-configured credentials beyond API keys.

Each API received two scores. A discovery score (0-100) based on seven items: llms.txt, OpenAPI spec, typed SDK, MCP server, CLI, Context7 indexing, and agent skills. And an integration score (0-100) based on task completion, errors, tool call efficiency, and cost. The overall score combines both, weighted toward discovery because you can't integrate what you can't find.

Grades: A is 90+, B is 75-89, C is 60-74, D is below 60.

All scores are public on the [leaderboard](/leaderboard). You can check every number in this report against the raw data.

---

# The Big Picture

| Grade | Count | What it means |
|-------|-------|---------------|
| A (90+) | 1 | Agent-ready. Zero human intervention needed. |
| B (75-89) | 24 | Mostly works. Maybe one manual step. |
| C (60-74) | 24 | Agent needs human guidance. |
| D (below 60) | 16 | Agents fail repeatedly. |

Average: 64.5 (SD: 16.5, median: 68). B and C tie at 24 each. The typical API lands right at the boundary between "works with friction" and "needs help."

| Category | Avg (±SD) | Range | N | Best | Worst |
|----------|-----------|-------|---|------|-------|
| Search | 73.3 ±22.7 | 30-95 | 6 | You.com (95) | Brave (30) |
| Durable Workflow | 73.0 ±9.5 | 54-83 | 7 | Akka (83) | Cadence (54) |
| Vector DBs | 70.3 ±8.5 | 58-82 | 6 | Pinecone (82) | Weaviate (58) |
| Inference | 69.9 ±8.7 | 54-80 | 8 | Groq (80) | SambaNova (54) |
| Auth | 69.2 ±7.0 | 60-76 | 6 | Auth0 (76) | Descope (60) |
| Sandboxes | 68.0 ±25.2 | 20-88 | 6 | Cloudflare (88) | Modal (20) |
| Payment | 62.1 ±14.9 | 36-85 | 10 | Stripe (85) | Adyen (36) |
| Voice | 54.2 ±18.9 | 32-76 | 6 | ElevenLabs (76) | Rime (32) |
| Meeting Bot | 49.6 ±10.6 | 40-66 | 5 | Meeting BaaS (66) | Meetstream (40) |
| Stablecoin | 49.0 ±17.2 | 30-75 | 5 | Coinbase (75) | Triple-A (30) |

The top six categories bunch between 68 and 73. Below that, Voice, Meeting Bot, and Stablecoin all fall under 55. The spread varies a lot by category: Auth (±7.0) is tight, everyone scores similarly. Sandboxes (±25.2) is all over the place — Cloudflare at 88, Modal at 20.

---

# Discoverability Determines Almost Everything

We grouped APIs by discovery score and looked at overall performance:

| Discovery Score | Avg Overall (±SD) | Range | Count |
|----------------|-------------------|-------|-------|
| 90-100 | 83.2 ±6.7 | 75-95 | 6 |
| 70-89 | 71.6 ±9.5 | 43-88 | 34 |
| 50-69 | 58.0 ±12.4 | 36-68 | 10 |
| Below 50 | 45.5 ±15.3 | 20-80 | 15 |

The gap between top and bottom tiers is 37.7 points, but the standard deviations matter more than the averages here. APIs with excellent discovery cluster tightly (±6.7), while poorly-discovered APIs scatter widely (±15.3). Good discoverability doesn't guarantee a great score, but it sets a floor — every API with a perfect discovery score of 100 landed at 80 or above.

Five APIs hit 100: You.com (95 overall), Stripe (85), Pinecone (82), Firecrawl (82), Vercel (80).

On the other end, APIs with no discovery items at all land in Grade D territory almost without exception. The agent can't figure out authentication patterns by guessing, and it can't parse unstructured marketing pages into working code.

One fair counter: aren't we just measuring marketing spend? APIs with bigger DevRel teams are more likely to publish llms.txt and build MCP servers. Partly true. But Vercel scored 80 despite 13 errors and 38 tool calls. Discovery got the agent to the API. The 13 errors happened afterward, during actual integration. If discoverability were purely cosmetic, it wouldn't correlate with getting the task done.

---

# The Checklist Isn't Enough

Adyen has an OpenAPI spec, a typed SDK, an MCP server, and Context7 indexing. Four of seven items. It scores 36.

Circle has five: llms.txt, OpenAPI, typed SDK, MCP server, Context7. It scores 43 with 17 errors and 55 tool calls at $1.65.

Having the items gets you found. It doesn't mean the agent can do anything useful once it arrives. Payment APIs have webhooks, idempotency keys, PCI flows. Stablecoin APIs have KYC requirements, blockchain transactions, custodial key management. When domain complexity is high enough, good tooling isn't enough.

Coinbase Payments tells the same story from the passing side. It scored 75 — Grade B — but racked up 15 errors and 47 tool calls at $1.03 getting there.

So the checklist is necessary but has a ceiling. After discovery, what determines success is how much the agent has to fight the actual integration surface. Simple APIs (search, inference) convert discovery into clean integrations. Complex APIs (payment, stablecoin) convert discovery into expensive, error-filled ones.

---

# Stories From the Data

Full results for every API are on the [leaderboard](/leaderboard). Here are the ones worth calling out.

**You.com** is the only Grade A in the dataset. Perfect discovery, zero errors, seven tool calls, $0.13. The agent found llms.txt, pulled the OpenAPI spec, installed the SDK, and finished the task without hitting a problem. Out of 65 APIs, this is the only one where the agent's experience looked like a competent developer's.

**Clerk** is one of the most popular auth libraries among web developers, with polished docs and a well-regarded SDK. It scored 73 — but through 23 errors, 61 tool calls, and $0.82 in token costs. The agent kept running into redirect URI issues and middleware config problems. OAuth flows are multi-step and environment-dependent, which is what agents handle worst. Auth0, with its older but simpler SDK, scored 76 with zero errors in 4 tool calls.

Then there's **SambaNova**: zero errors, 3 tool calls, $0.07. A clean integration by any measure. It still scored 54, Grade D, because agents couldn't find its documentation. The API works fine. It's just invisible.

**Modal** scored 20, the lowest in the dataset. No discovery items, failed integration. Other sandbox tools with CLIs — Cloudflare's wrangler, Daytona, Vercel — scored 80-88.

In inference, **Groq**, **Cerebras**, and **OpenRouter** all landed between 76-80 with zero errors, 3 tool calls, and about $0.06 each. They all expose OpenAI-compatible endpoints, so the agent uses the OpenAI SDK with a different base URL. Compatibility with an existing standard works as its own form of discoverability.

Voice and meeting bot APIs have a structural problem. Voice (average 54) deals with audio files, WebSocket connections, and streaming — binary, stateful, real-time patterns that don't match how agents work. Meeting bots (average 50) need webhook URLs, calendar OAuth, and multi-step configuration. Half of voice APIs and 80% of meeting bot APIs scored Grade D.

Durable Workflow was the most consistent category: average 73, standard deviation of 9.5. Five of seven providers scored 76-83. Despite the conceptual complexity of sagas and compensation patterns, these APIs have well-typed SDKs and clean integration paths.

---

# What Correlates With Success

Standard deviations are large and group sizes are unequal, so treat exact numbers as directional.

| Feature | Adoption | Avg With (±SD) | Avg Without (±SD) | Gap |
|---------|----------|----------------|-------------------|-----|
| Typed SDK | 85% (55/65) | 68.2 ±13.2 | 44.3 ±19.1 | +23.9 |
| Context7 | 82% (53/65) | 68.3 ±13.8 | 47.9 ±18.0 | +20.4 |
| MCP Server | 72% (47/65) | 69.4 ±13.4 | 51.9 ±17.7 | +17.5 |
| CLI | 52% (34/65) | 72.3 ±12.4 | 56.0 ±16.4 | +16.3 |
| Agent Skills | 34% (22/65) | 74.0 ±11.2 | 59.7 ±16.8 | +14.3 |
| llms.txt | 66% (43/65) | 69.0 ±13.9 | 55.8 ±18.0 | +13.2 |
| OpenAPI | 63% (41/65) | 68.5 ±14.3 | 57.8 ±18.1 | +10.7 |

Typed SDKs show the biggest gap, but only 10 APIs lack one. That's a small enough group that a few outliers could shift the average substantially. The safer read: every feature correlates positively, the directions are all consistent, and the specific magnitudes are approximate.

These are correlations, not causes. APIs that invest in typed SDKs probably also invest in documentation and error messages. The SDK isn't adding 24 points by itself.

MCP ranks third despite the industry buzz. At 72% adoption it's approaching table stakes, and having one doesn't fix a messy integration surface. Adyen has an MCP server and scores 36.

Agent Skills has the lowest adoption (34%) but the tightest standard deviation among APIs that have them (±11.2 vs ±16.8 without). Low adoption and consistent high scores usually means opportunity.

---

# What To Do About It

**If you build APIs:** The data points to a rough priority. Typed SDKs and Context7 indexing have the highest adoption and strongest correlation. If you're missing either, start there. llms.txt is the lowest-effort addition. Agent skills have the most room to differentiate. But the checklist only gets you found. If the agent still struggles after finding you, the problem is the integration surface itself.

We don't know yet whether adding checklist items causes score improvements or just correlates with broader investment. We'll track score changes for APIs that add items between Q1 and Q2.

**If you choose APIs:** Check the [leaderboard](/leaderboard). Within a category, the spreads are large. Payment spans 49 points (Stripe 85, Adyen 36). Sandboxes span 68 (Cloudflare 88, Modal 20). If you plan to use an AI coding assistant, that gap translates directly into your time.

These scores are from Claude Sonnet 4.5. Other models may produce different results. We plan to add multi-model testing.

**If you run DevRel:** The 37.7-point discovery gap is the largest signal in the data. If your API isn't in Context7, doesn't have llms.txt, and lacks an MCP server, agents can't find you. The cheapest wins take hours, not weeks.

---

# Fine Print

This is not a product quality assessment. Low scores mean agents struggle with the API, not that the API is bad. Adyen processes billions. Circle powers major stablecoin infrastructure. A score of 36 says nothing about reliability or features.

Task difficulty varies across categories. Integrating a search API is simpler for an agent than integrating a stablecoin API with KYC and blockchain transactions. Keep that in mind when comparing across categories.

All evaluations used Claude Sonnet 4.5 in March 2026. Results will differ with other models and will change as APIs update their tooling. When a company has been evaluated more than once, we report the most recent result.

### Methodology Details

The agent received a single prompt in a sandboxed environment: integrate this API and complete a category-specific task. No human assistance beyond API keys. We measured four things: whether the agent could find the docs (discovery), install and authenticate (integration), complete the task (execution), and recover from errors (resilience).

The discovery score comes from seven items:

| Item | Why it matters |
|------|---------------|
| llms.txt | Machine-readable description for LLMs |
| OpenAPI Spec | Agents read structure, not prose |
| Typed SDK | Types prevent guessing |
| MCP Server | Native agent integration |
| CLI | Testing without writing code |
| Context7 | Automatic discovery by AI assistants |
| Agent Skills | One-step integration toolkits |

---

# FAQ

**Is a low score the same as a bad API?**
No. It means agents struggle to use it autonomously. That's different from reliability or feature quality.

**Why only one Grade A?**
90+ requires near-perfect discovery and integration. Most APIs lose points on one side. Cloudflare Workers at 88 is close.

**Why does MCP rank third in feature correlations?**
72% adoption means the "without" group is small and skews toward less-invested APIs. MCP helps discovery but doesn't fix integration problems.

**How often do you re-evaluate?**
Quarterly. Scores will change as APIs update their tooling.

**Can I submit my API?**
Yes. Visit the [leaderboard](/leaderboard).

---

*Evaluations conducted March 2026 by Sapient using Claude Sonnet 4.5. All scores available on the [leaderboard](/leaderboard). v1.0.*