The benchmark crossed the human line. Production stayed on the other side.
Stanford's 2026 AI Index, published this month, recorded one of the more consequential numbers of the year. AI agents reached 66% success on the OSWorld benchmark, an improvement from 12% twelve months earlier, putting them within six percentage points of human performance on a benchmark that asks them to operate a desktop computer to complete real productivity tasks. The headline reads as inevitability. The footnote does not.
The same body of evidence shows that 89% of enterprise AI agents never reach production. Per-pilot investments range from USD 150,000 to USD 800,000, and most of that spend returns nothing. Stanford notes that fewer than 10% of organizations have fully scaled AI in any single business function, despite 88% using AI somewhere in operations. By April 28, an enterprise AI governance brief released through GlobeNewswire added the procurement-side number — USD 665 billion in projected enterprise AI spend in 2026, with only 43% of organizations operating under a formal AI governance policy.
This is not a model-quality problem. It is a measurement problem. Enterprises are procuring agents on accuracy leaderboards and deploying them against operational service-level agreements. The two have almost nothing to do with one another.
The leaderboard is not the contract. Enterprise AI is purchased on a benchmark and deployed against an SLA — and the gap between them is where the budget dies.
From accuracy to CLEAR: the metric stack replaces the leaderboard
The most useful framework to emerge in April is academic. The CLEAR framework — Cost, Latency, Efficacy, Assurance, Reliability — was proposed in a multi-dimensional evaluation paper covering six leading agents on three hundred enterprise tasks. Its findings reset the conversation: optimizing for accuracy alone yields agents 4.4 to 10.8x more expensive than cost-aware alternatives with comparable performance, and the framework documented up to 50x cost variation between approaches achieving similar accuracy on the same tasks. Compressed for operating use, the stack collapses into three layers.
Cost-normalized accuracy (CNA) replaces raw success rate. Latency is measured against domain SLA thresholds — three seconds for customer support, thirty seconds for code generation — not against time-to-first-token. This is FinOps for agents: every task carries a price and a stopwatch, and the procurement decision sits between them rather than above them.
tau²-bench introduced the pass^k metric for a reason. Where pass@1 measures whether an agent solves a task once, pass^k measures whether it does so consistently across k attempts. April 2026 enterprise data shows agents achieving 60% success on single runs and only 25% across eight runs. A four-fold degradation between demo and production is not reliability. It is theater.
Policy adherence is the new pass criterion. tau-bench fails an agent that books the right flight while violating a stated change-fee policy. Auditability is the new reporting requirement. Without a decision ledger that captures inputs, model version, tool calls, override events, and outcomes, no enterprise can answer the EU AI Act Article 14 oversight question — and no auditor will accept the phrase "the model decided."
So what: If accuracy alone drove a procurement decision, the buyer is now paying between 4.4x and 10.8x more than necessary for the same outcome. Worse, that buyer is paying it on a model that performs in the demo and degrades the moment a policy constraint, an SLA threshold, or a multi-run distribution lands on top of it. The metric stack is the new contract.
Three patterns from the LATAM operating front
Customer service in CABA fintech. A buy-now-pay-later operator running 2.4 million monthly customer interactions deployed an avatar agent stack — built on the multilingual real-time technology AI Studios released to enterprise on April 29 — gated by a pass^5 reliability threshold of 90% and a cost-per-resolved-ticket ceiling of USD 0.32. Tier-1 routine queries route to a fine-tuned 7B model; Tier-2 disputes escalate to a frontier model; Tier-3 defaults sit behind human approval under EU AI Act Article 14 equivalent oversight. Cost per successful interaction fell 62%; override rate held under 6%.
Logistics scheduling for an Argentine grain exporter. Container booking, ETA recalculation, and dispatch optimization across 14 ports were re-architected around cost-normalized accuracy rather than raw planner accuracy. The model gateway routes 78% of decisions to a domain-tuned small language model and reserves frontier inference for the 22% exception cases the SLM cannot resolve at policy. Per-decision inference cost dropped 83%; on-time-in-full improved 11 points. Every decision is logged into an immutable ledger keyed to the bill-of-lading number, satisfying both internal audit and customs traceability.
Document parsing in a São Paulo bank. ParseBench-class document agents handle 340,000 KYC and credit-file documents per month. The deployment gate is not extraction accuracy — it is policy adherence at 99.2% against LGPD data-residency rules and Banco Central do Brasil credit standards. Failed-policy decisions never auto-execute; they enter an advisory queue with a full audit trail. Mean time to credit decision compressed from 72 hours to 9 hours, with zero policy-adherence regressions over the first quarter in production.
KPIs before APIs, evaluation before deployment
The mechanics are not exotic. They are uncomfortable because they require enterprises to do their own measurement work rather than outsource it to a vendor leaderboard. The evaluation harness is the most underbuilt component in enterprise AI today, and it is the one that determines whether a pilot becomes policy or becomes a write-off.
Three workstreams have to run in parallel: a governance contract that defines the metrics and the override paths, a KPI architecture that translates those metrics into operating numbers, and a roadmap that sequences capability with measurement so neither outpaces the other. Procurement, finance, and audit own seats on this stack — not as reviewers, but as designers.
So what: Interoperability or it doesn't scale. A metric stack that lives only inside the data-science team is a vanity project. KPIs before APIs is not a slogan; it is the only sequence that survives a board review past quarter two.
Governance contract
Decision ledger capturing inputs, model version, tool calls, overrides, and outcomes. Evaluation harness running pass^k, CLEAR scoring, and policy-adherence checks on every model substitution. Model gateway with a substitution registry that maps each capability to two or more provider tiers. No production promotion without harness pass — eval-as-CI.
KPI architecture
Cost per successful task in USD. p95 latency under SLA threshold in seconds. pass^k stability at k = 5 and k = 8. Override rate target under 8%. Policy-adherence rate target at least 99%. Decision-auditability ratio at 100%. Provider concentration ceiling at 60%. Each KPI is owned by a named role, not a committee.
Twelve-month roadmap
Days 0 to 90: inventory deployed agents, tag by risk and reversibility, baseline current cost-per-task and pass^k. Days 90 to 180: stand up the evaluation harness, integrate the decision ledger, route the top three workloads through a model gateway. Days 180 to 360: graduate qualifying agents to SLA-bound autonomous tier; introduce quarterly board metrics; pilot a sovereign-substrate workload via Latam-GPT and the CENIA Tarapacá supercomputer.
The leaderboard sold the model. The metric stack sells the outcome.
Enterprise AI in 2026 is being re-sorted not by which model wins this week's benchmark, but by which evaluation discipline an organization can build and defend. The accuracy curve has crossed a threshold; the production curve has not. That divergence is the most important single signal in enterprise AI right now, and it will reset competitive position more than any frontier model release in the same window. The buyers compounding return are not the buyers running the most pilots — they are the buyers running the strictest harnesses.
From pilot to policy. The organizations that will compound the next twenty-four months of AI investment are the ones treating CLEAR — or any equivalent multi-dimensional stack — as the procurement contract, the deployment gate, and the board metric in one. In LATAM, where data residency, sovereign substrate, and operating-cost discipline are not optional, this discipline is also the path to scale: cost-normalized accuracy plus pass^k stability plus an auditable decision ledger is the only architecture that survives Article 14 oversight, LGPD, and the next operating budget. Everything else is POC theater paid in production money.
Stop procuring on leaderboards. Start procuring on outcomes.
Socradata helps enterprises in the LATAM industrial, financial, and public-services sectors stand up CLEAR-class evaluation harnesses, decision ledgers, and model-gateway architectures so AI agents move from pilot economics to production accountability — under cost, latency, and audit constraints that hold at board level.