2026-05-21 — views

Anthropic Code with Claude London: agent platform grows up — Dreaming, Outcomes, Finance

Read this because The theme: a shift from "better model" to "reliable autonomy." Outcomes (a grader loop scoring agent runs) and Dreaming (scheduled memory curation) are the infra for agents you can leave running unattended — the real enterprise blocker, not model IQ.

At Code with Claude London, Anthropic shipped Dreaming, Outcomes, multi-agent orchestration, a 10-agent Claude Finance suite, and Small Business integrations.

Anthropic took its Code with Claude developer event to London (May 20-21) and used it to ship the parts of the agent platform that matter for production — not a new flagship model, but the reliability scaffolding around agents.

The 5 agent features

Feature	What it does
Dreaming (research preview)	A scheduled process that reviews past agent sessions + memory stores, extracts patterns, and curates long-term memory
Outcomes (public beta)	A grader loop that scores an agent’s runs against defined success criteria — closing the “did the agent actually succeed?” gap
Multi-agent orchestration	Coordinating multiple specialized agents on one task
Claude Finance	A suite of 10 finance-specific agents
Add-ins	Extending Claude into existing application surfaces

Plus Claude for Small Business — pre-built integrations with QuickBooks, PayPal, HubSpot, Canva, Docusign, Google Workspace, and Microsoft 365 — packaging agent capability for non-technical operators.

All of this runs on Claude Opus 4.7 (the model that took the coding-benchmark lead earlier this spring, ~+13% over Opus 4.6 on a 93-task coding suite).

The real theme: autonomy reliability, not model IQ

The under-appreciated shift: Anthropic isn’t selling a smarter model here — it’s selling the infrastructure that makes agents trustworthy enough to leave running unattended.

Outcomes is the answer to “how do I know the agent succeeded?” — a grader loop that turns agent runs from fire-and-hope into measurable, scoreable units. This is the blocker to enterprise agent deployment, not raw capability.
Dreaming is the answer to “how does the agent get better over time without me re-prompting?” — scheduled memory curation that compounds learning across sessions.

Together they target the gap between “demo that works once” and “agent you can deploy in production and walk away from.”

Why this matters

The competition has moved up the stack. Model quality (Opus 4.7 vs Gemini 3.5 Flash vs GPT) is now table stakes; the differentiation is in the agent operating layer — orchestration, memory, evaluation, integration. This echoes the Gemini 3.5 Flash story: when raw capability commoditizes quarterly, the moat shifts to the surrounding system.
Claude Finance + Small Business = vertical packaging. Anthropic is moving from horizontal API to packaged, vertical agent suites. That’s a bet that the value capture is in the application layer, not just the model API.
Outcomes is the most important release. A built-in grader loop is what lets a company define “success” for an agent and trust the score. That’s the difference between agent pilots and agent production.

Practitioner note

For builders shipping on Claude:

Adopt Outcomes before you scale any agent. If you’re running agents without a grader loop, you’re flying blind on reliability. Define success criteria, wire Outcomes, and you turn “it usually works” into a measurable SLA. This is the single highest-leverage thing from this event.
Dreaming changes the memory architecture. If you’ve been manually managing agent memory/context, scheduled memory curation may replace a chunk of your custom plumbing. Evaluate before building more memory infra yourself.
Claude for Small Business is a distribution signal. The QuickBooks/HubSpot/M365 integrations mean Anthropic is going after non-developer operators directly. If you build agent products for SMBs, you now compete with first-party packaged agents — differentiate on workflow depth, not raw capability.

The under-considered angle: the agent platform war is being won on reliability tooling, not model benchmarks. Outcomes and Dreaming are unglamorous — graders and memory curation don’t make headlines like a new model does. But they’re exactly what converts agent demos into deployed, unattended production systems. The lab that makes agents boring and reliable first wins the enterprise, regardless of who tops the next benchmark.