2026-05-20 — views

Google Gemini 3.5 Flash beats last quarter's Pro flagship on agentic tasks

Read this because The signal is the price-performance inversion: a budget tier now out-runs last quarter's flagship on agentic throughput-per-dollar. If you sized infra around Pro-tier pricing, your unit economics just improved without a code change.

At I/O 2026, Gemini 3.5 Flash beats Gemini 3.1 Pro on coding+agent benchmarks at $1.50/$9 per 1M tokens. Terminal-Bench 76.2% vs 70.3%. 4x faster, half cost.

At Google I/O 2026 (May 19), Google launched Gemini 3.5 Flash — and the headline isn’t the model, it’s the price-performance inversion. A Flash-tier (budget) model now beats Gemini 3.1 Pro — last quarter’s flagship — on agentic and coding benchmarks, at a fraction of the cost.

The benchmark numbers

Benchmark	Gemini 3.5 Flash	Gemini 3.1 Pro
Terminal-Bench 2.1 (coding)	76.2%	70.3%
MCP Atlas (tool use)	83.6%	—
Finance Agent v2	57.9%	—
GDPval-AA (real-world agentic)	1656 Elo	—

Google’s framing: frontier-level performance at 4x the speed of comparable frontier models, “often at less than half the cost.”

Pricing + availability

$1.50 / 1M input tokens · $9 / 1M output tokens
1M-token context window
GA day-one across 6 surfaces (Gemini app, AI Mode in Search, Vertex AI, AI Studio, and more)
Gemini 3.5 Pro teased for “next month”

Why this matters for builders

The structural shift is that the budget tier crossed the previous flagship’s capability line on agentic workloads — the workloads that actually matter for production AI products (multi-step tool use, coding, long-horizon agents).

If you architected your inference budget around 3.1-Pro-tier pricing, your unit economics just improved without a single code change — swap the model string, keep the behavior, cut the bill. This is the same dynamic we flagged in the Anthropic gross-margin story: the foundation-model layer keeps repricing capability downward, and the savings flow to whoever ships on the newest tier fastest.

Practitioner note

Re-benchmark before you migrate. Terminal-Bench wins don’t guarantee your specific workload improves. Run your last 5 production traces on 3.5 Flash vs your current model before switching.
Watch the throughput-per-dollar, not the headline price. 4x speed at half the cost means your agent loop completes more tasks per wall-clock minute — the throughput framing we covered for coding agents applies here too.
Don’t over-commit to one provider. With Gemini Flash, Claude, and GPT all repricing quarterly, multi-model routing keeps you on the best price-performance tier as it moves.

The under-considered angle: the “Flash beats last-quarter Pro” pattern is now a reliable quarterly cadence across all three labs. That means the rational architecture is provider-agnostic model routing with quarterly re-benchmarking — not a long-term bet on any single model family. The moat is your eval harness, not your model choice.