AutomationBench

Tool use

AutomationBench (from Zapier) measures how reliably a model can build and run real-world automations end to end — wiring up triggers, transforming data and chaining actions across third-party apps with minimal human help. Scores are low across the board, leaving plenty of headroom.

Model scores

Fable 517.4%
Opus 4.815.5%
GPT-5.512.9%
Opus 4.7—
Gemini 3.1 Pro9.6%
Mythos Preview—

Official source: Anthropic — Fable 5 / Mythos 5 announcement

Model scores

Related reading