Humanity's Last Exam

Multidisciplinary reasoning

Humanity's Last Exam (HLE) is a frontier, multi-modal benchmark of expert-authored questions across dozens of academic disciplines, designed to be extremely difficult. Scores are reported with and without tool use.

Model scores

Fable 559.0% (no tools) / 64.5% (with tools)
Opus 4.849.8% (no tools) / 57.9% (with tools)
Sonnet 543.2% (no tools) / 57.4% (with tools)
GPT-5.6 Sol—
GPT-5.541.4% (no tools (Pro)) / 52.2% (with tools (Pro))
Composer 2.5—
Opus 4.746.9% (no tools) / 54.7% (with tools)
Gemini 3.1 Pro44.4% (no tools) / 51.4% (with tools)
Mythos Preview56.8% (no tools) / 64.7% (with tools)

Official source: Humanity's Last Exam (lastexam.ai)

Model scores

Related reading