All benchmarks

Humanity's Last Exam

Multidisciplinary reasoning

Humanity's Last Exam (HLE) is a frontier, multi-modal benchmark of expert-authored questions across dozens of academic disciplines, designed to be extremely difficult. Scores are reported with and without tool use.

Model scores

  • Opus 4.849.8% (no tools) / 57.9% (with tools)
  • Opus 4.746.9% (no tools) / 54.7% (with tools)
  • GPT-5.541.4% (no tools (Pro)) / 52.2% (with tools (Pro))
  • Gemini 3.1 Pro44.4% (no tools) / 51.4% (with tools)
  • Mythos Preview56.8% (no tools) / 64.7% (with tools)

Official source: Humanity's Last Exam (lastexam.ai)

Related reading