All benchmarks

OSWorld-Verified

Agentic computer use

OSWorld-Verified is a multimodal benchmark that tests an agent’s ability to complete real tasks in a desktop operating system — navigating GUIs, clicking, typing and using applications the way a person would.

Model scores

  • Opus 4.883.4%
  • Opus 4.782.8%
  • GPT-5.578.7%
  • Gemini 3.1 Pro76.2%
  • Mythos Preview79.6%

Official source: OSWorld project

Related reading