All benchmarks

CyberGym

Cybersecurity vulnerability reproduction

CyberGym tests an agent’s ability to reproduce previously-discovered vulnerabilities in real open-source projects from a high-level description — a targeted vulnerability-reproduction task scored pass@1 over 1,507 cases.

Model scores

  • Opus 4.878.8%
  • Opus 4.773.1%
  • GPT-5.581.8%
  • Gemini 3.1 Pro
  • Mythos Preview83.1%

Official source: CyberGym (cybergym.io)