CyberGym

Cybersecurity vulnerability reproduction

CyberGym tests an agent’s ability to reproduce previously-discovered vulnerabilities in real open-source projects from a high-level description — a targeted vulnerability-reproduction task scored pass@1 over 1,507 cases.

Model scores

Fable 583.8%
Opus 4.878.8%
Sonnet 5—
GPT-5.6 Sol84.5%
GPT-5.581.8%
Composer 2.5—
Opus 4.773.1%
Gemini 3.1 Pro—
Mythos Preview83.1%

Official source: CyberGym (cybergym.io)

Model scores

Related reading