About these benchmarks
This page compares leading frontier AI models across the evaluations labs use to measure real-world ability — agentic coding, terminal and computer use, tool orchestration, web search, long-context and graduate-level reasoning, visual understanding and multilingual knowledge. Choose any two models above to see a head-to-head: the first becomes the baseline and the second is scored green where it wins and red where it falls behind, benchmark by benchmark.
Figures are drawn from each model’s published system card and from independent leaderboards such as Artificial Analysis. Where a lab does not report a given benchmark, the cell is left blank (—) rather than estimated.
Benchmark data is provided for informational comparison and may be updated as labs publish new results.