Methodology

We aim to make LLM benchmark numbers comparable and honest. Here is how the data on LLM Boss is collected and presented.

Where the numbers come from

Scores are drawn from each model's official system card or from independent leaderboards such as Artificial Analysis and Scale's SEAL evaluations. Every benchmark page links its official source so you can verify the figure and read the full evaluation details.

Keeping comparisons fair

Where possible we use each model's strongest publicly-reported configuration (for example, maximum reasoning effort), and we note when a benchmark distinguishes results with and without tools. When a lab does not report a benchmark, we leave the cell blank ("—") rather than estimate it. Read how to read benchmark scores and benchmark contamination for the caveats that apply to all leaderboard numbers.

How comparisons are computed

On the comparison table and head-to-head pages, the baseline model is shown in orange and the challenger in cyan. For each benchmark the challenger's score is coloured green when it beats the baseline and red when it trails. Split metrics (such as "no tools / with tools") are compared part by part.

Updates

Benchmark data changes as labs publish new results and revise evaluations. We update figures as new system cards land; if you spot an out-of-date number, the official source linked on each benchmark page is the authority.