Everything you need to understand how large language models are measured — benchmark explainers, evaluation methodology, and head-to-head model comparisons.
A practical, end-to-end guide to how large language models are measured — the benchmark categories, what the numbers mean, and how to choose a model.
Public leaderboards are a starting point, not the answer. Here is how to run your own evaluation and pick the right model for your actual workload.
9 min readStop guessing which model to use. Map your use case to the benchmarks that predict it, then read the numbers.
9 min readWhen every frontier model scores above 90%, a benchmark stops being useful. Score ceilings are driving researchers to harder evaluations — here is why and what comes next.
8 min readMMLU-Pro was designed to restore the discriminative power that MMLU lost as frontier models approached human-expert accuracy — by adding more choices and filtering for questions that require real reasoning.
8 min readHumanity’s Last Exam is deliberately designed to be unsolvable for years — a 2,500-question gauntlet of questions that stumped the experts who wrote them.
8 min readCharXiv challenges models to reason over real scientific charts from arXiv, testing whether they can perform multi-step visual inference rather than simply reading off a labelled value.
7 min readGPQA Diamond, Humanity's Last Exam, and AA-LCR expose clear differences in how leading models handle graduate-level reasoning, frontier research, and long-context recall.
9 min readSWE-bench Pro, SWE-bench Verified, and Terminal-Bench reveal a clear ranking for agentic coding — with one model pulling well ahead of the pack.
9 min readOSWorld-Verified evaluates AI agents on real desktop OS tasks — clicking, typing, navigating apps — across a curated, reproducible subset with verified ground-truth outcomes.
8 min readMMLU became the standard knowledge benchmark for LLMs, but frontier models now score above 90% — making MMMLU and harder evals the new reference points for capability comparisons.
7 min readGemini 3.1 Pro leads on BrowseComp, MMMLU, and GPQA Diamond. Here is every number in context.
7 min readOpus 4.8 dominates agentic coding and computer use, but Gemini 3.1 Pro edges ahead on GPQA, BrowseComp, and MMMLU. Here is the full breakdown.
8 min readStatic question-answering cannot measure an LLM that browses the web, runs code and calls APIs. Here is how agentic benchmarks work and what they reveal.
8 min readTerminal-Bench drops an agent into a real shell and asks it to complete tasks that span many commands — the closest public eval to how coding agents actually operate in production.
7 min readBy continuously sourcing fresh problems from competitive programming contests, LiveCodeBench sidesteps the training-data contamination that makes static benchmarks unreliable over time.
7 min readAA-LCR probes whether a model can genuinely reason across a large context window, distinguishing deep long-context inference from shallow retrieval over long documents.
8 min readpass@1, pass@k and majority voting tell different stories about the same model. Here is how to read each metric and when it matters.
7 min readOpus 4.8 and GPT-5.5 trade wins across every major benchmark category. Here is the full picture with numbers.
8 min readGemini 3.1 Pro leads MMMLU at 92.6%, edging Claude Opus 4.7 (91.5%) and pulling ahead of GPT-5.5 (83.2%) by a significant margin.
6 min readMCP-Atlas measures whether an AI can manage a large catalogue of tools over the Model Context Protocol — selecting, chaining, and recovering from errors across complex workflows.
7 min readGPQA Diamond presents questions so hard that the domain experts who wrote them average around 65% — yet frontier models are now closing in on that ceiling.
6 min readGPT-5.5 leads on Terminal-Bench, SWE-bench Verified, and AA-LCR. Here is every number with context.
7 min readWhen a model has seen the test questions during training, its score measures memory rather than intelligence. Here is how contamination works and what is being done about it.
7 min readSWE-bench tasks models with resolving real GitHub issues end-to-end — no hints, no scaffolding. Here is what the variants mean and why it became the gold standard for coding evals.
7 min readBrowseComp measures whether AI agents can hunt down obscure, hard-to-find facts across the live web — not just retrieve obvious answers from a single page.
7 min readA benchmark score is only as trustworthy as the conditions behind it. Here is what to check before comparing two models.
8 min readA leaderboard number hides a lot of machinery. Here is how Elo ratings, human-preference arenas, and automated benchmarks each produce rankings — and why those rankings keep changing.
9 min readOpus 4.8 leads among shipping models on MCP-Atlas, OSWorld, and SWE-bench Pro. Here is every number explained.
8 min readLong-context performance diverges sharply from chat quality. GPT-5.5 leads AA-LCR at 74.3%, with important nuances for different document lengths.
7 min readHumanEval introduced the pass@k metric and made automated code evaluation mainstream — but near-perfect scores by frontier models eventually forced the community to build harder, more realistic evals.
8 min readHuman evaluation does not scale. Using a model as a judge makes open-ended scoring tractable — but it introduces its own biases and failure modes.
7 min readMCP-Atlas is the hardest tool-use benchmark available. Claude Opus 4.8 leads at 82.2%, with meaningful gaps that matter in production agentic pipelines.
7 min readThe American Invitational Mathematics Examination pushes LLMs far beyond arithmetic — each problem demands a chain of novel deductions that cannot be pattern-matched from training data.
7 min readAn LLM agent is a model that takes actions in the world — calling tools, writing code, browsing the web — to complete goals that span many steps.
8 min readAgentic benchmarks reveal a fragmented leaderboard — Opus 4.8 leads on tool use and computer control, while GPT-5.5 edges ahead on shell tasks.
8 min read