What Is MMLU-Pro? A Harder Knowledge Benchmark
MMLU-Pro extends the original MMLU with 10 answer choices and harder reasoning questions to combat saturation. Learn how it differs from MMLU and GPQA, and what scores mean today.
MMLU-Pro was created because the original MMLU had run out of headroom: by 2024, multiple frontier models were scoring above 85%, making it nearly impossible to distinguish the best from the second-best.
What MMLU-Pro is and how it differs from MMLU
MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic domains using 4-option multiple-choice questions. It was groundbreaking in 2020 but saturated quickly as models improved. MMLU-Pro addresses saturation with two targeted changes:
- 10 answer choices instead of 4. With 4 options, a model that knows the answer with moderate confidence can often eliminate distractors and guess correctly. Expanding to 10 options sharply increases the penalty for partial knowledge and compresses scores toward the harder end of the capability range.
- Harder, reasoning-intensive questions. The MMLU-Pro team filtered out questions answerable by surface-level recall and added problems that require multi-step reasoning, quantitative computation, or applying principles to novel scenarios. Questions were sourced from advanced textbooks, competitive exams, and expert-curated datasets, then validated by human experts.
The resulting benchmark has 12,000+ questions across the same 57 domains. Human expert performance sits around 72%, compared to ~90% on original MMLU — confirming that the difficulty increase was genuine rather than cosmetic.
Domains and subject coverage
MMLU-Pro covers the same broad sweep as MMLU: STEM (mathematics, physics, chemistry, biology, computer science, engineering), social sciences (economics, psychology, law), humanities (history, philosophy, literature), and professional domains (medicine, business, finance). The STEM sections tend to show the largest difficulty increase relative to original MMLU because those questions benefit most from quantitative reasoning steps. A model that scores 85% on MMLU STEM may score 60–65% on the equivalent MMLU-Pro section.
This subject breadth distinguishes MMLU-Pro from more focused benchmarks. Where GPQA Diamond covers only biology, chemistry, and physics at PhD difficulty, MMLU-Pro measures breadth of competent knowledge across dozens of fields. For a detailed look at GPQA, see the GPQA explainer. The two benchmarks are complementary: GPQA tells you how deep a model goes in science; MMLU-Pro tells you how wide its reliable knowledge is.
How MMLU-Pro scores compare across frontier models
As of mid-2026, top frontier models score between 73% and 82% on MMLU-Pro — meaningfully below their original MMLU ceilings but still above or near human-expert accuracy on several subjects. The benchmark continues to separate models that were indistinguishable on original MMLU. You can check current standings for models like Claude Opus 4.8 and GPT-5.5 on the live benchmark comparison.
One caveat: because MMLU-Pro questions were collected before most current frontier models were trained, contamination risk exists — though it is harder to quantify than for benchmarks with publicly known problem sets. For context on how to interpret scores given contamination risk, see our benchmark contamination explainer and the guide on how to read LLM benchmark scores.
MMLU-Pro vs MMLU: when to use each
Original MMLU is still useful as a broad baseline: it has a long history of reported scores, making it easy to track progress over model generations. MMLU-Pro is the right choice when you want to differentiate current frontier models or understand how a model performs under reasoning pressure rather than just knowledge recall.
For a full comparison of how MMLU fits into the benchmark landscape, see the original MMLU explainer. For the complete picture of knowledge and reasoning benchmarks, the complete guide to LLM benchmarks places MMLU-Pro alongside GPQA, AIME, and other major evals.
Key takeaways
- MMLU-Pro extends MMLU with 10-choice questions and harder reasoning problems to restore discriminative power lost to saturation.
- Human expert accuracy is ~72%, compared to ~90% on original MMLU — the difficulty increase is genuine and not just cosmetic.
- STEM sections show the largest relative difficulty increase because they benefit most from quantitative multi-step reasoning.
- Use MMLU-Pro alongside GPQA Diamond: MMLU-Pro measures breadth of reliable knowledge; GPQA measures depth in expert science domains.
- Contamination risk exists but is harder to quantify than for benchmarks with fully public problem sets — treat scores as a signal rather than a definitive ranking.