What Is MMLU and MMMLU? LLM Knowledge Benchmarks Explained

MMLU was the benchmark that first forced the AI community to take language model capabilities seriously — 57 subjects, thousands of questions, and a clear bar: match or beat the average human professional. Most frontier models have now crossed that bar, so where does that leave us?

What MMLU is and how it works

Massive Multitask Language Understanding (MMLU) was introduced in 2020 by Hendrycks et al. It contains 14,079 multiple-choice questions spanning 57 subjects — from elementary mathematics and US history to professional law, medical diagnosis, and abstract algebra. Each question has four answer options; correct answers require knowledge retention, not reasoning from first principles.

The benchmark was deliberately broad. No single model was expected to master every domain. Scores are reported as a macro-average across all 57 subjects, which means poor performance in any cluster of topics drags down the total even if the model excels elsewhere.

MMMLU: taking the benchmark multilingual

MMMLU (Multilingual MMLU, sometimes called GMMLU in certain evaluation frameworks) translates the MMLU question set into more than 14 languages, including Arabic, Chinese, French, German, Hindi, Japanese, Korean, Portuguese, Russian, Spanish, and others. The benchmark structure is identical — same questions, same four-option format — but answers must be generated in or parsed from the target language.

MMMLU reveals capability gaps that the English-only version conceals. A model that scores 85% in English may score 60% in Hindi or 55% in Arabic, reflecting uneven training data coverage. Current multilingual scores are tracked in our MMMLU benchmark page. You can compare models across languages using the live benchmark comparison table.

Why MMLU saturated so quickly

When MMLU was released, a score of 50% was considered impressive. By 2023, GPT-4 exceeded 86%. By 2025, leading frontier models were scoring above 90% on the English version. At that level, the benchmark stops being useful: a one or two point difference in score is within noise, and the ranking of models becomes unstable across evaluation runs.

Saturation happened for two reasons. First, models genuinely improved at knowledge retrieval. Second, MMLU questions likely appear in many pretraining corpora — a contamination problem we explore in benchmark contamination. For tips on reading scores when saturation is a concern, see how to read LLM benchmark scores.

What MMLU does not measure

MMLU tests declarative knowledge — facts and concepts a model can recognise in a four-option list. It does not test:

Reasoning — picking the right answer from a list is easier than deriving it from first principles. For reasoning depth, see GPQA Diamond.
Generation quality — a model can ace MMLU while producing poorly written, incoherent prose.
Coding ability — no programming tasks appear in MMLU. For that, see SWE-bench.
Frontier difficulty — for tasks that remain hard for even the best models, MMLU is no longer the right tool.

How to use MMLU results today

MMLU is still useful as a baseline for general knowledge coverage, particularly for smaller or more specialised models that have not yet reached saturation. For frontier model comparisons, MMMLU is more informative because multilingual scores continue to vary significantly. For the hardest reasoning tasks, the community has moved on to GPQA Diamond and Humanity's Last Exam. The complete guide to LLM benchmarks explains how to combine these evals for a well-rounded picture.

Key takeaways

MMLU covers 57 academic subjects via 14,079 multiple-choice questions; it measures declarative knowledge, not reasoning or generation.
MMMLU translates the same questions into 14+ languages, exposing multilingual gaps that English-only scores hide.
Frontier English MMLU scores have saturated above 90%; multilingual versions still differentiate models meaningfully.
Contamination is a known risk: MMLU questions likely appear in many pretraining datasets, inflating scores beyond true capability.
For harder evaluations, GPQA Diamond and Humanity's Last Exam have taken over as the frontier reasoning benchmarks.