Benchmark explained

What Is AIME? Measuring LLM Math Reasoning

AIME is a competition-math benchmark that tests multi-step numerical reasoning. Learn how pass@1 vs sampling works, why frontier models still struggle, and what scores actually mean.

7 min read

AIME — the American Invitational Mathematics Examination — has become one of the most demanding benchmarks for frontier language models, because every problem requires a genuine chain of multi-step deduction rather than recall.

What AIME is and why it matters for LLMs

AIME is a 15-problem, 3-hour competition exam administered annually by the Mathematical Association of America. It is the qualifying round between the easier AMC 10/12 and the elite USAMO. Each answer is an integer between 0 and 999, so there is no partial credit and no multiple-choice guessing — the model must compute the exact value. Problems span combinatorics, number theory, algebra, and geometry, often combining several of those areas in a single question.

For language models this structure is valuable precisely because it is hard to shortcut. A model cannot recognize a surface pattern and pick the right letter; it must execute a sequence of algebraic or logical steps, each of which must be correct for the answer to be right. That sensitivity to reasoning chains makes AIME a strong signal for genuine mathematical capability rather than memorization.

pass@1 vs sampling: why the distinction matters

When you see an AIME score reported, the first question to ask is how many tries did the model get? The two common protocols are:

  • pass@1 (greedy or single sample): the model produces one answer and is judged on it. This reflects the experience of a user who runs the model once and trusts the output.
  • Consensus / majority voting (pass@k): the model generates k independent answers and the most common one is submitted. With k = 64 or k = 256, scores can jump dramatically even without any improvement in underlying reasoning. For a full explanation of the math, see our pass@k explainer.

A model scoring 60% with 256 samples may score only 35% pass@1. Both numbers are real, but they answer different questions. Leaderboard comparisons that mix protocols are not directly comparable, so always check the evaluation notes before drawing conclusions from the live benchmark comparison.

AIME 2024 and AIME 2025 as contamination checkpoints

Because AIME problems are released publicly each year, older exam years risk appearing in training data. Researchers now commonly report scores on AIME 2024 and AIME 2025 separately to give a sense of temporal robustness. A model that scores substantially higher on 2023 problems than 2025 problems likely benefited from memorizing solutions that circulated online. The same contamination concern applies across many benchmarks — for context, see the benchmark contamination explainer.

Frontier reasoning models from Anthropic, OpenAI, and Google all now report AIME 2025 scores as the primary signal. You can compare how Claude Opus 4.8 and GPT-5.5 stack up on reasoning benchmarks using our side-by-side comparison.

How AIME relates to GPQA and other reasoning evals

AIME sits firmly in the mathematical reasoning niche. For scientific reasoning that requires domain knowledge — biology, chemistry, physics — the comparable benchmark is GPQA Diamond, which we cover in depth in the GPQA explainer. AIME and GPQA together form a strong two-dimensional view of frontier reasoning: one tests pure symbolic manipulation, the other tests application of expert knowledge under ambiguity.

For a broader view of how math and reasoning scores fit into the full evaluation picture, start with the complete guide to LLM benchmarks. The best LLM for reasoning roundup ranks current models specifically on tasks like AIME.

What AIME scores do not tell you

High AIME scores are necessary but not sufficient for a model to be useful in applied math contexts. AIME problems are self-contained and have clean integer answers; real engineering or research math involves messy inputs, symbolic output, and the need to write correct code or proofs rather than just state a number. A model that aces AIME may still hallucinate when working through a multipage proof or implementing a numerical algorithm. Use AIME scores as a signal for reasoning depth, not as a guarantee of mathematical correctness in open-ended settings.

Key takeaways

  • AIME tests multi-step competition mathematics with exact integer answers — no partial credit, no guessing.
  • Always check whether a reported score is pass@1 or a consensus over many samples; the gap can be 20+ percentage points.
  • AIME 2025 is the current contamination-resistant reference; scores on older years may be inflated by memorized solutions.
  • Pair AIME with GPQA Diamond for a two-dimensional view of reasoning: symbolic manipulation plus knowledge-grounded inference.
  • High AIME scores signal strong reasoning chains but do not guarantee correctness in open-ended or applied mathematical tasks.

Keep reading