Benchmark explained

What Is AA-LCR? Long-Context Reasoning Explained

AA-LCR (Artificial Analysis Long Context Reasoning) tests whether AI models can reason over massive inputs — not just retrieve text, but draw multi-step conclusions across a full context window.

8 min read

A model advertising a 200 000-token context window sounds impressive — but the real question is what it can do with all that context. AA-LCR was built specifically to answer that question.

What AA-LCR is

AA-LCR stands for Artificial Analysis Long Context Reasoning, a benchmark published by Artificial Analysis to evaluate whether language models can reason over large inputs rather than merely retrieve information from them. Each task presents the model with a long document or set of documents — often tens of thousands of tokens — and asks a question whose correct answer requires synthesising information from multiple, non-adjacent passages, not just locating a single relevant sentence.

Current model scores are visible on the AA-LCR benchmark page, and you can compare them to other evaluations on the live benchmark comparison table.

Retrieval vs. reasoning over long context

The critical distinction in AA-LCR is between two very different uses of a large context window:

  • Simple retrieval — the answer appears verbatim in one place in the document. A model can score well here by finding the relevant span and repeating it. Many "long-context" benchmarks are dominated by this pattern.
  • Multi-step reasoning — the answer requires combining evidence from several passages, resolving apparent contradictions, performing arithmetic across scattered figures, or tracing a chain of logic spread across the whole document.

AA-LCR tasks are constructed to require the second kind of processing. If a model reads the document in a shallow, retrieval-oriented way, it will produce plausible but wrong answers — a failure mode that standard short-context benchmarks cannot detect.

Why long-context reasoning is hard

Several factors make genuine long-context reasoning difficult for current models:

  • Attention dilution — transformers distribute attention across all tokens; in very long sequences, relevant tokens can receive too little weight relative to surrounding noise.
  • Lost-in-the-middle effect — empirically, models recall information near the beginning and end of a context more reliably than information buried in the middle, even when the context fits within the window.
  • Multi-hop inference — combining fact A from page 3 with fact B from page 47 to produce conclusion C requires holding intermediate results in working memory across a long decoding chain.
  • Hallucination under uncertainty — when genuine evidence is ambiguous or scattered, the pressure to produce a fluent answer pushes models toward confabulation.

How AA-LCR scores should be interpreted

As with any benchmark, context matters. When reading AA-LCR results, check whether scores were obtained with or without extended thinking / chain-of-thought prompting, and whether the model was allowed to use tools like a code interpreter for calculation steps. These conditions can significantly shift scores upward, and comparisons are only fair within matching conditions.

For a deeper treatment of how to interpret any benchmark number without being misled, see how to read LLM benchmark scores. For the full picture of how long-context evaluation fits into the wider benchmark landscape, see the complete guide to LLM benchmarks. AA-LCR is complementary to agentic benchmarks like BrowseComp, which tests whether a model can find information across the web rather than reason over information already in context.

Key takeaways

  • AA-LCR measures genuine reasoning over large inputs, not just the ability to locate a relevant passage in a long document.
  • The retrieval-vs-reasoning distinction is the benchmark's core insight: a large context window is only useful if the model can think across it.
  • Lost-in-the-middle effects and attention dilution are real failure modes that AA-LCR is specifically designed to expose.
  • Always check whether scores were produced with chain-of-thought or tools enabled before comparing models.
  • Strong AA-LCR performance is a leading indicator of usefulness in document analysis, legal review, code auditing, and any task where the relevant information spans a long input.

Keep reading