Benchmark explained

What Is CharXiv? Visual and Chart Reasoning Explained

CharXiv tests AI models on understanding and reasoning over real scientific charts from arXiv papers. Learn how it separates shallow chart reading from genuine multi-step visual reasoning.

7 min read

Researchers, analysts, and engineers spend large portions of their working day reading charts. CharXiv asks whether today's multimodal AI models can do the same — and the answer is more nuanced than leaderboards often suggest.

What CharXiv is

CharXiv is a chart-understanding benchmark built from real figures sourced from arXiv scientific papers. Instead of using synthetic or simplified charts, the benchmark pulls the kinds of complex, dense visualisations that appear in machine learning, physics, biology, and economics research — multi-panel figures, log-scale plots, heatmaps, and annotated bar charts with overlapping series.

Each task pairs a chart image with a question that requires understanding what the chart shows. Questions range from simple value extraction ("what is the peak accuracy on the validation set?") to multi-step reasoning ("across which conditions does method A outperform method B, and by approximately how much on average?").

Current model scores are on the CharXiv benchmark page. Compare them alongside other evaluations in the live benchmark comparison table.

Descriptive vs. reasoning questions

CharXiv splits its questions into two explicit categories, which is one of its most important design choices:

  • Descriptive questions — ask the model to extract a fact directly visible in the chart: a labelled value, a series name, a tick mark. These test whether a model can read a chart accurately at a surface level.
  • Reasoning questions — require the model to compare, compute, or infer across multiple elements of the chart. The answer cannot be read off a single label; it must be derived through a chain of visual and quantitative reasoning steps.

The gap between descriptive and reasoning scores for a given model is itself revealing. A model that scores well on descriptive questions but poorly on reasoning questions can read charts but cannot think with them — a significant limitation for any analytical use case.

With-tools vs. without-tools conditions

Some CharXiv evaluations are run in a with-tools condition where the model can invoke a code interpreter or calculation tool alongside the chart. This matters because some reasoning questions require precise arithmetic across values that are hard to estimate visually. Allowing tool use typically raises scores substantially on quantitative reasoning sub-tasks.

When comparing CharXiv results across models, always confirm whether both were evaluated with or without tools. A with-tools score compared to a no-tools score is not a fair comparison. For more on how evaluation conditions affect interpretability, see how to read LLM benchmark scores.

Why real arXiv charts are harder than synthetic ones

Synthetic chart benchmarks tend to use clean data, standard colour schemes, and simple layouts — conditions that overestimate model performance on real-world figures. arXiv charts are harder for several reasons:

  • Visual complexity — dense legends, overlapping lines, small fonts, and non-standard colour palettes all increase the difficulty of accurately parsing the figure.
  • Domain specificity — labels and axis names reference specialised terminology that requires domain knowledge to interpret correctly.
  • Ambiguity — real charts often have imprecise gridlines, partially occluded labels, or logarithmic scales that make exact value extraction genuinely uncertain.

Where CharXiv fits in the broader evaluation landscape

CharXiv is primarily a multimodal benchmark, sitting alongside OSWorld-Verified as a test of visual understanding. The key difference is that CharXiv tests comprehension and reasoning over a static image, while OSWorld tests a model's ability to interact with a live visual interface over multiple steps.

For multi-step reasoning over text rather than images, a complementary benchmark is AA-LCR, which tests long-context reasoning over large documents. Together, CharXiv and AA-LCR give a picture of how well a model reasons across both visual and textual complex inputs. For the full context of how these benchmarks relate, see the complete guide to LLM benchmarks.

Key takeaways

  • CharXiv tests chart understanding using real figures from arXiv papers, making it substantially harder than benchmarks built on synthetic charts.
  • The descriptive/reasoning split exposes the difference between a model that can read a chart and one that can actually reason with it.
  • With-tools conditions allow code execution for precise arithmetic, and scores are only comparable within matching conditions.
  • Real arXiv charts introduce visual complexity, domain-specific terminology, and measurement ambiguity that synthetic benchmarks do not capture.
  • Strong CharXiv performance predicts usefulness in data analysis, scientific literature review, and any task where understanding charts is central.

Keep reading