<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>LLM Boss — Blog</title>
    <link>https://llm-boss.com/blog</link>
    <atom:link href="https://llm-boss.com/blog/feed.xml" rel="self" type="application/rss+xml" />
    <description>Guides, benchmark explainers and model comparisons from LLM Boss.</description>
    <language>en-us</language>
    <lastBuildDate>Thu, 28 May 2026 20:05:38 GMT</lastBuildDate>
    <item>
      <title>The Complete Guide to LLM Benchmarks</title>
      <link>https://llm-boss.com/blog/llm-benchmarks-complete-guide</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/llm-benchmarks-complete-guide</guid>
      <description>What LLM benchmarks measure, the categories that matter, how to read the scores without being misled, and how to choose a model. The pillar guide to evaluating large language models.</description>
      <pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>How to Evaluate an LLM for Your Own Use Case</title>
      <link>https://llm-boss.com/blog/how-to-evaluate-an-llm</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/how-to-evaluate-an-llm</guid>
      <description>A practical guide to evaluating LLMs for your specific task: define requirements, pick matching benchmarks, build a private eval set, and weigh cost against quality.</description>
      <pubDate>Wed, 27 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>How to Choose an LLM: A Benchmark-Driven Framework</title>
      <link>https://llm-boss.com/blog/how-to-choose-an-llm</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/how-to-choose-an-llm</guid>
      <description>A practical decision framework for choosing the right LLM. Maps real-world use cases to the benchmarks that predict them, with scores for Opus 4.8, GPT-5.5, and Gemini 3.1 Pro.</description>
      <pubDate>Tue, 26 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Why LLM Benchmarks Saturate (and What Comes Next)</title>
      <link>https://llm-boss.com/blog/why-benchmarks-saturate</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/why-benchmarks-saturate</guid>
      <description>LLM benchmarks saturate when top models score near the ceiling and the test no longer distinguishes between them. Learn why this happens, what it means, and which harder benchmarks are taking over.</description>
      <pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is MMLU-Pro? A Harder Knowledge Benchmark</title>
      <link>https://llm-boss.com/blog/what-is-mmlu-pro</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-mmlu-pro</guid>
      <description>MMLU-Pro extends the original MMLU with 10 answer choices and harder reasoning questions to combat saturation. Learn how it differs from MMLU and GPQA, and what scores mean today.</description>
      <pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is Humanity’s Last Exam? The Frontier Reasoning Benchmark</title>
      <link>https://llm-boss.com/blog/what-is-humanitys-last-exam</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-humanitys-last-exam</guid>
      <description>Humanity</description>
      <pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is CharXiv? Visual and Chart Reasoning Explained</title>
      <link>https://llm-boss.com/blog/what-is-charxiv</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-charxiv</guid>
      <description>CharXiv tests AI models on understanding and reasoning over real scientific charts from arXiv papers. Learn how it separates shallow chart reading from genuine multi-step visual reasoning.</description>
      <pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Best LLM for Reasoning in 2026 (Benchmarked)</title>
      <link>https://llm-boss.com/blog/best-llm-for-reasoning</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/best-llm-for-reasoning</guid>
      <description>Which LLM reasons best in 2026? We compare Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro on GPQA Diamond, Humanity\</description>
      <pubDate>Mon, 25 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Best LLM for Coding in 2026 (Benchmarked)</title>
      <link>https://llm-boss.com/blog/best-llm-for-coding</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/best-llm-for-coding</guid>
      <description>Which LLM is best for coding in 2026? We compare Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and Mythos Preview on SWE-bench Pro, SWE-bench Verified, and Terminal-Bench.</description>
      <pubDate>Sun, 24 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is OSWorld-Verified? Computer-Use Agents Explained</title>
      <link>https://llm-boss.com/blog/what-is-osworld</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-osworld</guid>
      <description>OSWorld-Verified tests AI agents on completing real desktop tasks — navigating GUIs, using apps, and interpreting screenshots. Learn the Verified methodology and why computer use is uniquely challenging.</description>
      <pubDate>Sat, 23 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is MMLU and MMMLU? LLM Knowledge Benchmarks Explained</title>
      <link>https://llm-boss.com/blog/what-is-mmlu</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-mmlu</guid>
      <description>MMLU tests LLM knowledge across 57 academic subjects via multiple-choice questions. MMMLU extends this to 14+ languages. Learn what they measure, their limitations, and why saturation matters.</description>
      <pubDate>Sat, 23 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Gemini 3.1 Pro Benchmarks: A Full Breakdown</title>
      <link>https://llm-boss.com/blog/gemini-3-1-pro-benchmarks</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/gemini-3-1-pro-benchmarks</guid>
      <description>A complete breakdown of Gemini 3.1 Pro benchmark scores across web search, multilingual knowledge, reasoning, and coding. See where Google DeepMind&amp;apos;s flagship leads.</description>
      <pubDate>Sat, 23 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Claude Opus 4.8 vs Gemini 3.1 Pro: Which Wins?</title>
      <link>https://llm-boss.com/blog/claude-vs-gemini</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/claude-vs-gemini</guid>
      <description>Benchmark comparison of Claude Opus 4.8 and Gemini 3.1 Pro across coding, reasoning, knowledge, and web browsing. See where each model leads and which to choose.</description>
      <pubDate>Sat, 23 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Agentic Evals: How We Benchmark Tool-Using LLMs</title>
      <link>https://llm-boss.com/blog/agentic-evals-explained</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/agentic-evals-explained</guid>
      <description>Agentic evaluations measure multi-step, tool-using LLMs in live environments — not static QA. Learn how harnesses, environments and scoring differ from traditional benchmarks.</description>
      <pubDate>Sat, 23 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is Terminal-Bench? Benchmarking Agents in the Shell</title>
      <link>https://llm-boss.com/blog/what-is-terminal-bench</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-terminal-bench</guid>
      <description>Terminal-Bench evaluates AI agents on real command-line tasks inside a live shell environment. Learn what it measures, why it tests long-horizon agentic behavior, and how setup affects scores.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is LiveCodeBench? Contamination-Free Coding Evals</title>
      <link>https://llm-boss.com/blog/what-is-livecodebench</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-livecodebench</guid>
      <description>LiveCodeBench uses a rolling time window of competitive programming problems to prevent training-data leakage. Learn how it works, why contamination matters, and how it compares to SWE-bench.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is AA-LCR? Long-Context Reasoning Explained</title>
      <link>https://llm-boss.com/blog/what-is-aa-lcr</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-aa-lcr</guid>
      <description>AA-LCR (Artificial Analysis Long Context Reasoning) tests whether AI models can reason over massive inputs — not just retrieve text, but draw multi-step conclusions across a full context window.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>pass@k, maj@k and Sampling: LLM Eval Metrics Explained</title>
      <link>https://llm-boss.com/blog/pass-at-k-explained</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/pass-at-k-explained</guid>
      <description>Understand pass@1, pass@k, and maj@k: what each metric measures, how temperature and sampling affect them, and what the choice of metric reveals about real-world performance.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Claude Opus 4.8 vs GPT-5.5: A Benchmark Comparison</title>
      <link>https://llm-boss.com/blog/opus-4-8-vs-gpt-5-5</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/opus-4-8-vs-gpt-5-5</guid>
      <description>Head-to-head benchmark comparison of Claude Opus 4.8 and GPT-5.5 across coding, terminal use, reasoning, and long context. Find out which model wins and when.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Best LLM for Multilingual Tasks in 2026</title>
      <link>https://llm-boss.com/blog/best-llm-for-multilingual</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/best-llm-for-multilingual</guid>
      <description>Which LLM handles multilingual tasks best in 2026? We compare Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.5 on MMMLU, the leading multilingual benchmark.</description>
      <pubDate>Fri, 22 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is MCP-Atlas? Scaled Tool Use Explained</title>
      <link>https://llm-boss.com/blog/what-is-mcp-atlas</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-mcp-atlas</guid>
      <description>MCP-Atlas benchmarks AI models on orchestrating many tools via the Model Context Protocol across long, multi-step workflows. Learn what tool selection, chaining, and error recovery reveal about a model.</description>
      <pubDate>Thu, 21 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is GPQA Diamond? Graduate-Level Reasoning Explained</title>
      <link>https://llm-boss.com/blog/what-is-gpqa</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-gpqa</guid>
      <description>GPQA Diamond tests LLMs on expert-written graduate-level science questions that even domain PhDs struggle with. Learn what the benchmark measures and why scores are nearing saturation.</description>
      <pubDate>Thu, 21 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>GPT-5.5 Benchmarks: A Full Breakdown</title>
      <link>https://llm-boss.com/blog/gpt-5-5-benchmarks</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/gpt-5-5-benchmarks</guid>
      <description>A complete breakdown of GPT-5.5 benchmark scores across coding, terminal use, long-context retrieval, and reasoning. See where OpenAI&amp;apos;s flagship leads the field.</description>
      <pubDate>Thu, 21 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Benchmark Contamination: Why LLM Scores Can Lie</title>
      <link>https://llm-boss.com/blog/benchmark-contamination</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/benchmark-contamination</guid>
      <description>Benchmark contamination happens when training data includes test questions. Learn how memorisation inflates scores, how labs detect it, and why private benchmarks matter.</description>
      <pubDate>Thu, 21 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is SWE-bench? Agentic Coding Benchmarks Explained</title>
      <link>https://llm-boss.com/blog/what-is-swe-bench</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-swe-bench</guid>
      <description>SWE-bench measures whether AI agents can resolve real GitHub issues. Learn how Verified, Pro, and Multilingual variants work and why SWE-bench is the standard coding eval.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is BrowseComp? Measuring Agentic Web Search</title>
      <link>https://llm-boss.com/blog/what-is-browsecomp</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-browsecomp</guid>
      <description>BrowseComp tests whether AI agents can find hard-to-locate facts buried deep on the open web using multi-step browsing. Learn how the benchmark works and why it matters.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>How to Read LLM Benchmark Scores Without Being Fooled</title>
      <link>https://llm-boss.com/blog/how-to-read-llm-benchmark-scores</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/how-to-read-llm-benchmark-scores</guid>
      <description>Learn to read LLM benchmark scores critically: check subsets, tool access, trial counts, saturation and error bars before trusting any leaderboard number.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>How LLM Leaderboards Work: Elo, Arenas and Pitfalls</title>
      <link>https://llm-boss.com/blog/how-llm-leaderboards-work</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/how-llm-leaderboards-work</guid>
      <description>LLM leaderboards use Elo ratings, human preference arenas, and automated evals to rank models. Learn how Elo and Bradley-Terry work, why rankings shift, and what pitfalls to watch for.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>Claude Opus 4.8 Benchmarks: A Full Breakdown</title>
      <link>https://llm-boss.com/blog/claude-opus-4-8-benchmarks</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/claude-opus-4-8-benchmarks</guid>
      <description>A full breakdown of Claude Opus 4.8 benchmark scores across coding, agentic tasks, reasoning, and tool use. See where Anthropic&amp;apos;s flagship leads and where it trails.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Best LLM for Long-Context Tasks in 2026</title>
      <link>https://llm-boss.com/blog/best-llm-for-long-context</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/best-llm-for-long-context</guid>
      <description>Which LLM handles long-context tasks best in 2026? We compare GPT-5.5, Claude Opus 4.8, and Opus 4.7 on AA-LCR, the leading long-context retrieval benchmark.</description>
      <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is HumanEval? The Classic Code-Generation Benchmark</title>
      <link>https://llm-boss.com/blog/what-is-humaneval</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-humaneval</guid>
      <description>HumanEval is OpenAI\</description>
      <pubDate>Tue, 19 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>LLM-as-Judge: How Models Grade Models</title>
      <link>https://llm-boss.com/blog/llm-as-judge-explained</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/llm-as-judge-explained</guid>
      <description>LLM-as-judge uses a powerful model to score open-ended outputs against a rubric. Learn how it works, where it is reliable, and the biases you must watch for.</description>
      <pubDate>Mon, 18 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Best LLM for Tool Use and Function Calling in 2026</title>
      <link>https://llm-boss.com/blog/best-llm-for-tool-use</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/best-llm-for-tool-use</guid>
      <description>Which LLM handles tool use and function calling best in 2026? We rank Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and Opus 4.7 on MCP-Atlas, the gold-standard tool-use benchmark.</description>
      <pubDate>Sun, 17 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is AIME? Measuring LLM Math Reasoning</title>
      <link>https://llm-boss.com/blog/what-is-aime</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-aime</guid>
      <description>AIME is a competition-math benchmark that tests multi-step numerical reasoning. Learn how pass@1 vs sampling works, why frontier models still struggle, and what scores actually mean.</description>
      <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>What Is an LLM Agent? Tools, Planning and Evaluation</title>
      <link>https://llm-boss.com/blog/what-is-an-llm-agent</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/what-is-an-llm-agent</guid>
      <description>LLM agents go beyond chat: they use tools, plan multi-step tasks, and act in real environments. Learn what defines an agent, how planning works, and how agents are evaluated.</description>
      <pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate>
    </item>
    <item>
      <title>The Best LLM for AI Agents in 2026</title>
      <link>https://llm-boss.com/blog/best-llm-for-agents</link>
      <guid isPermaLink="true">https://llm-boss.com/blog/best-llm-for-agents</guid>
      <description>Which LLM performs best for AI agents in 2026? We compare Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and Mythos Preview on MCP-Atlas, OSWorld-Verified, and Terminal-Bench.</description>
      <pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate>
    </item>
  </channel>
</rss>
