Vending-Bench 2

reasoning

Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Claude Opus 4.6 self-reported llm-stats
    8,017.59
  2. GLM-5.1 self-reported llm-stats
    5,634.41
  3. Gemini 3 Pro self-reported llm-stats
    5,478.16
  4. Gemini 3 Flash self-reported llm-stats
    3,635