Vending-Bench 2

reasoning

Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over extended periods. The benchmark measures a model's ability to maintain consistent tool usage and decision-making for a full simulated year of operation, driving higher returns without drifting off task.

Leaderboard

Showing 4 of 4 results

Claude Opus 4.6

8,017.59

i
GLM-5.1

5,634.41

i
Gemini 3 Pro

5,478.16

i
Gemini 3 Flash

3,635

i