CRUX-O

reasoning

CRUXEval-O (output prediction) is part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate AI models' capabilities in code reasoning, understanding, and execution. The benchmark tests models' ability to predict correct function outputs given function code and inputs, focusing on short problems that a good human programmer should be able to solve in a minute.

Leaderboard

Showing 1 of 1 result

Qwen3 235B A22B

79.0%

i