CRUX-O
reasoning official site →
CRUXEval-O (output prediction) is part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate AI models' capabilities in code reasoning, understanding, and execution. The benchmark tests models' ability to predict correct function outputs given function code and inputs, focusing on short problems that a good human programmer should be able to solve in a minute.
Methodology
Imported from llm-stats public benchmark metadata. Modality: text. Max score: 100. Categories: reasoning. Language: en. Verified by llm-stats: no.