CRUXEval-Output-CoT

reasoning official site →

CRUXEval-O (output prediction) with Chain-of-Thought prompting. Part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate code reasoning, understanding, and execution capabilities. The output prediction task requires models to predict the output of a given Python function with specific inputs, evaluated using chain-of-thought reasoning methodology.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5-Coder 7B Instruct self-reported llm-stats
    56.0%