KernelBench Hard

coding

KernelBench Hard evaluates agentic GPU kernel optimization on the hardest problem set. Each question is scored by the agent's submitted operator TFLOPs relative to the theoretical peak of the current hardware, with the benchmark score being the average across all questions.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, code, systems. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiniMax M3 self-reported llm-stats
    28.8%