Instruct HumanEval

general official site →

Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using pass@k metric on programming problems

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: general. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Llama 3.1 Nemotron 70B Instruct self-reported llm-stats
    73.8%