PostTrainBench

coding

PostTrainBench evaluates a model's ability to autonomously post-train base models. Given pretrain-only base models, the agent must complete the full pipeline of data synthesis, training, evaluation, and iteration within a time budget, scored across downstream benchmarks such as AIME2025, BFCL, GPQA Main, GSM8K, and HumanEval.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: agents, code, reasoning, systems. Language: en. Verified by llm-stats: no.

Leaderboard

  1. MiniMax M3 self-reported llm-stats
    37.1%