FRAMES

reasoning official site →

Factuality, Retrieval, And reasoning MEasurement Set - a unified evaluation dataset of 824 challenging multi-hop questions for testing retrieval-augmented generation systems across factuality, retrieval accuracy, and reasoning capabilities, requiring integration of 2-15 Wikipedia articles per question

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning, search. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Kimi K2-Thinking-0905 self-reported llm-stats
    87.0%
  2. DeepSeek-V3 self-reported llm-stats
    73.3%