Bird-SQL (dev)

reasoning official site →

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across 95 databases (33.4 GB total) spanning 37+ professional domains. It evaluates large language models' ability to convert natural language to executable SQL queries in real-world scenarios with complex database schemas and dirty data.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: reasoning. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Gemini 2.0 Flash-Lite self-reported llm-stats
    57.4%
  2. Gemini 2.0 Flash self-reported llm-stats
    56.9%
  3. Gemma 3 27B self-reported llm-stats
    54.4%
  4. Gemma 3 12B self-reported llm-stats
    47.9%
  5. Nemotron 3 Super (120B A12B) self-reported llm-stats
    41.8%
  6. Gemma 3 4B self-reported llm-stats
    36.3%
  7. Gemma 3 1B self-reported llm-stats
    6.4%