WritingBench

communication official site →

A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.

Methodology

Imported from llm-stats public benchmark metadata. Modality: text. Max score: 1. Categories: communication, creativity, finance, legal, writing. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen3-235B-A22B-Thinking-2507 self-reported llm-stats
    88.3%
  2. Qwen3-Next-80B-A3B-Instruct self-reported llm-stats
    87.3%
  3. Qwen3 VL 235B A22B Thinking self-reported llm-stats
    86.7%
  4. Qwen3 VL 32B Thinking self-reported llm-stats
    86.2%
  5. Qwen3 VL 235B A22B Instruct self-reported llm-stats
    85.5%
  6. Qwen3 VL 8B Thinking self-reported llm-stats
    85.5%
  7. Qwen3-235B-A22B-Instruct-2507 self-reported llm-stats
    85.2%
  8. Qwen3 VL 30B A3B Thinking self-reported llm-stats
    85.2%
  9. Qwen3-Next-80B-A3B-Thinking self-reported llm-stats
    84.6%
  10. Qwen3 VL 4B Thinking self-reported llm-stats
    84.0%
  11. Qwen3 VL 8B Instruct self-reported llm-stats
    83.1%
  12. Qwen3 VL 32B Instruct self-reported llm-stats
    82.9%
  13. Qwen3 VL 30B A3B Instruct self-reported llm-stats
    82.6%
  14. Qwen3 VL 4B Instruct self-reported llm-stats
    82.5%
  15. Kimi K2-Thinking-0905 self-reported llm-stats
    73.8%