MusicCaps

multimodal official site →

MusicCaps is a dataset composed of 5,521 music examples, each labeled with an English aspect list and a free text caption written by musicians. The dataset contains 10-second music clips from AudioSet paired with rich textual descriptions that capture sonic qualities and musical elements like genre, mood, tempo, instrumentation, and rhythm. Created to support research in music-text understanding and generation tasks.

Methodology

Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: audio, multimodal. Language: en. Verified by llm-stats: no.

Leaderboard

  1. Qwen2.5-Omni-7B self-reported llm-stats
    32.8%