MusicCaps
multimodal official site →
MusicCaps is a dataset composed of 5,521 music examples, each labeled with an English aspect list and a free text caption written by musicians. The dataset contains 10-second music clips from AudioSet paired with rich textual descriptions that capture sonic qualities and musical elements like genre, mood, tempo, instrumentation, and rhythm. Created to support research in music-text understanding and generation tasks.
Methodology
Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: audio, multimodal. Language: en. Verified by llm-stats: no.