RefCOCOg
multimodal official site →
RefCOCOg is a referring expression comprehension benchmark that evaluates spatial grounding in images. Given a natural language expression describing an object, the model must localize the correct region, evaluated by accuracy at a 0.5 IoU threshold. It features longer, more descriptive expressions than RefCOCO and RefCOCO+.
Methodology
Imported from llm-stats public benchmark metadata. Modality: multimodal. Max score: 1. Categories: grounding, multimodal, vision. Language: en. Verified by llm-stats: no.