GVR-Bench: Probing Spatial Reasoning Limits in Vision-Language Models
Geometric Reasoning, Visual Grounding, Spatial Transformations
Developed a systematic evaluation framework to probe the spatial reasoning capabilities of Vision-Language Models in deterministic settings. By engineering a suite of programmatic geometric tasks (e.g., precise rotations, spatial translations), demonstrated a critical dissociation between perceptual fidelity and logical execution. SOTA models achieved only 16.8% pixel-level accuracy despite maintaining high perceptual similarity. Our analysis established a formal error taxonomy—classifying failures into geometric imprecision, grounding errors, and hallucinations—providing empirical evidence that current end-to-end architectures require hybrid spatial computation modules for precise robotic manipulation.
Paper
Code
Slides