ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

Dave Zhenyu Chen1      Angel X. Chang2      Matthias Nießner1     

1Technical University of Munich       2Simon Fraser University

European Conference on Computer Vision, 2020.

Submit to our ScanRefer Localization Benchmark here!


We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.



The ScanRefer data can be browsed online in your web browser. Learn more at the ScanRefer Data Browser.
(For a better browsing experience, we recommend using Google Chrome.)


European Conference on Computer Vision (ECCV), 2020.
Paper | arXiv | Code

If you find our project useful, please consider citing us:

    title={ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language},
    author={Chen, Dave Zhenyu and Chang, Angel X and Nie{\ss}ner, Matthias},
    journal={16th European Conference on Computer Vision (ECCV)},

Dataset Download

If you would like to access to the ScanRefer dataset, please fill out the ScanRefer Terms of Use Form. Once your request is accepted, you will receive an email with the download link.