D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans


Dave Zhenyu Chen1      Qirui Wu2      Matthias Nießner1      Angel X. Chang2     

1Technical University of Munich       2Simon Fraser University



Introduction

Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D3Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D3Net also introduces discriminability during object caption generation and enables semi-supervised training on ScanNet data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin (23.56% CiDEr@0.5IoU improvement).

Video

Results


Qualitative results in 3D dense captioning task from Scan2Cap and our method. We underline the inaccurate words and mark the spatially discriminative phrases in bold. Our method qualitatively outperforms Scan2Cap in producing better object bounding boxes and more discriminative descriptions.



3D visual grounding results using 3DVG-Transformer and our method. 3DVG-Transformer fails to accurately predict object bounding boxes, while our method produces accurate bounding boxes and correctly distinguishes target objects from distractors.


Publication

Paper | arXiv | Code

If you find our project useful, please consider citing us:

@misc{chen2021d3net,
    title={D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans}, 
    author={Dave Zhenyu Chen and Qirui Wu and Matthias Nie├čner and Angel X. Chang},
    year={2021},
    eprint={2112.01551},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}