Abstract and 1 Introduction
Related Works
2.1. Vision-and-Language Navigation
2.2. Semantic Scene Understanding and Instance Segmentation
2.3. 3D Scene Reconstruction
Methodology
3.1. Data Collection
3.2. Open-set Semantic Information from Images
3.3. Creating the Open-set 3D Representation
3.4. Language-Guided Navigation
Experiments
4.1. Quantitative Evaluation
4.2. Qualitative Results
Conclusion and Future Work, Disclosure statement, and References
To qualitatively demonstrate the effectiveness of the proposed O3D-SIM, this section includes visualizations of the model’s performance using select examples. These visualizations are displayed in Figure 5, illustrating the outcomes for two mapping sequences.
\ Notably, the open-set capability of O3D-SIM enables the identification of objects that are typically undetectable by conventional pipelines relying on closed sets or predefined datasets, such as wheelchairs. The figure showcases a comparative analysis of our pipeline’s ability to accurately identify and segment various objects, including mannequins and mobile robots, against their actual counts. This comparison highlights situations where traditional methods, constrained by a limited set of recognizable objects, fall short.
\ Our approach excels in recognizing instance-level semantics, accurately identifying 5 out of 6 table instances (with one false positive), underscoring its precision across both simulated and real-world data. This demonstrates the robustness of our pipeline, further evidenced by the clarity of the semantic map and the ease with which instance-level segmentation results can be visualized. While methods like VLMaps might identify a broader range of objects due to their open-set nature, and SI-Maps may detect multiple instances of the same object, O3D-SIM uniquely excels at both, offering a comprehensive solution.
\
:::info Authors:
(1) Laksh Nanwani, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;
(2) Kumaraditya Gupta, International Institute of Information Technology, Hyderabad, India;
(3) Aditya Mathur, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;
(4) Swayam Agrawal, International Institute of Information Technology, Hyderabad, India;
(5) A.H. Abdul Hafez, Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey;
(6) K. Madhava Krishna, International Institute of Information Technology, Hyderabad, India.
:::
:::info This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.
:::
\

