DGSG-Mind:
Dynamic3D GaussianScene Graphs
for Long-Term Scene Understanding and Grounding
Anonymous Authors
IEEE Robotics and Automation Letters 2026 Submission
Teaser
Compared with previous state-of-the-art systems, DGSG-Mind offers a more complete functional framework for long-term embodied scene understanding in dynamic environments. It jointly maintains a hybrid instance-aware 3D Gaussian representation and a hierarchical scene graph, supporting high-quality reconstruction, semantic mapping, accurate visual relocalization, and instance-level dynamic updates. Moreover, it leverages RoI Gaussian-rendered visual cues together with 3D semantic and spatial relations for multimodal reasoning. These capabilities allow DGSG-Mind to support long-term robotic tasks in dynamic real-world environments.
Method
System Overview of DGSG-Mind. Given a posed RGB-D sequence, DGSG-Mind extracts open-vocabulary instance masks and semantic features, and integrates them into a hybrid 3D Gaussian instance representation. Cross-modal association leverages the sparse probabilistic voxel grid to link 2D observations with persistent 3D Gaussian instances, guide new Gaussian initialization, and support multi-view optimization with photometric, depth, scale, and normal regularization. For dynamic scenes, localized masked refinement updates newly appeared or removed objects without re-optimizing the entire map. The instance-aware Gaussian map is further abstracted into a hierarchical 3D scene graph, enabling the 3D Gaussian Mind to perform zero-shot 3D visual grounding and spatial reasoning from rendered RoI-views and structured 3D relations.
Dynamic Scene Update
Dynamic Scene Update: Get current camera view, we first estimate a coarse camera pose by a fine-tuning ACE model and refine it on the 3D Gaussian map. With the refined pose, visible instances are evaluated by joint geometric-semantic consistency to detect removed objects, while residual detection identifies newly appeared ones. The Gaussian map is then optimized by localized masked refinement, and the scene graph is synchronized with the resulting object additions and removals.
3D Gaussian Mind
3D Gaussian Mind: By integrating natural language queries, structured 3D scene graphs, and generated annotated Gaussian views (RoI images), this framework leverages a Vision Language Model for joint spatial reasoning and object localization.
3D Open-Vocabulary Semantic Segmentation
1 / 2
3D Visual Grounding
Qualitative results of 3DVG. DGSG-Mind localizes target objects from free-form language queries on self-reconstructed 3D Gaussian maps, comparing predicted boxes with ground truth and representative baselines on ScanRefer and Nr3D scenes.