GeoHAT: Geometry-Adaptive Hybrid Action
Transformer for Mobile Manipulation

Anonymous Authors

Abstract

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

Video

Overview

GeoHAT Teaser Figure

Overview of GeoHAT. Left: GeoHAT fuses multi-view RGB features with PointMap features through reliability-aware gated fusion, and then predicts coordinated arm-base actions with a hybrid whole-body decoder. This design injects geometry only where reliable and routes visual context to the action subspaces that need it. Middle: Simulation and real-world mobile manipulation tasks used for evaluation. Right: Success rates showing strong performance in both simulation and real-world deployment.

Method

GeoHAT Method Overview

Architecture of GeoHAT. Patch-aligned PointMap coordinates are encoded via a Fourier MLP and selectively fused into DINOv2 features through per-token gating modulated by depth confidence (left). The hybrid action decoder interleaves arm and base tokens, applies query-level Top-K cross-attention for subspace-specific visual grounding, and uses causal self-attention for temporal coordination (right).

Real World Performance

Pick and Place
DP
π0
Ours
Drop Rubbish
DP
π0
Ours
Open Drawer
DP
π0
Ours
Close Drawer
DP
π0
Ours

BibTeX

@misc{zhu2026geohatgeometryadaptivehybridaction,
      title={GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation},
      author={Xiangyu Zhu and Renjun Wu and Luzhou Ge and Jinyan Liu and Xuesong Li},
      year={2026},
      eprint={2606.13394},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.13394},
}