Gripper Keypose and Object Pointflow
as Interfaces for Bimanual Robotic Manipulation

RSS 2025

Yuyin Yang^*,1,2, Zetao Cai^*,1,3, Yang Tian^1,4, Jia Zeng¹, Jiangmiao Pang^✉1,

¹Shanghai AI Lab, ²Fudan University, ³Zhejiang University, ⁴Peking University, ^*Equal Contribution, ^✉Corresponding Author

Paper ArXiv Code Data

Overview

PPI (keyPose and Pointflow Interface) is an end-to-end framework which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation.

In contrast to (i) keyframe-based policies, which excel in spatial localization but struggle with movement restrictions (e.g., curved motion and collision-free actions), and (ii) continuous-action-based policies, which accommodate diverse trajectories but lack strong perception, PPI enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories.

By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions.

Method

PPI consists of three components: (a) Perception. PPI first construct a 3D semantic neural field and sample initial query points for pointflow prediction. (b) Interface. Next, two intermediate interfaces are defined: target gripper poses and object pointflow. (c) Prediction. Finally, a diffusion transformer incorporates robot proprio tokens, scene tokens, language tokens, pointflow query tokens and action tokens with gaussian noise. Using a carefully designed unidirectional attention, the model progressively denoises action predictions conditioned on the interfaces.

Real World Experiments

Four Main Tasks

Carry the Tray

Handover and Insert the Plate

Wipe the Plate

Scan the Bottle

Generalization

Object Interference

Lighting Background Changes

Object Interference & Background Changes

Unseen Object

Simulation Experiments

Ablation Study on RLBench2

Relying solely on either keyframe or continuous actions is insufficient for general manipulation tasks.
Conditioning on separate interfaces improves performance, likely due to the local spatial features they provide.
Combining both interfaces yields further gains, highlighting the synergy between keypose and pointflow in enhancing performance on downstream tasks.
Less number of keyframes will reduce the interface’s effectiveness and lead to lower success rates.

Interfaces Visualization in Simulation

BibTeX

@article{yang2025gripper,
  title={Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation},
  author={Yang, Yuyin and Cai, Zetao and Tian, Yang and Zeng, Jia and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2504.17784},
  year={2025}
}