Gripper Keypose and Object Pointflow
as Interfaces for Bimanual Robotic Manipulation

RSS 2025

1Shanghai AI Lab, 2Fudan University, 3Zhejiang University, 4Peking University, *Equal Contribution, Corresponding Author

Overview



PPI (keyPose and Pointflow Interface) is an end-to-end framework which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation.

In contrast to (i) keyframe-based policies, which excel in spatial localization but struggle with movement restrictions (e.g., curved motion and collision-free actions), and (ii) continuous-action-based policies, which accommodate diverse trajectories but lack strong perception, PPI enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories.

By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions.

Method



PPI consists of three components: (a) Perception. PPI first construct a 3D semantic neural field and sample initial query points for pointflow prediction. (b) Interface. Next, two intermediate interfaces are defined: target gripper poses and object pointflow. (c) Prediction. Finally, a diffusion transformer incorporates robot proprio tokens, scene tokens, language tokens, pointflow query tokens and action tokens with gaussian noise. Using a carefully designed unidirectional attention, the model progressively denoises action predictions conditioned on the interfaces.

Real World Experiments

Four Main Tasks

Carry the Tray

Handover and Insert the Plate

Wipe the Plate

Scan the Bottle



Generalization

Object Interference

Object Interference

Object Interference

Lighting Background Changes

Object Interference & Background Changes

Unseen Object

Simulation Experiments

Ablation Study on RLBench2


Interfaces Visualization in Simulation