PPI (keyPose and Pointflow Interface) is an end-to-end framework which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation.
In contrast to (i) keyframe-based policies, which excel in spatial localization but struggle with movement restrictions (e.g., curved motion and collision-free actions), and (ii) continuous-action-based policies, which accommodate diverse trajectories but lack strong perception, PPI enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories.
By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions.
PPI consists of three components: (a) Perception. PPI first construct a 3D semantic neural field and sample initial query points for pointflow prediction. (b) Interface. Next, two intermediate interfaces are defined: target gripper poses and object pointflow. (c) Prediction. Finally, a diffusion transformer incorporates robot proprio tokens, scene tokens, language tokens, pointflow query tokens and action tokens with gaussian noise. Using a carefully designed unidirectional attention, the model progressively denoises action predictions conditioned on the interfaces.
Carry the Tray
Handover and Insert the Plate
Wipe the Plate
Scan the Bottle
Object Interference
Object Interference
Object Interference
Lighting Background Changes
Object Interference & Background Changes
Unseen Object