Abstract
An important aspect of robotic grasping is the ability to detect incipient slip based on real-time information through tactile sensors. In this paper, we propose to use Video Vision Transformers to detect the onset of slip in grasping scenarios. The dynamic nature of slip makes Video Vision Transformers well-suited for capturing temporal correlations with relatively small datasets. The training data is acquired through two GelSight tactile sensors attached to the generic finger grippers of a Panda Franka Emika robot arm that grasps, lifts and shakes 30 everyday objects in order to induce slip. We further conducted an ablation study by considering 5, 4, 3, and 2 frames prior to slip onset, revealing consistent prediction accuracy. Our approach demonstrates the capability to predict slips well in advance, even up to the 5th frame before the onset. This underscores the predictive capability of our approach, indicating its effectiveness in slip detection well before of its occurrence. This advance prediction capability may be a valuable tool for undertaking preemptive corrective actions, such as implementing a more secure gripper closure. We evaluate the efficiency of our approach to predict onset of slip on 10 previously-unseen objects and achieve a zero-shot mean prediction accuracy of 99%.