Sequential Decision-Making for Active Object Detection from Hand
A key component of understanding hand-object interactions is the ability to identify the active object – the object that is being manipulated by the human hand – despite the occlusion induced by hand-object interactions. Based on the observation that hand appearance is a strong indicator of the location and size of the active object, we set up our active object detection method as a sequential decision-making process that is conditioned on the location and appearance of the hands. The key innovation of our approach is the design of the active object detection policy that uses an internal representation called the Relational Box Field, which allows for every pixel to regress an improved location of an active object bounding box, essentially giving every pixel the ability to vote for a better bounding box location. The policy is trained using a hybrid imitation learning and reinforcement learning approach, and at test time, the policy is used repeatedly to refine the bounding box location of the active object. We perform experiments on two large-scale datasets: 100DOH and MECCANO, improving AP50 performance by 8 of the art.
READ FULL TEXT