Grounding Predicates through Actions
Symbols representing abstract states such as "dish in dishwasher" or "cup on table" allow robots to reason over long horizons by hiding details unnecessary for high level planning. Current methods for learning to identify symbolic states in visual data require large amounts of labeled training data, but manually annotating such datasets is prohibitively expensive due to the combinatorial number of predicates in images. We propose a novel method for automatically labeling symbolic states in large-scale video activity datasets by exploiting known pre- and post-conditions of actions. This automatic labeling scheme only requires weak supervision in the form of an action label that describes which action is demonstrated in each video. We apply our framework to an existing large-scale human activity dataset. We train predicate classifiers to identify symbolic relationships between objects when prompted with object bounding boxes and achieve 0.93 test accuracy. We further demonstrate the ability of these predicate classifiers trained on human data to be applied to robot environments in a real-world task planning domain.
READ FULL TEXT