A self-organizing neural network architecture for learning human-object interactions
The visual recognition of transitive actions comprising human-object interactions is a key component enabling artificial systems to operate in natural environments. This challenging task requires, in addition to the recognition of articulated body actions, the extraction of semantic elements from the scene such as the identity of the manipulated objects. In this paper, we present a self-organizing neural network for the recognition of human-object interactions from RGB-D videos. Our model consists of a hierarchy of Grow When Required (GWR) networks which learn prototypical representations of body motion patterns and objects, also accounting for the development of action-object mappings in an unsupervised fashion. To demonstrate this ability, we report experimental results on a dataset of daily activities collected for the purpose of this study as well as on a publicly available benchmark dataset. In line with neurophysiological studies, our self-organizing architecture shows higher neural activation for congruent action-object pairs learned during training sessions with respect to artificially created incongruent ones. We show that our model achieves good classification accuracy on the benchmark dataset in an unsupervised fashion, showing competitive performance with respect to strictly supervised state-of-the-art approaches.
READ FULL TEXT