Unsupervised learning of image motion by recomposing sequences
We propose a new method for learning a representation of image motion in an unsupervised fashion. We do so by learning an image sequence embedding that respects associativity and invertibility properties of composed sequences with known temporal order. This procedure makes minimal assumptions about scene content, and the resulting networks learn to exploit rigid and non-rigid motion cues. We show that a deep neural network trained to respect these constraints implicitly identifies the characteristic motion patterns of many different sequence types. Our network architecture consists of a CNN followed by an LSTM and is structured to learn motion representations over sequences of arbitrary length. We demonstrate that a network trained using our unsupervised procedure on real-world sequences of human actions and vehicle motion can capture semantic regions corresponding to the motion in the scene, and not merely image-level differences, without requiring any motion labels. Furthermore, we present results that suggest our method can be used to extract information useful for independent motion tracking, localization, and nearest neighbor identification. Our results suggest that this representation may be useful for motion-related tasks where explicit labels are often very difficult to obtain.
READ FULL TEXT