Video prediction is a longstanding challenge in computer vision. While neural network models have provided significant advancements in the high-dimensional data regime, these advancements typically fail to provide an interpretable representation of the data-generating distribution. Juxtaposed against increased adoption of neural learning systems in safety critical domains, a growing segment of the community is advocating for the incorporation of physical priors to improve both the interpretability of representations and the generalizability of neural network models[lake2017building, higgins2018towards]
. In line with this call, we present a video prediction model that utilizes well-understood physical principles to capture the time evolution of the underlying data-generating process. In video prediction, high-dimensional images represent observations corresponding to low-dimensional states in a dynamical system. This correspondence is captured by the manifold hypothesis, which posits the existence of a low-dimensional manifold for high-dimensional data[bengio2013representation]. By identifying the low-dimensional state-space manifold, well-understood physical principles can be leveraged to model the evolution of the system.
The time evolution of a physical system is determined by its equations of motion. Given the system Lagrangian, which is system kinetic energy minus system potential energy, the equations of motion can be derived using the Euler-Lagrange equations [goldstein2002classical]. Previous works [cranmer2020lagrangian, lutter2019deep] have shown that it is possible to learn the system Lagrangian from low-dimensional measurements. In this work, we present an approach to simultaneously learn a low-dimensional state-space representation and system Lagrangian from high-dimensional image data. In our approach, images are mapped to and from the low-dimensional state-space by an auto-encoding neural network and initial states are integrated forward using the equations of motion determined by the system Lagrangian and Euler-Lagrange equations. Towards interpretability of the representation, the inertia tensor, which determines the kinetic energy, and the potential energy are parameterized by distinct neural networks. A pictorial representation of our model (LagNetViP) is given in Figure 1.
The efficacy of our approach requires that the auto-encoder and system Lagrangian agree on an appropriate low-dimensional representation. To encourage this, we define the training loss as the sum of three terms: (1) the reconstruction loss of the auto-encoded image sequence, (2) the reconstruction loss of the latent trajectory generated by the learned dynamics and (3) a mean absolute difference between the auto-encoded image sequence and the latent trajectory generated by the learned dynamics. We validate the importance of each term through an ablation study and provide a qualitative assessment using image sequences rendered in modified OpenAI gym Pendulum-V0 and Acrobot environments.
Recurrent neural networks (RNNs) have been applied to a broad class of sequence prediction problems including language modeling, machine translation, image processing and audio processing [graves2013generating, cho2014learning, srivastava2015highway, oord2016wavenet]. The utility of RNNs stems from the architectural structure, which enforces parameter sharing and encourages the network to learn statistics that generalize across sequences.
When sequential data can be interpreted as the observations of a dynamical system, the structure of difference or differential equations can be leveraged for prediction. Sequential image data is necessarily discrete although the generating process for image data is continuous. Consequently, there is a disparity in modeling approaches; watter2015embed and karl2016deep model the nature of the data with a (generative) discrete-time dynamical model, while yildiz2019ode model the continuous generating process with a continuous-time dynamical model. The related work of [chen2018neural]
The structure of the output space of a dynamical system can also be leveraged in learning [stewart2017label]. In [greydanus2019hamiltonian, toth2019hamiltonian, bertalan2019learning], the authors model the underlying dynamics of sequential data assuming the Hamiltonian structure. Dynamical state updates are generated using Hamilton’s equations. For high-dimensional data, greydanus2019hamiltonian and toth2019hamiltonian learn a map from the observation space to a low-dimensional phase space where Hamilton’s equations can be applied. A limitation of these approaches is that they are only applicable in the context of conservative systems. In [desmond2019symplectic], the authors surpass this limitation by assuming the Port-Hamiltonian structure, which can be used to model nonconservative systems.
In [lutter2019deep, cranmer2020lagrangian, saemundsson2019variational, zhong2020unsupervised]
, the authors assume the Lagrangian structure which allows for incorporation of nonconservative forces naturally. saemundsson2019variational and zhong2020unsupervised introduce generative models that leverage the variational auto-encoder (VAE) formulation to learn a representation of the low-dimensional state-space from high-dimensional images. saemundsson2019variational use a Gaussian prior on the latent code but set its dimension substantially higher than the number of degrees of freedom inherent to the system. This architectural choice limits interpretability of the learned coordinate representation and may have adverse effects in the control setting. In[zhong2020unsupervised] (work concurrent to ours), the authors select the latent dimension according to the number of degrees of freedom in the system. They find that with this choice the standard Gaussian prior severely inhibits learning and instead choose to apply system specific priors on the latent code.
In this work we introduce a discriminative model with latent dimension consistent with the number of degrees of freedom inherent to the system but without the need for system specific structural priors. Other discriminative models that leverage the Lagrangian structure are [lutter2019deep] and [cranmer2020lagrangian]. Our model differs from [lutter2019deep] in that we learn and apply the forward model for dynamical state prediction whereas [lutter2019deep] learn the inverse model for generalized force prediction and control. Moreover, we apply our approach to high-dimensional observations which neither [lutter2019deep] nor [cranmer2020lagrangian] attempt.
In this section, we provide a brief introduction to Lagrangian dynamics. For the sake of brevity, the exposition is kept terse; interested readers can find further details in [spong2008robot, goldstein2002classical].
A Lagrangian mechanical system has an associated configuration space which, loosely speaking, includes all the feasible configurations (or poses) of the mechanical system; e.g., for a simple pendulum, the configuration space can be defined as the space of all possible angles. We denote the rate of change of the configuration by . Mathematically, takes on the structure of a manifold with being a set of coordinates on it and the tuple lies in its tangent bundle . The dimension of corresponds to the degrees of freedom of the dynamical system and will be denoted by throughout the paper.
The Lagrangian is a function that maps the tangent space to a scalar. To define the Lagrangian first requires the introduction of the kinetic energy and the potential energy . Intuitively, the kinetic energy is the energy possessed by a mechanical system by virtue of its motion whereas the potential energy is the energy stored in a mechanical system due to its configuration. We assume the form of the kinetic energy to be quadratic in the velocity, as follows:
where is the positive definite inertia matrix. The kinetic energy of any mechanical system with holonomic constraints satisfies the quadratic form (1) — see [spong2008robot] for examples. Hence, learning the quadratic form (1) allows us to embed more structure in our architecture without compromising the richness of the class of systems that can be addressed by our approach. Indeed, the quadratic form of the kinetic energy has also been adopted in SymODEN [desmond2019symplectic] and DeLaN [lutter2019deep]. With the kinetic and the potential energy introduced above, we can express the Lagrangian as the difference between them:
The equations of motion for the Lagrangian dynamical system can be conveniently uncovered by the following operation on the Lagrangian (Euler-Lagrange equations):
where represents the generalized forces, which encapsulate the effect of exogenous influences on the evolution of the Lagrangian system. For systems for which the total energy is conserved, . However, the explicit presence of in our Lagrangian formalism allows us to handle non-conservative influences with ease, e.g., friction in the pendulum system.
where is the inertia matrix, is the Coriolis term, , and are the generalized forces.
Noting that is positive definite, we can solve for the acceleration :
The Lagrangian dynamics admit further structure that allows us to express the components of the Coriolis term as a function of the components of the inertia matrix . For any , we can express the the -th row of , denoted by , as . Further, the terms take the form [spong2008robot]:
Lagrangian neural networks
In this section, we review estimation of the system Lagrangian in the context of low-dimensional state-space measurements and introduce a strategy for estimating the system Lagrangian from (high-dimensional) image data.
Learning from state-space measurements
We outline first our approach for estimating the system Lagrangian from low-dimensional position-velocity measurements. The system Lagrangian is modelled as the difference between the kinetic and potential energies where the mass matrix and potential energy function are parameterized by neural networks
This formulation differs from previous work [cranmer2020lagrangian] where neither the kinetic energy nor the potential energy are explicitly modelled and from [lutter2019deep] where the potential force is modelled instead of the potential energy. To ensure invertibility of the mass matrix we use an intermediate matrix of the same shape and compute by
In our experiments
is set to the dimension of the position vector.
The parameters of the system Lagrangian are estimated by iterative minimization of the mean absolute difference between ground truth position-velocity measurement sequences denoted , with and , and predicted sequences defined similarly. The minimization problem is given by
Predicted sequences are computed recursively from an initial ground truth measurement and the current values of and . Concretely, the predicted measurement is computed from the preceding measurement with and the dynamical update given by
The Coriolis term and potential force are discussed in the previous section. In our experiments we consider systems without external forces, that is , and perform numerical integration with the Euler method. Note that the use of more sophisticated numerical integrators, in particular variational integrators, would likely improve performance.
Learning from image sequences
In this section we outline our approach for estimating the state-space representation and system Lagrangian from high-dimensional observations111Since we cannot perform velocity prediction from a single image we use observations which we define as image tuples.. Our prediction pipeline maps high-dimensional observations to a low-dimensional state-space representation whose structure is learned during the training phase. We predict future low-dimensional states using the Euler-Lagrange equations, then map the resulting sequence back to the observation space giving the predicted image sequence.
To compress and reconstruct observations of system trajectories given as temporal sequences of high-dimensional observations , with , we use an auto-encoding neural network , where and are decoding and encoding networks respectively. We denote encoded observation sequences and interpret as a position-velocity measurement. Predicted sequences are determined as described in the previous section.
The parameters of the auto-encoding network and of the system Lagrangian are estimated jointly by iterative minimization of the three component loss function:
where is the auto-encoding reconstruction loss:
with ; is the predicted sequence reconstruction loss:
and is the distance between encoded and predicted sequences:
This formulation differs from previous work [greydanus2019hamiltonian] where a structural prior is imposed on the embedding space. In our experiments we set .
In this section we empirically validate the proposed approach. Toward this end, we perform an ablation study where components of the cost function are removed and a qualitative comparison of the results are presented. Specifically, we compare the following models:
LagNetViP: The proposed model that simultaneously learns a low-dimensional state-space representation and system Lagrangian from high-dimensional image data. The training loss is the sum of three terms: (1) the auto-encoder reconstruction loss, (2) the reconstruction loss of the latent trajectory generated by the learned dynamics and (3) a mean absolute difference between the auto-encoded image sequence and the latent trajectory generated by the learned dynamics.
LagNetViP-dyn: The proposed model trained without a reconstruction loss on the latent trajectory generated by the learned dynamics.
LagNetViP-lat: The proposed model trained without a mean absolute difference between the auto-encoded image sequence and the latent trajectory generated by the learned dynamics.
LagNetViP-ae: The proposed model trained without the auto-encoder reconstruction loss.
In each model, we use a symmetric auto-encoding network to map high-dimensional observations to a low-dimensional state-space representation. The encoding network is a four layer neural network: three convolutional layers followed by a fully-connected layer (see Table 1); where convolutional layer is followed by a ReLU nonlinear unit. Both the inertia tensor and potential energy functions are parameterized by three layer fully-connected neural networks with 200 hidden units in each layer and tanh nonlinear units.
To demonstrate the efficacy of our approach on video prediction problems, we consider image sequences generated using modified OpenAI gym Pendulum-V0 and Acrobot environments. To generate image sequences we modify the OpenAI gym environments to use RK4 integration instead of Euler integration and increase the width of the pendulum and Acrobot arms in rendering.
The pendulum dataset consists of trajectories. Each trajectory is comprised of observations and each observation is constructed by concatenating three sequential images along the channel dimension. We train on 8,000 of the 10,000 trajectories using the Adam optimizer [kingma2014adam] with a learning rate of and weight decay of .
The Acrobot dataset consists of trajectories. Each trajectory is comprised of observations and each observation is constructed by concatenating three sequential images along the channel dimension. We train on 8,000 of the 10,000 trajectories using the Adam optimizer [kingma2014adam] with a learning rate of and weight decay of .
Figures 2 and 3 present qualitative comparisons on randomly selected testset trajectories from the pendulum and Acrobot datasets. On both datasets LagNetViP is able to extrapolate beyond the last observation in the testset trajectory. Removing the reconstruction loss on the latent trajectory generated by the learned dynamics has the most disastrous effect on reconstruction quality. This is evident in the fact that LagNetViP-dyn is unable to provide reasonable reconstructions even within the range of the testset in some cases. LagNetViP-lat performs well within the testset range but is unable to extrapolate farther and LagNetViP-ae exhibits good performance but fails to out perform LagNetViP.
Consider the trajectory generated by the LagNetViP model in the left-most panel of Figure 2. The position of the pendulum begins pointing up and to the left of the fixed point. In the first 10 frames, the pendulum swings down but not quite reaching the downward pointing position, as in the trajectory of the testset. LagNetViP is able to extrapolate beyond the testset trajectory predicting the upward swing of the pendulum on the right side through the downward pointing position.
Discussion and Conclusion
In this work, we introduce a video prediction model where equations of motion are explicitly constructed from learned representations of underlying physical quantities. The low-dimensional state-space representation and system Lagrangian are learned simultaneously. Images are mapped to and from the low-dimensional state-space by an auto-encoding network and initial states are integrated forward using the equations of motion determined by the system Lagrangian and Euler-Lagrange equations. The approach excels over the baseline model with strong reconstruction performance on the pendulum system and strong indications of possibility on the chaotic Acrobot system.
We thank Shinkyu Park, Desmond Zhong, David Isele and Patricia Posey for their insights. This research has been supported in part by ONR grant #N00014-18-1-2873 and by the School of Engineering and Applied Science at Princeton University through the generosity of William Addy ’82.