Introduction
Video prediction is a longstanding challenge in computer vision. While neural network models have provided significant advancements in the highdimensional data regime, these advancements typically fail to provide an interpretable representation of the datagenerating distribution. Juxtaposed against increased adoption of neural learning systems in safety critical domains, a growing segment of the community is advocating for the incorporation of physical priors to improve both the interpretability of representations and the generalizability of neural network models
[lake2017building, higgins2018towards]. In line with this call, we present a video prediction model that utilizes wellunderstood physical principles to capture the time evolution of the underlying datagenerating process. In video prediction, highdimensional images represent observations corresponding to lowdimensional states in a dynamical system. This correspondence is captured by the manifold hypothesis, which posits the existence of a lowdimensional manifold for highdimensional data
[bengio2013representation]. By identifying the lowdimensional statespace manifold, wellunderstood physical principles can be leveraged to model the evolution of the system.The time evolution of a physical system is determined by its equations of motion. Given the system Lagrangian, which is system kinetic energy minus system potential energy, the equations of motion can be derived using the EulerLagrange equations [goldstein2002classical]. Previous works [cranmer2020lagrangian, lutter2019deep] have shown that it is possible to learn the system Lagrangian from lowdimensional measurements. In this work, we present an approach to simultaneously learn a lowdimensional statespace representation and system Lagrangian from highdimensional image data. In our approach, images are mapped to and from the lowdimensional statespace by an autoencoding neural network and initial states are integrated forward using the equations of motion determined by the system Lagrangian and EulerLagrange equations. Towards interpretability of the representation, the inertia tensor, which determines the kinetic energy, and the potential energy are parameterized by distinct neural networks. A pictorial representation of our model (LagNetViP) is given in Figure 1.
The efficacy of our approach requires that the autoencoder and system Lagrangian agree on an appropriate lowdimensional representation. To encourage this, we define the training loss as the sum of three terms: (1) the reconstruction loss of the autoencoded image sequence, (2) the reconstruction loss of the latent trajectory generated by the learned dynamics and (3) a mean absolute difference between the autoencoded image sequence and the latent trajectory generated by the learned dynamics. We validate the importance of each term through an ablation study and provide a qualitative assessment using image sequences rendered in modified OpenAI gym PendulumV0 and Acrobot environments.
Related work
Recurrent neural networks (RNNs) have been applied to a broad class of sequence prediction problems including language modeling, machine translation, image processing and audio processing [graves2013generating, cho2014learning, srivastava2015highway, oord2016wavenet]. The utility of RNNs stems from the architectural structure, which enforces parameter sharing and encourages the network to learn statistics that generalize across sequences.
When sequential data can be interpreted as the observations of a dynamical system, the structure of difference or differential equations can be leveraged for prediction. Sequential image data is necessarily discrete although the generating process for image data is continuous. Consequently, there is a disparity in modeling approaches; watter2015embed and karl2016deep model the nature of the data with a (generative) discretetime dynamical model, while yildiz2019ode model the continuous generating process with a continuoustime dynamical model. The related work of [chen2018neural]
facilitates estimation of continuoustime dynamics with neural networks by allowing for backpropagation through arbitrary ODE solvers.
The structure of the output space of a dynamical system can also be leveraged in learning [stewart2017label]. In [greydanus2019hamiltonian, toth2019hamiltonian, bertalan2019learning], the authors model the underlying dynamics of sequential data assuming the Hamiltonian structure. Dynamical state updates are generated using Hamilton’s equations. For highdimensional data, greydanus2019hamiltonian and toth2019hamiltonian learn a map from the observation space to a lowdimensional phase space where Hamilton’s equations can be applied. A limitation of these approaches is that they are only applicable in the context of conservative systems. In [desmond2019symplectic], the authors surpass this limitation by assuming the PortHamiltonian structure, which can be used to model nonconservative systems.
In [lutter2019deep, cranmer2020lagrangian, saemundsson2019variational, zhong2020unsupervised]
, the authors assume the Lagrangian structure which allows for incorporation of nonconservative forces naturally. saemundsson2019variational and zhong2020unsupervised introduce generative models that leverage the variational autoencoder (VAE) formulation to learn a representation of the lowdimensional statespace from highdimensional images. saemundsson2019variational use a Gaussian prior on the latent code but set its dimension substantially higher than the number of degrees of freedom inherent to the system. This architectural choice limits interpretability of the learned coordinate representation and may have adverse effects in the control setting. In
[zhong2020unsupervised] (work concurrent to ours), the authors select the latent dimension according to the number of degrees of freedom in the system. They find that with this choice the standard Gaussian prior severely inhibits learning and instead choose to apply system specific priors on the latent code.In this work we introduce a discriminative model with latent dimension consistent with the number of degrees of freedom inherent to the system but without the need for system specific structural priors. Other discriminative models that leverage the Lagrangian structure are [lutter2019deep] and [cranmer2020lagrangian]. Our model differs from [lutter2019deep] in that we learn and apply the forward model for dynamical state prediction whereas [lutter2019deep] learn the inverse model for generalized force prediction and control. Moreover, we apply our approach to highdimensional observations which neither [lutter2019deep] nor [cranmer2020lagrangian] attempt.
Lagrangian dynamics
In this section, we provide a brief introduction to Lagrangian dynamics. For the sake of brevity, the exposition is kept terse; interested readers can find further details in [spong2008robot, goldstein2002classical].
A Lagrangian mechanical system has an associated configuration space which, loosely speaking, includes all the feasible configurations (or poses) of the mechanical system; e.g., for a simple pendulum, the configuration space can be defined as the space of all possible angles. We denote the rate of change of the configuration by . Mathematically, takes on the structure of a manifold with being a set of coordinates on it and the tuple lies in its tangent bundle . The dimension of corresponds to the degrees of freedom of the dynamical system and will be denoted by throughout the paper.
The Lagrangian is a function that maps the tangent space to a scalar. To define the Lagrangian first requires the introduction of the kinetic energy and the potential energy . Intuitively, the kinetic energy is the energy possessed by a mechanical system by virtue of its motion whereas the potential energy is the energy stored in a mechanical system due to its configuration. We assume the form of the kinetic energy to be quadratic in the velocity, as follows:
(1) 
where is the positive definite inertia matrix. The kinetic energy of any mechanical system with holonomic constraints satisfies the quadratic form (1) — see [spong2008robot] for examples. Hence, learning the quadratic form (1) allows us to embed more structure in our architecture without compromising the richness of the class of systems that can be addressed by our approach. Indeed, the quadratic form of the kinetic energy has also been adopted in SymODEN [desmond2019symplectic] and DeLaN [lutter2019deep]. With the kinetic and the potential energy introduced above, we can express the Lagrangian as the difference between them:
(2) 
The equations of motion for the Lagrangian dynamical system can be conveniently uncovered by the following operation on the Lagrangian (EulerLagrange equations):
(3) 
where represents the generalized forces, which encapsulate the effect of exogenous influences on the evolution of the Lagrangian system. For systems for which the total energy is conserved, . However, the explicit presence of in our Lagrangian formalism allows us to handle nonconservative influences with ease, e.g., friction in the pendulum system.
The equations of motion (3) can be expanded further by using (1) and (2) in (3) giving:
(4) 
where is the inertia matrix, is the Coriolis term, , and are the generalized forces.
Noting that is positive definite, we can solve for the acceleration :
(5) 
The Lagrangian dynamics admit further structure that allows us to express the components of the Coriolis term as a function of the components of the inertia matrix . For any , we can express the the th row of , denoted by , as . Further, the terms take the form [spong2008robot]:
Lagrangian neural networks
In this section, we review estimation of the system Lagrangian in the context of lowdimensional statespace measurements and introduce a strategy for estimating the system Lagrangian from (highdimensional) image data.
Learning from statespace measurements
We outline first our approach for estimating the system Lagrangian from lowdimensional positionvelocity measurements. The system Lagrangian is modelled as the difference between the kinetic and potential energies where the mass matrix and potential energy function are parameterized by neural networks
This formulation differs from previous work [cranmer2020lagrangian] where neither the kinetic energy nor the potential energy are explicitly modelled and from [lutter2019deep] where the potential force is modelled instead of the potential energy. To ensure invertibility of the mass matrix we use an intermediate matrix of the same shape and compute by
In our experiments
is set to the dimension of the position vector
.The parameters of the system Lagrangian are estimated by iterative minimization of the mean absolute difference between ground truth positionvelocity measurement sequences denoted , with and , and predicted sequences defined similarly. The minimization problem is given by
(6) 
Predicted sequences are computed recursively from an initial ground truth measurement and the current values of and . Concretely, the predicted measurement is computed from the preceding measurement with and the dynamical update given by
The Coriolis term and potential force are discussed in the previous section. In our experiments we consider systems without external forces, that is , and perform numerical integration with the Euler method. Note that the use of more sophisticated numerical integrators, in particular variational integrators, would likely improve performance.
Learning from image sequences
In this section we outline our approach for estimating the statespace representation and system Lagrangian from highdimensional observations^{1}^{1}1Since we cannot perform velocity prediction from a single image we use observations which we define as image tuples.. Our prediction pipeline maps highdimensional observations to a lowdimensional statespace representation whose structure is learned during the training phase. We predict future lowdimensional states using the EulerLagrange equations, then map the resulting sequence back to the observation space giving the predicted image sequence.
To compress and reconstruct observations of system trajectories given as temporal sequences of highdimensional observations , with , we use an autoencoding neural network , where and are decoding and encoding networks respectively. We denote encoded observation sequences and interpret as a positionvelocity measurement. Predicted sequences are determined as described in the previous section.
The parameters of the autoencoding network and of the system Lagrangian are estimated jointly by iterative minimization of the three component loss function:
where is the autoencoding reconstruction loss:
with ; is the predicted sequence reconstruction loss:
and is the distance between encoded and predicted sequences:
This formulation differs from previous work [greydanus2019hamiltonian] where a structural prior is imposed on the embedding space. In our experiments we set .
Empirical analysis
In this section we empirically validate the proposed approach. Toward this end, we perform an ablation study where components of the cost function are removed and a qualitative comparison of the results are presented. Specifically, we compare the following models:

LagNetViP: The proposed model that simultaneously learns a lowdimensional statespace representation and system Lagrangian from highdimensional image data. The training loss is the sum of three terms: (1) the autoencoder reconstruction loss, (2) the reconstruction loss of the latent trajectory generated by the learned dynamics and (3) a mean absolute difference between the autoencoded image sequence and the latent trajectory generated by the learned dynamics.

LagNetViPdyn: The proposed model trained without a reconstruction loss on the latent trajectory generated by the learned dynamics.

LagNetViPlat: The proposed model trained without a mean absolute difference between the autoencoded image sequence and the latent trajectory generated by the learned dynamics.

LagNetViPae: The proposed model trained without the autoencoder reconstruction loss.
In each model, we use a symmetric autoencoding network to map highdimensional observations to a lowdimensional statespace representation. The encoding network is a four layer neural network: three convolutional layers followed by a fullyconnected layer (see Table 1); where convolutional layer is followed by a ReLU nonlinear unit. Both the inertia tensor and potential energy functions are parameterized by three layer fullyconnected neural networks with 200 hidden units in each layer and tanh nonlinear units.
Layer  Filter  Stride  Padding 

1  4x4x3x12  2  1 
2  4x4x12x24  2  1 
3  4x4x24x12  2  1 
4P  (4*4*12)x4  2  1 
4A  (4*4*12)x6  2  1 
To demonstrate the efficacy of our approach on video prediction problems, we consider image sequences generated using modified OpenAI gym PendulumV0 and Acrobot environments. To generate image sequences we modify the OpenAI gym environments to use RK4 integration instead of Euler integration and increase the width of the pendulum and Acrobot arms in rendering.
Pendulum Dataset.
The pendulum dataset consists of trajectories. Each trajectory is comprised of observations and each observation is constructed by concatenating three sequential images along the channel dimension. We train on 8,000 of the 10,000 trajectories using the Adam optimizer [kingma2014adam] with a learning rate of and weight decay of .
Acrobot.
The Acrobot dataset consists of trajectories. Each trajectory is comprised of observations and each observation is constructed by concatenating three sequential images along the channel dimension. We train on 8,000 of the 10,000 trajectories using the Adam optimizer [kingma2014adam] with a learning rate of and weight decay of .
Results
Figures 2 and 3 present qualitative comparisons on randomly selected testset trajectories from the pendulum and Acrobot datasets. On both datasets LagNetViP is able to extrapolate beyond the last observation in the testset trajectory. Removing the reconstruction loss on the latent trajectory generated by the learned dynamics has the most disastrous effect on reconstruction quality. This is evident in the fact that LagNetViPdyn is unable to provide reasonable reconstructions even within the range of the testset in some cases. LagNetViPlat performs well within the testset range but is unable to extrapolate farther and LagNetViPae exhibits good performance but fails to out perform LagNetViP.
Consider the trajectory generated by the LagNetViP model in the leftmost panel of Figure 2. The position of the pendulum begins pointing up and to the left of the fixed point. In the first 10 frames, the pendulum swings down but not quite reaching the downward pointing position, as in the trajectory of the testset. LagNetViP is able to extrapolate beyond the testset trajectory predicting the upward swing of the pendulum on the right side through the downward pointing position.
Discussion and Conclusion
In this work, we introduce a video prediction model where equations of motion are explicitly constructed from learned representations of underlying physical quantities. The lowdimensional statespace representation and system Lagrangian are learned simultaneously. Images are mapped to and from the lowdimensional statespace by an autoencoding network and initial states are integrated forward using the equations of motion determined by the system Lagrangian and EulerLagrange equations. The approach excels over the baseline model with strong reconstruction performance on the pendulum system and strong indications of possibility on the chaotic Acrobot system.
Acknowledgements
We thank Shinkyu Park, Desmond Zhong, David Isele and Patricia Posey for their insights. This research has been supported in part by ONR grant #N000141812873 and by the School of Engineering and Applied Science at Princeton University through the generosity of William Addy ’82.
Comments
There are no comments yet.