Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction

by   Vincent Le Guen, et al.

Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting.



There are no comments yet.


page 1

page 3

page 6

page 7


APIK: Active Physics-Informed Kriging Model with Partial Differential Equations

Kriging (or Gaussian process regression) is a popular machine learning m...

KO-PDE: Kernel Optimized Discovery of Partial Differential Equations with Varying Coefficients

Partial differential equations (PDEs) fitting scientific data can repres...

Data-driven discovery of PDEs in complex datasets

Many processes in science and engineering can be described by partial di...

Deep Neural Network Modeling of Unknown Partial Differential Equations in Nodal Space

We present a numerical framework for deep neural network (DNN) modeling ...

STENCIL-NET: Data-driven solution-adaptive discretization of partial differential equations

Numerical methods for approximately solving partial differential equatio...

Taylor Swift: Taylor Driven Temporal Modeling for Swift Future Frame Prediction

While recurrent neural networks (RNNs) demonstrate outstanding capabilit...

Latent-Space Inpainting for Packet Loss Concealment in Collaborative Object Detection

Edge devices, such as cameras and mobile units, are increasingly capable...

Code Repositories


Code for our CVPR 2020 paper "Disentangling Physical Dynamics from Unknown Factors for UnsupervisedVideo Prediction"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video forecasting consists in predicting the future content of a video conditioned on previous frames. This is of crucial importance in various contexts, such as weather forecasting [73], autonomous driving [29]

, reinforcement learning

[43], robotics [16], or action recognition [33]. In this work, we focus on unsupervised video prediction, where the absence of semantic labels to drive predictions exacerbates the challenges of the task. In this context, a key problem is to design video prediction methods able to represent the complex dynamics underlying raw data.

State-of-the-art methods for training such complex dynamical models currently rely on deep learning, with specific architectural choices based on 2D/3D convolutional 

[40, 62]

or recurrent neural networks 

[66, 64, 67].

To improve predictions, recent methods use adversarial training [40, 62, 29], stochastic models [7, 41], constraint predictions by using geometric knowledge [16, 24, 75] or by disentangling factors of variation [60, 58, 12, 21].

Figure 1: PhyDNet is a deep model mapping an input video into a latent space , from which future frame prediction can be accurately performed. PhyDNet learns in an unsupervised manner, such that physical dynamics and unknown factors necessary for prediction, e.g. appearance, details, texture, are disentangled.

Another appealing way to model the video dynamics is to exploit prior physical knowledge, e.g. formalized by partial differential equations (PDEs) [11, 55]. Recently, interesting connections between residual networks and PDEs have been drawn [71, 37, 8]

, enabling to design physically-constrained machine learning frameworks 

[49, 11, 55, 52]. These approaches are very successful for modeling complex natural phenomena, e.g. climate, when the underlying dynamics is well described by the physical equations in the input space [49, 52, 35]. However, such assumption is rarely fulfilled in the pixel space for predicting generalist videos.

In this work, we introduce PhyDNet, a deep model dedicated to perform accurate future frame predictions from generalist videos. In such a context, physical laws do not apply in the input pixel space ; the goal of PhyDNet is to learn a semantic latent space in which they do, and are disentangled from other factors of variation required to perform future prediction. Prediction results of PhyDNet when trained on Moving MNIST [56] are shown in Figure 1. The left branch represents the physical dynamics in ; when decoded in the image space, we can see that the corresponding features encode approximate segmentation masks predicting digit positions on subsequent frames. On the other hand, the right branch extracts residual information required for prediction, here the precise appearance of the two digits. Combining both representations eventually makes accurate prediction successful.

Our contributions to the unsupervised video prediction problem with PhyDNet can be summarized as follows:

  • We introduce a global sequence to sequence two-branch deep model (section 3.1) dedicated to jointly learn the latent space and to disentangle physical dynamics from residual information, the latter being modeled by a data-driven (ConvLSTM [73]) method.

  • Physical dynamics is modeled by a new recurrent physical cell, PhyCell (section 3.2), discretizing a broad class of PDEs in . PhyCell is based on a prediction-correction paradigm inspired from the data assimilation community [1], enabling robust training with missing data and for long-term forecasting.

  • Experiments (section 4) reveal that PhyDNet outperforms state-of-the-art methods on four generalist datasets: this is, as far as we know, the first physically-constrained model able to show such capabilities. We highlight the importance of both disentanglement and physical prediction for optimal performances.

2 Related work

We review here related multi-step video prediction approaches dedicated to long-term forecasting. We also focus on unsupervised training, i.e. only using input video data and without manual supervision based on semantic labels.

(a) PhyDNet disentangling recurrent bloc (b) Global seq2seq architecture
Figure 2: Proposed PhyDNet deep model for video forecasting. a) The core of PhyDNet is a recurrent block projecting input images into a latent space , where two recurrent neural networks disentangle physical dynamics (PhyCell, section 3.2) from residual information (ConvLSTM). Learned physical and residual representations are summed before decoding to predict the future image . b) Unfolded in time, PhyDNet forms a sequence to sequence (seq2seq) architecture suited for multi-step video prediction. Dotted arrows mean that predictions are reinjected as next input only for the ConvLSTM branch, and not for PhyCell, as explained in section 3.3.
Deep video prediction

Deep neural networks have recently achieved state-of-the-art performances for data-driven video prediction. Seminal works include the application of sequence to sequence LSTM or Convolutional variants [56, 73], adopted in many studies [16, 36, 74]. Further works explore different architectural designs based on Recurrent Neural Networks (RNNs) [66, 64, 44, 67, 65] and 2D/3D ConvNets [40, 62, 50, 6]

. Dedicated loss functions

[10, 30] and Generative Adversarial Networks (GANs) have been investigated for sharper predictions [40, 62, 29]. However, the problem of conditioning GANs with prior information, such as physical models, remains an open question.

To constrain the challenging generation of high dimensional images, several methods rather predict geometric transformations between frames [16, 24, 75] or use optical flow [46, 38, 33, 32, 31]. This is very effective for short-term prediction, but degrades quickly when the video content evolves, where more complex models and memory about dynamics are required.

A promising line of work consists in disentangling independent factors of variations in order to apply the prediction model on lower-dimensional representations. A few approaches explicitly model interactions between objects inferred from an observed scene [14, 27, 76]. Relational reasoning, often implemented with graphs [2, 26, 53, 45, 59], can account for basic physical laws, e.g. drift, gravity, spring [70, 72, 42]. However, these methods are object-centric, only evaluate on controlled settings and are not suited for general real-world video forecasting. Other disentangling approaches factorize the video into independent components [60, 58, 12, 21, 19]. Several disentanglement criteria are used, such as content/motion [60] or deterministic/stochastic [12]. In specific contexts, the prediction space can be structured using additional information, e.g. with human pose [61, 63] or key points [41], which imposes a severe overhead on the annotation budget.

Physics and PDEs

Exploiting prior physical knowledge is another appealing way to improve prediction models. Earlier attempts for data-driven PDE discovery include sparse regression of potential differential terms [5, 52, 54] or neural networks approximating the solution and response function of PDEs [48, 49, 55]. Several approaches are dedicated to a specific PDE, e.g. advection-diffusion in [11]. Based on the connection between numerical schemes for solving PDEs (e.g. Euler, Runge-Kutta) and residual neural networks [71, 37, 8, 78], several specific architectures were designed for predicting and identifying dynamical systems [15, 35, 47]. PDE-Net [35, 34] discretizes a broad class of PDEs by approximating partial derivatives with convolutions. Although these works leverage physical knowledge, they either suppose physics behind data to be explicitly known or are limited to a fully visible state, which is rarely the case for general video forecasting.

Deep Kalman filters

To handle unobserved phenomena, state space models, in particular the Kalman filter

[25], have been recently integrated with deep learning, by modeling dynamics in learned latent space [28, 69, 20, 17, 3]

. The Kalman variational autoencoder


separates state estimation in videos from dynamics with a linear gaussian state space model. The Recurrent Kalman Network

[3] uses a factorized high dimensional latent space in which the linear Kalman updates are simplified and don’t require computationally-heavy covariance matrix inversions. These methods inspired by the data assimilation community [1, 4] have advantages in missing data or long-term forecasting contexts due to their mechanisms decoupling latent dynamics and input assimilation. However, they assume simple latent dynamics (linear) and don’t include any physical prior.

3 PhyDNet model for video forecasting

We introduce PhyDNet, a model dedicated to video prediction, which leverages physical knowledge on dynamics, and disentangles it from other unknown factors of variations necessary for accurate forecasting. To achieve this goal, we introduce a disentangling architecture (section 3.1), and a new physically-constrained recurrent cell (section 3.2).
Problem statement As discussed in introduction, physical laws do not apply at the pixel level for general video prediction tasks. However, we assume that there exists a conceptual latent space in which physical dynamics and residual factors are linearly disentangled. Formally, let us denote as the frame of a video sequence at time , for spatial coordinates . is the latent representation of the video up to time , which decomposes as , where (resp. ) represents the physical (resp. residual) component of the disentanglement. The video evolution in the latent space is thus governed by the following partial differential equation (PDE):


and represent physical and residual dynamics in the latent space .

3.1 PhyDNet disentangling architecture

The main goal of PhyDNet is to learn the mapping from input sequences to a latent space which approximates the disentangling properties formalized in Eq (1).

To reach this objective, we introduce a recurrent bloc which is shown in Figure 2(a). A video frame at time is mapped by a deep convolutional encoder into a latent space representing the targeted space . is then used as input for two parallel recurrent neural networks, incorporating this spatial representation into a dynamical model.

The left branch in Figure 2(a) models the latent representation fulfilling the physical part of the PDE in Eq (1), i.e. . This PDE is modeled by our recurrent physical cell described in section 3.2, PhyCell, which leads to the computation of from and . From the machine learning perspective, PhyCell leverages physical constraints to limit the number of model parameters, regularizes training and improves generalization.

The right branch in Figure 2(a) models the latent representation fulfilling the residual part of the PDE in Eq (1), i.e. . Inspired by wavelet decomposition [39] and recent semi-supervised works [51], this part of the PDE corresponds to unknown phenomena, which do not correspond to any prior model, and is therefore entirely learned from data. We use a generic recurrent neural network for this task, e.g. ConvLSTM [73] for videos, which computes from and .

is the combined representation processed by a deep decoder to forecast the image .

Figure 2(b) shows the ”unfolded” PhyDNet. An input video with spatial size and channels is projected into by the encoder and processed by the recurrent block unfolded in time. This forms a Sequence To Sequence architecture [57] suited for multi-step prediction, outputting future frame predictions . Encoder, decoder and recurrent block parameters are all trained end-to-end, meaning that PhyDNet learns itself without supervision the latent space in which physics and residual factors are disentangled.

3.2 PhyCell: a deep recurrent physical model

PhyCell is a new physical cell, whose dynamics is governed by the PDE response function 111In the sequel, we drop the index in for the sake of simplicity:


where is a physical predictor modeling only the latent dynamics and is a correction term modeling the interactions between latent state and input data.
Physical predictor: in Eq (2) is modeled as follows:


in Eq (3) combines the spatial derivatives with coefficients up to a certain differential order . This generic class of linear PDEs subsumes a wide range of classical physical models, e.g. the heat equation, the wave equations, the advection-diffusion equations.
Correction: in Eq (2) takes the following form:


Eq (4) computes is the difference between the latent state after physical motion and the embedded new observed input . is a gating factor, where is the Hadamard product.

3.2.1 Discrete PhyCell

Figure 3: PhyCell recurrent cell implements a two-steps scheme: physical prediction with convolutions for approximating and combining spatial derivatives (Eq (6) and Eq (3)), and input assimilation as a correction of latent physical dynamics driven by observed data (Eq (7

)). During training, the filter moment loss in red (Eq (

10)) enforces the convolutional filters to approximate the desired differential operators.

We discretize the continuous time PDE in Eq (2) with the standard forward Euler numerical scheme [37], leading to the discrete time PhyCell (derivation in supplementary 1.1):


Depicted in Figure 3, PhyCell is an atomic recurrent cell for building physically-constrained RNNs. In our experiments, we use one layer of PhyCell but one can also easily stack several PhyCell layers to build more complex models, as done for stacked RNNs [66, 64, 67]. To gain insight into PhyCell in Eq (5), we write the equivalent two-steps form:


The prediction step in Eq (6) is a physically-constrained motion in the latent space, computing the intermediate representation . Eq (7) is a correction step incorporating input data. This prediction-correction formulation is reminiscent of the way to combine numerical models with observed data in the data assimilation community [1, 4], e.g. with the Kalman filter [25]. We show in section 3.3 that this decoupling between prediction and correction can be leveraged to robustly train our model in long-term forecasting and missing data contexts. can be interpreted as the Kalman gain controlling the trade-off between both steps.

max width=   Moving MNIST Traffic BJ Sea Surface Temperature Human 3.6 Method MSE MAE SSIM MSE MAE SSIM MSE MAE SSIM MSE / 10 MAE SSIM ConvLSTM [73] 103.3 182.9 0.707 PredRNN [66] 56.8 126.1 0.867 46.4 41.9 62.1 0.955 48.4 18.9 0.781 Causal LSTM [64] 46.5 106.8 0.898 44.8 45.8 17.2 0.851 MIM [67] 44.2 101.1 0.910 42.9 42.9 17.8 0.790 E3D-LSTM [65] 41.3 86.4 0.920 46.4 16.6 0.869 Advection-diffusion [11] - - - - - - - - - DDPAE [21] 38.9 - - - - - - - - - PhyDNet 24.4 70.3 0.947 41.9 16.2 0.982 31.9 53.3 0.972 36.9 16.2 0.901  

Table 1: Quantitative forecasting results of PhyDNet compared to baselines using various datasets. Numbers are copied from original or citing papers. * corresponds to results obtained by running online code from the authors. The first five baseline are general deep models applicable to all datasets, whereas DDPAE [21] (resp. advection-diffusion flow [11]) are specific state-of-the-art models for Moving MNIST (resp. SST). Metrics are scaled to be in a similar range across datasets to ease comparison.

3.2.2 PhyCell implementation

We now specify how the physical predictor in Eq (6) and the correction Kalman gain in Eq (7) are implemented.
Physical predictor: we implement

using a convolutional neural network (left gray box in Figure

3), based on the connection between convolutions and differentiations [13, 35]. This offers the possibility to learn a class of filters approximating each partial derivative in Eq (3), which are constrained by a kernel moment loss, as detailed in section 3.3. As noted by [35], the flexibility added by this constrained learning strategy gives better results for solving PDEs than handcrafted derivative filters. Finally, we use convolutions to linearly combine these derivatives with coefficients in Eq (3).
Kalman gain: We approximate in Eq (7) by a gate with learned convolution kernels , and bias :


Note that if , the input is not accounted for and the dynamics follows the physical predictor ; if , the latent dynamics is resetted and only driven by the input. This is similar to gating mechanisms in LSTMs or GRUs.
Discussion: With specific predictor, gain and encoder , PhyCell recovers recent models from the literature:

max width= model PDE-Net [34] Eq (6) Advection-diffusion advection-diffusion flow [11] predictor RKF [3] locally linear, no approx. deep encoder phys. constraint Kalman gain PhyDNet (ours) Eq (6) Eq (8) deep encoder

PDE-Net [35] directly works on raw pixel data (identity encoder ) and assumes Markovian dynamics (no correction, ): the model solves the autonomous PDE given in Eq (6) but in pixel space. This prevents from modeling time-varying PDEs such as those tackled in this work, e.g. varying advection terms.  The flow model in [11] uses the closed-form solution of the advection-diffusion equation as predictor ; it is however limited only to this PDE, whereas PhyDNet models a much broader class of PDEs. The Recurent Kalman Filter (RKF) [3] also proposes a prediction-correction scheme in a deep latent space, but their approach does not include any prior physical information, and the prediction step is locally linear, whereas we use deep models. An approximated form of the covariance matrix is used for estimating in [3], which we find experimentally inferior to our gating mechanism in Eq (8).

3.3 Training

Given a training set of videos and PhyDNet parameters , where (resp. ) are parameters of the PhyCell (resp. residual) branch, and are encoder and decoder shared parameters, we minimize the following objective function:


We use the loss for the image reconstruction loss , as commonly done in the literature [66, 64, 44, 65, 67].

imposes physical constraints on the learned filters , such that each of size approximates . This is achieved by using a loss based on the moment matrix  [34], representing the order of the filter differentiation [13]. is compared to a target moment matrix (see and computations in supplementary 1.2), leading to:


Prediction mode  An appealing feature of PhyCell is that we can use and train the model in a ”prediction-only” mode by setting in Eq (7), i.e. by only relying on the physical predictor in Eq (6). It is worth mentioning that the ”prediction-only” mode is not applicable to standard Seq2Seq RNNs: although the decomposition in Eq (2) still holds, i.e. , the resulting predictor is naive and useless for multi-step prediction , see supplementary 1.3).

Therefore, standard RNNs are not equipped to deal with unreliable input data . We show in section 4.4 that the gain of PhyDNet over those models increases in two important contexts with unreliable inputs: multi-step prediction and dealing with missing data.

4 Experiments

Figure 4: Qualitative results of the predicted frames by PhyDNet for all datasets. First line is the input sequence, second line the target and third line PhyDNet prediction. For Moving MNIST, we add a fourth line with the comparison to DDPAE [21] and for Traffic BJ the difference for better visualization.

4.1 Experimental setup


We evaluate PhyDNet on four datasets from various origins. Moving MNIST [56] is a standard synthetic benchmark in video prediction with two random digits bouncing on the walls. Traffic BJ [77] represents complex real-world traffic flows, which requires modeling transport phenomena and traffic diffusion for prediction. SST (Sea Surface Temperature) [11] consists in meteorological data, whose evolution is governed by the physical laws of fluid dynamics. Finally, Human 3.6 [22] represents general human actions with complex 3D articulated motions. We give details about all datasets in supplementary 2.1.

Network architectures and training

PhyDNet shares a common backbone architecture for all datasets where the physical branch contains 49 PhyCells ( kernel filters) and the residual branch is composed of a 3-layers ConvLSTM with 128 filters in each layer. We set up the trade-off parameter between and to . Detailed architectures and impact are given in supplementary 2.2. Our code is available at https://github.com/vincent-leguen/PhyDNet.

Evaluation metrics

We follow evaluation metrics commonly used in state-of-the-art video prediction methods: the Mean Squared Error (MSE), Mean Absolute Error (MAE) and the Structural Similarity (SSIM)

[68] that computes the perceived image quality with respect to a reference. Metrics are averaged for each frame of the output sequence. Lower MSE, MAE and higher SSIM indicate better performances.

4.2 State of the art comparison

We evaluate PhyDNet against strong recent baselines, including very competitive data-driven RNN architectures: ConvLSTM [73], PredRNN [66], Causal LSTM [64], Memory in Memory (MIM) [67]. We also compare to methods dedicated to specific datasets: DDPAE [21], a disentangling method specialized and state-of-the-art on Moving MNIST ; and the physically-constrained advection-diffusion flow model [11] that is state-of-the-art for the SST dataset.

max width=   Moving MNIST Traffic BJ Sea Surface Temperature Human 3.6 Method MSE MAE SSIM MSE 100 MAE SSIM MSE 10 MAE SSIM MSE 10 MAE 100 SSIM ConvLSTM 103.3 182.9 0.707 PhyCell 50.8 129.3 0.870 48.9 17.9 0.978 38.2 60.2 0.969 42.5 18.3 0.891 PhyDNet 24.4 70.3 0.947 41.9 16.2 0.982 31.9 53.3 0.972 36.9 16.2 0.901  

Table 2: An ablation study shows the consistent performance gain on all datasets of our physically-constrained PhyCell vs the general purpose ConvLSTM, and the additional gain brought up by the disentangling architecture PhyDNet. * corresponds to results obtained by running online code from the authors.

Overall results presented in Table 1 reveal that PhyDNet outperforms significantly all baselines on all four datasets. The performance gain is large with respect to state-of-the-art general RNN models, with a gain of 17 MSE points for Moving MNIST, 6 MSE points for Human 3.6, 3 MSE points for SST and 1 MSE point for Traffic BJ. In addition, PhyDNet also outperforms specialized models: it gains 14 MSE points compared to the disentangling DDPAE model [21] specialized for Moving MNIST, and 2 MSE points compared to the advection-diffusion model [11] dedicated to SST data. PhyDNet also presents large and consistent gains in SSIM, indicating that image quality is greatly improved by the physical regularization. Note that for Human 3.6, a few approaches use specific strategies dedicated to human motion with additional supervision, e.g. human pose in [61]. We perform similarly to [61] using only unsupervised training, as shown in supplementary 2.3. This is, to the best of our knowledge, the first time that physically-constrained deep models reach state-of-the-art performances on generalist video prediction datasets.

In Figure 4, we provide qualitative prediction results for all datasets, showing that PhyDNet properly forecasts future images for the considered horizons: digits are sharply and accurately predicted for Moving MNIST in (a), the absolute traffic flow error is low and approximately spatially independent in (b), the evolving physical SST phenomena are well anticipated in (c) and the future positions of the person is accurately predicted in (d). We add in Figure 4(a) a qualitative comparison to DDPAE [21], which fails to predict the future frames properly. Since the two digits overlap in the input sequence, DPPAE is unable to disentangle them. In contrast, PhyDNet successfully learns the physical dynamics of the two digits in a disentangled latent space, leading a correct prediction. In supplementary 2.4, we detail this comparison to DPPAE, and provide additional visualizations for all datasets.

4.3 Ablation Study

We perform here an ablation study to analyse the respective contributions of physical modeling and disentanglement. Results are presented in Table 2 for all datasets. We see that a 1-layer PhyCell model (only the left branch of PhyDNet in Figure 2(b)) outperforms a 3-layers ConvLSTM (50 MSE points gained for Moving MNIST, 8 MSE points for Human 3.6, 7 MSE points for SST and equivalent results for Traffic BJ), while PhyCell has much fewer parameters (270,000 vs. 3 million parameters). This confirms that PhyCell is a very effective recurrent cell that successfully incorporates physical prior in deep models. When we further add our disentangling strategy with the two-branch architecture (PhyDNet), we have another performance gap on all datasets (25 MSE points for Moving MNIST, 7 points for Traffic and SST, and 5 points for Human 3.6), which proves that physical modeling is not sufficient by itself to perform general video prediction and that learning unknown factors is necessary.

We qualitatively analyze in Figure 5 partial predictions of PhyDNet for the physical branch and residual branch . As noted in Figure 1 for Moving MNIST, captures coarse localisations of objects, while captures fine-grained details that are not useful for the physical model. Additional visualizations for the other datasets and a discussion on the number of parameters are provided in supplementary 2.5.

Figure 5: Qualitative ablation results on Moving MNIST: partial predictions show that PhyCell captures coarse localisation of digits, whereas the ConvLSTM branch models the fine shape details of digits. Every two frames are displayed.
Influence of physical regularization

We conduct in Table 3 a finer ablation on Moving MNIST to study the impact of the physical regularization on the performance of PhyCell and PhyDNet. When we disable for training PhyCell, performances improve by 7 points in MSE. This underlines that physical laws alone are too restrictive for learning dynamics in a general context, and that complementary factors should be accounted for. On the other side, when we disable for training our disentangled architecture PhyDNet, performances decrease by 5 MSE points ( vs ) compared to the physically-constrained version. This proves that physical constraints are relevant, but should be incorporated carefully in order to make both branches cooperate. This enables to leverage physical prior, while keeping remaining information necessary for pixel-level prediction. Same conclusions can be drawn for the other datasets, see supplementary 2.6.

max width=   Method MSE MAE SSIM PhyCell 50.8 129.3 0.870 PhyCell without 43.4 112.8 0.895 PhyDNet 24.4 70.3 0.947 PhyDNet without 29.0 81.2 0.934  

Table 3: Influence of physical regularization for Moving MNIST.

4.4 PhyCell analysis

4.4.1 Physical filter analysis

With the same general backbone architecture, PhyDNet can express different PDE dynamics associated to the underlying phenomena by learning specific coefficients combining the partial derivatives in Eq (3). In Figure 6, we display the mean amplitude of the learned coefficients with respect to the order of differentiation. For Moving MNIST, the and orders are largely dominant, meaning a purely advective behaviour coherent with the piecewise-constant translation dynamics of the dataset. For Traffic BJ and SST, there is also a global decrease in amplitude with respect to order, we nonetheless notice a few higher order terms appearing to be useful for prediction. For Human 3.6, where the nature of the prior motion is less obvious, these coefficients are more spread across order derivatives.

Moving MNIST Traffic BJ
SST Human 3.6
Figure 6: Mean amplitude of the combining coefficients with respect to the order of the differential operators approximated.

4.4.2 Dealing with unreliable inputs

We explore here the robustness of PhyDNet when dealing with unreliable inputs, that can arise in two contexts: long-term forecasting and missing data. As explained in section 3.3, PhyDNet can be used in a prediction mode in this context, limiting the use of unreliable inputs, whereas general RNNs cannot. To validate the relevance of the prediction mode, we compare PhyDNet to DDPAE [21], based on a standard RNN (LSTM) as predictor module. Figure 7 presents the results in MSE obtained by PhyDNet and DDPAE on Moving MNIST (see supplementary 2.7 for similar results in SSIM).

For long-term forecasting, we evaluate the performances of both methods far beyond the prediction range seen during training (up to 80 frames), as shown in Figure 7(a). We can see that the performance drop (MSE increase rate) is approximately linear for PhyNet, whereas it is much more pronounced for DDPAE. For example, PhyDNet for 80-steps prediction reaches similar performances in MSE than DDPAE for 20-steps prediction. This confirms that PhyDNet can limit error accumulation during forecasting by using a powerful dynamical model.

Finally, we evaluate the robustness of PhyDNet on DDPAE on missing data, by varying the ratio of missing data (from 10 to 50%) in input sequences during training and testing. A missing input image is replaced with a default value (0) image. In this case, PhyCell relies only on its latent dynamics by setting , whereas DDPAE takes the null image as input. Figure 7(b) shows that the performance gap between PhyDNet and DDPAE increases with the percentage of missing data.

(a) Long-term forecasting (b) Missing data

Figure 7: MSE comparison between PhyDNet and DDPAE [21] when dealing with unreliable inputs.

5 Conclusion

We propose PhyDNet, a new model for disentangling prior dynamical knowledge from other factors of variation required for video prediction. PhyDNet enables to apply PDE-constrained prediction beyond fully observed physical phenomena in pixel space, and to outperform state-of-the-art performances on four generalist datasets. Our introduced recurrent physical cell for modeling PDE dynamics generalizes recent models and offers the appealing property to decouple prediction from correction. Future work include using more complex numerical schemes, e.g. Runge-Kutta [15], and extension to probabilistic forecasts with uncertainty estimation [18, 9], e.g. with stochastic differential equations [23].


  • [1] M. Asch, M. Bocquet, and M. Nodet (2016) Data assimilation: methods, algorithms, and applications. Vol. 11, SIAM. Cited by: 2nd item, §2, §3.2.1.
  • [2] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems (NeurIPS), pp. 4502–4510. Cited by: §2.
  • [3] P. Becker, H. Pandya, G. Gebhardt, C. Zhao, C. J. Taylor, and G. Neumann (2019) Recurrent K

    alman networks: factorized inference in high-dimensional deep feature spaces

    In International Conference on Machine Learning (ICML), pp. 544–552. Cited by: §2, §3.2.2, §3.2.2.
  • [4] M. Bocquet, J. Brajard, A. Carrassi, and L. Bertino (2019)

    Data assimilation as a learning tool to infer ordinary differential equation representations of dynamical models

    Nonlinear Processes in Geophysics 26 (3), pp. 143–162. Cited by: §2, §3.2.1.
  • [5] S. L. Brunton, J. L. Proctor, and J. N. Kutz (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (15), pp. 3932–3937. Cited by: §2.
  • [6] W. Byeon, Q. Wang, R. Kumar Srivastava, and P. Koumoutsakos (2018) ContextVP: fully context-aware video prediction. In

    European Conference on Computer Vision (ECCV)

    pp. 753–769. Cited by: §2.
  • [7] L. Castrejon, N. Ballas, and A. Courville (2019) Improved conditional VRNNs for video prediction. In International Conference on Computer Vision (ICCV), Cited by: §1.
  • [8] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018) Neural ordinary differential equations. In Advances in neural information processing systems (NeurIPS), Cited by: §1, §2.
  • [9] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez (2019) Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2902–2913. Cited by: §5.
  • [10] M. Cuturi and M. Blondel (2017) Soft-dtw: a differentiable loss function for time-series. In International Conference on Machine Learning (ICML), pp. 894–903. Cited by: §2.
  • [11] E. de Bezenac, A. Pajot, and P. Gallinari (2018) Deep learning for physical processes: incorporating prior scientific knowledge. International Conference on Learning Representations (ICLR). Cited by: §1, §2, §3.2.2, §3.2.2, Table 1, §4.1, §4.2, §4.2.
  • [12] E. L. Denton et al. (2017) Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems (NeurIPS), pp. 4414–4423. Cited by: §1, §2.
  • [13] B. Dong, Q. Jiang, and Z. Shen (2017) Image restoration: wavelet frame shrinkage, nonlinear evolution PDEs, and beyond. Multiscale Modeling & Simulation 15 (1), pp. 606–660. Cited by: §3.2.2, §3.3.
  • [14] S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. (2016)

    Attend, infer, repeat: fast scene understanding with generative models

    In Advances in Neural Information Processing Systems (NeurIPS), pp. 3225–3233. Cited by: §2.
  • [15] R. Fablet, S. Ouala, and C. Herzet (2018) Bilinear residual neural network for the identification and forecasting of geophysical dynamics. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1477–1481. Cited by: §2, §5.
  • [16] C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems (NeurIPS), pp. 64–72. Cited by: §1, §1, §2, §2.
  • [17] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther (2017) A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3601–3610. Cited by: §2.
  • [18] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), pp. 1050–1059. Cited by: §5.
  • [19] H. Gao, H. Xu, Q. Cai, R. Wang, F. Yu, and T. Darrell (2019) Disentangling propagation and generation for video prediction. In International Conference on Computer Vision (ICCV), pp. 9006–9015. Cited by: §2.
  • [20] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel (2016) Backprop KF: learning discriminative deterministic state estimators. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4376–4384. Cited by: §2.
  • [21] J. Hsieh, B. Liu, D. Huang, L. F. Fei-Fei, and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems (NeurIPS), pp. 517–526. Cited by: §1, §2, Table 1, Figure 4, Figure 7, §4.2, §4.2, §4.2, §4.4.2.
  • [22] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §4.1.
  • [23] J. Jia and A. R. Benson (2019) Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems, pp. 9843–9854. Cited by: §5.
  • [24] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 667–675. Cited by: §1, §2.
  • [25] R. Kalman (1960) A new approach to linear filtering and prediction problems. Trans. ASME, D 82, pp. 35–44. Cited by: §2, §3.2.1.
  • [26] T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. Zemel (2018) Neural relational inference for interacting systems. In International Conference on Machine Learning (ICML), pp. 2693–2702. Cited by: §2.
  • [27] A. Kosiorek, H. Kim, Y. W. Teh, and I. Posner (2018) Sequential attend, infer, repeat: generative modelling of moving objects. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8606–8616. Cited by: §2.
  • [28] R. G. Krishnan, U. Shalit, and D. Sontag (2015) Deep Kalman filters. ArXiv abs/1511.05121. Cited by: §2.
  • [29] Y. Kwon and M. Park (2019) Predicting future frames using retrospective cycle GAN. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 1811–1820. Cited by: §1, §1, §2.
  • [30] V. Le Guen and N. Thome (2019) Shape and time distortion loss for training deep time series forecasting models. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4191–4203. Cited by: §2.
  • [31] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2018) Flow-grounded spatial-temporal video prediction from still images. In European Conference on Computer Vision (ECCV), pp. 600–615. Cited by: §2.
  • [32] X. Liang, L. Lee, W. Dai, and E. P. Xing (2017) Dual motion GAN for future-flow embedded video prediction. In International Conference on Computer Vision (ICCV), pp. 1744–1752. Cited by: §2.
  • [33] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala (2017) Video frame synthesis using deep voxel flow. In International Conference on Computer Vision (ICCV), pp. 4463–4471. Cited by: §1, §2.
  • [34] Z. Long, Y. Lu, and B. Dong (2019) PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. Journal of Computational Physics, pp. 108925. Cited by: §2, §3.2.2, §3.3.
  • [35] Z. Long, Y. Lu, X. Ma, and B. Dong (2018) PDE-Net: learning PDEs from data. In International Conference on Machine Learning, pp. 3214–3222. Cited by: §1, §2, §3.2.2, §3.2.2.
  • [36] C. Lu, M. Hirsch, and B. Scholkopf (2017) Flexible spatio-temporal networks for video prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6523–6531. Cited by: §2.
  • [37] Y. Lu, A. Zhong, Q. Li, and B. Dong (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In International Conference on Machine Learning (ICML), pp. 3282–3291. Cited by: §1, §2, §3.2.1.
  • [38] Z. Luo, B. Peng, D. Huang, A. Alahi, and L. Fei-Fei (2017) Unsupervised learning of long-term motion dynamics for videos. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2203–2212. Cited by: §2.
  • [39] S. Mallat (1999) A wavelet tour of signal processing. Elsevier. Cited by: §3.1.
  • [40] M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2.
  • [41] M. Minderer, C. Sun, R. Villegas, F. Cole, K. Murphy, and H. Lee (2019) Unsupervised learning of object structure and dynamics from videos. In Advances in neural information processing systems (NeurIPS), Cited by: §1, §2.
  • [42] D. Mrowca, C. Zhuang, E. Wang, N. Haber, L. F. Fei-Fei, J. Tenenbaum, and D. L. Yamins (2018) Flexible neural representation for physics prediction. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8799–8810. Cited by: §2.
  • [43] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in Atari games. In Advances in neural information processing systems (NeurIPS), pp. 2863–2871. Cited by: §1.
  • [44] M. Oliu, J. Selva, and S. Escalera (2018) Folded recurrent neural networks for future video prediction. In European Conference on Computer Vision (ECCV), pp. 716–731. Cited by: §2, §3.3.
  • [45] R. Palm, U. Paquet, and O. Winther (2018) Recurrent relational networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3368–3378. Cited by: §2.
  • [46] V. Patraucean, A. Handa, and R. Cipolla (2015) Spatio-temporal video autoencoder with differentiable memory. In ICLR 2016 Workshop Track, Cited by: §2.
  • [47] T. Qin, K. Wu, and D. Xiu (2019) Data driven governing equations approximation using deep neural networks. Journal of Computational Physics. Cited by: §2.
  • [48] M. Raissi, P. Perdikaris, and G. E. Karniadakis (2017) Physics informed deep learning (part ii): data-driven discovery of nonlinear partial differential equations. arXiv preprint arXiv:1711.10566. Cited by: §2.
  • [49] M. Raissi (2018) Deep hidden physics models: deep learning of nonlinear partial differential equations. The Journal of Machine Learning Research 19 (1), pp. 932–955. Cited by: §1, §2.
  • [50] F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao, and B. Catanzaro (2018) SDC-Net: video prediction using spatially-displaced convolution. In European Conference on Computer Vision (ECCV), pp. 718–733. Cited by: §2.
  • [51] T. Robert, N. Thome, and M. Cord (2018)

    Hybridnet: classification and reconstruction cooperation for semi-supervised learning

    In European Conference on Computer Vision (ECCV), pp. 153–169. Cited by: §3.1.
  • [52] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz (2017) Data-driven discovery of partial differential equations. Science Advances 3 (4), pp. e1602614. Cited by: §1, §2.
  • [53] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia (2018) Graph networks as learnable physics engines for inference and control. In International Conference on Machine Learning (ICML), pp. 4467–4476. Cited by: §2.
  • [54] H. Schaeffer (2017) Learning partial differential equations via data discovery and sparse optimization. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473 (2197), pp. 20160446. Cited by: §2.
  • [55] S. Seo and Y. Liu (2019) Differentiable physics-informed graph networks. arXiv preprint arXiv:1902.02950. Cited by: §1, §2.
  • [56] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning (ICML), pp. 843–852. Cited by: §1, §2, §4.1.
  • [57] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems (NeurIPS), pp. 3104–3112. Cited by: §3.1.
  • [58] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) Mocogan: decomposing motion and content for video generation. In Computer Vision and Pattern Recognition (CVPR), pp. 1526–1535. Cited by: §1, §2.
  • [59] S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber (2018)

    Relational neural expectation maximization: unsupervised discovery of objects and their interactions

    In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [60] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. International Conference on Learning Representations (ICLR). Cited by: §1, §2.
  • [61] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017) Learning to generate long-term future via hierarchical prediction. In International Conference on Machine Learning (ICML), pp. 3560–3569. Cited by: §2, §4.2.
  • [62] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances In Neural Information Processing Systems (NeurIPS), pp. 613–621. Cited by: §1, §1, §2.
  • [63] J. Walker, K. Marino, A. Gupta, and M. Hebert (2017) The pose knows: video forecasting by generating pose futures. In International Conference on Computer Vision (ICCV), pp. 3332–3341. Cited by: §2.
  • [64] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu (2018) PredRNN++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300. Cited by: §1, §2, §3.2.1, §3.3, Table 1, §4.2.
  • [65] Y. Wang, L. Jiang, M. Yang, L. Li, M. Long, and L. Fei-Fei (2019) Eidetic 3D LSTM: a model for video prediction and beyond. In International Conference on Learning Representations (ICLR), Cited by: §2, §3.3, Table 1.
  • [66] Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip (2017) PredRNN: recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems (NeurIPS), pp. 879–888. Cited by: §1, §2, §3.2.1, §3.3, Table 1, §4.2.
  • [67] Y. Wang, J. Zhang, H. Zhu, M. Long, J. Wang, and P. S. Yu (2019) Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Computer Vision and Pattern Recognition (CVPR), pp. 9154–9162. Cited by: §1, §2, §3.2.1, §3.3, Table 1, §4.2.
  • [68] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.1.
  • [69] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems (NeurIPS), pp. 2746–2754. Cited by: §2.
  • [70] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti (2017) Visual interaction networks: learning a physics simulator from video. In Advances in neural information processing systems (NeurIPS), pp. 4539–4547. Cited by: §2.
  • [71] E. Weinan (2017) A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5 (1), pp. 1–11. Cited by: §1, §2.
  • [72] J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum (2017) Learning to see physics via visual de-animation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 153–164. Cited by: §2.
  • [73] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems (NeurIPS), pp. 802–810. Cited by: 1st item, §1, §2, §3.1, Table 1, §4.2.
  • [74] J. Xu, B. Ni, Z. Li, S. Cheng, and X. Yang (2018) Structure preserving video prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1460–1469. Cited by: §2.
  • [75] T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In Advances in neural information processing systems (NeurIPS), pp. 91–99. Cited by: §1, §2.
  • [76] Y. Ye, M. Singh, A. Gupta, and S. Tulsiani (2019) Compositional video prediction. In Computer Vision and Pattern Recognition (CVPR), pp. 10353–10362. Cited by: §2.
  • [77] J. Zhang, Y. Zheng, and D. Qi (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §4.1.
  • [78] M. Zhu, B. Chang, and C. Fu (2019) Convolutional neural networks combined with Runge-Kutta methods. In International Conference on Learning Representations (ICLR), Cited by: §2.