PhyDNet
Code for our CVPR 2020 paper "Disentangling Physical Dynamics from Unknown Factors for UnsupervisedVideo Prediction"
view repo
Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting.
READ FULL TEXT VIEW PDFCode for our CVPR 2020 paper "Disentangling Physical Dynamics from Unknown Factors for UnsupervisedVideo Prediction"
Video forecasting consists in predicting the future content of a video conditioned on previous frames. This is of crucial importance in various contexts, such as weather forecasting [73], autonomous driving [29]
[43], robotics [16], or action recognition [33]. In this work, we focus on unsupervised video prediction, where the absence of semantic labels to drive predictions exacerbates the challenges of the task. In this context, a key problem is to design video prediction methods able to represent the complex dynamics underlying raw data.State-of-the-art methods for training such complex dynamical models currently rely on deep learning, with specific architectural choices based on 2D/3D convolutional
[40, 62][66, 64, 67].To improve predictions, recent methods use adversarial training [40, 62, 29], stochastic models [7, 41], constraint predictions by using geometric knowledge [16, 24, 75] or by disentangling factors of variation [60, 58, 12, 21].
Another appealing way to model the video dynamics is to exploit prior physical knowledge, e.g. formalized by partial differential equations (PDEs) [11, 55]. Recently, interesting connections between residual networks and PDEs have been drawn [71, 37, 8]
, enabling to design physically-constrained machine learning frameworks
[49, 11, 55, 52]. These approaches are very successful for modeling complex natural phenomena, e.g. climate, when the underlying dynamics is well described by the physical equations in the input space [49, 52, 35]. However, such assumption is rarely fulfilled in the pixel space for predicting generalist videos.In this work, we introduce PhyDNet, a deep model dedicated to perform accurate future frame predictions from generalist videos. In such a context, physical laws do not apply in the input pixel space ; the goal of PhyDNet is to learn a semantic latent space in which they do, and are disentangled from other factors of variation required to perform future prediction. Prediction results of PhyDNet when trained on Moving MNIST [56] are shown in Figure 1. The left branch represents the physical dynamics in ; when decoded in the image space, we can see that the corresponding features encode approximate segmentation masks predicting digit positions on subsequent frames. On the other hand, the right branch extracts residual information required for prediction, here the precise appearance of the two digits. Combining both representations eventually makes accurate prediction successful.
Our contributions to the unsupervised video prediction problem with PhyDNet can be summarized as follows:
Physical dynamics is modeled by a new recurrent physical cell, PhyCell (section 3.2), discretizing a broad class of PDEs in . PhyCell is based on a prediction-correction paradigm inspired from the data assimilation community [1], enabling robust training with missing data and for long-term forecasting.
Experiments (section 4) reveal that PhyDNet outperforms state-of-the-art methods on four generalist datasets: this is, as far as we know, the first physically-constrained model able to show such capabilities. We highlight the importance of both disentanglement and physical prediction for optimal performances.
We review here related multi-step video prediction approaches dedicated to long-term forecasting. We also focus on unsupervised training, i.e. only using input video data and without manual supervision based on semantic labels.
(a) PhyDNet disentangling recurrent bloc | (b) Global seq2seq architecture |
Deep neural networks have recently achieved state-of-the-art performances for data-driven video prediction. Seminal works include the application of sequence to sequence LSTM or Convolutional variants [56, 73], adopted in many studies [16, 36, 74]. Further works explore different architectural designs based on Recurrent Neural Networks (RNNs) [66, 64, 44, 67, 65] and 2D/3D ConvNets [40, 62, 50, 6]
. Dedicated loss functions
[10, 30] and Generative Adversarial Networks (GANs) have been investigated for sharper predictions [40, 62, 29]. However, the problem of conditioning GANs with prior information, such as physical models, remains an open question.To constrain the challenging generation of high dimensional images, several methods rather predict geometric transformations between frames [16, 24, 75] or use optical flow [46, 38, 33, 32, 31]. This is very effective for short-term prediction, but degrades quickly when the video content evolves, where more complex models and memory about dynamics are required.
A promising line of work consists in disentangling independent factors of variations in order to apply the prediction model on lower-dimensional representations. A few approaches explicitly model interactions between objects inferred from an observed scene [14, 27, 76]. Relational reasoning, often implemented with graphs [2, 26, 53, 45, 59], can account for basic physical laws, e.g. drift, gravity, spring [70, 72, 42]. However, these methods are object-centric, only evaluate on controlled settings and are not suited for general real-world video forecasting. Other disentangling approaches factorize the video into independent components [60, 58, 12, 21, 19]. Several disentanglement criteria are used, such as content/motion [60] or deterministic/stochastic [12]. In specific contexts, the prediction space can be structured using additional information, e.g. with human pose [61, 63] or key points [41], which imposes a severe overhead on the annotation budget.
Exploiting prior physical knowledge is another appealing way to improve prediction models. Earlier attempts for data-driven PDE discovery include sparse regression of potential differential terms [5, 52, 54] or neural networks approximating the solution and response function of PDEs [48, 49, 55]. Several approaches are dedicated to a specific PDE, e.g. advection-diffusion in [11]. Based on the connection between numerical schemes for solving PDEs (e.g. Euler, Runge-Kutta) and residual neural networks [71, 37, 8, 78], several specific architectures were designed for predicting and identifying dynamical systems [15, 35, 47]. PDE-Net [35, 34] discretizes a broad class of PDEs by approximating partial derivatives with convolutions. Although these works leverage physical knowledge, they either suppose physics behind data to be explicitly known or are limited to a fully visible state, which is rarely the case for general video forecasting.
To handle unobserved phenomena, state space models, in particular the Kalman filter
[25], have been recently integrated with deep learning, by modeling dynamics in learned latent space [28, 69, 20, 17, 3]. The Kalman variational autoencoder
[17]separates state estimation in videos from dynamics with a linear gaussian state space model. The Recurrent Kalman Network
[3] uses a factorized high dimensional latent space in which the linear Kalman updates are simplified and don’t require computationally-heavy covariance matrix inversions. These methods inspired by the data assimilation community [1, 4] have advantages in missing data or long-term forecasting contexts due to their mechanisms decoupling latent dynamics and input assimilation. However, they assume simple latent dynamics (linear) and don’t include any physical prior.We introduce PhyDNet, a model dedicated to video prediction, which leverages physical knowledge on dynamics, and disentangles it from other unknown factors of variations necessary for accurate forecasting. To achieve this goal, we introduce a disentangling architecture (section 3.1), and a new physically-constrained recurrent cell (section 3.2).
Problem statement As discussed in introduction, physical laws do not apply at the pixel level for general video prediction tasks. However, we assume that there exists a conceptual latent space in which physical dynamics and residual factors are linearly disentangled.
Formally, let us denote as the frame of a video sequence at time , for spatial coordinates . is the latent representation of the video up to time , which decomposes as , where (resp. ) represents the physical (resp. residual) component of the disentanglement. The video evolution in the latent space is thus governed by the following partial differential equation (PDE):
(1) |
and represent physical and residual dynamics in the latent space .
The main goal of PhyDNet is to learn the mapping from input sequences to a latent space which approximates the disentangling properties formalized in Eq (1).
To reach this objective, we introduce a recurrent bloc which is shown in Figure 2(a). A video frame at time is mapped by a deep convolutional encoder into a latent space representing the targeted space . is then used as input for two parallel recurrent neural networks, incorporating this spatial representation into a dynamical model.
The left branch in Figure 2(a) models the latent representation fulfilling the physical part of the PDE in Eq (1), i.e. . This PDE is modeled by our recurrent physical cell described in section 3.2, PhyCell, which leads to the computation of from and . From the machine learning perspective, PhyCell leverages physical constraints to limit the number of model parameters, regularizes training and improves generalization.
The right branch in Figure 2(a) models the latent representation fulfilling the residual part of the PDE in Eq (1), i.e. . Inspired by wavelet decomposition [39] and recent semi-supervised works [51], this part of the PDE corresponds to unknown phenomena, which do not correspond to any prior model, and is therefore entirely learned from data. We use a generic recurrent neural network for this task, e.g. ConvLSTM [73] for videos, which computes from and .
is the combined representation processed by a deep decoder to forecast the image .
Figure 2(b) shows the ”unfolded” PhyDNet. An input video with spatial size and channels is projected into by the encoder and processed by the recurrent block unfolded in time. This forms a Sequence To Sequence architecture [57] suited for multi-step prediction, outputting future frame predictions . Encoder, decoder and recurrent block parameters are all trained end-to-end, meaning that PhyDNet learns itself without supervision the latent space in which physics and residual factors are disentangled.
PhyCell is a new physical cell, whose dynamics is governed by the PDE response function ^{1}^{1}1In the sequel, we drop the index in for the sake of simplicity:
(2) |
where is a physical predictor modeling only the latent dynamics and is a correction term modeling the interactions between latent state and input data.
Physical predictor: in Eq (2) is modeled as follows:
(3) |
in Eq (3) combines the spatial derivatives with coefficients up to a certain differential order . This generic class of linear PDEs subsumes a wide range of classical physical models, e.g. the heat equation, the wave equations, the advection-diffusion equations.
Correction: in Eq (2) takes the following form:
(4) |
Eq (4) computes is the difference between the latent state after physical motion and the embedded new observed input . is a gating factor, where is the Hadamard product.
We discretize the continuous time PDE in Eq (2) with the standard forward Euler numerical scheme [37], leading to the discrete time PhyCell (derivation in supplementary 1.1):
(5) |
Depicted in Figure 3, PhyCell is an atomic recurrent cell for building physically-constrained RNNs. In our experiments, we use one layer of PhyCell but one can also easily stack several PhyCell layers to build more complex models, as done for stacked RNNs [66, 64, 67]. To gain insight into PhyCell in Eq (5), we write the equivalent two-steps form:
(6) | ||||
(7) |
The prediction step in Eq (6) is a physically-constrained motion in the latent space, computing the intermediate representation . Eq (7) is a correction step incorporating input data. This prediction-correction formulation is reminiscent of the way to combine numerical models with observed data in the data assimilation community [1, 4], e.g. with the Kalman filter [25]. We show in section 3.3 that this decoupling between prediction and correction can be leveraged to robustly train our model in long-term forecasting and missing data contexts. can be interpreted as the Kalman gain controlling the trade-off between both steps.
We now specify how the physical predictor in Eq (6) and the correction Kalman gain in Eq (7) are implemented.
Physical predictor: we implement
using a convolutional neural network (left gray box in Figure
3), based on the connection between convolutions and differentiations [13, 35]. This offers the possibility to learn a class of filters approximating each partial derivative in Eq (3), which are constrained by a kernel moment loss, as detailed in section 3.3. As noted by [35], the flexibility added by this constrained learning strategy gives better results for solving PDEs than handcrafted derivative filters. Finally, we use convolutions to linearly combine these derivatives with coefficients in Eq (3).(8) |
Note that if , the input is not accounted for and the dynamics follows the physical predictor ; if , the latent dynamics is resetted and only driven by the input. This is similar to gating mechanisms in LSTMs or GRUs.
Discussion: With specific predictor, gain and encoder , PhyCell recovers recent models from the literature:
PDE-Net [35] directly works on raw pixel data (identity encoder ) and assumes Markovian dynamics (no correction, ): the model solves the autonomous PDE given in Eq (6) but in pixel space. This prevents from modeling time-varying PDEs such as those tackled in this work, e.g. varying advection terms. The flow model in [11] uses the closed-form solution of the advection-diffusion equation as predictor ; it is however limited only to this PDE, whereas PhyDNet models a much broader class of PDEs. The Recurent Kalman Filter (RKF) [3] also proposes a prediction-correction scheme in a deep latent space, but their approach does not include any prior physical information, and the prediction step is locally linear, whereas we use deep models. An approximated form of the covariance matrix is used for estimating in [3], which we find experimentally inferior to our gating mechanism in Eq (8).
Given a training set of videos and PhyDNet parameters , where (resp. ) are parameters of the PhyCell (resp. residual) branch, and are encoder and decoder shared parameters, we minimize the following objective function:
(9) |
We use the loss for the image reconstruction loss , as commonly done in the literature [66, 64, 44, 65, 67].
imposes physical constraints on the learned filters , such that each of size approximates . This is achieved by using a loss based on the moment matrix [34], representing the order of the filter differentiation [13]. is compared to a target moment matrix (see and computations in supplementary 1.2), leading to:
(10) |
Prediction mode An appealing feature of PhyCell is that we can use and train the model in a ”prediction-only” mode by setting in Eq (7), i.e. by only relying on the physical predictor in Eq (6). It is worth mentioning that the ”prediction-only” mode is not applicable to standard Seq2Seq RNNs: although the decomposition in Eq (2) still holds, i.e. , the resulting predictor is naive and useless for multi-step prediction , see supplementary 1.3).
Therefore, standard RNNs are not equipped to deal with unreliable input data . We show in section 4.4 that the gain of PhyDNet over those models increases in two important contexts with unreliable inputs: multi-step prediction and dealing with missing data.
We evaluate PhyDNet on four datasets from various origins. Moving MNIST [56] is a standard synthetic benchmark in video prediction with two random digits bouncing on the walls. Traffic BJ [77] represents complex real-world traffic flows, which requires modeling transport phenomena and traffic diffusion for prediction. SST (Sea Surface Temperature) [11] consists in meteorological data, whose evolution is governed by the physical laws of fluid dynamics. Finally, Human 3.6 [22] represents general human actions with complex 3D articulated motions. We give details about all datasets in supplementary 2.1.
PhyDNet shares a common backbone architecture for all datasets where the physical branch contains 49 PhyCells ( kernel filters) and the residual branch is composed of a 3-layers ConvLSTM with 128 filters in each layer. We set up the trade-off parameter between and to . Detailed architectures and impact are given in supplementary 2.2. Our code is available at https://github.com/vincent-leguen/PhyDNet.
We follow evaluation metrics commonly used in state-of-the-art video prediction methods: the Mean Squared Error (MSE), Mean Absolute Error (MAE) and the Structural Similarity (SSIM)
[68] that computes the perceived image quality with respect to a reference. Metrics are averaged for each frame of the output sequence. Lower MSE, MAE and higher SSIM indicate better performances.We evaluate PhyDNet against strong recent baselines, including very competitive data-driven RNN architectures: ConvLSTM [73], PredRNN [66], Causal LSTM [64], Memory in Memory (MIM) [67]. We also compare to methods dedicated to specific datasets: DDPAE [21], a disentangling method specialized and state-of-the-art on Moving MNIST ; and the physically-constrained advection-diffusion flow model [11] that is state-of-the-art for the SST dataset.
Overall results presented in Table 1 reveal that PhyDNet outperforms significantly all baselines on all four datasets. The performance gain is large with respect to state-of-the-art general RNN models, with a gain of 17 MSE points for Moving MNIST, 6 MSE points for Human 3.6, 3 MSE points for SST and 1 MSE point for Traffic BJ. In addition, PhyDNet also outperforms specialized models: it gains 14 MSE points compared to the disentangling DDPAE model [21] specialized for Moving MNIST, and 2 MSE points compared to the advection-diffusion model [11] dedicated to SST data. PhyDNet also presents large and consistent gains in SSIM, indicating that image quality is greatly improved by the physical regularization. Note that for Human 3.6, a few approaches use specific strategies dedicated to human motion with additional supervision, e.g. human pose in [61]. We perform similarly to [61] using only unsupervised training, as shown in supplementary 2.3. This is, to the best of our knowledge, the first time that physically-constrained deep models reach state-of-the-art performances on generalist video prediction datasets.
In Figure 4, we provide qualitative prediction results for all datasets, showing that PhyDNet properly forecasts future images for the considered horizons: digits are sharply and accurately predicted for Moving MNIST in (a), the absolute traffic flow error is low and approximately spatially independent in (b), the evolving physical SST phenomena are well anticipated in (c) and the future positions of the person is accurately predicted in (d). We add in Figure 4(a) a qualitative comparison to DDPAE [21], which fails to predict the future frames properly. Since the two digits overlap in the input sequence, DPPAE is unable to disentangle them. In contrast, PhyDNet successfully learns the physical dynamics of the two digits in a disentangled latent space, leading a correct prediction. In supplementary 2.4, we detail this comparison to DPPAE, and provide additional visualizations for all datasets.
We perform here an ablation study to analyse the respective contributions of physical modeling and disentanglement. Results are presented in Table 2 for all datasets. We see that a 1-layer PhyCell model (only the left branch of PhyDNet in Figure 2(b)) outperforms a 3-layers ConvLSTM (50 MSE points gained for Moving MNIST, 8 MSE points for Human 3.6, 7 MSE points for SST and equivalent results for Traffic BJ), while PhyCell has much fewer parameters (270,000 vs. 3 million parameters). This confirms that PhyCell is a very effective recurrent cell that successfully incorporates physical prior in deep models. When we further add our disentangling strategy with the two-branch architecture (PhyDNet), we have another performance gap on all datasets (25 MSE points for Moving MNIST, 7 points for Traffic and SST, and 5 points for Human 3.6), which proves that physical modeling is not sufficient by itself to perform general video prediction and that learning unknown factors is necessary.
We qualitatively analyze in Figure 5 partial predictions of PhyDNet for the physical branch and residual branch . As noted in Figure 1 for Moving MNIST, captures coarse localisations of objects, while captures fine-grained details that are not useful for the physical model. Additional visualizations for the other datasets and a discussion on the number of parameters are provided in supplementary 2.5.
We conduct in Table 3 a finer ablation on Moving MNIST to study the impact of the physical regularization on the performance of PhyCell and PhyDNet. When we disable for training PhyCell, performances improve by 7 points in MSE. This underlines that physical laws alone are too restrictive for learning dynamics in a general context, and that complementary factors should be accounted for. On the other side, when we disable for training our disentangled architecture PhyDNet, performances decrease by 5 MSE points ( vs ) compared to the physically-constrained version. This proves that physical constraints are relevant, but should be incorporated carefully in order to make both branches cooperate. This enables to leverage physical prior, while keeping remaining information necessary for pixel-level prediction. Same conclusions can be drawn for the other datasets, see supplementary 2.6.
With the same general backbone architecture, PhyDNet can express different PDE dynamics associated to the underlying phenomena by learning specific coefficients combining the partial derivatives in Eq (3). In Figure 6, we display the mean amplitude of the learned coefficients with respect to the order of differentiation. For Moving MNIST, the and orders are largely dominant, meaning a purely advective behaviour coherent with the piecewise-constant translation dynamics of the dataset. For Traffic BJ and SST, there is also a global decrease in amplitude with respect to order, we nonetheless notice a few higher order terms appearing to be useful for prediction. For Human 3.6, where the nature of the prior motion is less obvious, these coefficients are more spread across order derivatives.
Moving MNIST | Traffic BJ |
SST | Human 3.6 |
We explore here the robustness of PhyDNet when dealing with unreliable inputs, that can arise in two contexts: long-term forecasting and missing data. As explained in section 3.3, PhyDNet can be used in a prediction mode in this context, limiting the use of unreliable inputs, whereas general RNNs cannot. To validate the relevance of the prediction mode, we compare PhyDNet to DDPAE [21], based on a standard RNN (LSTM) as predictor module. Figure 7 presents the results in MSE obtained by PhyDNet and DDPAE on Moving MNIST (see supplementary 2.7 for similar results in SSIM).
For long-term forecasting, we evaluate the performances of both methods far beyond the prediction range seen during training (up to 80 frames), as shown in Figure 7(a). We can see that the performance drop (MSE increase rate) is approximately linear for PhyNet, whereas it is much more pronounced for DDPAE. For example, PhyDNet for 80-steps prediction reaches similar performances in MSE than DDPAE for 20-steps prediction. This confirms that PhyDNet can limit error accumulation during forecasting by using a powerful dynamical model.
Finally, we evaluate the robustness of PhyDNet on DDPAE on missing data, by varying the ratio of missing data (from 10 to 50%) in input sequences during training and testing. A missing input image is replaced with a default value (0) image. In this case, PhyCell relies only on its latent dynamics by setting , whereas DDPAE takes the null image as input. Figure 7(b) shows that the performance gap between PhyDNet and DDPAE increases with the percentage of missing data.
(a) Long-term forecasting | (b) Missing data |
We propose PhyDNet, a new model for disentangling prior dynamical knowledge from other factors of variation required for video prediction. PhyDNet enables to apply PDE-constrained prediction beyond fully observed physical phenomena in pixel space, and to outperform state-of-the-art performances on four generalist datasets. Our introduced recurrent physical cell for modeling PDE dynamics generalizes recent models and offers the appealing property to decouple prediction from correction. Future work include using more complex numerical schemes, e.g. Runge-Kutta [15], and extension to probabilistic forecasts with uncertainty estimation [18, 9], e.g. with stochastic differential equations [23].
alman networks: factorized inference in high-dimensional deep feature spaces
. In International Conference on Machine Learning (ICML), pp. 544–552. Cited by: §2, §3.2.2, §3.2.2.Data assimilation as a learning tool to infer ordinary differential equation representations of dynamical models
. Nonlinear Processes in Geophysics 26 (3), pp. 143–162. Cited by: §2, §3.2.1.European Conference on Computer Vision (ECCV)
, pp. 753–769. Cited by: §2.Attend, infer, repeat: fast scene understanding with generative models
. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3225–3233. Cited by: §2.Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 1811–1820. Cited by: §1, §1, §2.Hybridnet: classification and reconstruction cooperation for semi-supervised learning
. In European Conference on Computer Vision (ECCV), pp. 153–169. Cited by: §3.1.Relational neural expectation maximization: unsupervised discovery of objects and their interactions
. In International Conference on Learning Representations (ICLR), Cited by: §2.Thirty-First AAAI Conference on Artificial Intelligence
, Cited by: §4.1.
Comments
There are no comments yet.