Robust prediction of dynamical systems remains an open question in machine learning and engineering in general. Such capabilities would enable innovations in several fields including system control, autonomous agents and computer aided engineering. The use of deep networks for sequence modelling has recently gained significant traction (Girin et al., 2020), aided also by advances in self-supervision (Wei et al., 2018). Accurate long-term prediction, though, can be notoriously difficult, especially for some dynamical systems, where errors can accumulate in finite time (Zhou et al., 2020; Fotiadis et al., 2020; Raissi et al., 2019). One reason why the prediction of dynamical systems is hard is the variability of the solution space. Even simple ODEs, like the swinging pendulum or the -body system, can have multiple continuous parameters that affect their evolution. Capturing the whole range of such parameters in a single training set is unrealistic and further inductive biases are required for robustness (Fotiadis et al., 2020; Bird and Williams, 2019; Barber et al., 2021; Miladinović et al., 2019).
Many approaches based on the use of neural networks try to jointly learn the dynamics and the physical parameters, which results in convoluted representations and usually leads to overfitting(Bengio et al., 2012). System identification can be used to extract parameters, but requires knowledge of the underlying system to be computationally effective (Ayyad et al., 2020). We leverage advances in Variational Autoencoders (Kingma and Welling, 2014) to learn representations in which the ODE parameters are disentangled from the dynamics. Disentanglement enables distinct latent variables to focus on different factors of variation of the data distribution, and has been successfully applied in the context of image generation (Higgins et al., 2017; Kim and Mnih, 2018). We translate this idea to dynamics modelling by treating ODE parameters as factors of variation. Recent findings (Locatello et al., 2018, 2019) emphasize the vital role of inductive biases from models or data for useful disentanglement. We tap into the wealth of ground truth values of ODE parameters, which are cheaply collected in simulations. Furthermore, while non-trivial, using simulated data for training real-world models is an increasingly appealing option (Peng et al., 2017). With supervised disentanglement, VAEs can achieve better generalization in parameter spaces they had not been exposed to during training.
Contributions By treating the ODE parameters as factors of variation of the data and applying supervised disentanglement, we enforce several inductive biases. First, the encoder in addition to prediction also performs ”soft” system identification which acts as a regularizer. Second, it creates an implicit hierarchy such that some latent variables correspond to sequence-wide ODE parameters and the rest capture instant dynamics. Additionally, this renders the latent space more interpretable. Third, the extracted parameters condition the decoder, bringing it closer to numerical integrators where the ODE parameters are known. We assess our method in three dynamical systems and demonstrate that disentangled VAEs can better capture the variability of dynamical systems compared to baseline models. We, also, assess the out-of-distribution (OOD) generalization to increasing degrees of ODE parameter shift, and find that disentanglement provides an important advantage in this case.
2 Related Work
VAEs and disentanglement While supervised disentanglement in generative models is a long-standing idea (Mathieu et al., 2016), information-theoretic properties can be leveraged to allow unsupervised disentanglement in VAEs (Higgins et al., 2017; Kim and Mnih, 2018). The impossibility result from (Locatello et al., 2018) demonstrated that disentangled learning is only possible by inductive biases coming either from the model or the data. Hence, the focus shifted back to semi- or weakly-supervised disentanglement approaches (Locatello et al., 2019, 2020). While most of these methods focus on assessing the disentanglement, we directly assess using the downstream prediction task.
Disentanglement in sequence modelling While disentanglement techniques are mainly tested in a static setting, there is a growing interest in applying it to sequence dynamics. Using a bottleneck based on physical knowledge, Iten et al. (2018) learn an interpretable representation that requires conditioning the decoder on time, but it can return physically inconsistent predictions in OOD data (Barber et al., 2021). Deep state-space models (SSMs) have also employed techniques for disentangling content from dynamics (Fraccaro et al., 2017; Li and Mandt, 2018), but, focus mostly on modelling variations in the content, failing to take dynamics into account. In hierarchical approaches (Karl et al., 2017), different layers of latent variables correspond to different timescales: for example, in speech analysis for separating voice characteristics and phoneme-level attributes (Hsu et al., 2017). In an approach similar to our work, Miladinović et al. (2019) separate the dynamics from sequence-wide properties in dynamical systems like Lotka-Volterra, but do so in an unsupervised way which dismisses a wealth of cheap information and only assesses the OOD generalization in a very limited way.
Feed-forward models for sequence modelling Deep SSM models are difficult to train as they require non-trivial inference schemes and a careful design of the dynamic model (Krishnan et al., 2015; Karl et al., 2017). Feed-forward models, with necessary inductive biases, have been used for sequence modelling both in language (Bai et al., 2018) and also in dynamical systems (Greydanus et al., 2019; Fotiadis et al., 2020). Disentanglement has not been successfully addressed in these models; together with Barber et al. (2021), our work is an attempt in this direction.
3 Supervised disentanglement of ODE parameters in VAEs
Variational autoencoders (VAEs) (Kingma and Welling, 2014) offer a principled approach to latent variable modeling by combining a variational inference model with a generative model . As in other approximate inference methods, the goal is to maximize the evidence lower bound (ELBO) over the data:
The first part of the ELBO is the reconstruction loss (in our case the prediction loss) and the second part is the Kullback-Leibler divergence that quantifies how close is the approximate posterior to the prior.
Design choices for the model We use an isotropic unit Gaussian prior which helps to disentangle the learned representation (Higgins et al., 2017). The approximate posterior (encoder) distribution is a Gaussian with diagonal covariance allowing a closed form KL-divergence, while the decoder has a Laplace distribution with constant diagonal covariance which is tuned empirically. This leads to an loss that provides improved results in some problems (Mathieu et al., 2018) and empirically works better in our case. The parameters , and
are computed via feed-forward neural networks.
Disentanglment of ODE parameters in latent space Apart from the disentanglement that stems from the choice of prior , we explicitly disentangle part of latent space so that it corresponds to the ODE parameters of each input sequence. We achieve this by using a regression loss term between the ground truth factors of the ODE parameters and the output of the corresponding latents, . We opted for an loss, corresponding to a Laplacian prior with mean and unitary covariance. Previous methods have reported that binary cross-entropy works better than (Locatello et al., 2019) but this does not fit well in a setting like ours. We hypothesize that BCE works better because of the implicit scaling. To address this, we propose applying a function which linearly scales the between the min and max values of the corresponding factor of variation. In all cases, the regression term is weighted by a parameter which is empirically tuned. Plugging these choices in results in the following loss function :
Model were compared on three dynamical systems:
The systems where chosen for varied complexity in terms of degrees of freedom, number of ODE equations and factors of variation. For the pendulum we consider one factor of variation, its length; Lotka-Volterra has 4 factors of variation and the 3-body system has also 4 factors of variation . Factor are drawn uniformly from a predetermined range which is the same between the training, validation and test sets. To further assess the OOD prediction accuracy, we create two additional test sets with factor values outside of the original range. We denote these datasets as OOD Test-set Easy and Hard, representing a smaller and bigger deviation from the original range. The data were additionally corrupted with Gaussian noise. Dataset details can be found on Table 1 of the Appendix.
4.2 Models and training
The main goal of this work is to assess whether OOD prediction can be improved by using ODE parameters to disentangle the latent representation in VAEs. We opted to use simple models to allow more experiments and comparisons. Our main baseline is the VAE upon which we propose two enhancements that leverage supervised disentanglement, using the loss function described in Section 3. The first model, called VAE with Supervised Disentanglement (VAE-SD), uses and identity scaling function . The second one uses a linear scaling function, termed VAE-SSD: where
are the ODE parameters and their corresponding minimum and maximum values from the training set. Another baseline is a multilayer perceptron (MLP) autoencoder which allows comparison with a deterministic counterpart of the VAE. We additionally use supervised disentanglement on the latent neurons of the MLP, a model we refer to as MLP-SD. This enables us to assess if the parameter information can improve other models. Lastly, we include a stacked LSTM model, a popular choice for low dimensional sequence modelling(Yu et al., 2019), as a representative recurrent method.
Early experiments revealed a significant variance on the performance of the models, depending on hyperparameters. Under these conditions, we took various steps to make model comparisons as fair as possible. Firstly, all models have similar capacity in terms of neuron count. Secondly, we tune various hyperparameter dimensions, some of which are shared and others are model-specific as can be seen in detail in Tables3, 4 and 5 of the Appendix. Lastly, we conduct a thorough grid search on the hyperparameters to avoid undermining a model. We train the same number of experiments for all models which amounts to 1440 trained model in total, as summarized in Table 2 of the Appendix.
For each dynamical system we focus on the performance on the three test-set, the in-distribution test set and the two OOD test-sets which represent an increasing shift from the training data. Models are compared on the cumulative Mean Absolute Error(MAE) between prediction and ground truth for 200 predicted time-steps. This is at least 20 times longer than training supervision. Long predictions are obtained by re-feeding the model outputs back as input. This approach has been shown to work well in systems where the dynamics are locally deterministic (Fotiadis et al., 2020). A summary of the quantitative results can be found in Figure 2. To account for the variability in the results, we present the 10 runs of each model with the lower MAE.
As expected, the MAE is positively correlated with the data distribution shift of the test-sets for all systems and models. Results show that in the non-disentangled models the MLP is generally better than the VAE in most cases, while the LSTM is only comparable in the pendulum dataset for small OOD shifts. Disentangled VAE models offer a substantial and consistent improvement over the VAE. It is also important to note that the improvement is more pronounced for the OOD test-sets where the distribution shift is greater. This holds true across all 3 dynamical systems, a strong sign that disentanglement of ODE parameters is an inductive bias that can lead to better generalization. On the other hand, results for the MLP-SD are mixed with overfitting observed in some cases, especially OOD. It probabilistic are better suited to capture the variation in the data. In any case, the contrast between VAE-SD and MLP-SD illustrates that making use of privileged information is not trivial and more work is needed to help us understand what works in practice and why.
Comparing the disentangled VAEs, we see that the scaling in VAE-SSD allows it to better model the data, yielding a lower error in-distribution. This seems to come at a slight overfitting cost, because the VAE-SD provides better OOD extrapolation in most cases. This could be explained because the extra scaling is dependent on min and max values of the factors in the training set. The extra information allows the model to better capture the training data but sacrifices some generalization capacity. Qualitative prediction can be found in Figure 3. All models produce plausible trajectories; nevertheless the error in some experiments explodes after a finite number of steps.
Supervised disentanglement of ODE parameters in the latent space of VAEs is a helpful inductive bias that improves OOD generalization in modelling of dynamical systems. Disentanglement acts as a regularizer for the encoder, enforcing an implicit hierarchy in the latent space, making the model not only more explainable but also acts as conditioning for the decoder. Disentanglement in MLP autoencoders does not yield equally consistent improvements indicating that using extra information is not a straightforward task that requires further exploration. While transferring models trained in simulated data to the real world is far from trivial, simulated data are cheap and this motivates similar fields like sim2real. Under that light supervised disentanglement can provide a pathway for improved robustness in real world applications where dynamical system prediction is critical. Applying the method to high-dimensional spatiotemporal data from more complicated dynamical systems can further increase its relevance. Sequence-wide parameters could also be exploited through self-supervision.
- Real-time system identification using deep learning for linear processes with application to unmanned aerial vehicles. IEEE Access 8 (), pp. 122539–122553. External Links: Cited by: §1.
- Trellis Networks for Sequence Modeling. arXiv. External Links: Cited by: §2.
- Joint Parameter Discovery and Generative Modeling of Dynamic Systems. External Links: Cited by: §2.
- Joint Parameter Discovery and Generative Modeling of Dynamic Systems. External Links: Cited by: §1, §2.
- Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. External Links: Cited by: §1.
- Customizing sequence generation with multitask dynamical systems. arXiv (i). External Links: Cited by: §1.
Comparing recurrent and convolutional neural networks for predicting wave propagation. In ICLR 2020 Workshop on Deep Differential Equations, External Links: Cited by: §1, §2, §4.3.
A disentangled recognition and nonlinear dynamics model for unsupervised learning. Advances in Neural Information Processing Systems 2017-Decem (section 5), pp. 3602–3611. External Links: Cited by: §2.
- Dynamical Variational Autoencoders: A Comprehensive Review. External Links: Cited by: §1.
- Hamiltonian Neural Networks. pp. 1–15. External Links: Cited by: §2.
- beta-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK. 44 (6), pp. 807–831. External Links: Cited by: §1, §2, §3.
- Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data. Advances in Neural Information Processing Systems 2017-December, pp. 1879–1890. External Links: Cited by: §2.
- Discovering physical concepts with neural networks. External Links: Cited by: §2.
- Deep variational Bayes filters: Unsupervised learning of state space models from raw data. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings (ii), pp. 1–13. Cited by: §2, §2.
- Disentangling by Factorising. 35th International Conference on Machine Learning, ICML 2018 6, pp. 4153–4171. External Links: Cited by: §1, §2.
- Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, External Links: Cited by: §1, §3, §3.
Deep Kalman Filters. External Links: Cited by: §2.
- Disentangled Sequential Autoencoder. 35th International Conference on Machine Learning, ICML 2018 13, pp. 8992–9001. External Links: Cited by: §2.
- Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. 36th International Conference on Machine Learning, ICML 2019 2019-June, pp. 7247–7283. External Links: Cited by: §1, §2.
- Weakly-Supervised Disentanglement Without Compromises. arXiv. External Links: Cited by: §2.
- Disentangling Factors of Variation Using Few Labels. External Links: Cited by: §1, §2, §3.
- Disentangling Disentanglement in Variational Autoencoders. External Links: Cited by: §3.
- Disentangling factors of variation in deep representations using adversarial training. External Links: Cited by: §2.
- Disentangled State Space Representations. External Links: Cited by: §1, §2.
- Sim-to-real transfer of robotic control with dynamics randomization. CoRR abs/1710.06537. External Links: Cited by: §1.
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, pp. 686–707. External Links: Cited by: §1.
- Learning and using the arrow of time. In , Vol. , pp. 8052–8060. External Links: Cited by: §1.
A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Computation 31 (7), pp. 1235–1270. External Links: Cited by: §4.2.
- Informer: Beyond efficient transformer for long sequence time-series forecasting. arXiv. External Links: Cited by: §1.
Appendix A Datasets
For simulations, we use an adaptive Runge-Kutta integrator with a timestep of seconds. Each simulated sequence has a different combination of factors of variation. Simulation of the pendulum uses an initial angle which is randomly between while the angular velocity is 0. For the other two systems the initial conditions are always the same to avoid pathological configurations.
|Number of ODEs||1||2||6|
|Independent Variables||,||(prey), (predator)|
|Factors of variation||(length)|
|OOD Test Set Easy|
|OOD Test Set Hard|
|Number of sequences|
|OOD Test Set Easy||1000|
|OOD Test Set Hard||1000|
Appendix B Training and Hyperparameters
During training the back-propagation is used after a single forward pass. The input and output of the models are smaller than the sequence size, so to cover the whole sequence we use random starting points per batch, both during training and testing. The validation set is used for early stopping. We used the Adam optimizer with and . A scheduler for the learning rate was applied whose patience and scaling factor are hyperparameters.
b.1 Number of experiments
|Input Size||10, 50|
|Output Size||1, 10|
|Hidden Layers||[400, 200]||50,100,200|
|Latent Size||4, 8, 16||-|
|Batch size||16, 32||16||16, 32||16||16, 64|
|Sched. patience||20, 30, 40||20,30||20||20||30|
|Layer norm (latent)||No||No||Yes||Yes||No|
|Supervision||-||0.1, 0.2, 0.3||-||0.01, 0.1, 0.2||-|
|# of experiments||72||72||72||72||72|
|Hidden Layers||[400, 200]||50,100|
|Latent Size||8, 16, 32||-|
|Batch size||16, 32, 64||16, 32||16, 32||16||10, 64, 128|
|Sched. patience||20, 30||20, 30||20||20||20, 30|
|Sched. factor||0.3, 0.4||0.3||0.3||0.3||0.3|
|Gradient clipping||No||No||0.1, 1.0||0.1, 1.0||No|
|Layer norm (latent)||No||No||No||No||No|
|Teacher Forcing||-||-||-||-||Partial, No|
|Supervision||-||0.1, 0.2, 0.3||-||0.01, 0.1, 0.2, 0.3||-|
|# of experiments||72||72||72||72||72|
|Hidden Layers||[400, 200]||50,100|
|Latent Size||8, 16, 32||-|
|Batch size||16, 32||16||16||16||16, 64, 128|
|Sched. patience||30, 40, 50, 60||30, 40, 50, 60||30, 40, 50, 60||30, 40, 50, 60||20, 30|
|Sched. factor||0.3, 0.4||0.3||0.3, 0.4||0.3, 0.4||0.3|
|Layer norm (latent)||No||No||No||No||No|
|Supervision||-||0.05, 0.1, 0.2, 0.3||-||0.1, 0.2||-|
|# of experiments||96||96||96||96||96|