1 Introduction
Dynamics modeling of complex systems is one of the key problems in different areas of science. The underlying dependencies rising from the reallife systems are often complex or even unpredictable. Numerical modeling of their solutions is a challenging task of computational science. The main factor complicating the dynamical systems modeling is the sensitivity to initial conditions. If the solutions of the system with arbitrary close initial conditions exponentially diverge, then even small error in predictions leads to complete divergence of the approximation model.
In this paper, we are focusing on the problem of the dynamical systems modeling from observations. We are given a set of trajectories of the dynamical system:
(1) 
and our goal is to train a model reproducing the underlying dynamics. We focus on the specific class of dynamical systems, deterministic chaos, combing sensitivity to initial conditions with the existence of a bounded invariant set. Even though the solutions of such a system are bounded, they exponentially diverge from each other:
(2) 
This complex structure of an attractor is caused by the properties of the underlying equations. Therefore, if we directly approximate function , even a small deviation from the functional dependence results in the trajectories leaving the invariant set. As a result, a straightforward approximation of the function
by a neural network does not allow preserving the properties of the system.
To overcome this issue, we introduce a scalable probabilistic approach to the dynamical systems modeling based on the transformer architecture. The method directly exploits the geometry of the phase space, which allows modeling of the highdimensional systems. The model has parameters, where is the size of the grid and  dimensionality of the dynamical system.
Main contributions of this paper are:

We propose a geometryaware transformerdecoder model for the dynamical systems modeling that incorporates the factorization of both embedding and classification layers.

We introduce an iterative training approach that significantly speeds up the convergence of the model.

We apply the model to the simulated data and demonstrate that it outperforms baselines on the Rossler system and shows competitive performance on the Lorenz96 systems.
2 Proposed Method
For chaotic systems, the accuracy of the longterm predictions is a questionable functional, since even for the original system, small perturbation leads to large errors in the trajectories. Instead, we can approximate the stable quantities of dynamical systems, such as the invariant measure. In order to do it, instead of dealing with continuous spacetime, we discretize them both and convert the dynamical system modeling into discrete sequence modeling.
The goal is now to predict the conditional probability distribution. The main challenge is that the phasespace is highdimensional, and discretization with a uniform grid requires
"words", making this approach impractical fordue to the curse of dimensionality. However, such discretization is perfect for the use of
tensor factorization methods. Standard matrix of embeddings for the given vocabulary would have the shape , where is the size of representations. To avoid the exponential growth of the size of this matrix with respect to , we introduce the tensor coding layer, which factorizes these embeddings into geometryaware components. Similarly, we handle the classification layer, reducing a single classification problem to the separate prediction of the indexes along each of the axis. To get a trajectory in a continuous space we map predicted indexes back to the centers of the corresponding discretization segments.2.1 Factorized probabilistic model
We are given a set of the solutions of a system 1: , such that , where
is the constant time step. In real life, trajectories of the dynamical system are measured with a limited accuracy. This uncertainty can be viewed as an additive noise. So instead of the maximumlikelihood estimation of the next state:
(3) 
we propose to model the conditional distribution:
(4) 
For the discrete , this task is known as language modeling
. Similar to the dynamical systems, for a long time, language modeling tasks were being solved by recurrent neural networks
kalchbrenner2013recurrent ; sutskever2014sequence ; cho2014properties ; bahdanau2014neural, until the transformer model has revolutionized natural language processing
vaswani2017transformer. Selfattentive models remain the stateofart architectures in most of the NLP tasks
vaswani2017transformer ; devlin2018bert ; radford2019gpt ; brown2020language and lately are also adopted to the tasks from other areas of science sun2019contrastive ; carrasquilla2019quantum ; zhang2019self .In this paper, we build upon methodology proposed in shalova2020deep : we reduce the task of dynamical systems modeling to the language modeling using an equidistant grid discretization of the trajectories and apply transformer decoder to model dynamics in the discrete space. Each state of dimensional system is mapped to the multiindex representation , where is the grid size and each index indicates the position along one of the dimensions. Highdimensional probabilities modeling is a very challenging task even in the discrete space case, so instead of the direct modeling of the dimensional , we propose to approximate it with a factorization into the onedimensional distributions corresponding to the different dimensions:
(5) 
Such factorization makes easy both the likelihood computation and the sampling from the predicted distribution.
In other words, the indices are modeled by separate classification heads, that are learned to maximize the KLdivergence between empirical conditional distributions of the training set and predicted factorized distribution (Figure 2). Each of heads has parameters, which gives an parameters for the classification layer size.
2.2 Tensor coding
At the same time, each component of the distribution is conditioned by the whole information about previous states. It requires training of the unique representation for each of the unique states. Similar to the factorization of the probabilities, we propose the usage of the factorized embedding layer that relies on the known relation between the states. As such factorization, we use a tensor coding layer, a geometryaware analog of the tensortrain embeddings khrulkov2019ttembeddings . With a fixed representation size, it gives a estimation of the number of the parameters. In khrulkov2019ttembeddings , it was proposed to train the embeddings matrix in the format of a tensortrain matrix. Instead of matrix , where is the size of vocabulary and is a representation size, there is a dimensional tensor , such that . According to oseledets2011tensor , tensor can be approximated by the tensortrain decomposition:
(6) 
where are the ranks of the decomposition. In case of the embeddings layer in deep neural networks, during training, instead of the approximation of the fullrank matrix , we optimize parameters of the decomposition . In our case, we propose using a single component of the TTdecomposition for every dimension. That additionally allows an intuitive restoration of the chosen embeddings without computation of the whole matrix.
3 Iterative training
We found that direct training of the tensorized transformerdecoder model does not lead to good convergence (Figure 4
). However, uniform grid discretization introduces a natural hierarchy of models, and we can use the idea of a multigrid method from numerical analysis. Similar approaches have been successfully used in machine learning in progressive growing GANs, which were iteratively trained starting from a low resolution
karras2017progressive ; acharya2018towards .Each component of the embeddings factorization can be interpreted as an additional layer in the model. So, for highdimensional systems, we have a deep architecture with a slow convergence. Instead of the direct optimization of the model with a desirable discretization, we propose training model iteratively such that, on each step, the discretization scale is decreased. We begin with training the transformer decoder to reproduce dynamics on the roughest possible discretization with . Then, on each step, we use pretrained embeddings corresponding to segments to initialize embeddings with discretization size and finetune both representations and dynamics model on the obtained finer grid.
Each component of the TTembeddings is a 3dimensional tensor , where is the rank of decomposition and is the discretization size. To initialize the th component of the finergrid model we use linear approximation of the tensor. The even segments are initialized directly with the components of
, and the odd components are the average between the adjacent ones:
(7) 
In section 4.2, we demonstrate that linear approximation is appropriate as the learned representations are smooth along each direction.
On each finetuning iteration the model is trained to maximize the loglikelihood of the discrete trajectories. Easy to notice, that the factorization of the distribution leads to the factorization of the loss function:
(8) 
which corresponds to the standard multihead classification problem.
4 Experiments
In all experiments we use Hugginface implementation^{1}^{1}1https://github.com/huggingface/transformers
of the transformer decoder in PyTorch. Our implementation of the TTembeddings is based on the code
^{2}^{2}2https://github.com/KhrulkovV/ttpytorch for khrulkov2019ttembeddings except that on the forward pass all representations are reconstructed separately without restoration of the whole embeddings matrix. For both systems we use TTembeddings with the rank . Training data is simulated using an odeint integrator from the SciPy library in Python.In all cases, we use a zerotemperature generation, meaning that on each step model selects the mode of the predicted distribution. For the used dynamical systems, the learned distributions were shown to be close to the delta functions, which means that the characteristic size of the distribution was smaller than the size of discretization segments. In such a case, highertemperature generation, or, in other words, random index selection with the scaled predicted probabilities, is noisy because of the specific distribution form.
As baselines we use LSTM hochreiter1997long and Deep Koopman lusch2018deepkoopman models. For both dynamical systems, we use a singlelayer LSTM model with a hidden size . In the case of the Deep Koompan model, both encoder and decoder consist of three linear layers followed by activation; we set the hidden size = .
4.1 Rossler Attractor
is attractor for the Rossler system:
(9) 
We use the system with The training dataset consists of 1000 trajectories corresponding to the time with a time step . Initial states are sampled from distribution with and uniform . For testing we use trajectories with the initial conditions samples from the same distribution. In this experiment, we use discretization size . The transformer model has 12 selfattention layers with heads and size of representations .
Examples of the reconstructed trajectories are presented on Figure 6; shown simulation corresponds to . All models capture general behaviour, still, performance of the transformer model is better, that corresponds to the Root Mean Squared Error (RMSE) that is demonstrated in Figure 5. The graph represents dependence of the predictions error from time:
(10) 
averaged over
initial conditions. For both transformer and LSTM models, 100 steps were used as a conditioning. It is shown the GPT2 model significantly outperforms regression models in the longterm perspective. After about 50 steps, the transformer model compensates inaccuracy of the singlestate prediction and demonstrates much more stable behaviour then the Deep Koopman and LSTM models.
4.2 Lorenz96 System
We test scalability of the model on the example of Lorenz96 system with :
(11) 
where . We use forcing and iteratively increase the grid size from to segments along every axis. Final grid corresponds to size of the vocabulary. There are and trajectories in train and test sets; each trajectory corresponds to the simulation for steps with
. Initial states are chosen as the vector
, where perturbation is added only to the th component of the vector. For Lorenz96 systems we used transformer with 12 layers with attention heads each and representations of size .For this system, the exponential divergence of the real trajectories appears for the scale compared to the one of the discretization grid. As a result, the fact that we lose the microscale information is crucial for the restoration of the given trajectory. One can note that the discretization process is not bijective because the same sequence of multiindexes corresponds to the infinite number of continuous trajectories. If the continuations of these trajectories are already uncorrelated, then the probabilistic model can not be expected to generate a precise approximation of a single solution. Instead, it samples continuation from all possible trajectories. In Figure 5 (right), the divergence of real trajectories is shown to be comparable with the behavior of the ones generated by the tensorized transformer.
We also demonstrate the iterative training algorithm in action. In Figure 8, there are the components of the representation corresponding to the coordinate of the Lorenz96 system that is learned by the tensorized transformer after each iteration of the progressive training. After the first step, there are only two segments, so each component takes one of two possible values. On the next iteration, the axis is divided into three segments, then five and so on. Transitions to the finer grids are smooth, which proves that linear initialization is relevant in this case.
5 Related Work
Recently, deep neural networks have been replacing standard regression methods for dynamical systems modeling. Besides specific approaches like Deep Koopman model lusch2018deepkoopman ; yeung2019deepkoopman and PhysicsInformed neural networks raissi2019physics ; tipireddy2019physicsds ; doan2019physicsreservoir ; pangphysics , various universal architectures including ResNet chashchin2019resnet , ODEnet chen2018neural ; rubanova2019latent ; portwood2019turbulence ; de2019gru and recurrent models bailer1998recurrent ; trischler2016synthesis ; pan2018long ; gers2002lstmlaser ; vlachas2018lstm ; yu2017ttrnn ; vlachas2020backpropagation have been applied to the dynamics modeling of chaotic systems. Still, none of the mentioned methods can preserve the invariant set of the dynamical system without additional modifications. One of the possible solutions was presented in trischler2016synthesis ; it was proposed to predict
that belongs to the unitary cube and then rescale it with a linear transformation. This approach bounds the predictions but does not hold the shape of the invariant set. An alternative solution is the usage of an additional model bounding the predictions if they have low probability according to the training set
vlachas2018lstm .The radical alternative was proposed in shalova2020deep ; the authors suggested a probabilistic approach based on the transformer model. The equidistant grid discretization was used to map continuous states to the discrete representations. Then the model was trained to reproduce the conditional probabilities of the next state conditioned by the states’ history. It was shown that by sacrificing a microscale data, it is possible to obtain a good approximation of the longterm system’s behavior. The discretization of the phase space allowed modeling of the dynamics strictly holding the invariant set. One can note that a similar discrete approach to the conditional probabilities modeling is used in distributional RL bellemare2017distributional . At the same time, in shalova2020deep the geometrical properties of the phase space were not incorporated into the discretespace model. As a result, proposed discretization led to the size of the model.
6 Discussion
Limitations. Although the tensorization allows the handling of larger systems, such an approach has several drawbacks. For high dimensions
, even the iterative training approach does not lead to the efficient training of the full model, and many dynamical systems (such as discretizations of partial differential equations) have much higher dimensionality. Another restriction is a very simple factorized form of the probability distribution. Finally, it is not clear at the moment whether the developed method indeed learns the physics of the problem, and will it be able to predict something really "new”, and how dense should be the trajectories on the invariant set.
Future Work. The factorization approach proposed in this paper does not rely on the architecture of the language model. In this work, we used a powerful transformer decoder model as a backbone; however, any of the NLP models are applicable. Despite the transformer being an undoubtable SOTA in discrete sequence modeling, the discrete probabilistic approach alone may give a significant improvement compared to the continuous space modeling. Another possible line of work is the incorporation of the developed pipeline into more complex models that include dimensionality reduction: the dynamical system is first mapped to a lowdimensional latent feature space, the discretized and modeled there. It would make the proposed approach more applicable for very large .
7 Conclusion
In this paper, we introduced a probabilistic transformerbased model that learns the dynamics in the discrete space. To handle the dimensionality curse, we proposed using a tensor coding layer and factorizing the conditional distribution into the components corresponding to the system’s dimensions. Together these approaches give estimation of the number of parameters. The model with fine discretization is shown to be superior to the LSTM and Deep Koopman models. Moreover, even in the case of the rough discretization scale, the model demonstrates interpretable behavior. Due to the complexity of the tensor coding layer, it requires a specific training method. For this purpose, we introduced a progressive training algorithm, that allows finegrid modeling of dynamical systems.
References
 (1) Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, 2013.
 (2) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 (3) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 (4) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 (5) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 (6) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 (7) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
 (8) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are fewshot learners. arXiv preprint arXiv:2005.14165, 2020.
 (9) Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019.
 (10) Juan Carrasquilla, Di Luo, Felipe Pérez, Ashley Milsted, Bryan K Clark, Maksims Volkovs, and Leandro Aolita. Probabilistic simulation of quantum circuits with the transformer. arXiv preprint arXiv:1912.11052, 2019.
 (11) Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Selfattentive hawkes processes. arXiv preprint arXiv:1907.07561, 2019.
 (12) Anna Shalova and Ivan Oseledets. Deep representation learning for dynamical systems modeling. arXiv preprint arXiv:2002.05111, 2020.
 (13) Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019.
 (14) Ivan V Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 (15) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 (16) Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool. Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. arXiv preprint arXiv:1810.02419, 2018.
 (17) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 (18) Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications, 9(1):1–10, 2018.
 (19) Enoch Yeung, Soumya Kundu, and Nathan Hodas. Learning deep neural network representations for koopman operators of nonlinear dynamical systems. In 2019 American Control Conference (ACC), pages 4832–4839. IEEE, 2019.
 (20) Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
 (21) Ramakrishna Tipireddy, Paris Perdikaris, Panos Stinis, and Alexandre Tartakovsky. A comparative study of physicsinformed neural network models for learning unknown dynamics and constitutive relations. arXiv preprint arXiv:1904.04058, 2019.
 (22) Nguyen Anh Khoa Doan, Wolfgang Polifke, and Luca Magri. Physicsinformed echo state networks for chaotic systems forecasting. In International Conference on Computational Science, pages 192–198. Springer, 2019.
 (23) Guofei Pang and George Em Karniadakis. Physicsinformed learning machines for pdes: Gaussian processes versus neural networks. 2020.
 (24) Artem Chashchin, Mikhail Botchev, Ivan Oseledets, and George Ovchinnikov. Predicting dynamical system evolution with residual neural networks. arXiv preprint arXiv:1910.05233, 2019.

(25)
Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.
Neural ordinary differential equations.
In Advances in neural information processing systems, pages 6571–6583, 2018.  (26) Yulia Rubanova, Tian Qi Chen, and David K Duvenaud. Latent ordinary differential equations for irregularlysampled time series. In Advances in Neural Information Processing Systems, pages 5321–5331, 2019.
 (27) Gavin D Portwood, Peetak P Mitra, Mateus Dias Ribeiro, Tan Minh Nguyen, Balasubramanya T Nadiga, Juan A Saenz, Michael Chertkov, Animesh Garg, Anima Anandkumar, Andreas Dengel, et al. Turbulence forecasting via neural ode. arXiv preprint arXiv:1911.05180, 2019.
 (28) Edward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. GRUODEBayes: Continuous modeling of sporadicallyobserved time series. In Advances in Neural Information Processing Systems, pages 7377–7388, 2019.
 (29) Coryn AL BailerJones, David JC MacKay, and Philip J Withers. A recurrent neural network for modelling dynamical systems. network: computation in neural systems, 9(4):531–547, 1998.
 (30) Adam P Trischler and Gabriele MT D’Eleuterio. Synthesis of recurrent neural networks for dynamical system simulation. Neural Networks, 80:67–78, 2016.
 (31) Shaowu Pan and Karthik Duraisamy. Longtime predictive modeling of nonlinear dynamical systems using neural networks. Complexity, 2018, 2018.
 (32) Felix A Gers, Douglas Eck, and Jürgen Schmidhuber. Applying LSTM to time series predictable through timewindow approaches. In Neural Nets WIRN Vietri01, pages 193–200. Springer, 2002.
 (33) Pantelis R Vlachas, Wonmin Byeon, Zhong Y Wan, Themistoklis P Sapsis, and Petros Koumoutsakos. Datadriven forecasting of highdimensional chaotic systems with long shortterm memory networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170844, 2018.
 (34) Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Longterm forecasting using higher order tensor RNNs. arXiv preprint arXiv:1711.00073, 2017.
 (35) Pantelis R Vlachas, Jaideep Pathak, Brian R Hunt, Themistoklis P Sapsis, Michelle Girvan, and Edward Ott amd Petros Koumoutsakos. Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics. Neural Networks, 2020.

(36)
Marc G Bellemare, Will Dabney, and Rémi Munos.
A distributional perspective on reinforcement learning.
In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 449–458. JMLR. org, 2017.
Comments
There are no comments yet.