Tensorized Transformer for Dynamical Systems Modeling

by   Anna Shalova, et al.

The identification of nonlinear dynamics from observations is essential for the alignment of the theoretical ideas and experimental data. The last, in turn, is often corrupted by the side effects and noise of different natures, so probabilistic approaches could give a more general picture of the process. At the same time, high-dimensional probabilities modeling is a challenging and data-intensive task. In this paper, we establish a parallel between the dynamical systems modeling and language modeling tasks. We propose a transformer-based model that incorporates geometrical properties of the data and provide an iterative training algorithm allowing the fine-grid approximation of the conditional probabilities of high-dimensional dynamical systems.



There are no comments yet.


page 2

page 6

page 8


Deep Representation Learning for Dynamical Systems Modeling

Proper states' representations are the key to the successful dynamics mo...

Application of Adaptive Multilevel Splitting to High-Dimensional Dynamical Systems

Stochastic nonlinear dynamical systems can undergo rapid transitions rel...

Learning deep dynamical models from image pixels

Modeling dynamical systems is important in many disciplines, e.g., contr...

NLMEModeling: A Wolfram Mathematica Package for Nonlinear Mixed Effects Modeling of Dynamical Systems

Nonlinear mixed effects modeling is a powerful tool when analyzing data ...

Sparse learning of stochastic dynamic equations

With the rapid increase of available data for complex systems, there is ...

Designing spontaneous behavioral switching via chaotic itinerancy

Chaotic itinerancy is a frequently observed phenomenon in high-dimension...

Efficient computation of extreme excursion probabilities for dynamical systems

We develop a novel computational method for evaluating the extreme excur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dynamics modeling of complex systems is one of the key problems in different areas of science. The underlying dependencies rising from the real-life systems are often complex or even unpredictable. Numerical modeling of their solutions is a challenging task of computational science. The main factor complicating the dynamical systems modeling is the sensitivity to initial conditions. If the solutions of the system with arbitrary close initial conditions exponentially diverge, then even small error in predictions leads to complete divergence of the approximation model.

In this paper, we are focusing on the problem of the dynamical systems modeling from observations. We are given a set of trajectories of the dynamical system:


and our goal is to train a model reproducing the underlying dynamics. We focus on the specific class of dynamical systems, deterministic chaos, combing sensitivity to initial conditions with the existence of a bounded invariant set. Even though the solutions of such a system are bounded, they exponentially diverge from each other:


This complex structure of an attractor is caused by the properties of the underlying equations. Therefore, if we directly approximate function , even a small deviation from the functional dependence results in the trajectories leaving the invariant set. As a result, a straightforward approximation of the function

by a neural network does not allow preserving the properties of the system.

To overcome this issue, we introduce a scalable probabilistic approach to the dynamical systems modeling based on the transformer architecture. The method directly exploits the geometry of the phase space, which allows modeling of the high-dimensional systems. The model has parameters, where is the size of the grid and - dimensionality of the dynamical system.

Main contributions of this paper are:

  1. We propose a geometry-aware transformer-decoder model for the dynamical systems modeling that incorporates the factorization of both embedding and classification layers.

  2. We introduce an iterative training approach that significantly speeds up the convergence of the model.

  3. We apply the model to the simulated data and demonstrate that it outperforms baselines on the Rossler system and shows competitive performance on the Lorenz-96 systems.

2 Proposed Method

Figure 1: An example of the proposed method for the 2-dimensional trajectory. The initial states are discretized along each dimension; then transformer decoder is used to predict conditional probabilities along each dimension.

For chaotic systems, the accuracy of the long-term predictions is a questionable functional, since even for the original system, small perturbation leads to large errors in the trajectories. Instead, we can approximate the stable quantities of dynamical systems, such as the invariant measure. In order to do it, instead of dealing with continuous space-time, we discretize them both and convert the dynamical system modeling into discrete sequence modeling.

The goal is now to predict the conditional probability distribution. The main challenge is that the phase-space is high-dimensional, and discretization with a uniform grid requires

"words", making this approach impractical for

due to the curse of dimensionality. However, such discretization is perfect for the use of

tensor factorization methods. Standard matrix of embeddings for the given vocabulary would have the shape , where is the size of representations. To avoid the exponential growth of the size of this matrix with respect to , we introduce the tensor coding layer, which factorizes these embeddings into geometry-aware components. Similarly, we handle the classification layer, reducing a single classification problem to the separate prediction of the indexes along each of the axis. To get a trajectory in a continuous space we map predicted indexes back to the centers of the corresponding discretization segments.

Figure 2: General scheme of the model: firstly, the states are mapped to the discrete space, where the representations are calculated with tensor coding layer; then the dynamics of these representations is modeled with a transformer decoder; and finally, the factorized probabilities of the next state are predicted with a multi-head classification layer.
Figure 3: An example of the tensor coding layer: continuous point is discretized along each dimension (), then the representations are calculated using the corresponding components of the TT-embeddings.

2.1 Factorized probabilistic model

We are given a set of the solutions of a system 1: , such that , where

is the constant time step. In real life, trajectories of the dynamical system are measured with a limited accuracy. This uncertainty can be viewed as an additive noise. So instead of the maximum-likelihood estimation of the next state:


we propose to model the conditional distribution:


For the discrete , this task is known as language modeling

. Similar to the dynamical systems, for a long time, language modeling tasks were being solved by recurrent neural networks

kalchbrenner2013recurrent ; sutskever2014sequence ; cho2014properties ; bahdanau2014neural

, until the transformer model has revolutionized natural language processing


. Self-attentive models remain the state-of-art architectures in most of the NLP tasks

vaswani2017transformer ; devlin2018bert ; radford2019gpt ; brown2020language and lately are also adopted to the tasks from other areas of science sun2019contrastive ; carrasquilla2019quantum ; zhang2019self .

In this paper, we build upon methodology proposed in shalova2020deep : we reduce the task of dynamical systems modeling to the language modeling using an equidistant grid discretization of the trajectories and apply transformer decoder to model dynamics in the discrete space. Each state of -dimensional system is mapped to the multi-index representation , where is the grid size and each index indicates the position along one of the dimensions. High-dimensional probabilities modeling is a very challenging task even in the discrete space case, so instead of the direct modeling of the -dimensional , we propose to approximate it with a factorization into the one-dimensional distributions corresponding to the different dimensions:


Such factorization makes easy both the likelihood computation and the sampling from the predicted distribution.

In other words, the indices are modeled by separate classification heads, that are learned to maximize the KL-divergence between empirical conditional distributions of the training set and predicted factorized distribution (Figure 2). Each of heads has parameters, which gives an parameters for the classification layer size.

2.2 Tensor coding

At the same time, each component of the distribution is conditioned by the whole information about previous states. It requires training of the unique representation for each of the unique states. Similar to the factorization of the probabilities, we propose the usage of the factorized embedding layer that relies on the known relation between the states. As such factorization, we use a tensor coding layer, a geometry-aware analog of the tensor-train embeddings khrulkov2019ttembeddings . With a fixed representation size, it gives a estimation of the number of the parameters. In khrulkov2019ttembeddings , it was proposed to train the embeddings matrix in the format of a tensor-train matrix. Instead of matrix , where is the size of vocabulary and is a representation size, there is a -dimensional tensor , such that . According to oseledets2011tensor , tensor can be approximated by the tensor-train decomposition:


where are the ranks of the decomposition. In case of the embeddings layer in deep neural networks, during training, instead of the approximation of the full-rank matrix , we optimize parameters of the decomposition . In our case, we propose using a single component of the TT-decomposition for every dimension. That additionally allows an intuitive restoration of the chosen embeddings without computation of the whole matrix.

3 Iterative training

We found that direct training of the tensorized transformer-decoder model does not lead to good convergence (Figure 4

). However, uniform grid discretization introduces a natural hierarchy of models, and we can use the idea of a multigrid method from numerical analysis. Similar approaches have been successfully used in machine learning in progressive growing GANs, which were iteratively trained starting from a low resolution

karras2017progressive ; acharya2018towards .

Figure 4: Training loss for the standard procedure (left) and with the iterative training algorithm (right). The peaks corresponds to the transition to the finer grids.

Each component of the embeddings factorization can be interpreted as an additional layer in the model. So, for high-dimensional systems, we have a deep architecture with a slow convergence. Instead of the direct optimization of the model with a desirable discretization, we propose training model iteratively such that, on each step, the discretization scale is decreased. We begin with training the transformer decoder to reproduce dynamics on the roughest possible discretization with . Then, on each step, we use pre-trained embeddings corresponding to segments to initialize embeddings with discretization size and fine-tune both representations and dynamics model on the obtained finer grid.

Each component of the TT-embeddings is a 3-dimensional tensor , where is the rank of decomposition and is the discretization size. To initialize the -th component of the finer-grid model we use linear approximation of the tensor. The even segments are initialized directly with the components of

, and the odd components are the average between the adjacent ones:


In section 4.2, we demonstrate that linear approximation is appropriate as the learned representations are smooth along each direction.

On each fine-tuning iteration the model is trained to maximize the log-likelihood of the discrete trajectories. Easy to notice, that the factorization of the distribution leads to the factorization of the loss function:


which corresponds to the standard multi-head classification problem.

4 Experiments

Figure 5: Average over 100 trajectories error of predictions from time for LSTM, Deep Koopman and Tensorized Transformer (ours) models for: (right) Rossler attractor and (left) Lorenz-96 system with . For Rossler we use grid with segments, for Lorenz-96 - . For the Lorenz-96 system we also show the divergence of the real trajectories.

In all experiments we use Hugginface implementation111https://github.com/huggingface/transformers

of the transformer decoder in PyTorch. Our implementation of the TT-embeddings is based on the code

222https://github.com/KhrulkovV/tt-pytorch for khrulkov2019ttembeddings except that on the forward pass all representations are reconstructed separately without restoration of the whole embeddings matrix. For both systems we use TT-embeddings with the rank . Training data is simulated using an odeint integrator from the SciPy library in Python.

In all cases, we use a zero-temperature generation, meaning that on each step model selects the mode of the predicted distribution. For the used dynamical systems, the learned distributions were shown to be close to the delta functions, which means that the characteristic size of the distribution was smaller than the size of discretization segments. In such a case, higher-temperature generation, or, in other words, random index selection with the scaled predicted probabilities, is noisy because of the specific distribution form.

As baselines we use LSTM hochreiter1997long and Deep Koopman lusch2018deepkoopman models. For both dynamical systems, we use a single-layer LSTM model with a hidden size . In the case of the Deep Koompan model, both encoder and decoder consist of three linear layers followed by activation; we set the hidden size = .

Figure 6: Trajectory of the Rossler attractor restored by LSTM, Deep Koopman and Tensorized Transformer model with (ours). Both LSTM and Transformer are conditioned on 100 states of the real trajectory.
Figure 7: From left to right: real trajectory of the Lorenz-96 model with and trajectories restored by Tensorized Transformer model with , Deep Koopman and LSTM models. Both LSTM and Transformer are conditioned on 100 states of the real trajectory.

4.1 Rossler Attractor

is attractor for the Rossler system:


We use the system with The training dataset consists of 1000 trajectories corresponding to the time with a time step . Initial states are sampled from distribution with and uniform . For testing we use trajectories with the initial conditions samples from the same distribution. In this experiment, we use discretization size . The transformer model has 12 self-attention layers with heads and size of representations .

Examples of the reconstructed trajectories are presented on Figure 6; shown simulation corresponds to . All models capture general behaviour, still, performance of the transformer model is better, that corresponds to the Root Mean Squared Error (RMSE) that is demonstrated in Figure 5. The graph represents dependence of the predictions error from time:


averaged over

initial conditions. For both transformer and LSTM models, 100 steps were used as a conditioning. It is shown the GPT-2 model significantly outperforms regression models in the long-term perspective. After about 50 steps, the transformer model compensates inaccuracy of the single-state prediction and demonstrates much more stable behaviour then the Deep Koopman and LSTM models.

4.2 Lorenz-96 System

We test scalability of the model on the example of Lorenz-96 system with :


where . We use forcing and iteratively increase the grid size from to segments along every axis. Final grid corresponds to size of the vocabulary. There are and trajectories in train and test sets; each trajectory corresponds to the simulation for steps with

. Initial states are chosen as the vector

, where perturbation is added only to the -th component of the vector. For Lorenz-96 systems we used transformer with 12 layers with attention heads each and representations of size .

For this system, the exponential divergence of the real trajectories appears for the scale compared to the one of the discretization grid. As a result, the fact that we lose the micro-scale information is crucial for the restoration of the given trajectory. One can note that the discretization process is not bijective because the same sequence of multi-indexes corresponds to the infinite number of continuous trajectories. If the continuations of these trajectories are already uncorrelated, then the probabilistic model can not be expected to generate a precise approximation of a single solution. Instead, it samples continuation from all possible trajectories. In Figure 5 (right), the divergence of real trajectories is shown to be comparable with the behavior of the ones generated by the tensorized transformer.

We also demonstrate the iterative training algorithm in action. In Figure 8, there are the components of the representation corresponding to the coordinate of the Lorenz-96 system that is learned by the tensorized transformer after each iteration of the progressive training. After the first step, there are only two segments, so each component takes one of two possible values. On the next iteration, the axis is divided into three segments, then five and so on. Transitions to the finer grids are smooth, which proves that linear initialization is relevant in this case.

Figure 8: The first component of TT-embeddings (corresponds to ) after each iteration of fine-tuning; is the number of the discretization segments. Fine grid representations have the same pattern as the ones with a small number of segments. The representations get smoother with the growth of the discretization size .

5 Related Work

Recently, deep neural networks have been replacing standard regression methods for dynamical systems modeling. Besides specific approaches like Deep Koopman model lusch2018deepkoopman ; yeung2019deepkoopman and Physics-Informed neural networks raissi2019physics ; tipireddy2019physicsds ; doan2019physicsreservoir ; pangphysics , various universal architectures including ResNet chashchin2019resnet , ODEnet chen2018neural ; rubanova2019latent ; portwood2019turbulence ; de2019gru and recurrent models bailer1998recurrent ; trischler2016synthesis ; pan2018long ; gers2002lstmlaser ; vlachas2018lstm ; yu2017ttrnn ; vlachas2020backpropagation have been applied to the dynamics modeling of chaotic systems. Still, none of the mentioned methods can preserve the invariant set of the dynamical system without additional modifications. One of the possible solutions was presented in trischler2016synthesis ; it was proposed to predict

that belongs to the unitary cube and then re-scale it with a linear transformation. This approach bounds the predictions but does not hold the shape of the invariant set. An alternative solution is the usage of an additional model bounding the predictions if they have low probability according to the training set

vlachas2018lstm .

The radical alternative was proposed in shalova2020deep ; the authors suggested a probabilistic approach based on the transformer model. The equidistant grid discretization was used to map continuous states to the discrete representations. Then the model was trained to reproduce the conditional probabilities of the next state conditioned by the states’ history. It was shown that by sacrificing a micro-scale data, it is possible to obtain a good approximation of the long-term system’s behavior. The discretization of the phase space allowed modeling of the dynamics strictly holding the invariant set. One can note that a similar discrete approach to the conditional probabilities modeling is used in distributional RL bellemare2017distributional . At the same time, in shalova2020deep the geometrical properties of the phase space were not incorporated into the discrete-space model. As a result, proposed discretization led to the size of the model.

6 Discussion

Limitations. Although the tensorization allows the handling of larger systems, such an approach has several drawbacks. For high dimensions

, even the iterative training approach does not lead to the efficient training of the full model, and many dynamical systems (such as discretizations of partial differential equations) have much higher dimensionality. Another restriction is a very simple factorized form of the probability distribution. Finally, it is not clear at the moment whether the developed method indeed learns the physics of the problem, and will it be able to predict something really "new”, and how dense should be the trajectories on the invariant set.

Future Work. The factorization approach proposed in this paper does not rely on the architecture of the language model. In this work, we used a powerful transformer decoder model as a backbone; however, any of the NLP models are applicable. Despite the transformer being an undoubtable SOTA in discrete sequence modeling, the discrete probabilistic approach alone may give a significant improvement compared to the continuous space modeling. Another possible line of work is the incorporation of the developed pipeline into more complex models that include dimensionality reduction: the dynamical system is first mapped to a low-dimensional latent feature space, the discretized and modeled there. It would make the proposed approach more applicable for very large .

7 Conclusion

In this paper, we introduced a probabilistic transformer-based model that learns the dynamics in the discrete space. To handle the dimensionality curse, we proposed using a tensor coding layer and factorizing the conditional distribution into the components corresponding to the system’s dimensions. Together these approaches give estimation of the number of parameters. The model with fine discretization is shown to be superior to the LSTM and Deep Koopman models. Moreover, even in the case of the rough discretization scale, the model demonstrates interpretable behavior. Due to the complexity of the tensor coding layer, it requires a specific training method. For this purpose, we introduced a progressive training algorithm, that allows fine-grid modeling of dynamical systems.