1 Introduction
The increase in computing capabilities together with new deep learning models has led to great advances in several machine learning tasks
LeCun2015DeepLearning; Sutskever2014SequenceTS; Silver2017MasteringTG.Deep learning architectures such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), as well as the use of distributed representations in natural language processing, have allowed to take into account the symmetries and the structure of the problem to be solved.
However, a major criticism of deep learning remains, namely, that it only performs perception, mapping inputs to outputs Marcus2018Deep.
A new direction to more general and flexible models is differentiable programming, that is, the combination of geometric modules (traditional neural networks) with more algorithmic modules in an endtoend differentiable model. As a result, differentiable programming is a generalization of deep learning with differentiable modules that provide reasoning, abstraction, memory, etc. To efficiently calculate derivatives, this approach uses automatic differentiation, an algorithmic technique similar to backpropagation and implemented in modern software packages such as PyTorch, Julia, etc.
To keep our exposition concise, this tutorial is aimed at researchers in nonlinear systems with prior knowledge of deep learning; see Goodfellow2016DL for an excellent introduction to the concepts and methods of deep learning. Therefore, this tutorial focuses right away on the limitations of traditional deep learning and the advantages of differential programming, with special attention to its application to dynamical systems. By a dynamical system we mean here and hereafter a set of state variables that evolve in time under the influence of internal and possibly external inputs.
Examples of differentiable programming that have been successfully developed in recent years include
(i) attention mechanisms Bahdanau2014Attention, in which the model automatically searchs and learns which parts of a source sequence are relevant to predict the target element,
(ii) selfattention,
(iii) endtoend Memory Networks Suk2015EndToEndMN, and
(iv) Differentiable Neural Computers (DNCs) Graves2016DNC, which are neural networks (controllers) with an external readwrite memory.
As expected, in recent years there has been a growing interest in applying deep learning techniques to dynamical systems. In this regard, RNNs and Long ShortTerm Memories (LSTMs), specially designed for sequence modelling and temporal dependence, have been successful in various applications to dynamical systems such as model identification and time series prediction
WangModelIdLSTM; Yu2017ConceptLSTM; Li2018LSTMTourismFlow.The performance of theses models (e.g. encoderdecoder networks), however, degrades rapidly as the length of the input sequence increases and they are not able to capture the dynamic (i.e., timechanging) interdependence between time steps. The combination of neural networks with new differentiable modules could overcome some of those problems and offer new opportunities and applications.
Among the potential applications of differentiable programming to dynamical systems let us mention
(i) attention mechanisms to select the relevant time steps and inputs,
(ii) memory networks to store historical data from dynamical systems and selectively use it for modelling and prediction, and
(iii) the use of differentiable components in scientific computing.
Despite some achievements, more work is still needed to verify the benefits of these models over traditional networks.
Thanks to software libraries that facilitate automatic differentiation, differentiable programming extends deep learning models with new capabilities (reasoning, memory, attention, etc.) and the models can be efficiently coded and implemented.
In the following sections of this tutorial we introduce differentiable programming and explain in detail why it is an extension of deep learning (Section 2). We describe some models based on this new approach such as attention mechanisms (Section 3.1), memory networks and differentiable neural computers (Section 3.2), and continuous learning (Section 3.3). Then we review the use of deep learning in dynamical systems and their limitations (Section 4). And, finally, we present the new opportunities that differentiable programming can bring to the modelling, simulation and prediction of dynamical systems (Section 5). The conclusions and outlook are summarized in Section 6.
2 From deep learning to differentiable programming
In recent years, we have seen major advances in the field of machine learning. The combination of deep neural networks with the computational capabilities of Graphics Processing Units (GPUs) Yadan2013MultiGPU has improved the performance of several tasks (image recognition, machine translation, language modelling, time series prediction, game playing, etc.) LeCun2015DeepLearning; Sutskever2014SequenceTS; Silver2017MasteringTG. Interestingly, deep learning models and architectures have evolved to take into account the structure of the problem to be resolved.
RNNs are a special class of neural networks where outputs from previous steps are fed as inputs to the current step Graves2009ANC; Sherstinsky2018FundRNN. This recurrence makes them appropriate for modelling dynamic processes and systems.
CNNs are neural networks that alternate convolutional and pooling layers to implement translational invariance Lecun1998CNN
. They learn spatial hierarchies of features through backpropagation by using these building layers. CNNs are being applied successfully to computer vision and image processing
Yamashita2018CNN.Especially important is the use of distributed representations as inputs to natural language processing pipelines. With this technique, the words of the vocabulary are mapped to an element of a vector space with a much lower dimensionality
Bengio2003NeuralModel; Mikolov2013Word2vec. This word embedding is able to keep, in the learned vector space, some of the syntactic and semantic relationships presented in the original data.Let us recall that, in a feedforward neural network (FNN) composed of multiple layers, the output (without the bias term) at layer
, see Figure 1, is defined as(1) 
being the weight matrix at layer .
is the activation function and
, the output vector at layer and the input vector at layer . The weight matrices for the different layers are the parameters of the model.Learning is the mechanism by which the parameters of a neural network are adapted to the environment in the training process. This is an optimization problem which has been addressed by using gradientbased methods, in which given a cost function , the algorithm finds local minima updating each layer parameter with the rule , being the step size.
Apart from regarding neural networks as universal approximators, there is no sound theoretical explanation for a good performance of deep learning. Several theoretical frameworks have been proposed:

As pointed out in Lin2016Cheaplearning, the class of functions of practical interest can be approximated with exponentially fewer parameters than the generic ones. Symmetry, locality and compositionality properties make it possible to have simpler neural networks.

From the point of view of information theory ShwartzZiv2017ITDL, an explanation has been put forward based on how much information each layer of the neural network retains and how this information varies with the training and testing process.
Although deep learning can implicitly implement logical reasoning Hohenecker2018Onto, one of its limitations is that it only performs perception, representing a mapping between inputs and outputs Marcus2018Deep.
A path to a more general intelligence, as we will see below, is the combination of geometric modules with more algorithmic modules in an endtoend differentiable model. This approach, called differentiable programming, adds new parametrizable and differentiable components to traditional neural networks.
Differentiable programming can be seen as a continuation of the deep learning endtoend architectures that have replaced, for example, the traditional linguistic components in natural language processing Deng2018NLP; Goldberg2017NLPBook. To efficiently calculate the derivatives in a gradient descent, this approach uses automatic differentiation, an algorithmic technique similar but more general than backpropagation.
Automatic differentiation, in its reversemode and in contrast to manual, symbolic and numerical differentiation, computes the derivatives in a twostep process Baydin2018AutomaticDiff; Wang2018DemystifyingDP. In a first step, called the forward pass, the computational graph is built populating intermediate variables and recording the dependencies. In a second step, called the backward pass, derivatives are calculated using backpropagation. In the last years, deep learning frameworks such as PyTorch have been developed that provide reversemode automatic differentiation Paszke2017automatic. The definebyrun philosophy of PyTorch, whose execution dynamically constructs the computational graph, is facilitating the development of general differentiable programs.
Differentiable programming is an evolution of classical (traditional) software programming where, as shown in Table 1:

Instead of specifying explicit instructions to the computer, an objective is set and an optimizable architecture is defined which allows to search in a subset of possible programs.

The program is defined by the inputoutput data and not predefined by the user.

The algorithmic elements of the program have to be differentiable, say, by converting them into neural networks.
RNNs, for example, are an evolution of feedforward networks because they are classical neural networks inside a forloop (a control flow statement for iteration) which allows the neural network to be executed repeatedly with recurrence. However, this forloop is a predefined feature of the model. The ideal situation would be to augment the neural network with programming primitives (forloops, if branches, while statements, external memories, logical modules, etc.) that are not predefined by the user but are parametrizable by the training data.
The trouble is that many of these programming primitives are not differentiable and need to be converted into optimizable modules. For instance, if the condition of an ”if” primitive (e.g., if is satisfied do , otherwise do
) is to be learned, it can be the output of a neural network (linear transformation and a sigmoid function) and the conditional primitive will transform into a weighted combination of both branches
. Similarly, in an attention module, different weights that are learned with the model are assigned to give a different influence to each part of the input.In this way, using differentiable programming we can combine traditional perception modules (CNN, RNN, FNN) with additional algorithmic modules that provide reasoning, abstraction, memory, etc. Yang2017DiffeLogical. In the following section we describe some examples of this approach that have been developed in recent years.
3 Differentiable learning and reasoning
3.1 Attention mechanisms
RNNs (see Figure 2) are a basic component of modern deep learning architectures, especially of encoderdecoder networks. The following equations define the time evolution of an RNN:
(2) 
(3) 
, and being weight matrices. and are the hidden and output activation functions while , and are the network input, hidden state and output.
An evolution of RNNs are LSTMs Hochreiter1997LSTM, an RNN structure with gated units. LSTM are composed of a cell, an input gate, an output gate and a forget gate, and allow gradients to flow unchanged. The memory cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.
An encoderdecoder network maps an input sequence to a target one with both sequences of arbitrary length Sutskever2014SequenceTS. They have applications ranging from machine translation to time series prediction.
More specifically, this mechanism uses an RNN (or any of its variants, an LSTM or a GRU, Gated Recurrent Unit) to map the input sequence to a fixedlength vector, and another RNN (or any of its variants) to decode the target sequence from that vector (see Figure
3). Such a seq2seq model features normally an architecture composed of:
An encoder which, given an input sequence with , maps to
(4) where is the hidden state of the encoder at time , is the size of the hidden state and is an RNN (or any of its variants).

A decoder, whose initial state is initialized with the last hidden state of the encoder . It generates the output sequence , (the dimension depending on the task), with
(5) where
is an RNN (or any of its variants) with an additional softmax layer.
Because the encoder compresses all the information of the input sequence in a fixedlength vector (the final hidden state ), the decoder possibly does not take into account the first elements of the input sequence. The use of this fixedlength vector is a limitation to improve the performance of the encoderdecoder networks. Moreover, the performance of encoderdecoder networks degrades rapidly as the length of the input sequence increases cho2014ProblemsEncDecod. This occurs in applications such as machine translation and time series predition, where it is necessary to model long time dependencies.
The key to solve this problem, inspired by neuroscience and human behaviour, is the attention mechanism. This mechanism allows the brain to focus on one part of the input (image, text, etc), giving less attention to others. So, implement a similar mechanism in an encoderdecoder network and make it differentiable.
In Bahdanau2014Attention an extension of the basic encoderdecoder arquitecture was proposed by allowing the model to automatically search and learn which parts of a source sequence are relevant to predict the target element. Instead of encoding the input sequence in a fixedlength vector, it generates a sequence of vectors, choosing the most appropriate subset of these vectors during the decoding process.
With the attention mechanism, the encoder is a bidirectional RNN Graves2013HybridRNN with a forward hidden state and a backward one . The encoder state is represented as a simple concatenation of the two states,
(6) 
with . The encoder state includes both the preceding and following elements of the sequence, thus capturing information from neighbouring inputs.
The decoder has an output
(7) 
for . is an RNN with an additional softmax layer, and the input is a concatenation of with the context vector , which is a sum of hidden states of the input sequence weighted by alignment scores:
(8) 
The weight of each state is calculated by
(9) 
The score measures how well the input at position and the output at position match. are the weights that implement the attention mechanism, defining how much of each input hidden state should be considered when deciding the next state and generating the output (see Figure 4).
The score function can be parametrized using different alignment models such as
(10) 
as proposed in Bahdanau2014Attention, where and are matrices to be jointly learned with the rest of the model. Also, in Graves2014NTM the authors use a similarity measure for contentbased attention, namely,
(11) 
where denotes the angle between and .
An example of a matrix of alignment scores can be seen in Figure 5. This matrix provides interpretability to the model since it allows to know which part (timestep) of the input is more important to the output.
3.2 Other attention mechanisms and differentiable neural computers
A variant of the attention mechanism is selfattention, in which the attention component relates different positions of a single sequence in order to compute a representation of the sequence. In this way, the mechanism can connect distant elements of the sequence more directly than using RNNs Tang2018SelfAtt.
Another variant of attention are endtoend memory networks Suk2015EndToEndMN
, which are neural networks with a recurrent attention model over an external memory. The model, trained endtoend, outputs an answer based on a set of inputs
stored in a memory and a query.Traditional computers are based on the von Neumann architecture which has two basic components: the CPU (Central Processing Unit), which carries out the program instructions, and the memory, which is accessed by the CPU to perform write/read operations. In contrast, neural networks follow a hybrid model where synaptic weights perform both processing and memory tasks.
Neural networks and deep learning models are good at mapping inputs to outputs but are limited in their ability to use facts from previous events and store useful information. Differentiable Neural Computers (DNCs) Graves2016DNC try to overcome these shortcomings by combining neural networks with an external readwrite memory.
As described in Graves2016DNC, a DNC is a neural network, called the controller (playing the role of a differentiable CPU), with an external memory, an matrix. The DNC uses differentiable attention mechanisms to define distributions (weightings) over the N rows and learn the importance each row has in a read or write operation.
To select the most appropriate memory components during read/write operations, a weighted sum is used over the memory locations The attention mechanism is used in three different ways: Access content (read or write) based on similarity, time ordered access (temporal links) to recover the sequences in the order in which they were written, and dynamic memory allocation, where the DNC assigns and releases memory based on usage percentage. At each time step, the DNC gets an input vector and emits an output vector that is a function of the combination of the input vector and the memories selected.
DNCs, by combining the following characteristics, have very promising applications in complex tasks that require both perception and reasoning:

The classical perception capability of neural networks.

Read and write capabilities based on content similarity and learned by the model.

The use of previous knowledge to plan and reason.

Endtoend differentiability of the model.

Implementation using software packages with automatic differentiation libraries such as PyTorch, Tensorflow or similar.
3.3 Metaplasticity and continuous learning
The combination of geometric modules (classical neural networks) with algorithmic ones adds new learning capabilities to deep learning models. In the previous sections we have seen that one way to improve the learning process is by focusing on certain elements of the input or a memory and making this attention differentiable.
Another natural way to improve the process of learning is to incorporate differentiable primitives that add flexibility and adaptability. A source of inspiration is neuromodulators, which furnish the traditional synaptic transmission with new computational and processing capabilities Hernandez2018MultilayerAN.
Unlike the continuous learning capabilities of animal brains, which allow animals to adapt quickly to the experience, in neural networks, once the training is completed, the parameters are fixed and the network stops learning. To solve this issue, in Miconi2018DifferentiablePT a differentiable plasticity component is attached to the network that helps previouslytrained networks adapt to ongoing experience.
The activation
of neuron
has a conventional fixed weight and a plastic component , where is a structural parameter tuned during the training period and a plastic component automatically updated as a function of ongoing inputs and outputs. The equations for the activation of with learning rate , as described in Miconi2018DifferentiablePT, are:(12) 
(13) 
Then, during the initial training period, and are trained using gradient descent and after this period, the model keeps learning from ongoing experience.
4 Modeling dynamical systems with neural networks
Dynamical systems deal with timeevolutionary processes and their corresponding systems of equations. At any given time, a dynamical system has a state that can be represented by a point in a state space (manifold). The evolutionary process of the dynamical system describes what future states follow from the current state. This process can be deterministic, if its entire future is uniquely determined by its current state, or nondeterministic otherwise Layek2015BookDS (e.g., a random dynamical system Arnold2003RandomDS). Furthermore, it can be a continuoustime process, represented by differential equations or, as in this paper, a discretetime process, represented by difference equations or maps.
Dynamical systems have important applications in physics, chemistry, economics, engineering, biology and medicine Jackson2015BookDSBio. They are relevant even in daytoday phenomena with great social impact such as tsunami warning, earth temperature analysis and financial markets prediction.
Dynamical systems that contain a very large number of variables interacting with each other in nontrivial ways are sometimes called complex (dynamical) systems Gross2008CAS. Their behaviour is intrinsically difficult to model due to the dependencies and interactions between their parts and they have emergence properties arising from these interactions such as adaptation, evolution, learning, etc.
Here we consider discretetime, deterministic and nonautonomous (i.e., the time evolution depending also on exogenous variables) dynamical systems as well as the more general complex systems. Specifically, the dynamical systems of interest range from systems of difference equations with multiple time delays to systems with a dynamic (i.e., timechanging) interdependence between time steps. Notice that the former ones may be rewritten as higher dimensional systems with time delay 1.
On the other hand, in recent years deep learning models have been very successful performing various tasks such as image recognition, machine translation, game playing, etc. When the amount of training data is sufficient and the distribution that generates the real data is the same as the distribution of the training data, these models perform extremely well and approximate the inputoutput relation.
In view of the importance of dynamical systems for modeling physical, biological and social phenomena, there is a growing interest in applying deep learning techniques to dynamical systems. This can be done in different contexts, such as:

Modeling dynamical systems with known structure and equations but nonanalytical or complex solutions Pan2018LongTimePM.

Modeling dynamical systems without knowledge of the underlying governing equations Düben2018GlobalWeather; Chakraborty1992ForecastingTB. In this regard, let us mention that commercial initiatives are emerging that combine large amounts of meteorological data with deep learning models to improve weather predictions.

Modeling dynamical systems with partial or noisy data Yeo2019DLNoisyDS.
Therefore, the combination of greater computing resources, large amounts of data and deep learning models can, in principle, revolutionize the modeling of dynamical systems in general, and complex systems in particular, and thus enable a better understanding and prediction of relevant natural and social phenomena.
A key aspect in modelling dynamical systems is temporal dependence. There are two ways to introduce it into a neural network Narendra1990DSystemsNN:

A classical feedforward neural network with time delayed states in the inputs but perhaps with an unnecessary increase in the number of parameters.
Thus, RNNs, specially designed for sequence modelling chang2018antisymmetricrnn, seem the ideal candidates to model, analyze and predict dynamical systems in the broad sense used in this tutorial. The temporal recurrence of RNNs, theoretically, allows to model and identify dynamical systems described with equations with any temporal dependence.
To learn chaotic dynamics, recurrent radial basis function (RBF) networks
Miyosi1995ChaoticRBFand evolutionary algorithms that generate RNNs have been proposed
SatoEvolutionaryRNN. ”Nonlinear Autoregressive model with exogenous input” (NARX)
DiaconescuNARXChaoticSeries and boosted RNNs AssaadChaoticSeriesRNN have been applied to predict chaotic time series.However, a difficulty with RNNs is the vanishing gradient problem
Bengio1994GradientDifficult. RNNs are trained by unfolding them into deep feedforward networks, creating a new layer for each time step of the input sequence. When backpropagation computes the gradient by the chain rule, this gradient vanishes as the number of timesteps increases. As a result, for long inputoutput sequences, as depicted in Figure
6, RNNs have trouble modelling longterm dependencies, that is, relationships between elements that are separated by large periods of time.To overcome this problem, LSTMs were proposed. LSTMs have an advantage over basic RNNs due to their relative insensitivity to temporal delays and, therefore, are appropriate for modeling and making predictions based on time series whenever there exist temporary dependencies of unknown duration. With the appropriate number of hidden units and activation functions Yu2017ConceptLSTM, LSTMs can model and identify any nonlinear dynamical system of the form:
(14) 
(15) 
and are the state and output functions while , and are the system input, state and output.
LSTMs have succeeded in various applications to dynamical systems such as model identification and time series prediction WangModelIdLSTM; Yu2017ConceptLSTM; Li2018LSTMTourismFlow.
An also remarkable application of the LSTM has been machine translation Sutskever2014SequenceTS; Cho2014EncDec, using the encoderdecoder architecture described in Section 3.1.
However, as we have seen, the decoder possibly does not take into account the first elements of the input sequence because the encoder compresses all the information of the input sequence in a fixedlength vector. Then, the performance of encoderdecoder networks degrades rapidly as the length of input sequence increases and this can be a problem in time series analysis, where predictions are based upon a long segment of the series.
Furthermore, as depicted in Figure 7, a complex dynamic may feature interdependencies between time steps that vary with time. In this situation, the equation that defines the temporal evolution may change at each . For these dynamical systems, adding an attention module like the one described in Equation 8 can help model such timechanging interdependencies.
In the next section, we will discuss whether the new differentiable techniques in machine learning are a promising approach to solve these issues.
5 Improving dynamical systems with differentiable programming
Deep learning models together with graphic processors and large amounts of data have improved the modeling of dynamical systems but this has some limitations such as those mentioned in the previous section. The combination of neural networks with new differentiable algorithmic modules is expected to overcome some of those shortcomings and offer new opportunities and applications.
Among the applications to dynamical systems we highlight the following:

Use of differentiable attention mechanisms to select the relevant time steps and inputs.
As we mentioned previously, in many dynamical systems there are long term dependencies between time steps. Moreover, these interdependencies can be dynamic, i.e., timechanging. In these cases, attention mechanisms learn to focus on the most relevant parts of the system input or state. This can be very useful in systems modeling and learning or in time series prediction.
In Hollis2018LSTMvsAttention, a comparison is made between LSTMs and attention mechanisms for financial time series forecasting. It is shown there that an LSTM with attention perform better than standalone LSTMs.
A temporal attention layer is used in Vinayavekhin2018AttentonTimeSeries to select relevant information and to provide model interpretability, an essential feature to understand deep learning models. Interpretability is further studied in detail in Serrano2019AttInterpret, concluding that attention weights partially reflect the impact of the input elements on model prediction.
In Quin2017DualAttention, the authors propose an interesting architecture with a dualstage attention RNN; the first stage extracts the relevant input features and the second selects the relevant time steps. The model outperforms classical model in time series prediction.
Despite the theoretical advantages and some achievements, further studies are needed to verify the benefits of the attention mechanism over traditional networks.

Use of memory networks to store historical data and selectively retrieve it to model and predict.
Memory networks allow longterm dependencies in sequential data to be learned thanks to an external memory component. Instead of taking into account only the most recent states, memory networks consider the entire list of entries or states.
In Chang2018AMB the authors propose a model with a memory component, three encoders and an autoregressive component for multivariate timeseries forecasting. Compared to nonmemory RNN models, their model is better at modeling and capturing longterm dependencies and, moreover, it is interpretable.
Taking advantage of the highlighted capabilities of Differentiable Neural Computers (DNCs), an enhanced DNC for electroencephalogram (EEG) data analysis is proposed in Ming2018EEGDataDNC. By replacing the LSTM network controller with a recurrent convolutional network, the potential of DNCs in EEG signal processing is convincingly demonstrated.

Scientific simulation. Classical simulation models along with differentiable components allow more flexible models that better explain real data.
Scientific computing traditionally deals with modeling and simulating complex phenomena using numerical models. As we have seen, differentiable programming is an evolution of classical software programming. In differentiable programming, an optimizable architecture is defined and the inputoutput data allows to search in a subset of possible programs.
For example, one type of application would be the combination of a numerical model (differential or difference equations) with a differentiable component (e.g., a neural network) that has been trained with real data. Another type would be the integration of a differentiable physical model into a deep learning architecture.
In Innes2019Zygote the authors extend Julia programming language with differentiable programming capabilities. The system allows to use the Julia scientific computing packages in deep learning models. They show how to solve problems that combine scientific computing and machine learning. DiffTaichi, a differentiable programming for building differentiable physical simulations, is proposed in Hu2019DiffTaichi, integrating a neural network controller with a physical simulation module.
A differentiable physics engine is presented in Belbute2018DiffPhysics. The system simulates rigid body dynamics and can be integrated in an endtoend differentiable deep learning model for learning the physical parameters.
Therefore, combining scientific computing and differentiable components will open new avenues in the coming years.
6 Conclusions and future directions
Differentiable programming is the use of new differentiable components beyond classical neural networks. This generalization of deep learning allows to have data parametrizable architectures instead of prefixed ones and new learning capabilities such as reasoning, attention, memory, etc.
The first models created under this new paradigm, such as attention mechanisms, differentiable neural computers and memory networks, are already having great impact on natural language processing.
These new models and differentiable programming are also beginning to improve machine learning applications to dynamical systems. As we have seen, these models improve the capabilities of RNNs and LSTMs in identification, modeling and prediction of dynamical systems. They even add a feature as necessary in machine learning as interpretability.
However, this is an emerging field and further research is needed in several directions, such as:

More comparative studies between attention mechanisms and LSTMs in predicting dynamical systems.

Use of selfattention and its possible applications to dynamical systems.

As with RNNs, a theoretical analysis (e.g., in the framework of dynamical systems) of attention and memory networks.

Clear guidelines so that scientists without advanced knowledge of machine learning can use new differentiable models in computational simulations.
Acknowledgments. This work was financially supported by the Spanish Ministry of Science, Innovation and Universities, grant MTM201674921P (AEI/FEDER, EU).
Comments
There are no comments yet.