The increase in computing capabilities together with new deep learning models has led to great advances in several machine learning tasksLeCun2015DeepLearning; Sutskever2014SequenceTS; Silver2017MasteringTG.
Deep learning architectures such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), as well as the use of distributed representations in natural language processing, have allowed to take into account the symmetries and the structure of the problem to be solved.
However, a major criticism of deep learning remains, namely, that it only performs perception, mapping inputs to outputs Marcus2018Deep.
A new direction to more general and flexible models is differentiable programming, that is, the combination of geometric modules (traditional neural networks) with more algorithmic modules in an end-to-end differentiable model. As a result, differentiable programming is a generalization of deep learning with differentiable modules that provide reasoning, abstraction, memory, etc. To efficiently calculate derivatives, this approach uses automatic differentiation, an algorithmic technique similar to backpropagation and implemented in modern software packages such as PyTorch, Julia, etc.
To keep our exposition concise, this tutorial is aimed at researchers in nonlinear systems with prior knowledge of deep learning; see Goodfellow2016DL for an excellent introduction to the concepts and methods of deep learning. Therefore, this tutorial focuses right away on the limitations of traditional deep learning and the advantages of differential programming, with special attention to its application to dynamical systems. By a dynamical system we mean here and hereafter a set of state variables that evolve in time under the influence of internal and possibly external inputs.
Examples of differentiable programming that have been successfully developed in recent years include
(i) attention mechanisms Bahdanau2014Attention, in which the model automatically searchs and learns which parts of a source sequence are relevant to predict the target element,
(iii) end-to-end Memory Networks Suk2015EndToEndMN, and
(iv) Differentiable Neural Computers (DNCs) Graves2016DNC, which are neural networks (controllers) with an external read-write memory.
As expected, in recent years there has been a growing interest in applying deep learning techniques to dynamical systems. In this regard, RNNs and Long Short-Term Memories (LSTMs), specially designed for sequence modelling and temporal dependence, have been successful in various applications to dynamical systems such as model identification and time series predictionWangModelIdLSTM; Yu2017ConceptLSTM; Li2018LSTMTourismFlow.
The performance of theses models (e.g. encoder-decoder networks), however, degrades rapidly as the length of the input sequence increases and they are not able to capture the dynamic (i.e., time-changing) interdependence between time steps. The combination of neural networks with new differentiable modules could overcome some of those problems and offer new opportunities and applications.
Among the potential applications of differentiable programming to dynamical systems let us mention
(i) attention mechanisms to select the relevant time steps and inputs,
(ii) memory networks to store historical data from dynamical systems and selectively use it for modelling and prediction, and
(iii) the use of differentiable components in scientific computing.
Despite some achievements, more work is still needed to verify the benefits of these models over traditional networks.
Thanks to software libraries that facilitate automatic differentiation, differentiable programming extends deep learning models with new capabilities (reasoning, memory, attention, etc.) and the models can be efficiently coded and implemented.
In the following sections of this tutorial we introduce differentiable programming and explain in detail why it is an extension of deep learning (Section 2). We describe some models based on this new approach such as attention mechanisms (Section 3.1), memory networks and differentiable neural computers (Section 3.2), and continuous learning (Section 3.3). Then we review the use of deep learning in dynamical systems and their limitations (Section 4). And, finally, we present the new opportunities that differentiable programming can bring to the modelling, simulation and prediction of dynamical systems (Section 5). The conclusions and outlook are summarized in Section 6.
2 From deep learning to differentiable programming
In recent years, we have seen major advances in the field of machine learning. The combination of deep neural networks with the computational capabilities of Graphics Processing Units (GPUs) Yadan2013MultiGPU has improved the performance of several tasks (image recognition, machine translation, language modelling, time series prediction, game playing, etc.) LeCun2015DeepLearning; Sutskever2014SequenceTS; Silver2017MasteringTG. Interestingly, deep learning models and architectures have evolved to take into account the structure of the problem to be resolved.
RNNs are a special class of neural networks where outputs from previous steps are fed as inputs to the current step Graves2009ANC; Sherstinsky2018FundRNN. This recurrence makes them appropriate for modelling dynamic processes and systems.
CNNs are neural networks that alternate convolutional and pooling layers to implement translational invariance Lecun1998CNN
. They learn spatial hierarchies of features through backpropagation by using these building layers. CNNs are being applied successfully to computer vision and image processingYamashita2018CNN.
Especially important is the use of distributed representations as inputs to natural language processing pipelines. With this technique, the words of the vocabulary are mapped to an element of a vector space with a much lower dimensionalityBengio2003NeuralModel; Mikolov2013Word2vec. This word embedding is able to keep, in the learned vector space, some of the syntactic and semantic relationships presented in the original data.
Let us recall that, in a feed-forward neural network (FNN) composed of multiple layers, the output (without the bias term) at layer, see Figure 1, is defined as
being the weight matrix at layer .
is the activation function and, the output vector at layer and the input vector at layer . The weight matrices for the different layers are the parameters of the model.
Learning is the mechanism by which the parameters of a neural network are adapted to the environment in the training process. This is an optimization problem which has been addressed by using gradient-based methods, in which given a cost function , the algorithm finds local minima updating each layer parameter with the rule , being the step size.
Apart from regarding neural networks as universal approximators, there is no sound theoretical explanation for a good performance of deep learning. Several theoretical frameworks have been proposed:
As pointed out in Lin2016Cheaplearning, the class of functions of practical interest can be approximated with exponentially fewer parameters than the generic ones. Symmetry, locality and compositionality properties make it possible to have simpler neural networks.
From the point of view of information theory ShwartzZiv2017ITDL, an explanation has been put forward based on how much information each layer of the neural network retains and how this information varies with the training and testing process.
Although deep learning can implicitly implement logical reasoning Hohenecker2018Onto, one of its limitations is that it only performs perception, representing a mapping between inputs and outputs Marcus2018Deep.
A path to a more general intelligence, as we will see below, is the combination of geometric modules with more algorithmic modules in an end-to-end differentiable model. This approach, called differentiable programming, adds new parametrizable and differentiable components to traditional neural networks.
Differentiable programming can be seen as a continuation of the deep learning end-to-end architectures that have replaced, for example, the traditional linguistic components in natural language processing Deng2018NLP; Goldberg2017NLPBook. To efficiently calculate the derivatives in a gradient descent, this approach uses automatic differentiation, an algorithmic technique similar but more general than backpropagation.
Automatic differentiation, in its reverse-mode and in contrast to manual, symbolic and numerical differentiation, computes the derivatives in a two-step process Baydin2018AutomaticDiff; Wang2018DemystifyingDP. In a first step, called the forward pass, the computational graph is built populating intermediate variables and recording the dependencies. In a second step, called the backward pass, derivatives are calculated using backpropagation. In the last years, deep learning frameworks such as PyTorch have been developed that provide reverse-mode automatic differentiation Paszke2017automatic. The define-by-run philosophy of PyTorch, whose execution dynamically constructs the computational graph, is facilitating the development of general differentiable programs.
Differentiable programming is an evolution of classical (traditional) software programming where, as shown in Table 1:
Instead of specifying explicit instructions to the computer, an objective is set and an optimizable architecture is defined which allows to search in a subset of possible programs.
The program is defined by the input-output data and not predefined by the user.
The algorithmic elements of the program have to be differentiable, say, by converting them into neural networks.
RNNs, for example, are an evolution of feedforward networks because they are classical neural networks inside a for-loop (a control flow statement for iteration) which allows the neural network to be executed repeatedly with recurrence. However, this for-loop is a predefined feature of the model. The ideal situation would be to augment the neural network with programming primitives (for-loops, if branches, while statements, external memories, logical modules, etc.) that are not predefined by the user but are parametrizable by the training data.
The trouble is that many of these programming primitives are not differentiable and need to be converted into optimizable modules. For instance, if the condition of an ”if” primitive (e.g., if is satisfied do , otherwise do. Similarly, in an attention module, different weights that are learned with the model are assigned to give a different influence to each part of the input.
In this way, using differentiable programming we can combine traditional perception modules (CNN, RNN, FNN) with additional algorithmic modules that provide reasoning, abstraction, memory, etc. Yang2017DiffeLogical. In the following section we describe some examples of this approach that have been developed in recent years.
3 Differentiable learning and reasoning
3.1 Attention mechanisms
RNNs (see Figure 2) are a basic component of modern deep learning architectures, especially of encoder-decoder networks. The following equations define the time evolution of an RNN:
, and being weight matrices. and are the hidden and output activation functions while , and are the network input, hidden state and output.
An evolution of RNNs are LSTMs Hochreiter1997LSTM, an RNN structure with gated units. LSTM are composed of a cell, an input gate, an output gate and a forget gate, and allow gradients to flow unchanged. The memory cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.
An encoder-decoder network maps an input sequence to a target one with both sequences of arbitrary length Sutskever2014SequenceTS. They have applications ranging from machine translation to time series prediction.
More specifically, this mechanism uses an RNN (or any of its variants, an LSTM or a GRU, Gated Recurrent Unit) to map the input sequence to a fixed-length vector, and another RNN (or any of its variants) to decode the target sequence from that vector (see Figure3). Such a seq2seq model features normally an architecture composed of:
An encoder which, given an input sequence with , maps to
where is the hidden state of the encoder at time , is the size of the hidden state and is an RNN (or any of its variants).
A decoder, whose initial state is initialized with the last hidden state of the encoder . It generates the output sequence , (the dimension depending on the task), with
is an RNN (or any of its variants) with an additional softmax layer.
Because the encoder compresses all the information of the input sequence in a fixed-length vector (the final hidden state ), the decoder possibly does not take into account the first elements of the input sequence. The use of this fixed-length vector is a limitation to improve the performance of the encoder-decoder networks. Moreover, the performance of encoder-decoder networks degrades rapidly as the length of the input sequence increases cho2014ProblemsEncDecod. This occurs in applications such as machine translation and time series predition, where it is necessary to model long time dependencies.
The key to solve this problem, inspired by neuroscience and human behaviour, is the attention mechanism. This mechanism allows the brain to focus on one part of the input (image, text, etc), giving less attention to others. So, implement a similar mechanism in an encoder-decoder network and make it differentiable.
In Bahdanau2014Attention an extension of the basic encoder-decoder arquitecture was proposed by allowing the model to automatically search and learn which parts of a source sequence are relevant to predict the target element. Instead of encoding the input sequence in a fixed-length vector, it generates a sequence of vectors, choosing the most appropriate subset of these vectors during the decoding process.
With the attention mechanism, the encoder is a bidirectional RNN Graves2013HybridRNN with a forward hidden state and a backward one . The encoder state is represented as a simple concatenation of the two states,
with . The encoder state includes both the preceding and following elements of the sequence, thus capturing information from neighbouring inputs.
The decoder has an output
for . is an RNN with an additional softmax layer, and the input is a concatenation of with the context vector , which is a sum of hidden states of the input sequence weighted by alignment scores:
The weight of each state is calculated by
The score measures how well the input at position and the output at position match. are the weights that implement the attention mechanism, defining how much of each input hidden state should be considered when deciding the next state and generating the output (see Figure 4).
The score function can be parametrized using different alignment models such as
as proposed in Bahdanau2014Attention, where and are matrices to be jointly learned with the rest of the model. Also, in Graves2014NTM the authors use a similarity measure for content-based attention, namely,
where denotes the angle between and .
An example of a matrix of alignment scores can be seen in Figure 5. This matrix provides interpretability to the model since it allows to know which part (time-step) of the input is more important to the output.
3.2 Other attention mechanisms and differentiable neural computers
A variant of the attention mechanism is self-attention, in which the attention component relates different positions of a single sequence in order to compute a representation of the sequence. In this way, the mechanism can connect distant elements of the sequence more directly than using RNNs Tang2018SelfAtt.
Another variant of attention are end-to-end memory networks Suk2015EndToEndMN
, which are neural networks with a recurrent attention model over an external memory. The model, trained end-to-end, outputs an answer based on a set of inputsstored in a memory and a query.
Traditional computers are based on the von Neumann architecture which has two basic components: the CPU (Central Processing Unit), which carries out the program instructions, and the memory, which is accessed by the CPU to perform write/read operations. In contrast, neural networks follow a hybrid model where synaptic weights perform both processing and memory tasks.
Neural networks and deep learning models are good at mapping inputs to outputs but are limited in their ability to use facts from previous events and store useful information. Differentiable Neural Computers (DNCs) Graves2016DNC try to overcome these shortcomings by combining neural networks with an external read-write memory.
As described in Graves2016DNC, a DNC is a neural network, called the controller (playing the role of a differentiable CPU), with an external memory, an matrix. The DNC uses differentiable attention mechanisms to define distributions (weightings) over the N rows and learn the importance each row has in a read or write operation.
To select the most appropriate memory components during read/write operations, a weighted sum is used over the memory locations The attention mechanism is used in three different ways: Access content (read or write) based on similarity, time ordered access (temporal links) to recover the sequences in the order in which they were written, and dynamic memory allocation, where the DNC assigns and releases memory based on usage percentage. At each time step, the DNC gets an input vector and emits an output vector that is a function of the combination of the input vector and the memories selected.
DNCs, by combining the following characteristics, have very promising applications in complex tasks that require both perception and reasoning:
The classical perception capability of neural networks.
Read and write capabilities based on content similarity and learned by the model.
The use of previous knowledge to plan and reason.
End-to-end differentiability of the model.
Implementation using software packages with automatic differentiation libraries such as PyTorch, Tensorflow or similar.
3.3 Meta-plasticity and continuous learning
The combination of geometric modules (classical neural networks) with algorithmic ones adds new learning capabilities to deep learning models. In the previous sections we have seen that one way to improve the learning process is by focusing on certain elements of the input or a memory and making this attention differentiable.
Another natural way to improve the process of learning is to incorporate differentiable primitives that add flexibility and adaptability. A source of inspiration is neuromodulators, which furnish the traditional synaptic transmission with new computational and processing capabilities Hernandez2018MultilayerAN.
Unlike the continuous learning capabilities of animal brains, which allow animals to adapt quickly to the experience, in neural networks, once the training is completed, the parameters are fixed and the network stops learning. To solve this issue, in Miconi2018DifferentiablePT a differentiable plasticity component is attached to the network that helps previously-trained networks adapt to ongoing experience.
of neuronhas a conventional fixed weight and a plastic component , where is a structural parameter tuned during the training period and a plastic component automatically updated as a function of ongoing inputs and outputs. The equations for the activation of with learning rate , as described in Miconi2018DifferentiablePT, are:
Then, during the initial training period, and are trained using gradient descent and after this period, the model keeps learning from ongoing experience.
4 Modeling dynamical systems with neural networks
Dynamical systems deal with time-evolutionary processes and their corresponding systems of equations. At any given time, a dynamical system has a state that can be represented by a point in a state space (manifold). The evolutionary process of the dynamical system describes what future states follow from the current state. This process can be deterministic, if its entire future is uniquely determined by its current state, or non-deterministic otherwise Layek2015BookDS (e.g., a random dynamical system Arnold2003RandomDS). Furthermore, it can be a continuous-time process, represented by differential equations or, as in this paper, a discrete-time process, represented by difference equations or maps.
Dynamical systems have important applications in physics, chemistry, economics, engineering, biology and medicine Jackson2015BookDSBio. They are relevant even in day-to-day phenomena with great social impact such as tsunami warning, earth temperature analysis and financial markets prediction.
Dynamical systems that contain a very large number of variables interacting with each other in non-trivial ways are sometimes called complex (dynamical) systems Gross2008CAS. Their behaviour is intrinsically difficult to model due to the dependencies and interactions between their parts and they have emergence properties arising from these interactions such as adaptation, evolution, learning, etc.
Here we consider discrete-time, deterministic and non-autonomous (i.e., the time evolution depending also on exogenous variables) dynamical systems as well as the more general complex systems. Specifically, the dynamical systems of interest range from systems of difference equations with multiple time delays to systems with a dynamic (i.e., time-changing) interdependence between time steps. Notice that the former ones may be rewritten as higher dimensional systems with time delay 1.
On the other hand, in recent years deep learning models have been very successful performing various tasks such as image recognition, machine translation, game playing, etc. When the amount of training data is sufficient and the distribution that generates the real data is the same as the distribution of the training data, these models perform extremely well and approximate the input-output relation.
In view of the importance of dynamical systems for modeling physical, biological and social phenomena, there is a growing interest in applying deep learning techniques to dynamical systems. This can be done in different contexts, such as:
Modeling dynamical systems with known structure and equations but non-analytical or complex solutions Pan2018LongTimePM.
Modeling dynamical systems without knowledge of the underlying governing equations Düben2018GlobalWeather; Chakraborty1992ForecastingTB. In this regard, let us mention that commercial initiatives are emerging that combine large amounts of meteorological data with deep learning models to improve weather predictions.
Modeling dynamical systems with partial or noisy data Yeo2019DLNoisyDS.
Therefore, the combination of greater computing resources, large amounts of data and deep learning models can, in principle, revolutionize the modeling of dynamical systems in general, and complex systems in particular, and thus enable a better understanding and prediction of relevant natural and social phenomena.
A key aspect in modelling dynamical systems is temporal dependence. There are two ways to introduce it into a neural network Narendra1990DSystemsNN:
A classical feedforward neural network with time delayed states in the inputs but perhaps with an unnecessary increase in the number of parameters.
Thus, RNNs, specially designed for sequence modelling chang2018antisymmetricrnn, seem the ideal candidates to model, analyze and predict dynamical systems in the broad sense used in this tutorial. The temporal recurrence of RNNs, theoretically, allows to model and identify dynamical systems described with equations with any temporal dependence.
To learn chaotic dynamics, recurrent radial basis function (RBF) networksMiyosi1995ChaoticRBF
and evolutionary algorithms that generate RNNs have been proposedSatoEvolutionaryRNN
. ”Nonlinear Autoregressive model with exogenous input” (NARX)DiaconescuNARXChaoticSeries and boosted RNNs AssaadChaoticSeriesRNN have been applied to predict chaotic time series.
However, a difficulty with RNNs is the vanishing gradient problemBengio1994GradientDifficult
. RNNs are trained by unfolding them into deep feedforward networks, creating a new layer for each time step of the input sequence. When backpropagation computes the gradient by the chain rule, this gradient vanishes as the number of time-steps increases. As a result, for long input-output sequences, as depicted in Figure6, RNNs have trouble modelling long-term dependencies, that is, relationships between elements that are separated by large periods of time.
To overcome this problem, LSTMs were proposed. LSTMs have an advantage over basic RNNs due to their relative insensitivity to temporal delays and, therefore, are appropriate for modeling and making predictions based on time series whenever there exist temporary dependencies of unknown duration. With the appropriate number of hidden units and activation functions Yu2017ConceptLSTM, LSTMs can model and identify any non-linear dynamical system of the form:
and are the state and output functions while , and are the system input, state and output.
LSTMs have succeeded in various applications to dynamical systems such as model identification and time series prediction WangModelIdLSTM; Yu2017ConceptLSTM; Li2018LSTMTourismFlow.
An also remarkable application of the LSTM has been machine translation Sutskever2014SequenceTS; Cho2014EncDec, using the encoder-decoder architecture described in Section 3.1.
However, as we have seen, the decoder possibly does not take into account the first elements of the input sequence because the encoder compresses all the information of the input sequence in a fixed-length vector. Then, the performance of encoder-decoder networks degrades rapidly as the length of input sequence increases and this can be a problem in time series analysis, where predictions are based upon a long segment of the series.
Furthermore, as depicted in Figure 7, a complex dynamic may feature interdependencies between time steps that vary with time. In this situation, the equation that defines the temporal evolution may change at each . For these dynamical systems, adding an attention module like the one described in Equation 8 can help model such time-changing interdependencies.
In the next section, we will discuss whether the new differentiable techniques in machine learning are a promising approach to solve these issues.
5 Improving dynamical systems with differentiable programming
Deep learning models together with graphic processors and large amounts of data have improved the modeling of dynamical systems but this has some limitations such as those mentioned in the previous section. The combination of neural networks with new differentiable algorithmic modules is expected to overcome some of those shortcomings and offer new opportunities and applications.
Among the applications to dynamical systems we highlight the following:
Use of differentiable attention mechanisms to select the relevant time steps and inputs.
As we mentioned previously, in many dynamical systems there are long term dependencies between time steps. Moreover, these interdependencies can be dynamic, i.e., time-changing. In these cases, attention mechanisms learn to focus on the most relevant parts of the system input or state. This can be very useful in systems modeling and learning or in time series prediction.
In Hollis2018LSTMvsAttention, a comparison is made between LSTMs and attention mechanisms for financial time series forecasting. It is shown there that an LSTM with attention perform better than stand-alone LSTMs.
A temporal attention layer is used in Vinayavekhin2018AttentonTimeSeries to select relevant information and to provide model interpretability, an essential feature to understand deep learning models. Interpretability is further studied in detail in Serrano2019AttInterpret, concluding that attention weights partially reflect the impact of the input elements on model prediction.
In Quin2017DualAttention, the authors propose an interesting architecture with a dual-stage attention RNN; the first stage extracts the relevant input features and the second selects the relevant time steps. The model outperforms classical model in time series prediction.
Despite the theoretical advantages and some achievements, further studies are needed to verify the benefits of the attention mechanism over traditional networks.
Use of memory networks to store historical data and selectively retrieve it to model and predict.
Memory networks allow long-term dependencies in sequential data to be learned thanks to an external memory component. Instead of taking into account only the most recent states, memory networks consider the entire list of entries or states.
In Chang2018AMB the authors propose a model with a memory component, three encoders and an autoregressive component for multivariate time-series forecasting. Compared to non-memory RNN models, their model is better at modeling and capturing long-term dependencies and, moreover, it is interpretable.
Taking advantage of the highlighted capabilities of Differentiable Neural Computers (DNCs), an enhanced DNC for electroencephalogram (EEG) data analysis is proposed in Ming2018EEGDataDNC. By replacing the LSTM network controller with a recurrent convolutional network, the potential of DNCs in EEG signal processing is convincingly demonstrated.
Scientific simulation. Classical simulation models along with differentiable components allow more flexible models that better explain real data.
Scientific computing traditionally deals with modeling and simulating complex phenomena using numerical models. As we have seen, differentiable programming is an evolution of classical software programming. In differentiable programming, an optimizable architecture is defined and the input-output data allows to search in a subset of possible programs.
For example, one type of application would be the combination of a numerical model (differential or difference equations) with a differentiable component (e.g., a neural network) that has been trained with real data. Another type would be the integration of a differentiable physical model into a deep learning architecture.
In Innes2019Zygote the authors extend Julia programming language with differentiable programming capabilities. The system allows to use the Julia scientific computing packages in deep learning models. They show how to solve problems that combine scientific computing and machine learning. DiffTaichi, a differentiable programming for building differentiable physical simulations, is proposed in Hu2019DiffTaichi, integrating a neural network controller with a physical simulation module.
A differentiable physics engine is presented in Belbute2018DiffPhysics. The system simulates rigid body dynamics and can be integrated in an end-to-end differentiable deep learning model for learning the physical parameters.
Therefore, combining scientific computing and differentiable components will open new avenues in the coming years.
6 Conclusions and future directions
Differentiable programming is the use of new differentiable components beyond classical neural networks. This generalization of deep learning allows to have data parametrizable architectures instead of pre-fixed ones and new learning capabilities such as reasoning, attention, memory, etc.
The first models created under this new paradigm, such as attention mechanisms, differentiable neural computers and memory networks, are already having great impact on natural language processing.
These new models and differentiable programming are also beginning to improve machine learning applications to dynamical systems. As we have seen, these models improve the capabilities of RNNs and LSTMs in identification, modeling and prediction of dynamical systems. They even add a feature as necessary in machine learning as interpretability.
However, this is an emerging field and further research is needed in several directions, such as:
More comparative studies between attention mechanisms and LSTMs in predicting dynamical systems.
Use of self-attention and its possible applications to dynamical systems.
As with RNNs, a theoretical analysis (e.g., in the framework of dynamical systems) of attention and memory networks.
Clear guidelines so that scientists without advanced knowledge of machine learning can use new differentiable models in computational simulations.
Acknowledgments. This work was financially supported by the Spanish Ministry of Science, Innovation and Universities, grant MTM2016-74921-P (AEI/FEDER, EU).