Predictive Neural Networks

Recurrent neural networks are a powerful means to cope with time series. We show that already linearly activated recurrent neural networks can approximate any time-dependent function f(t) given by a number of function values. The approximation can effectively be learned by simply solving a linear equation system; no backpropagation or similar methods are needed. Furthermore the network size can be reduced by taking only the most relevant components of the network. Thus, in contrast to others, our approach not only learns network weights but also the network architecture. The networks have interesting properties: In the stationary case they end up in ellipse trajectories in the long run, and they allow the prediction of further values and compact representations of functions. We demonstrate this by several experiments, among them multiple superimposed oscillators (MSO) and robotic soccer. Predictive neural networks outperform the previous state-of-the-art for the MSO task with a minimal number of units.



There are no comments yet.


page 1

page 2

page 3

page 4


A fast noise filtering algorithm for time series prediction using recurrent neural networks

Recent research demonstrate that prediction of time series by recurrent ...

Exploiting Spatio-Temporal Structure with Recurrent Winner-Take-All Networks

We propose a convolutional recurrent neural network, with Winner-Take-Al...

A fast memoryless predictive algorithm in a chain of recurrent neural networks

In the recent publication (arxiv:2007.08063v2 [cs.LG]) a fast prediction...

Deep Recurrent Neural Networks for Time Series Prediction

Ability of deep networks to extract high level features and of recurrent...

Predictive State Recurrent Neural Networks

We present a new model, Predictive State Recurrent Neural Networks (PSRN...

Single Stream Parallelization of Recurrent Neural Networks for Low Power and Fast Inference

As neural network algorithms show high performance in many applications,...

Dataflow Matrix Machines as a Generalization of Recurrent Neural Networks

Dataflow matrix machines are a powerful generalization of recurrent neur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning in general means a class of machine learning algorithms that use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation (Deng and Yu, 2014)

. The tremendous success of deep learning in diverse fields such as computer vision and natural language processing seems to depend on a bunch of ingredients: artificial, possibly recurrent neural networks (RNNs), with nonlinearly activated neurons, convolutional layers, and iterative training methods like backpropagation

(Goodfellow et al., 2016). But which of these components are really essential for machine learning tasks such as time-series prediction?

Research in time series analysis and hence modeling dynamics of complex systems has a long tradition and is still highly active due to its crucial role in many real-world applications (Lipton et al., 2015) like weather forecast, stock quotations, comprehend trajectories of objects and agents, or solving number puzzles (Ragni and Klein, 2011; Glüge and Wendemuth, 2013). The analysis of time series allows among others data compression, i.e., compact representation of time series, e.g., by a function , and prediction of further values.

Numerous research addresses these topics by RNNs, in particular variants of networks with long short-term memory (LSTM)

(cf. Hochreiter and Schmidhuber, 1997). In the following, we consider an alternative, simple, yet very powerful type of RNNs which we call predictive neural network (PrNN). It only uses linear activation and attempts to minimize the network size. Thus in contrast to other approaches not only network weights but also the network architecture is learned.

The rest of the paper is structured as follows: First we briefly review related works (Sect. 2). We then introduce PrNNs as a special and simple kind of RNNs together with their properties, including the general network dynamics and their long-term behavior (Sect. 3). Afterwards, learning PrNNs is explained (Sect. 4). It is a relatively straightforward procedure which allows network size reduction; no backpropagation or gradient descent method is needed. We then discuss results and experiments (Sect. 5), before we end up with conclusions (Sect. 6).

2 Related Works

2.1 Recurrent Neural Networks

Simple RNNs were proposed by Elman (1990)

. By allowing them to accept sequences as inputs and outputs rather than individual observations, RNNs extend the standard feedforward multilayer perceptron networks. As shown in many sequence modeling tasks, data points such as video frames, audio snippets and sentence segments are usually highly related in time. This results in RNNs being used as the indispensable tools for modeling such temporal dependencies. Linear RNNs and some of their properties (like short-term memory) are already investigated by

White et al. (1994). Unfortunately, however, it can be a struggle to train RNNs to capture long-term dependencies (see Bengio et al., 1994; Pascanu et al., 2013). This is due to the gradients vanishing or exploding during backpropagation, which in turn makes the gradient-based optimization difficult.

Nowadays, probably the most prominent and dominant type of RNNs are long short-term memory (LSTM) networks

(Hochreiter and Schmidhuber, 1997). The expression “long short-term”

refers to the fact that LSTM is a model for the short-term memory which can last for a long period of time. An LSTM is well-suited to classify, process and predict time series given time lags of unknown size. They were developed to deal with the exploding and vanishing gradient problem when training traditional RNNs (see above). A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. Each unit type is activated in a different manner, whereas in this paper we consider completely linearly activated RNNs.

Echo state networks (ESNs) play a significant role in RNN research as they provide an architecture and supervised learning principle for RNNs. They do so by driving a random, large, fixed RNN, called

reservoir in this context, with the input signal, which then induces in each neuron within this reservoir network a nonlinear response signal. They also combine a desired output signal by a trainable linear combination of all of these response signals (Jaeger and Haas, 2004; Jaeger, 2014). Xue et al. (2007) propose a variant of ESNs that work with several independent (decoupled) smaller networks. ESN-style initialization has been shown effective for training RNNs with Hessian-free optimization (Martens and Sutskever, 2011). Tiňo (2018) investigates the effect of weight changes in linear symmetric ESNs on (Fisher) memory of the network.

Hu and Qi (2017) have proposed a novel state-frequency memory (SFM) RNN, which aims to model the frequency patterns of the temporal sequences. The key idea of the SFM is to decompose the memory states into different frequency states. In doing so, they can explicitly learn the dependencies of both the low and high frequency patterns. As we will see (cf. Sect. 5.1), RNNs in general can easily learn time series that have a constant frequency spectrum which may be obtained also by Fourier analysis.

Ollivier et al. (2015) suggested to use the “NoBackTrack” algorithm in RNNs to train its parameters. This algorithm works in an online, memoryless setting, which therefore requires no backpropagation through time. It is also scalable, thus avoiding the large computational and memory cost of maintaining the full gradient of the current state with respect to the parameters, but it still uses an iterative method (namely gradient descent). In contrast to this and other related works, in this paper we present a method working with linearly activated RNNs that does not require backpropagation or similar procedures in the learning phase.

2.2 Reinforcement Learning and Autoregression

In reinforcement learning (Sutton and Barto, 1998)

, several opportunities to use RNNs exist: In model-based reinforcement learning, RNNs can be used to learn a model of the environment. Alternatively, a value function can also be learned directly from current observations and the state of the RNN as a representation of the recent history of observations. A consequence of an action in reinforcement learning may follow the action with a significant delay, this is also called the temporal credit assignment problem. RNN used for value functions in reinforcement learning are commonly trained using truncated backpropagation through time which can be problematic for credit assignment 

(see, e.g., Pong et al., 2017).

An autoregressive model is a representation of a type of random process (Akaike, 1969)

. It specifies that the output variable or a vector thereof depends linearly on its own previous values and on a stochastic term (white noise). In consequence the model is in the form of a stochastic difference equation as in general (physical) dynamic systems

(Colonius and Kliemann, 2014). A PrNN is also linearly activated, but its output does not only depend on its own previous values and possibly white noise but on the complete state of the possibly big reservoir whose dynamics is explicitly dealt with. In addition, the size of the network might be reduced in the further process.

3 Predictive Neural Networks

RNNs often host several types of neurons, each activated in a different manner (Elman, 1990; Hochreiter and Schmidhuber, 1997). In contrast to this, we here simply understand an interconnected group of standard neurons as a neural network which may have arbitrary loops, akin to biological neuronal networks. We adopt a discrete time model, i.e., input and output can be represented by a time series and is processed stepwise by the network.

Definition 1 (time series)

A time series is a series of data points in dimensions where and .

Definition 2 (recurrent neural network)

A recurrent neural network (RNN) is a directed graph consisting of altogether nodes, called neurons. denotes the activation of the neuron at (discrete) time . We may distinguish three groups of neurons (cf. Fig. 2):

  • input neurons (usually without incoming edges) whose activation is given by an external source, e.g., a time series,

  • output neurons (usually without outgoing edges) whose activation represents some output function, and

  • reservoir or hidden neurons (arbitrarily connected) that are used for auxiliary computations.

The edges of the graph represent the network connections. They are usually annotated with weights which are compiled in the transition matrix . An entry  in row  and column  denotes the weight of the edge from neuron  to neuron . If there is no connection, then . The transition matrix has the form

containing the following weight matrices: input weights (weights from the input and possibly the output to the reservoir), reservoir weights (matrix of size where is the number of reservoir neurons), and output weights (all weights to the output and possibly back to the input).

Figure 1: General recurrent neural network (cf. Jaeger and Haas, 2004, Fig. 1). In ESNs, only output weights are trained and the hidden layer is also called reservoir.
Figure 2: PrNN for where . The input/output neuron is marked by a thick border. The initial values of the neurons at time are written in the nodes. The weights are annotated at the edges.

Let us now define the network activity in more detail: The initial configuration of the neural network is given by a column vector with components, called start vector. It represents the network state at the start time . Because of the discrete time model we compute the activation of a (non-input) neuron at time (for some time step ) from the activation of the neurons that are connected to with the weights at time as follows:


This has to be done simultaneously for all neurons of the network. Here is a (real-valued) activation function

. Usually, a nonlinear, bounded, strictly increasing sigmoidal function

is used, e.g., the logistic function, the hyperbolic tangent (), or the softmax function (cf. Goodfellow et al., 2016). In the following, we employ simply the (linear) identity function and can still approximate arbitrary time-dependent functions (cf. Prop. 6).

Definition 3 (predictive neural network)

A predictive neural network (PrNN) is a RNN with the following properties:

  1. For the start time, it holds , and is constant, often .

  2. The initial state of the given time series constitutes the first components of the start vector .

  3. For all neurons we have linear activation, i.e., everywhere is the identity.

  4. The weights in and

    are initially taken randomly, independently, and identically distributed from the standard normal distribution, whereas the output weights

    are learned (see Sect. 4.1).

  5. There is no clear distinction of input and output but only one joint group of input/output neurons. They may be arbitrarily connected like the reservoir neurons. We thus can imagine the whole network as a big reservoir, because the input/output neurons are not particularly special.

PrNNs can run in one of two modes: either receiving input or generating (i.e., predicting) output, but not both. In output generating mode Eq. 1 is applied to all neurons including the input/output neurons, whereas in input receiving mode the activation of every input/output neuron always is overwritten with the respective input value at time , given by the time series .

Example 1

The function can be realized by a PrNN (in output generating mode) with three neurons (see Fig. 2). It exploits the identity . The corresponding transition matrix and start vector are:

3.1 Network Dynamics

Clearly, a PrNN runs through network states for . It holds (in output generating mode)

and hence simply .

Property 1

Let be the Jordan decomposition of the transition matrix where is the direct sum, i.e., a block diagonal matrix, of one or more Jordan blocks

in general with different sizes

and eigenvalues

. Then it holds:

If we decompose into matrices of size and the column vector into a stack of column vectors of size , corresponding to the Jordan blocks in , then can be expressed as a sum of vectors where the Jordan block powers are upper triangular Toeplitz matrices with (cf. Horn and Johnson, 2013, Sect. 3.2.5).

A Jordan decomposition exists for every square matrix (Horn and Johnson, 2013, Theorem 3.1.11), but it is needed only if not all eigenvalues of the transition matrix are semisimple. If the transition matrix has an eigendecomposition, i.e., there are

distinct eigenvectors, then the networks dynamics can be directly described by means of the eigenvalues and eigenvectors of


Property 2

Let be the eigendecomposition of the transition matrix with column eigenvectors in V and eigenvalues , on the diagonal of the diagonal matrix , sorted in decreasing order with respect to their absolute values. Like every column vector, we can represent the start vector as linear combination of the eigenvectors, namely as where . It follows . Since is a linear mapping and for each eigenvector with eigenvalue with it holds , we have . Induction over yields immediately:


So far, the input weights and reservoir weights are arbitrary random values. In order to obtain better numerical stability during the computation, they should be adjusted as follows:

  • In the presence of linear activation, the spectral radius of the reservoir weights matrix , i.e., the largest absolute value of its eigenvalues, is set to (cf. Jaeger and Haas, 2004). Otherwise, with increasing , the values of explode, if the spectral radius is greater, or vanish, if the spectral radius is smaller.

  • The norms of the vectors in and should be balanced (Koryakin et al., 2012). To achieve this, we initialize the reservoir neurons such that the reservoir start vector (with components; it is part of the start vector ) has unit norm by setting:

  • We usually employ fully connected graphs, i.e., all, especially the reservoir neurons are connected with each other, because the connectivity has nearly no influence on the best reachable performance (Koryakin et al., 2012).

Let us remark that, although the parameter usually is discrete, i.e., an integer number, the values of can also be computed for general values of , if can be diagonalized by eigendecomposition (according to Prop. 2). We simply have to take the -th power of each diagonal element in to obtain and hence from Eq. 2

. Note that, however, the interpolated values of

may be complex, even if and are completely real-valued.

3.2 Long-Term Behavior

Let us now investigate the long-term behavior of a RNN (run in output generating mode) by understanding it as an (autonomous) dynamic system (Colonius and Kliemann, 2014). We will see (in Prop. 4) that the network dynamics may be reduced to a very small number of neurons in the long run. This describes the behavior for . Nevertheless, for smaller , the use of many neurons is important for computing short-term predictions.

Property 3

In none of the dimensions grows faster than a polynomial and only single-exponential in .


Let denote the value of the -th dimension of , be the eigenvalue with maximal absolute value and be the maximal (geometric) multiplicity of an eigenvalue of the transition matrix . Then, from Prop. 1, we can easily deduce

as asymptotic behavior for large .

Property 4

Consider a RNN whose transition matrix is completely real-valued, has (according to Prop. 2) an eigendecomposition with a spectral radius , and all eigenvalues are distinct, e.g., a pure random reservoir. Then, almost all terms in Eq. 2 vanish for large , because for all eigenvalues with we have . Although a real matrix can have more than two complex eigenvalues which are on the unit disk, almost always only the eigenvalues and possibly have the absolute values . In consequence, we have one of the following cases:

  1. . In this case, the network activity contracts to one point, i.e., to a singularity:

  2. . For large it holds . This means we have an oscillation in this case. The dynamic system alternates between two points:

  3. and are two (properly) complex eigenvalues with absolute value . Since is a real-valued matrix, the two eigenvalues as well as the corresponding eigenvectors and are complex conjugate with respect to each other. Thus for large we have an ellipse trajectory

    where , , and .

We consider now the latter case in more detail: In this case, the matrix consists of the two complex conjugated eigenvectors and . From all linear combinations of both eigenvectors for with , we can determine the vectors with extremal lengths. Since we only have two real-valued dimensions, because clearly the activity induced by the real-valued matrix remains in the real-valued space, there are two such vectors.

Let now (Euler’s formula) and and be the real and imaginary parts of , respectively, i.e., and . Then, for the square of the vector length, it holds:

To find out the angle with extremal vector length of , we have to investigate the derivative of the latter term with respect to and compute its zeros. This yields and thus:

Because of the periodicity of the tangent function there are two main solutions for that are orthogonal to each other: and . They represent the main axes of an ellipse. All points the dynamic system runs through in the long run lie on this ellipse. The length ratio of the ellipse axes is . We normalize both vectors to unit length and put them in the matrix .

We now build a matrix , similar to but completely real-valued, which states the ellipse rotation. The rotation speed can be derived from the eigenvalue . In each step of length , there is a rotation by the angle where is the angular frequency, which can be determined from the equation . The two-dimensional ellipse trajectory can be stated by two (co)sinusoids: with . Applying the addition theorems of trigonometry, we get:

From this, we can read off the desired ellipse rotation matrix as indicated above and . Finally, we can determine the corresponding two-dimensional start vector by solving the equation .

In summary, we have for large . Every RNN with many neurons can thus be approximated by a simple network with at most two neurons, defined by the matrix and start vector . The output values can be computed for all original dimensions by multiplication with the matrix . They lie on an ellipse in general. Nonetheless, in the beginning, i.e., for small , the dynamics of the system is not that regular (cf. Fig. 3).

Figure 3: Dynamic system behavior in two dimensions (qualitative illustration): In the long run, we get an ellipse trajectory (red), though the original data may look like random (black). Projected to one dimension, we have pure sinusoids with one single angular frequency, sampled in large steps (blue).

The long-term behavior of PrNNs is related to that of ESNs. For the latter, usually the activation function is and the spectral radius is smaller than . Then reservoirs with zero input collapse because of for all but the convergence may be rather slow. This leads to the so-called echo state property (Manjunath and Jaeger, 2013): Any random initial state of a reservoir is forgotten, such that after a washout period the current network state is a function of the driving input. In contrast to ESNs, PrNNs have linear activation and usually a spectral radius of exactly is taken. But as we have just shown, there is a similar effect in the long run: The network activity reduces to at most two dimensions which are independent from the initial state of the network.

3.3 Real-Valued Transition Matrix Decomposition

For real-valued transition matrices , it is possible to define a decomposition that, in contrast to the ordinary Jordan decomposition in Prop. 1, solely makes use of real-valued components, adopting the so-called real Jordan canonical form (Horn and Johnson, 2013, Sect. 3.4.1) of the square matrix . For this completely real-valued decomposition, the Jordan matrix is transformed as follows:

  1. A Jordan block with real eigenvalue remains as is in .

  2. For complex conjugated eigenvalue pairs and , the direct sum of the corresponding Jordan blocks and is replaced by a real Jordan block:

This procedure yields us the real Jordan matrix . In consequence, we have to transform also into a completely real-valued form. For a simple complex conjugated eigenvalue pair and , the corresponding two eigenvectors in could be replaced by the vectors in (cf. Sect. 3.2). The subsequent theorem shows a more general way: The matrix from Prop. 1 is transformed into a real-valued matrix and, what is more, the start vector can be replaced by an arbitrary column vector with all non-zero entries.

Property 5

Let be the (real) Jordan decomposition of the transition matrix and the corresponding start vector. Then for all column vectors of size with all non-zero entries there exists a square matrix of size such that for all integers we have:


We first prove the case where the Jordan matrix only contains ordinary Jordan blocks as in Prop. 1, i.e., possibly with complex eigenvalues on the diagonal. Since is a direct sum of Jordan blocks, it suffices to consider the case where is a single Jordan block, because, as the Jordan matrix , the matrices and also (see below) can be obtained as direct sums, too.

Let with and . From , we construct the following upper triangular Toeplitz matrix

which commutes with the Jordan block (Horn and Johnson, 2013, Sect. 3.2.4), i.e., it holds (a) . We determine and hence by the equation (b)  with which is equivalent to:

Since the main diagonal of the left matrix contains no s because by precondition, there always exists a solution for (Horn and Johnson, 2013, Sect. 0.9.3). Now does the job:

The generalization to the real Jordan decomposition is straightforward by applying the fact that for complex conjugated eigenvalue pairs and the matrix from above in a real Jordan block is similar to the diagonal matrix via (Horn and Johnson, 2013, Sect. 3.4.1), i.e., . The above-mentioned commutation property (a) analogously holds for real Jordan blocks. This completes the proof.

4 Learning Predictive Neural Networks

Functions can be learned and approximated by PrNNs in two steps: First, as for ESNs (Jaeger and Haas, 2004), we only learn the output weights ; all other connections remain unchanged. Second, if possible, we reduce the network size; this often leads to better generalization and avoids overfitting. Thus, in contrast to many other approaches, the network architecture is changed during the learning process.

4.1 Learning the Output Weights

To learn the output weights , we run the input values from the time series through the network (in input receiving mode), particularly through the reservoir. For this, we build the sequence of corresponding reservoir states , where the reservoir start vector (cf. Sect. 3) can be chosen arbitrarily but with all non-zero entries (cf. Prop. 5):

We want to predict the next input value , given the current input and reservoir states and . To achieve this, we comprise all but the last input and reservoir states in one matrix with:

Each output value corresponds to the respective next input value . For this, we compose another matrix where the first value clearly has to be omitted, because it cannot be predicted. We compute from by assuming a linear dependency:


Its solution can easily be determined as , where denotes the operation of solving a linear equation system, possibly applying the least squares method in case of an overdetermined system, as implemented in many scientific programming languages.

Prediction of further values is now possible (in output generating mode) as follows:

This first phase of the learning procedure is related to a linear autoregressive model (Akaike, 1969). However, one important difference to an autoregressive model is that for PrNNs the output does not only depend on its own previous values and possibly white noise but on the complete state of the possibly big reservoir whose dynamics is explicitly dealt with in the reservoir matrix .

4.2 An Approximation Theorem

Property 6

From a function in dimensions, let a series of function values be given. Then there is a PrNN with the following properties:

  1. It runs exactly through all given function values, i.e., it approximates .

  2. It can effectively be learned by the above-stated solution procedure (Sect. 4.1).

The procedure for learning output weights (cf. Sect. 4.1) uses the reservoir state sequence as part of the coefficient matrix which reduces to at most two dimensions however – independent of the number of reservoir neurons (cf. Sect. 3.2). Therefore the rank of the coefficient matrix is not maximal in general and in consequence the linear equation system from Eq. 3 often has no solutions (although we may have an equation system with the same number of equations and unknowns). A simple increase of the number of reservoir neurons does not help much. Therefore we shall apply the learning procedure in a specific way, learning not only the output weights as in ESNs (Jaeger and Haas, 2004), but the complete transition matrix , as follows.


First, we take the series of function values and identify them with the time series . Let then be a random reservoir state sequence matrix for reservoir neurons, considered as additional input in this context. If all elements of this matrix are taken independently and identically distributed from the standard normal distribution, its rank is almost always maximal. Let now be the rank of the matrix and:

We now have to solve the linear matrix equation (with as in Sect. 4.1). If and denote the row vectors of the matrices and , respectively, then this is equivalent to simultaneously solving the equations , for . We must ensure that there exist almost always at least one solution. This holds if and only if the rank of the coefficient matrix is equal to the rank of the augmented matrix , for every . We obtain the equation . From this, it follows that reservoir neurons have to be employed to guarantee at least one exact solution.

The just sketched proof of Prop. 6 suggests a way to learn the input and reservoir weights. This topic is also investigated by Palangi et al. (2013) for ESNs with nonlinear activation function in the reservoir. However, for PrNNs, the given input and reservoir weights and together with the learned output weights provide the best approximation of the function . There is no need to learn them, because PrNNs are completely linearly activated RNNs (including the reservoir). If one tries to learn and taking not only the output time series but additionally the reservoir state time series into account, then in principle exactly the given input and reservoir weights are learned. Only with nonlinear activation there would be a learning effect. Nonetheless, and can be learned as sketched above by our procedure, if they are not given in advance, starting with a random reservoir state sequence. But our experiments indicate that this procedure is less numerically stable than the one with given input and reservoir weights (as described in Sect. 4.1) and a spectral radius of the reservoir normalized to (cf. Sect. 3.1).

Prop. 6 is related to the universal approximation theorem for feedforward neural networks (Hornik, 1991). It states that a (non-recurrent) network with a linear output layer and at least one hidden layer activated by a nonlinear, sigmoidal function can approximate any continuous function on a closed and bounded subset of the from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden neurons (Goodfellow et al., 2016, Sect. 6.4.1). Since RNNs are more general than feedforward networks, the universal approximation theorem also holds for them (see also Maass et al., 2002). Any measurable function can be approximated with a (general) recurrent network arbitrarily well in probability (Hammer, 2000).

Because of the completely linear activation, PrNNs cannot compute a nonlinear function from the (possibly multi-dimensional) input . Nevertheless, they can approximate any (possibly nonlinear) function over time , as Prop. 6 shows. Another important difference between PrNNs and nonlinearly activated feedforward neural networks is that the former can learn the function efficiently. No iterative method like backpropagation is required; we just have to solve a linear equation system. Thus learning is as easy as learning a single-layer perceptron, which however is restricted in expressibility because only linearly separable functions can be represented.

4.3 Network Size Reduction

To approximate a function exactly for sure, we need a large number of reservoir neurons in Prop. 6. It is certainly a good idea to lower this number. One could do this by simply taking a smaller number of reservoir neurons, but then a good approximation cannot be guaranteed. In what follows, the dimensionality of the transition matrix is reduced in a more controlled way – after learning the output weights. Our procedure of dimensionality reduction leads to smaller networks with sparse connectivity. In contrast to other approaches, we do not learn the new network architecture by incremental derivation from the original network, e.g., by removing unimportant neurons or weights, but in only one step, inspecting the eigenvalues of the transition matrix.

For ESNs, dimensionality reduction is considered, too, namely by means of so-called conceptors (Jaeger, 2014)

. These are special matrices which restrict the reservoir dynamics to a linear subspace that is characteristic for a specific pattern. However, as in principal component analysis, conceptors reduce only the spatial dimensionality of the point cloud of the given data. In contrast to this, for PrNNs, we reduce the transition matrix

and hence take also into account the temporal order of the data points in the time series. By applying insights from linear algebra, the actual network size can be reduced and not only the subspace of computation as with conceptors.

Property 7

With Prop. 5, the function can be rewritten by means of the Jordan matrix of the transition matrix , namely as , where the start vector can be chosen as non-zero constant, e.g., . Furthermore, by Prop. 1, can be expressed as a sum of vectors , where is constant because it is part of the start vector . Then it follows from Prop. 4 that for large the contribution of a Jordan component vanishes if and/or .

We can omit all Jordan components causing only small errors, until a given threshold is exceeded. The error of a network component corresponding to a Jordan block

can be estimated by the root mean square error normalized to the number of all sample components between input and predicted output (called NRMSE henceforth). In practice, we omit all network components with smallest errors as long as their cumulated sum is below a given threshold

for the desired precision which is defined as fraction of the sum of all single errors. From and , and (according to Prop. 5), we successively derive reduced matrices and , and the vector as follows:

  • From , take the rows corresponding to the input/output components and the columns corresponding to the relevant network components (with smallest errors).

  • From , take the rows and columns corresponding to the relevant network components.

  • From , take the rows corresponding to the relevant network components.

Note that the dimensionality reduction does not only lead to a smaller number of reservoir neurons, but also to a rather simple network structure: The transition matrix (which comprises the reservoir weights of the reduced network) is a sparse matrix with non-zero elements only on the main and immediately neighbored diagonals. Thus the number of connections is in , i.e., linear in the number of reservoir neurons, not quadratic – as in general. Fig. 4 summarizes the overall learning procedures for PrNNs including network size reduction.


4.4 Complexity and Generalization of the Procedure

% sample given -dimensional function as time series

% random initialization of reservoir and input weights

% learn output weights by linear regression

% transition matrix and its decomposition

% network size reduction

with rows restricted to input/output dimensions
      if (error(In,Out) threshold) break
       where causes smallest error
       columns from omitted Jordan component

Figure 4: Pseudocode for learning PrNNs including network size reduction. stands for omitting the respective matrix elements.
Property 8

In both learning steps, it is possible to employ any of the many available fast and constructive algorithms for linear regression and eigendecomposition. Therefore, the time complexity is just for both output weights learning and dimensionality reduction (cf. Demmel et al., 2007). In theory, if we assume that the basic numerical operations like  and 

can be done in constant time, the asymptotic complexity is even a bit better. In practice, however, the complexity depends on the bit length of numbers in floating point arithmetics, of course, and may be worse hence. The size of the learned network is in

(cf. Sect. 4.3).

Note that feedforward networks with three threshold neurons already are NP-hard to train (cf. Blum and Rivest, 1992). This results from the fact that the universal approximation theorem for feedforward networks differs from Prop. 6, because the former holds for multi-dimensional functions and not only time-dependent input. In this light, the computational complexity of for PrNNs does not look overly expensive. Furthermore, it is the overall time complexity of the whole learning procedure, because it is not embedded in a time-consuming iterative learning procedure (like backpropagation) as in other state-of-the-art methods.

We observe that most of the results presented in this paper still hold, if the transition matrix contains complex numbers. This means in particular that also complex functions can be learned (from complex-valued time series) and represented by predictive neural networks (Prop. 6). Nonetheless, the long-term behavior of networks with a random complex transition matrix differs from the one described in Sect. 3.2, because then usually there are no pairs of complex conjugated eigenvalues with absolute values .

5 Experiments

In this section, we demonstrate evaluation results for PrNNs on several tasks of learning and predicting time series, approximating them by a function represented by a RNN. We consider the following benchmarks: multiple superimposed oscillators, number puzzles, and robot soccer simulation. All experiments are performed with a program written by the authors in Octave (Eaton et al., 2017) that implements the PrNN learning procedure (cf. Sect. 4). Let us start with an example that illustrates the overall method.

Example 2

The graphs of the functions (parabola) and (sinusoid) look rather similar for , see Fig. 6. Can both functions be learned and distinguished from each other?

To investigate this, we sample both graphs for with . After that, we learn the output weights (cf. Sect. 4.1), starting with a large enough reservoir consisting of up to neurons (cf. Prop. 6). Finally, we reduce the size of the overall transition matrix with precision threshold (cf. Sect. 4.3). Minimal PrNNs consist of neurons for the parabola (cf. Ex. 1) and neurons for the sinusoid (cf. Sect. 3.2). They are learned already with reservoir neurons in the beginning in about 65% (parabola) or even 100% (sinusoid) of the trials, see also Fig. 6. Learning the parabola is more difficult because the corresponding transition matrix (cf. Ex. 1) has no proper eigendecomposition according to Prop. 2. The NRMSE is only about in both cases.

Figure 5: Graphs for Ex. 2: a parabola and a sinusoid – but the question is which one is which? Both can be learned and distinguished by PrNNs from the visually similar positive parts of the respective graphs (i.e., function values for ). The parabola is shown in blue and the sinusoid in red.
Figure 6: The error bar diagram shows the number of reservoir neurons (a) before versus (b) after dimensionality reduction for Ex. 2 (median of 100 trials). It demonstrates that, by PrNN learning, the number of reservoir neurons can be reduced to only (parabola, blue) or (sinusoid, red), respectively. In both cases, the neural networks have minimal size.

5.1 Multiple Superimposed Oscillators

Multiple superimposed oscillators (MSO) count as difficult benchmark problems for RNNs (cf. Koryakin et al., 2012; Schmidhuber et al., 2007). The corresponding time series is generated by summing up several simple sinusoids. Formally it is described by where denotes the number of sinusoids and their frequencies. Various publications have investigated the MSO problem with different numbers of sinusoids. We concentrate here solely on the case whose graph is shown in Fig. 7, because in contrast to other approaches it is still easy to learn for PrNNs.

Applying the PrNN learning procedure with precision threshold and taking as many time steps as reservoir neurons, we arrive at PrNNs with only reservoir neurons in most cases. Since two neurons are required for each frequency (cf. Sect. 3.2), this is the minimal size. Furthermore, if we start with a large enough reservoir, the NRMSE is rather small (see Fig. 9). Thus PrNNs outperform the previous state-of-the-art for the MSO task with a minimal number of units. Koryakin et al. (2012) report as the optimal reservoir size for ESNs, but in contrast to our approach, this number is not further reduced. In general a PrNN with neurons suffices to represent a signal, which might be a musical harmony (cf. Stolzenburg, 2017), consisting of sinusoids. It can be learned by the PrNN learning procedure with dimension reductionlong (see also Neitzel, 2018).

Figure 7: The signal of eight multiple superimposed oscillators (for ) does not have a simple periodic structure. PrNN learning leads to minimal networks with only reservoir neurons, i.e., two for each frequency in the signal.
Figure 8: Experimental results for the MSO example. The diagram shows the NRMSE (median of 100 trials) versus the initial number of reservoir neurons for the MSO example. The NRMSE is rather small (about ), if we start with about

reservoir neurons. The shaded area indicates standard deviations of the error, which also becomes small with large enough initial size.

Figure 9: Ball trajectory of RoboCup 2D soccer simulation game #2 (Gliders 2016 versus HELIOS 2017) on a pitch of size . The original trajectory of the ball during play is shown for all time steps (black). The game can be replayed by a PrNN with reservoir neurons with high accuracy (blue). The reduced network with reservoir neurons still mimics the trajectory with only small error (red).

5.2 Solving Number Puzzles

Example 3

Number series tests are a popular task in intelligence tests. The function represented by a number series can be learned by artificial neural networks, in particular RNNs. Glüge and Wendemuth (2013) list 20 number puzzles (cf. Ragni and Klein, 2011), among them are the series:

We apply the PrNN learning procedure to all 20 examples taking small reservoirs () and do not perform dimensionality reduction because the number series are too short for this. This also leads to more general functions which seems to be appropriate because number puzzles are usually presented to humans. The first 7 of 8 elements of each series is given as input. If the output of the learned network predicts the given input correctly, then the last (8-th) element is predicted (cf. Sect. 4.1).

Tab. 2 lists the percentages of correct predictions of the last element. The most frequently predicted last element (simple majority) is the correct one in most cases, namely 75% for . Exceptions are the series and which are the only series with definitions recurring to but not . If we always add the previous values of the time series as clue to the input, then the correctness of the procedure can be increased significantly (to 95% for ).

series clue 37,7% 33,5% 0,7% 22,5% 28,9% 46,2% 29,2% 18,6% 46,7% 52,8% 1,4% 45,4% 23,7% 30,0% 18,0% 40,3% 57,3% 67,6% 43,1% 66,7% 33,2% 54,1% 17,1% 64,4% 59,4% 58,9% 57,5% 51,5% 20,1% 23,4% 1,7% 32,7% 100,0% 100,0% 100,0% 100,0% 50,8% 68,4% 69,8% 59,3% 45,4% 63,0% 4,1% 68,1% 13,7% 24,2% 11,3% 37,5% 65,2% 56,4% 43,1% 37,6% 43,2% 63,1% 2,7% 58,0% 3,4% 8,5% 2,6% 3,6% 36,3% 47,2% 3,2% 48,5% 21,5% 28,5% 7,6% 23,3% 34,7% 31,5% 0,9% 23,2% 47,3% 69,8% 73,0% 57,3% 39,1% 39,6% 0,1% 24,2%
Table 1: Percentages of correct predictions of the last element for 20 number puzzles (Ragni and Klein, 2011; Glüge and Wendemuth, 2013) in 1,000 trials. In the last case, reservoir neurons are employed and the previous series value is used as a clue.
game NRMSE (1) NRMSE (2) net size #1 0.00013 0.66976 427 #2 0.00534 0.85794 389 #3 0.00048 0.81227 384 #4 0.00006 0.66855 408 #5 0.00002 0.65348 424 #6 0.00000 0.98644 327 #7 0.00000 0.75411 370 #8 0.00008 0.70957 385 #9 0.00000 0.67534 328 #10 0.00017 0.86802 364
Table 2: For ten RoboCup simulation games, a PrNN is learned with initially reservoir neurons. The table shows the NRMSE (1) before and (2) after dimensionality reduction where . The network size can be reduced significantly – about 24% on average (last column).

5.3 Replaying Soccer Games

RoboCup (Kitano et al., 1997) is an international scientific robot competition in which teams of multiple robots compete against each other. Its different leagues provide many sources of robotics data that can be used for further analysis and application of machine learning. A soccer simulation game lasts 10 mins and is divided into 6000 time steps where the length of each cycle is 100 ms. Logfiles contain information about the game, in particular about the current positions of all players and the ball including velocity and orientation for each cycle. Michael et al. (2019) describe a research dataset using some of the released binaries of the RoboCup 2D soccer simulation league (Chen et al., 2003) from 2016 and 2017 (see also Michael et al., 2018). In our experiments we evaluated ten games of the top-five teams (available from, considering only the -coordinates of the ball and the altogether 22 players for all time points during the so-called “play-on” modelong (see also Steckhan, 2018).

For PrNN learning, we use only every -th time step of each game with input dimensions and start with a reservoir consisting of neurons. We repeat the learning procedure until the NRMSE is smaller than ; on average, already two attempts suffice for this. This means, if we replay the game by the learned PrNN (in output generating mode), then on average the predicted positions deviate less than 1 m from the real ones – over the whole length of the game (cf. Fig. 9). Dimensionality reduction leads to a significant reduction of the network size – about 24% if we concentrate on the relevant components for the ball trajectory (cf. Tab. 2). The complete learning procedure runs in seconds on standard hardware.


6 Conclusions

In this paper, we have introduced PrNNs – a simple and yet powerful type of RNNs where all neurons are linearly activated. The learning procedure employs only standard matrix operations and is thus quite fast. No backpropagation, gradient descent, or other iterative procedure is required. In contrast to ESNs, also no washout period is required in the beginning. Any function can be approximated directly from the first step and with an arbitrary starting vector. The major innovation of PrNNs is network size reduction (cf. Sect. 4.3). It means that not only network weights but also the network architecture is learned, leading to significantly smaller and sparsely connected networks.

Although any time-dependent function can be approximated with arbitrary precision (cf. Prop. 6), not any function can be implemented by RNNs, in particular functions increasing faster than single-exponential (cf. Prop. 3) like (double-exponential) or (factorial function). Nevertheless, experiments with reasonably large example and network sizes can be performed successfully within seconds on standard hardware, e.g., with the robot soccer dataset (cf. Sect. 5.3). However, if thousands of reservoir neurons are employed, the procedure may become numerically instable, at least our Octave implementation. The likelihood of almost identical eigenvectors and eigenvalues with absolute values greater than in the learned transition matrix is increased then.

A particularly interesting application of our approach reducing the network size is in hardware implementations of neural networks, e.g., for neuromorphic or reservoir computing (Mead, 1990; Indiveri et al., 2011; Liao and Li, 2017). Future work will include improving predictive and memory capacity of PrNNs (cf. Marzen, 2017) taking inspiration from convolutional networks (cf. Goodfellow et al., 2016). Last but not least, other machine learning tasks besides prediction shall be addressed, including classification and reinforcement learning.



We would like to thank Chad Clark, Andrew Francis, Rouven Neitzel, Oliver Otto, Kai Steckhan, Flora Stolzenburg, and Ruben Zilibowitz for helpful discussions and comments. The research reported in this paper has been supported by the German Academic Exchange Service (DAAD) by funds of the German Federal Ministry of Education and Research (BMBF) in the Programmes for Project-Related Personal Exchange (PPP) under grant no. 57319564 and Universities Australia (UA) in the Australia-Germany Joint Research Cooperation Scheme within the project Deep Conceptors for Temporal Data Mining (Decorating).




  • Akaike (1969) Akaike H (1969) Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics 21(1):243–247, URL
  • Bengio et al. (1994) Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2):157–166, URL
  • Blum and Rivest (1992) Blum AL, Rivest RL (1992) Training a 3-node neural network is NP-complete. Neural Networks 5(1):117–127, URL
  • Chen et al. (2003) Chen M, Dorer K, Foroughi E, Heintz F, Huang Z, Kapetanakis S, Kostiadis K, Kummeneje J, Murray J, Noda I, Obst O, Riley P, Steffens T, Wang Y, Yin X (2003) Users Manual: RoboCup Soccer Server – for Soccer Server Version 7.07 and Later. The RoboCup Federation, URL
  • Colonius and Kliemann (2014) Colonius F, Kliemann W (2014) Dynamical Systems and Linear Algebra, Graduate Studies in Mathematics, vol 158. American Mathematical Society, Providence, Rhode Island, URL
  • Demmel et al. (2007) Demmel J, Dumitriu I, Holtz O (2007) Fast linear algebra is stable. Numerische Mathematik 108(1):59–91, URL
  • Deng and Yu (2014) Deng L, Yu D (2014) Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7(3-4):198–387, URL
  • Eaton et al. (2017) Eaton JW, Bateman D, Hauberg S, Wehbring R (2017) GNU Octave – A High-Level Interactive Language for Numerical Computations. URL, edition 4 for Octave version 4.2.1
  • Elman (1990) Elman JL (1990) Finding structure in time. Cognitive Science 14:179–211, URL
  • Glüge and Wendemuth (2013) Glüge S, Wendemuth A (2013) Solving number series with simple recurrent networks. In: Ferrández de Vicente JM, Álvarez Sánchez JR, de la Paz López F, Toledo-Moreo FJ (eds) Natural and Artificial Models in Computation and Biology – 5th International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2013. Proceedings, Part I, Springer, LNCS 7930, pp 412–420, URL
  • Goodfellow et al. (2016) Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, London, URL
  • Hammer (2000) Hammer B (2000) On the approximation capability of recurrent neural networks. Neurocomputing 31(1):107–123
  • Hochreiter and Schmidhuber (1997) Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780, URL
  • Horn and Johnson (2013) Horn RA, Johnson CR (2013) Matrix Analysis, 2nd edn. Cambridge University Press, New York, NY
  • Hornik (1991) Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2):251–257, URL
  • Hu and Qi (2017) Hu H, Qi GJ (2017) State-frequency memory recurrent neural networks. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, Proceedings of Machine Learning Research, vol 70, pp 1568–1577, URL
  • Indiveri et al. (2011) Indiveri G, Linares-Barranco B, Hamilton T, van Schaik A, Etienne-Cummings R, Delbruck T, Liu SC, Dudek P, Häfliger P, Renaud S, Schemmel J, Cauwenberghs G, Arthur J, Hynna K, Folowosele F, Saïghi S, Serrano-Gotarredona T, Wijekoon J, Wang Y, Boahen K (2011) Neuromorphic silicon neuron circuits. Frontiers in Neuroscience 5:73, URL
  • Jaeger (2014) Jaeger H (2014) Controlling recurrent neural networks by conceptors. CoRR – Computing Research Repository abs/1403.3369, Cornell University Library, URL
  • Jaeger and Haas (2004) Jaeger H, Haas H (2004) Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 2(304):78–80
  • Kitano et al. (1997) Kitano H, Asada M, Kuniyoshi Y, Noda I, Osawa E, Matsubara H (1997) RoboCup: A challenge problem for AI. AI Magazine 18(1):73–85
  • Koryakin et al. (2012) Koryakin D, Lohmann J, Butz MV (2012) Balanced echo state networks. Neural Networks 36:35–45, URL
  • Liao and Li (2017) Liao Y, Li H (2017) Reservoir computing trend on software and hardware implementation. Global Journal of Researches in Engineering (F) 17(5), URL
  • Lipton et al. (2015) Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. CoRR – Computing Research Repository abs/1506.00019, Cornell University Library, URL
  • Maass et al. (2002) Maass W, Natschläger T, Markram H (2002) Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation 14(11):2531–2560, URL
  • Manjunath and Jaeger (2013) Manjunath G, Jaeger H (2013) Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks. Neural Computation 25(3):671–696, URL, pMID: 23272918
  • Martens and Sutskever (2011) Martens J, Sutskever I (2011) Learning recurrent neural networks with Hessian-free optimization. In: Proceedings of the 28th International Conference on Machine Learning, pp 1033–1040, URL
  • Marzen (2017) Marzen S (2017) Difference between memory and prediction in linear recurrent networks. Physical Review E 96(3):032308 [1–7], URL
  • Mead (1990) Mead C (1990) Neuromorphic electronic systems. Proceedings of the IEEE 78(10):1629–1636, URL
  • Michael et al. (2018) Michael O, Obst O, Schmidsberger F, Stolzenburg F (2018) Analysing soccer games with clustering and conceptors. In: Akyama H, Obst O, Sammut C, Tonidandel F (eds) RoboCup 2017: Robot Soccer World Cup XXI. RoboCup International Symposium, Springer Nature Switzerland, Nagoya, Japan, LNAI 11175, pp 120–131, URL
  • Michael et al. (2019) Michael O, Obst O, Schmidsberger F, Stolzenburg F (2019) RoboCupSimData: Software and data for machine learning from RoboCup simulation league. In: Holz D, Genter K, Saad M, von Stryk O (eds) RoboCup 2018: Robot Soccer World Cup XXII. RoboCup International Symposium, Springer, Montréal, Canada, URL, to appear
  • Neitzel (2018) Neitzel R (2018) Prediction of sinusoidal signals by recurrent neural networks. Project thesis, Automation and Computer Sciences Department, Harz University of Applied Sciences, in German
  • Ollivier et al. (2015) Ollivier Y, Tallec C, Charpiat G (2015) Training recurrent networks online without backtracking. CoRR – Computing Research Repository abs/1507.07680, Cornell University Library, URL
  • Palangi et al. (2013) Palangi H, Deng L, Ward RK (2013) Learning input and recurrent weight matrices in echo state networks. CoRR – Computing Research Repository abs/1311.2987, Cornell University Library, URL
  • Pascanu et al. (2013) Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on Machine Learning 28(3):1310–1318, URL
  • Pong et al. (2017) Pong V, Gu S, Levine S (2017) Learning long-term dependencies with deep memory states. In: Lifelong Learning: A Reinforcement Learning Approach Workshop, International Conference on Machine Learning.
  • Ragni and Klein (2011) Ragni M, Klein A (2011) Predicting numbers: An AI approach to solving number series. In: Bach J, Edelkamp S (eds) KI 2011: Advances in Artificial Intelligence – Proceedings of the 34th Annual German Conference on Artificial Intelligence, Springer, Berlin, LNAI 7006, pp 255–259, URL
  • Schmidhuber et al. (2007) Schmidhuber J, Wierstra D, Gagliolo M, Gomez F (2007) Training recurrent networks by evolino. Neural Computation 19:757–779, URL
  • Steckhan (2018) Steckhan K (2018) Time-series analysis with recurrent neural networks. Project thesis, Automation and Computer Sciences Department, Harz University of Applied Sciences, in German
  • Stolzenburg (2017) Stolzenburg F (2017) Periodicity detection by neural transformation. In: Van Dyck E (ed) ESCOM 2017 – 25th Anniversary Conference of the European Society for the Cognitive Sciences of Music, IPEM, Ghent University, Ghent, Belgium, pp 159–162, URL, proceedings
  • Sutton and Barto (1998) Sutton RS, Barto AG (1998) Reinforcement Learning: An Introduction. MIT Press, URL
  • Tiňo (2018) Tiňo P (2018) Asymptotic Fisher memory of randomized linear symmetric echo state networks. Neurocomputing 298:4–8, URL
  • White et al. (1994) White OL, Lee DD, Sompolinsky H (1994) Short-term memory in orthogonal neural networks. Physical Review Letters 92(14):148102, URL
  • Xue et al. (2007) Xue Y, Yang L, Haykin S (2007) Decoupled echo state networks with lateral inhibition. Neural Networks 20(3):365–376, URL