I Introduction
The introduction of the reservoir computing paradigm by Jaeger Jaeger2001 and Maass Maass2001
independently gained considerable interest in supervised machine learning utilizing dynamical systems. The reservoir computing scheme contains three different parts: the input layer, the reservoir, and the output layer. The reservoir can be any dynamical system, like an artificial neural network but also a laser with selffeedback. The output layer is trained by a linear weighting of all accessible reservoir states, while the reservoir parameters are kept fixed. This simplification overcomes the main issues of the time expensive training of recurrent neural networks like exploding gradients and its high power consumption
Werbos1990.Appeltant et al. Appeltant2011 successfully implemented the RC scheme onto a single nonlinear node with a delayed selffeedback. In the input layer, timemultiplexing is used to create temporally separated virtual nodes. The reservoir dynamics are hereby given by a delay differential equation, which has been proven to exhibit rich highdimensional dynamics Hale1993; Erneux2009; Erneux2017; Yanchuk2017. For the training, the temporally separated nodes are read out and weighted to solve a given task. The introduction of timedelay reservoir computing enabled simple optical and optoelectronic hardware implementations, which led to improvements of computation time scales for supervised learning Brunner2018; Larger2017. The delaybased reservoirs were successfully applied to a wide range of tasks, such as chaotic time series forecasting or speech recognition.
The success of single node delaybased reservoir computing has triggered interest into more complex network architectures, like coupled StuartLandau oscillators arranged in a ring topology Rohm2018, single nonlinear nodes with multiple selffeedback loops of various length Chen2019, parallel usage of multiple nodes Sugano2020 and multiple nonlinear nodes coupled in a chain topology Penkovsky2019a. Further, it was recently shown by Gallicchio et al. Gallicchio2018; Gallicchio2019; Gallicchio2017c, that echo state networks with multiple unidirectional coupled layers called deepESN provide a performance boost in comparison to their shallow counterpart. Penkovsky et al. Penkovsky2019a found that unidirectional coupled delay systems are superior against bidirectional coupling for certain symmetric parameter choice. In the following, we present a general deep timedelay reservoir computing model, where we use asymmetrical layers and discuss their dynamical and computational properties.
The paper is structured as follows: In section II we present the system implementing deep timedelay reservoir computer and show its performance at predicting the chaotic MackeyGlass attractor. Afterward in III, we study the dynamical properties of an autonomous twolayer system. The conditional Lyapunov exponent for a nonautonomous system is introduced in IV. The numerically calculated conditional Lyapunov exponent is then related to the linear and nonlinear memory capacity. In section VI the resonances of the clock cycle and the delaytimes are presented for two and threelayer systems.
Ii Deep TimeDelay Reservoir Computing
ii.1 Model
A deep timedelay reservoir computer (deepTRC) consists of nonlinear nodes with states . All nodes feature a selffeedback with a delay length .
The nodes are coupled unidirectional, where the first node is the only which is feed by the taskspecific input sequence . The coupling between the nodes is instantaneous, i.e., without delay. The nodes with their corresponding feedback loops are referred to as layers in the following because of the unidirectional topology and their highdimensional dynamics. The dynamical evolution of a layer is given by:
(1) 
with , being a nonlinear delay differential equation, and the layer dependent input:
(2) 
Timemultiplexing is used to transform a layer into a high dimensional network. The discrete input is given by the sequence , . This sequence is transformed into the timecontinuous input as follows
(3) 
where is the clock cycle and the scaling , determines a mask, which is applied periodically with the period . Such a preprocessing method generates virtual nodes which are distributed temporally with a separation of , see more details in Appeltant2011; Rohm2018. The given deepTRC now contains layers with each having virtual nodes resulting in a total reservoir size of . The virtual nodes within the layers correspond to the values .
For the training of the deepTRC, all virtual nodes of each layer are read out. The virtual nodes are combined into the global state given by
(4) 
In order to train for a given task , the global state is weighted
(5) 
where is a constant bias and the weights
are determined via a linear regression with an optional Tikhonov regularization.
In the following we will focus on the analysis of the recently introduced optoelectronic reservoir Penkovsky2019a, which is governed by the equations:
(6)  
where is the input gain and are coupling strengths between the consecutive layers and . Further, is a damping constant satisfying , is the feedback gain and is a scalar phase shift of the nonlinearity. According to the nonlinearity, the system is referred as Ikeda timedelay system Penkovsky2019a; Gao2019; Ikeda1979.
ii.2 Chaotic TimeSeries Prediction
In Fig. 2, we show the performance of the Ikeda based deepTRC at the prediction of the chaotic MackeyGlass attractor, i.e., where is the timeseries of the MackeyGlass system and determines how far into the future it shall be predicted. The chosen value corresponds to twice the delay time of the MackeyGlass system. The parameters of the MackeyGlass system are as in Jaeger et al. Jaeger2004a. We simulated deepTRC with up to layers and varied the number of nodes per layer. Hereby, the separation of nodes is set to and will be kept fixed for all following simulations. A Bayesian optimization approach Snoek2012 was used to optimize the feedbackgains , the delays , coupling gains and phase offsets
. Hereby, the Bayesian optimisation of deepTRC becomes much harder for deeper systems according to an increased amount of hyperparameters. In general, additional layers improve the performance of deepTRC as compared to singlelayer TRC. As a remark, the deepTRC enables a faster computation by a constant separation of nodes
, i.e a deepTRC with layers is times faster than the singlelayer TRC with the same amount of total nodes in line with a 5 times shorter clock cycle.delay  damping  phase  feedback  coupling  

Layer 1  230  0  0.2  0.68  4.0 
Layer 2  457  0.01  0.2  0.8  1 
Layer 3  199  0.01  1.5  0.97  0.01 
Layer 4  27  0.01  1.28  0.83  0.13 
Layer 5  40  0.01  1.9  0.2  0.01 
Iii Dynamics of Autonomous deepTRC
The dynamics of a delay based RC play an essential role for its performance Appeltant2011; Dambre2012. In this section we consider an autonomous layer deepTRC by setting .
The equilibrium of Eq. (6) are given as solutions of the following nonlinear system of equations
(7) 
Without further restriction we set and . System (6) can be linearised around this equilibrium, which leads to:
(8) 
where is the linearisation of layer and is the linearisation shifted by the delay of the layer.
The block matrix is given by the submatrices
(9) 
According to the unidirectional topology of the IkedadeepTRC matrix becomes a lower triangular block matrix. Further, the block matrix can be calculated by:
(10) 
Therefore, becomes block diagonal. The submatrices of the autonomous Ikeda deepTRC (6) are given by:
(11) 
with and
. The Eigenvalues of the linearised system can therefore be calculated by solving the characteristic equation:
(12) 
with . This equation can further be simplified by using that the determinant of a lower triangular matrix is given by the product of the determinants of the block matrices:
(13) 
The characteristic equation of a Llayer Ikeda deepTRC (6) is therefore given by:
(14) 
For the details of the derivation we refer to Appendix B. One can see that the eigenvalues of the linearised autonomous deepTRC are given by the combined set of the eigenvalues for the single layers.
In the following we study the dynamics of the twolayer DRTC depending on the feedback gains and . The remaining parameters are kept fixed to the values shown in Table 2.
Description  Parameter  Layer 1  Layer 2 

delay time  30  30  
phase offset  0.2  0.2  
coupling  0  1  
damping  0  0.01  
initial value 
In Fig. 3 (a) the equilibrium of the first layer is shown as a function of the feedback gain . The layer exhibits two saddlenode bifurcations at and respectively; between these two points the first layer possesses two coexisting stable equilibria. Further, it reveals periodic solutions after the supercritical Hopf bifurcation at , and a perioddoubling cascade resulting in chaotic dynamics at . The numerical part of the bifurcation diagram, i.e., the chaotic solutions of in Fig. 3 (a) are computed using Heun’s method.
In Fig 3 (b) the bifurcation diagram of the second layer is shown for a constant feedback gain of the first layer . The second layer reveals a subcritical Hopf at . The created unstable limit cycle becomes stable due to a saddlenode bifurcation at . Accordingly, a coexistence of a limit cycle and the stable equilibrium occurs in the range .
While the first layer can be considered separately, the second layer is driven by the first one. As a result, one has to study the whole system (22) for analyzing the dynamics of the second layer. In Fig. 3 (c) the bifurcation diagram of the full twolayer system is obtained using the DDEBiftool Sieber2016. The bifurcations of the first layer occur as vertical lines (indicating saddle node Hopf and period doubling bifurcations in blue, red and magenta) while the Hopf bifurcation of the second layer shows as a loop in the , plane (green and orange).
In the following sections, the obtained bifurcation diagrams will be compared with the other characteristics of the reservoir that describe its memory capacity. In particular, the next section investigates conditional Lyapunov exponents and shows how they restrict the parameter set where the RC can properly function.
Iv Conditional Lyapunov Exponent of deep time delay reservoirs
To use the twolayer deep Ikeda reservoir we enable the input into the first layer by setting and solve the delay differential equation system (6) with initial history functions . The fading memory concept Dambre2012 states that the reservoir needs to become independent of those history functions after a certain time. Therefore two identical reservoirs with different initial conditions need to approximate each other asymptotically. From a dynamical perspective, a reservoir has to show generalized synchronization to its input.
In the following, we check for generalized synchronization of two unidirectional coupled systems by estimating the maximal conditional Lyapunov exponent of the driven system. This is done by the auxiliary system method, where we initialize two identical systems with different initial conditions and drive both with the same input sequence. The conditional Lyapunov exponent then measures the convergence or divergence rate. If its maximal value is below zero, the state sequences will approximate each other asymptotically, and therefore the system shows generalized synchronization to the input system. If the exponent is positive, the systems diverge.
In order to calculate the conditional Lyapunov exponent we consider the distance between the solutions of two identical systems with different initial conditions: and :
(15) 
where
(16)  
As a remark, gives the state of layer whereas is a function of the layer system state defined over the interval given in (16). The evolution of the distance is now given by a delay differential equation. For small perturbations, we linearise equation (15):
(17) 
where the linear part of was summarized into a constant matrix , and the nonlinearity and the time varying input are included into . According to Hale1977 the solution of (17) can be estimated as
(18) 
where is the conditional Lyapunov exponent of the nonautonomous reservoir.
For the numerical estimation of the conditional Lyapunov exponent, two equal nonautonomous systems with different initial conditions of the first layer
were evaluated. The input sequence was drawn from the uniform distribution
. The distances , of the state sequences was calculated using the maximum norm over each delay interval :(19) 
An exponential function was approximated accordingly to
(20) 
where determines the numerically approximated conditional Lyapunov exponent for layer .
In Fig. 4 the numerical estimation of the conditional Lyapunov exponent is presented for a twolayer deepTRC. The parameters are as in Table 2, except the input gain, which was set to .^{1}^{1}1The lower input gain was chosen to create lowdegree MC, because higher degrees are computationally expensive. Here, the feedbackgains were scanned systematically for both layers.
In Fig 4 (a) the conditional Lyapunov exponent of the first layer is shown, which depends only on . Here, the conditional Lyapunov exponent starts to increase with increasing and drops short behind the second saddlenode bifurcations SN at shown in Fig. 3 (a). In the region between the two saddle node bifurcations SN and SN according to the bistability of the autonomous system two Lyapunov exponent were computed referring to different initial conditions. The dip at ,close after the annihilation of the stable equilibrium with the unstable one, reveals the most negative Lyapunov exponent. After the Hopf bifurcation H in the first layer at the conditional Lyapunov exponent becomes positive, and no generalized synchronization is possible. In other words, the system violates the fading memory condition for .
The conditional Lyapunov exponent computed for the second layer reflects the dynamics of both layers. We observe from Fig. 4, that has, in general, a smaller magnitude than . As a result, a perturbation of the first layer’s initial condition stays longer in the system when a second layer is added. For larger values of , the borders of the region, where the conditional Lyapunov exponent is negative, are determined by the Hopf bifurcations. The strong negative conditional Lyapunov exponent in the range is due to the socalled exceptional point Sweeney2019, were two negative real eigenvalues coalescence. In the parameter range the second layer losses generalized synchronization before reaching the subcritical Hopf bifurcation. We assume that due to the ongoing drive of the system, the second layer is pushed into the basin of attracting of the coexisting limit cycle, shown in Fig. 3 (b). Accordingly, periodic oscillations shown in Fig. 4 (c) (red line) occur which leads to a loss of generalized synchronization before reaching the subcritical Hopf H.
V conditional Lyapunov exponents versus Memory Capacity
In this section, we systematically investigate the relation between the conditional Lyapunov exponent and the MC for twolayer deepTRC. The MC measures how the reservoir memorizes and transforms previous input Dambre2012; Koester2020; Harkhoe2019, for more details about MC and a definition, we refer to Appendix C.
Description  Parameter  Value 

separation of nodes  1  
clock cycle  25  
virtual nodes per layer  25  
initial steps  
training steps  
MC threshold ()  
MC threshold () 
In Fig. 5, we show the linear, quadratic, cubic, and total MC for the parameter values as in Table 3. The maximal total memory capacity can be reached in a wide parameter range, as shown in Fig. 5 (d), except for the regions with periodic solutions or strongly negative conditional Lyapunov exponent. In all four panels, a clear drop of the MC is visible close after the supercritical Hopf bifurcation in the first layer due to the violation of fading memory, i.e., in layer 1 at . Also the drop of MC occurs before reaching the subcritical Hopf in layer two, which is in agreement with the loss of generalized synchronization shown in Fig. 4. More specifically, we observe the following features:

A large memory capacity, which is necessary for a reservoir computer to perform its tasks, is observed in the regions where conditional Lyapunov exponent is negative, and the fading memory condition is satisfied.

The highest linear MC can be achieved close to the bifurcations where the conditional Lyapunov exponent is negative and small in absolute value. In such a case, the linear information of the input stays longer in the system.

With the decreasing of the conditional Lyapunov exponent, the linear MC is decreasing, and MC starts dominating. With the further decrease of conditional Lyapunov exponent, the thirdorder MC becomes dominant. We remark that there is always a tradeoff between the MC of different degrees since the total MC bounds their sum.
Concluding, different dynamical regimes of a deepTRC can boost different degrees of the MC.
Vi Resonances between Delays and Clock Cycle
As recently shown analytically and numerically by Stelzer et al. Stelzer2019, resonances between delaytime and clock cycle
lead to a degradation of linear memory capacity due to parallel alignment of eigenvectors of the underlying network. This effect was later shown in all degrees of the memory capacity by Harkhoe et al.
Harkhoe2019 and Köster et al. Koester2020 independently. This loss of total memory capacity at all resonances of delaytime and clock cycle further results in less performance at certain tasks.In the following, we analyze this effect for two and three layer deepTRC via computing the single degrees of memory capacity up to the cubic degree while scanning the delays , and , respectively. The system parameters are as in Table 2 and the simulation parameter are shown in Table 3.
vi.1 TwoLayer deepTRC
In Figs. 6 (a)–(d) the numerically computed memory capacities are shown as a function of the delays of a twolayer deepTRC. Resonances between the delays , , and the clock cycle are present in all shown degrees of the MC. Further, resonances of the two delays occur as diagonals in the plot, with the main diagonal being dominant. In contrast to the offdiagonal resonances, the resonance broadens with higher delays. The total memory capacity exhibits weak degradations at the diagonal delay–delay and the clock cycle–delay resonances. The comparison between the linear and nonlinear MC reveals the tradeoff between both, where the linear MC becomes dominant at and .
In contrast to the reported linear MC degradations of the delay–clock cycle resonances for singlelayer TRC, a new effect is visible in Fig. 6 (a), where the resonance crosses the main diagonal. We observe that for fixed , when increases, the linear MC is degraded for , while it is boosted for .
In the following, we present an explanation of a boosted linear MC of a twolayer deepTRC for . For this, we compute the linear recallability of inputs steps into the past and separate between the two layersJaeger2004a. Hereby, means the reservoir can fully recover the input from clock cycles ago. As shown in Fig. 7, increasing for a single layer TRC leads to an improved recallability from inputs farther in the past. For comparison, at , the singlelayer TRC is able to recall inputs up to . For , the linear MC splits up into small intervals with a high MC alternated with intervals of almost no MC, visible as blue rays in Fig. 7. The length of the intervals of no MC further increases for longer delay , where the frequency can be estimated as .
The addition of a second layer with a delay of while the delay of the first layer increases the length of the recallability up to . Furthermore, for the second layer can fill up the intervals of low MC. In Fig. 7 (a) the recallability of both layers is presented in different colour. Here, the areas with a high recallability in layer 1 (blue) are followed by areas of high recallability in layer 2 (red), which results in a overall higher linear MC. Again resonance effects at occur, resulting in degradations.
vi.2 Threelayer deepTRC
In Figs. 6 (e)(h), we show the MC of a threelayer deepTRC with a reservoir size of . The delay of the first layer is fixed at , and the delay plane is scanned. The linear and quadratic MC exhibit strong resonances at , , which increase the linear MC. The resonances with multiples of the clock cycle occur as well, but they are less prominent with . The resonances between and occur as the main diagonal in the plots. In comparison to the twolayer case, the regions of high linear and cubic MC are not separated by the main diagonal, here they are divided by the resonances with the first layer. The cubic MC becomes maximal with the delays of all three layers being equal. The total MC is almost equal for all and showing a small general loss and with weak degradations at resonances with the clock cycle for .
vi.3 Memory Capacity Distribution of deepTRC
In this section, we show the role of multiple layers of deepTRC and we show ways how to systematically create certain memory in the a deepTRC. Therefore, we are use using the knowledge obtained in the previous sections. In particular, we present two configurations: the first one allows using the deep layers for producing large higherorder MC; the second configuration produce an increasing in the linear MC with the growing number of layers.
We start with a singlelayer deepTRC. Figure 8 (a) shows the MC distribution with an increasing of the number of virtual nodes . The increase of the nodes number leads mainly to an increase of quadratic MC, and, starting from , a slow increase of the cubic MC. As a remark, increasing the virtual nodes of a singlelayer TRC changes the clock cycle and, therefore, we adjusted the delay to .
In Fig. 8 (b), we keep fixed and add layers with the same number of nodes to the deepTRC. The delay of all layers is kept fixed at and the clock cycle is . We observe that additional layers lead to an increasing of nonlinear MC of higher orders.
The third configuration is similar to (b) but now the delays were varied in order to boost only the linear MC. Here, the augmentation method described in section VI was extended to layers with the delays . The last layer has a delay because here a single layer shows its most linear MC before splitting into rays. As shown in Fig. 8 (c), this method produces large linear MC, while suppressing higher order MC.
Possible combinations of the presented delay configurations can be a good method for adjusting a deepTRCs memory to specific tasks. The deepTRC configurations in Fig. 8 show small general losses of total MC with a increasing number of layers, which is caused by linear dependence of different nodes.
Vii Conclusion
We analyzed a Deep TimeDelay Ikeda System and presented a relation between the dynamics of the autonomous deepTRC and the numerically computed conditional Lyapunov exponent. We showed a correlation between MC and conditional Lyapunov exponents. A high linear MC is observed for small negative conditional Lyapunov exponent. With decreasing of the Lyapunov exponent, different degrees of MC are sequentially activated. Further, we investigated the clock cycledelay resonances in different layers as well as delaydelay resonances. The degradation of the total MC at clock cycle delay resonances was numerically shown. We explained the boost of the linear MC at specific delayclock cycle resonances by an augmentation effect. Additionally, we used the gained information to present two general delay configurations, one with an increasing nonlinear MC and one with boosted linear MC. These configurations provide a variability superior to a singlelayer TRC. They provide a potential for building general deepTRC, which are oriented at different tasks due to a broad MC spectrum. We could verify that the deepTRC concept is a promising architecture for fast and variable reservoir computing.
Acknowledgements
The authors would like to thank Florian Stelzer for fruitfull discussion. M.G acknowledges financial support provided by the Deutsche Forschungsgemeinsschaft (DFG, German Research Foundation) through the IRTG 1740. K.L. and F.K. acknowledge support from the Deutsche Forschungsgemeinschaft in the framework of the CRC910. S.Y. acknowledges the financial support by the Deutsche Forschungsgemeinschaft  Project 411803875.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Appendix A Normalised Root Mean Squared Error
The performance at a certain task is measured using the Normalized Root Mean Squard Error (NRMSE) given by:
(21) 
where is the length of the testing sequence, is the reservoir output and are the taskdependent desired outputs. Further
is the variance of the desired output and
is the Euclidean norm.Appendix B Stability and Hopf Bifurcation Analysis
This section gives a detailed analysis of the autonomous dynamics of the twolayer deep Ikeda timedelay system given in (6) which we repeat here for convenience:
(22)  
By setting the derivates to zero, the equations for the equilibrium are given by:
where the equilibrium of the first layer cannot be given explicitly. In order to check the stability we linearise the system at its equilibrium leading to:
This can be rewritten into a lower triangular block matrix form as shown in section III and therefore the resulting characteristic equation is given by:
with and and lambda being the eigenvalues of the autonomous deepTRC.
A equilibrium is asymptotically stable if the real part of every solution of the characteristic equation is negative:
(23) 
what we check for in the following for layer 1 and layer 2 separately.
Stability of Layer 1
The eigenvalues of the first layer are roots of the equation , and they are given by the Lambert function:
with being the th order of the Lambert function.
The stability of the equilibrium point will change if the eigenvalues cross the imaginary axis. Therefore we set
where . Separating the real and imaginary part we can rewrite this into:
using the absolute values this leads to:
with being stable and being unstable due to a AndronovHopf bifurcation.
Stability of Layer 2
For the second layer we analyse the second term of the characteristic equation given by:
(24) 
By substituting it is straightforward to find the frequencies, at which the equation can cross the imaginary axis
(25) 
The Hopf Bifurcation of the second layer occur in regions where: . Denoting
the delay values for stabilizing and destabilizing Hopf bifurcations are given as:
(26)  
(27) 
Appendix C Memory Capacity
The taskindependent memory capacity (MC) introduced by Dambré et al Dambre2012 determines how a dynamical system memorizes previous inputs and how it further transforms them. The total MC is given by the sum over all degrees of MC.
(28) 
In the following refers to the linear and to the nonlinear
MC (quadratic, cubic, …). Hereby, the linear MC is given by a simple
linear recall of past inputs, whereas the nonlinear gives evidence
about which computation of past inputs are performed by the system.
Further, it was proven that the readout dimension of a system bounds
the maximal reachable total MC . According
to this fact, a tradeoff between the linear and nonlinear MC can
be obtained.
The MC can be calculated via the correlation between input and the
reservoir states:
(29)  
(30) 
where is the average value over the time
, is the inverse and the transpose of the matrix.
The calculation of the MCs via the correlation given in (30)
is biased due to statistics and this bias strongly depends on the
length of . Therefore we manually set a threshold meaning that
no MCs below this threshold are regarded.
As suggested by Dambre et al.Dambre2012, the input values
were drawn from a uniform distribution .
In order to compute the different degrees of MC we used the set of
Legendre polynomials , which provide orthogonality
over the given input range.
For example, a target with total degree and three variables
is given by:
(31)  
In order to find all appearing a maximal step into the past of was set.
Comments
There are no comments yet.