Performance boost of time-delay reservoir computing by non-resonant clock cycle

05/07/2019 ∙ by Florian Stelzer, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

The time-delay-based reservoir computing setup has seen tremendous success in both experiment and simulation. It allows for the construction of large neuromorphic computing systems with only few components. However, until now the interplay of the different timescales has not been investigated thoroughly. In this manuscript, we investigate the effects of a mismatch between the time-delay and the clock cycle for a general model. Typically, these two time scales are considered to be equal. Here we show that the case of equal or rationally related time-delay and clock cycle could be actively detrimental and leads to an increase of the approximation error of the reservoir. In particular, we can show that non-resonant ratios of these time scales have maximal memory capacities. We achieve this by translating the periodically driven delay-dynamical system into an equivalent network. Networks that originate from a system with resonant delay-times and clock cycles fail to utilize all of their degrees of freedom, which causes the degradation of their performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reservoir computing is a machine learning method, which was introduced independently by both Jaeger

[1] as a mathematical framework and by Maas et al. [2] from a biologically inspired background. It fundamentally differs from many other machine learning concepts and is particularly interesting due to its easy integration into hardware, especially photonics [3, 4]

. With the help of the reservoir computing paradigms, the naturally occurring computational power of almost any dynamical or physical system can be exploited. It is particularly valuable for solving the class of time-dependent problems, which is usually neglected by artificial neural network-based approaches. A time-dependent problem requires to estimate a target signal

which depends non-trivially on an input signal , the set of times may be continuous or discrete. This class of problems contains, in particular, speech recognition or time series prediction [5, 6, 7], and also has great promise for error correction in optical data transmission [8]. Furthermore, reservoir computing can be used to study fundamental properties of dynamical systems in a completely novel way [9], enabling new ways of characterizing physical systems.

The main idea of reservoir computing is simple, yet powerful: A dynamical system, the reservoir, is driven by an input . The state of the reservoir is described by a variable , which can be high- or even infinite-dimensional. A linear readout mapping provides an output. While the parameters of the reservoir itself remain fixed at all times, the coefficients of the linear mapping are subject to adaptation, i.e. the readout can be trained.

Reservoir computing is a supervised machine learning method. An input and the corresponding target function are given as a training example. Then optimal output weights, i.e. coefficients of the linear mapping , are determined such that approximates the target . This is analogous to a conventional artificial neural network where only the last layer is trained. The goal of this procedure is to approximate the mapping via such that it not only reproduces the target function for the given training input but also provides meaningful results for other, in certain sense similar, inputs. From a nonlinear dynamics perspective, the readout mapping is a linear combination of different degrees of freedom of the system.

The reservoir must fulfill several criteria to exhibit good computational properties: First, it must be able to carry information of multiple past input states, i.e. have memory. References [10] and [11] study how a reservoir can store information about past inputs. Second, the system should contain a non-linearity to allow for non-trivial data processing. Jaeger [1]

proposed to use a recurrent neural network with random connections as reservoir. In this case the reservoir

has a high-dimensional state space by design, as the dimension equals the number of nodes. Such systems are in general able to store a large amount of information. Moreover, the recurrent structure of the network ensures that information about past input states remains for a number of time steps and fades slowly. Jaeger compared the presence of past inputs in the state of to echoes. For this reason he called his proposed reservoir system echo state network.

In recent years, the field of reservoir computing has profited from experimental approaches that use a continuous time-delay dynamical system as reservoir [12]. While hybrid network-delay systems have also been proposed [13], typically, only a single dynamical nonlinear system is employed and connected to a long delay loop; i.e. as opposed to a network-based approach only a single active element is needed for a delay-based reservoir. Here, the complexity is induced by the inherent large phase space dimension of the dynamical system with time-delay [14, 15, 16]. The main advantage of using a delay system over Jaeger’s echo state approach [1] is that one can physically implement the delay system with analogue hardware at relatively low costs—either electronically [12] or even optically [17, 18].

Two main time scales exist in such delay-based reservoir systems: the delay-time given by the physical length of the feedback line, and a clock cycle given by the input speed. In this paper, we show that the non-trivial cases of mismatched delay-time and clock cycle possess better reservoir computing properties. We explain this by studying the corresponding equivalent network, where we can show that non-resonant ratios of to have maximal memory capacities.

The paper is organized as follows: In section 2 the general model of a reservoir computer based on a single delay-differential system is introduced. We refer to this method as time-delay reservoir computing (TDRC). Section 3 shows numerical simulations and the effect of mismatching clock cycles and delay-times. Section 4 derives a representation of the TDRC system with mismatching clock cycle and delay-time as an equivalent echo state network. Section 5 presents a direct calculation of the memory capacity. Section 6 derives a semi-analytic explanation for the observed decreased memory capacity for resonant to ratios and provides an intuitive interpretation. All results are summarized in section 7.

2 Time-delay reservoir computing

In this section we describe the reservoir computing system based on a delay equation. Its choice is inspired by the publications [12, 17], where it was experimentally implemented using analogue hardware. In comparison to the general reservoir computing scheme described above, an additional ‘preprocessing’ step is added to transforms the input in an appropriate way before being sent to the reservoir. This is particularly necessary when the input is discrete and the reservoir is time-continuous. In the following, we describe the resulting chain of transformations

(1)

in detail.

2.1 Step (I): preprocessing of the input

Since the reservoir is implemented with the physical experiment in mind (e.g. semiconductor laser), its state variable is time-continuous. However, the input data is discrete in typical applications of TDRC [12, 13, 17]. For this reason the preprocessing function translates the discrete input into a continuous function .

We consider a discrete input sequence , where is one-dimensional, however, the method can be extended to multi-dimensional inputs. The important parameters that define the preprocessing are the clock cycle , number of virtual nodes and the resulting time per virtual node . Previous results [12] suggest the following ranges for the values of these parameters: the clock cycle should be of the order of the delay , and the time per virtual node of the order of the reservoir timescale.

First, a function is defined as step function

(2)

with step length . Using the indicator function

(3)

the definition of can be equivalently written as

(4)

Secondly, is multiplied by the -periodic mask

(5)

which is piecewise constant with step length and values . Multiple options for the choice of a mask function are compared in reference [19]. The final preprocessed input signal is

(6)

It is a piecewise constant function with values

(7)

on the intervals . The details of the preprocessing are illustrated in figure 1.

For further analysis, it is convenient to denote the ‘mask’-vector

and the input vector as follows:

(8)
Figure 1: Schematic representation of the preprocessing step: the discrete input sequence defines the a function (blue), which is multiplied by a -periodic mask function to obtain the preprocessed input (red). Here the length of the mask vector is . The resulting function enters the reservoir equation (9).

2.2 Step (II): reservoir

Inspired by the previous works [12], we study the reservoir given by the delay-diffential equation

(9)

where is the delay-time, is the input strength,

the activation function, and

the preprocessed input function. The term in equation (9) induces the ‘fading memory’ property necessary for the reservoir [1].

For a given preprocessed input , the reservoir variable is computed by solving the delay-differential equation (9) with some initial condition. For a properly chosen reservoir, the initial conditions does not play an important role. This is related to the phenomenon of generalized synchronization [20, 21] but we do not address this question in this work in details. In short, generalized synchronization in reservoir computing means that identical reservoirs which are driven by the same input but have different initial states will approximate each other asymptotically. In the literature about reservoir computing this property is often referred to as echo state property.

2.3 Step (III): readout

The continuous reservoir variable needs to be discretized for the output. For this, the dynamical system is read out every time units. Because every small time window is fed with its own input , these time windows are often seen as ‘virtual nodes’, and the whole delay system as a ‘virtual network’ [12, 22]. We discretize the reservoir variable correspondingly:

(10)

In fact, is the vector containing the -point discretization of the variable on the interval . The output of the machine learning system is defined as

(11)

where is an -dimensional row vector and is a scalar bias. The output weight variables and

are to be adjusted in the training process and are chosen by linear regression for reservoir computers

[1].

3 Effect of the mismatch between delay and clock cycle times

When TDRC was first introduced, the clock cycle for the preprocessing of the input mask was chosen to be equal to the delay [12, 17]. In this case one can easily find an ‘equivalent network’ which is a discrete approximation of the reservoir system. See the suplementery material of [12] or reference [23] for an example. However, recent numerical observations show, that the performance may be improved if one sets . The earliest example of this can be arguably found in reference [24].

Figure 2: This plot shows the NRMSE of a linear TDRC for the NARMA-10 task (top) and the total memory capacity (bottom). Clearly visible are error peaks for values of that are close to integer multiples of and low-order resonances. Vertical lines denote the resonant values and . Other parameters are , , , and . The reservoir is trained on samples and evaluated on samples with a regularization by a noise of added to the output of the reservoir. The results are averaged for different random input masks.

We use the NARMA-10 task [25] and the memory capacity (MC) [10] to measure the performance of a simple TDRC to illustrate the role of the clock cycle . These are typical benchmark tasks and we refer to the appendix for a detailed explanation. Let the activation function in equation (9) be a linear function

(12)

with . The other parameters are set to , , and . We trained this system on

data points for the NARMA-10 task with a regularization by a white noise term with the amplitude

. For our tests we fix an input weight vector with independently -distributed entries . We simulate different masks and train the output weights for each of the mask for input steps, and then evaluate the learned behavior on an independent set of samples.

The top panel of figure 2 shows the results of simulations with these parameters. The normalized mean square error (NRMSE, see equation (98)) is a quantitative measure of the systems performance. The error can in general depend on many parameters. figure 2 shows that the linear network exhibits clear peaks of the error for certain values of the clock cycle . These peaks are located close to low-order resonances with the delay-time and fulfill the relation for small . In fact, the peaks are located slightly above the resonant values. The lower panel of figure 2 reveals at least part of the reason for this: The total memory capacity for these resonant clock cycles decreases dramatically.

The rest of this paper is devoted to the explanation of this phenomenon. Using a linear activation function is not the optimal choice for the TDRC. However, the resonance effects that we are interested in are general and have been numerically verified to be largely independent of . They do however get enhanced by a stronger regularization. Conversely, if the system is large enough, i.e. for large , the effect reduces and can be mostly ignored for simple tasks like NARMA-10.

4 Approximation by a network

In order to explain the degradation of the memory capacity for the resonant clock cycles, we present the delay-system reservoir as an equivalent network. Similar procedure was done in [12] for a TDRC with Mackey-Glass activation function and . (See the supplementary material of [12].) However, our case is more complicated, as we allow for . This changes the coupling matrix of the resulting equivalent network. Since a detailed derivation is technical, we move it to Appendix A and present the main results in this section.

As follows from Appendix A, the TDRC dynamics can be approximated by the discrete mapping

(13)

where

(14)

and is the discretized vector of the reservoir defined in equation (10). Let us define and explain further notations used in the mapping (13). The matrix

(15)

is the classical coupling matrix of an equivalent network for TDRC with [12]. Moreover,

(16)

where and denote the floor and the ceiling function, which we need to employ to allow delay-times that are not an integer multiple of . These quantities can be interpreted as follows: is the number of virtual nodes that are needed to cover a -interval, is a measure of the misalignment between and , and is roughly the ratio between the delay-time and the clock cycle . For , the delay is shorter than the clock cycle , and it is similar to or larger than the clock cycle for . The matrices and are shifted versions of the matrix . They are defined as follows:

(17)

is obtained by a downwards shift of by rows and

(18)

is obtained by the upwards shift of by rows. Furthermore

(19)
(20)

and is defined as the input vector (8).

The mapping (13) generalizes previous results from [12, 22]. If the clock cycle satisfies , the description coincides with the classical case and the approximate equation (13) yields the same mapping

(21)

as presented in [12] because then , , and .

Analytical approaches for the nonlinear system (13) can be difficult. To simplify, we will study the effect of different clock cycles with the help of a linear activation function , where is a scalar. Then equation (13) can be written as

(22)

by plugging in for and by writing for the sake of shortness.

System (13) possesses the following properties: in the case , we have , and hence, equation (13) is in general an implicit map. This is the physical case of a delay shorter than the clock cycle, which means that the feedback from some of the virtual nodes will act on other virtual nodes within the same cycle. However, for the linear activation function in equation (22) and by (14) we obtain the explicit map

(23)

where

(24)

is a matrix that describes the coupling and local dynamics of the virtual network and

(25)

is the generalized input matrix.

Equation (23) is the main result of this section, it allows us to calculate directly some figures of merit in the following. We first use it to explain the drops in the memory capacity in figure 2 for resonant delays. One important aspect to note, is that the basic shape of equation (23) does not change with . Rather, a changing of the clock cycle leads to a change of the evolution matrices and of the equivalent network. The obtained system (23) can be equivalently considered as a specific echo state network.

5 Direct calculation of memory capacity

One can find an estimation for the memory capacity of a reservoir computing system by solving the system numerically and let it perform the memory task. But there are also analytic methods for some cases. In this section we explain how to calculate analytically the memory capacity of the linear echo state network (23).

Memory capacity was originally defined by Jaeger in [10]. In the following, we use a slightly modified formulation. Let the elements of the input sequence be independently -distributed. Jaeger introduced the quantity which indicates how well the output of an ESN may approximate the input value which was fed into the reservoir time steps earlier. The memory capacity for a recall of time steps in the past is defined by

(26)

where denotes the expectation value and we require the initial state of the reservoir to be stationary distributed in order to ensure that this definition is consistent, i.e. that the distribution of does not depend on . For the existence of the stationary distribution, we assume . In such a case, the memory capacity (26) with stationary distributed can be equivalently written as

(27)

Note that means that we can drop the term in (27).

The total memory capacity is defined as the sum of all -step memory capacities

(28)

In the following we denote the optimal output weight vector for (27) by . Let be the covariance matrix of the stationary distribution of the reservoir. Jaeger [10] noted that, if is invertible, one can apply the Wiener-Hopf equation [26] to find

(29)

For details we refer to Appendix B. Using this optimal value , the memory capacity (27) can be calculated as

(30)

where we have used the relations

(31)

and

(32)

So once the covariance matrix of the reservoir is invertible, one can directly calculate the memory capacity. The stationary distribution of system (23

) with standard normal distributed input elements

is a multivariate normal distribution with mean zero and covariance matrix

(33)

We refer to Appendix B for a derivation. However, this matrix is in general not invertible. In order to obtain an invertible covariance matrix, we need to perturb the stochastic process (23). We choose a small number and let

be a sequence of independent multivariate normal distributed random variables. The stochastic process

(34)

has the stationary distribution , where the covariance matrix given by

(35)

is invertible.

6 Explanation for memory capacity gaps

Using the expressions (28) and (30) for the memory capacity obtained in section 5, we provide an explanation for the loss of the memory capacity when is close to rational numbers with small denominator. The explanation is based on the structure of the covariance matrix given by equation (35) and the corresponding expression for the memory capacity, which we repeat here for convenience

(36)

where

(37)

Our further strategy is as follows:

  • Firstly, we remark that the norms of the individual terms in the sum (37) are converging to zero due to the convergence of the series. Hence, only the first finitely many terms play an important role. For instance, for our previously chosen parameters in figure 2, the terms with do not make a large contribution and can be neglected. In the following we denote the approximate number of significant terms by .

  • We show that the largest eigenvalue of the

    -th term in (37) can be approximated by

    with the corresponding eigenvector

    .

  • We show that the memory capacity is high, i.e. for , when the eigenvectors corresponding to the first relevant terms in the sum (37) are orthogonal.

  • Using our setup, we show numerically that the lower order resonances , where and is small, lead to the alignment of the eigenvectors , and hence, to the loss of the memory capacity.

  • Finally we give an intuitive explanation of the obtained orthogonality conditions.

Remark (i) is obvious, so we start with showing (ii).

(ii) Estimating the largest eigenvalues and eigenvectors of the -th term in (37)

Consider at first the term with : The largest eigenvalue of this matrix is and the corresponding eigenvector is as can be easily checked by the direct calculation

(38)

For all other eigenvectors , which are orthogonal to due to the symmetry of the matrix, the corresponding eigenvalues are because

(39)

These eigenvalues are by definition small, since is a small perturbation.

We can also find approximations of the eigenvectors and eigenvalues for the higher order terms , . Namely, for the unperturbed matrix , the largest eigenvalue is and the corresponding eigenvector is because

(40)

All other eigenvalues are zero. Since the largest eigenvalue of is geometrically and algebraically simple, it is continuous under the perturbation by . Hence, the largest eigenvalue and the eigenvector of are approximated by and with an error of order . All other eigenvalues are correspondingly small of order .

(iii) The orthogonality of leads to the high memory capacity

Let be the number of terms in (37) that are significant (see (i)), and let us assume that the eigenvectors , are close to be orthogonal, i.e.

(41)

As we will see in (iv), such an assumption is indeed reasonable in our setup. More precisely, one could consider (41) as introducing another small parameter .

In case, when the orthogonality (41) holds, the largest eigenvalues of and their corresponding eigenvectors can be approximated by and , . Indeed

(42)

In this case, the memory capacity can be calculated as follows:

(43)

for . Hence, the orthogonality of the vectors with , guarantees a high memory capacity. We will present an intuitive explanation for this shortly.

Figure 3: The angles between and , are plotted in color, measuring, in particular, the orthogonality of the vectors with different . Panels (a) and (b) correspond to different ratios of : (a) ; (b) . In (b) the vectors point into the same direction for , , , etc., i.e. after two time steps the input values overlap in the state space of and the memory capacity drops. In contrast, in the case (a) it takes almost 30 time steps before the input overlaps with past inputs in the network state. This explains the high memory capacity in this case, which is illustrated in figure 2.
(iv) Resonances of and lead to lower memory capacity

The plots in figure 3 show for different ratios . White to light blue off-diagonal squares indicate that assumption (41) is satisfied, i.e. orthogonal or almost orthogonal vectors. Dark blue indicates a strong parallelism of the vectors. As can bee seen in the top panel of figure 3, the assumption (41) holds indeed for ratios which yield a good memory performance. Conversely, it is strongly violated for critical ratios with small denominator , e.g. the center panel of figure 3.

(v) Intuitive explanations

There is an additional intuitive understanding of the above derived formulas. Recall that the original system of the reservoir of equation (9) combines the delay term and the input additively. The approximated network formula for an equivalent network translated this into the matrix , which describes the free dynamics of the network, and the driving term defined by . The state of the network is given by an -dimensional system, and thus can at most hold orthogonal dimensions [27]. Each summand of can now be understood as an imprint of the driving term on the system after time steps. For the matrix , and thus the imprint is given by , i.e. the information of the current step is stored in the nodes as given by the weights of the effective input weight vector . In the next step, the system will get an additional input, but also evolve according to its local dynamics . Thus, after one time step, the imprint has transformed into , i.e. the summand for and . Now in every step, the information that is currently present in the network will be ‘rotated’ in the phase space of the network according to , while a new input will be projected onto the direction of . This holds in general, so that the th summand of of equation (35) describes the linear imprint of the input steps in the past.

The orthogonality condition of equation (41) then is the same concept as demanding that new information from the inputs should not overwrite the already present information. If for some , then the information that was stored from steps in the past will be partially overwritten by the currently injected step and lost. Hence, ensuring that the orthogonality between is fulfilled as much as possible will maximize the linear memory. For the case of resonant feedback, i.e. , this condition is not fulfilled. This is due to the fact, that has a strong diagonal component for the resonant cases, i.e. virtual nodes are most strongly coupled to themselves. This is a simple consequence of the fact that for , virtual nodes return to the single real node at the same time that they are updated. Similarly, for higher resonant cases , will in general have a strong diagonal part and thus the eigenvector will not be orthogonal to , and the information will be overwritten.

7 Discussion

In this paper we have shown a generalization of the frequently used time-delay reservoir computing for cases other than . We observed that a sudden increase in the computing error (NARMA-10 NRMSE) and a drop in the linear memory capacity () can be seen for resonant cases of with , where is small. We derived an equivalent network for the non-resonant cases which extends the previously studied cases. Assuming a linear activation function , we can analytically solve the resulting implicit equations and obtain an expression for the total memory capacity . Here we find that the resulting memory capacity will be small for cases where and are resonant because the information within the equivalent network will be overwritten by new inputs very quickly. Even though our analytics so far are only derived for the linear case, we expect these results to hold in general, as numerical investigations imply. We have also found that the observed effect is more pronounced in smaller networks and with stronger regularization.

Acknowledgements

S.Y. acknowledges the financial support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project 411803875. A.R. and K.L. acknowledge support from the Deutsche Forschungsgemeinschaft in the framework of the CRC910. F.S. acknowledges financial support provided by the Deutsche Forschungsgemeinschaft through the IRTG 1740.

Appendix

A Derivation of equivalent networks

This section presents a detailed derivation of the ESN represention of TDRC systems. The derivation is structured as follows:

  1. The delay system (9) is discretized such that the state of a virtual node depends on the state of its neighbor node , the input and the state of a second node at the time . In order to do so, we approximate the integral of the continuous TDRC system on a small integration interval of length which covers the point .

  2. The formulas for and are derived.

  3. The TDCR system can be written as a matrix equation. For this we use the same vectorization (10) of as for the readout. An induction argument is employed to obtain the matrix equation.

  4. It follows that the discretized TDRC system can be represented by an ESN if and if the activation function is linear.

  5. For the sake of completeness, we formulate the equivalent ESN for the classical case , which was described in [12].

a.1 The delay reservoir system and discretization

Consider the delay-system (9), which we repeat here for convenience:

(44)

where and .

It follows that

(45)

for . Set and . Then

(46)

where is defined in (7). One option to discretize the system, is to approximate the function by a step function with step length which is constant on the integration interval. One can find an appropriate step function by choosing and such that

(47)

and defining for . Then, one can replace by in the integrand in equation (46). This yields

(48)

a.2 The choice of and

The floor and the ceiling function are denoted by and , respectively. One can choose and in the following way:

Figure 4: The time interval over which the function is integrated in equation (46) is highlighted in blue. As stated is equation (47), the point must be chosen such that it lies within this interval. In equation (48) the value of on the integration interval is approximated by the value of . If the endpoints of the interval are grid points, is chosen to be the right endpoint.

First, let be the unique number such that , i.e. . Then

(49)

as illustrated in figure 4. Now, the choice of follows directly from the restriction . It holds that

(50)

From this result follows that

(51)

Hence, equation (49) implies

(52)

Note that one has as long as . If , then . For holds , etc.

a.3 Vectorization of the state space and a matrix equation for the discretized system

Define

(53)

and . From (48) follows

(54)

and induction yields