Memory-Loss is Fundamental for Stability and Distinguishes the Echo State Property Threshold in Reservoir Computing Beyond

01/03/2020 ∙ by G Manjunath, et al. ∙ University of Pretoria 0

Reservoir computing, a highly successful neuromorphic computing scheme used to filter, predict, classify temporal inputs, has entered an era of microchips for several other engineering and biological applications. A basis for reservoir computing is memory-loss or the echo state property. It is an open problem on how design parameters of the reservoir can be optimized to maximize reservoir freedom to map an input robustly and yet have its close-by-variants represented in the reservoir differently. We present a framework to analyze stability due to input and parameter perturbations and make a surprising fundamental conclusion, that the echo state property is equivalent to robustness to input in any nonlinear recurrent neural network that may or may not be in the gambit of reservoir computing. Further, backed by theoretical conclusions, we define and find the difficult-to-describe input specific edge-of-criticality or the echo state property threshold, which defines the boundary between parameter related stability and instability.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Memory loss or Echo State Property

In dynamical systems theory, memory loss appears in the context of two diametrically opposite reasons. When systems have sensitive dependence on initial conditions like in chaotic systems, small errors multiply within a short time gap, so that it is unfeasible to track a specific trajectory (e.g., [5, 15]). On the other hand, the diameter of the state space could asymptotically decrease to zero so that the trajectories tend to coalesce into a single trajectory. In this paper, we consider the latter case of memory loss when a dynamical system is driven exogenously.

A feature of memory loss in driven systems is that a single bounded trajectory emerges to be a representative of the drive/input. To illustrate this idea, we consider a continuous-time driven system,  in particular a scalar differential equation, , where . The solution of the differential equation with an initial condition can be shown to be , where . For a given , regardless of . Hence, in this case, or in cases similar to this (for a schematic see Fig. 2), for a fixed initial choice , setting the starting time further back in time would get get closer to for a given .  It follows that different solutions asymptotically forget or lose the memory of their initial condition, and a single bounded attractive solution emerges as a “proxy” for the drive (in the state space). This is the essence of memory loss w.r.t. to the input .  In general, such memory loss is not observed by all driven systems, and neither is it observed generically. Moreover, such a memory loss phenomenon usually depends on the exogenous input and also the equations governing the system. In cases, where there is no memory-loss w.r.t. to input, not a single trajectory like , but a set-valued function comprising multiple or a bunch of trajectories (solutions) would attract nearby solutions.

Figure 2: Schematic figure to explain memory loss in continuous-time.

The idea of relating memory loss to the presence of a single proxy corresponding to an input can be extended to the case of a discrete-time dynamical driven system (representative work is in [21]). In the RC literature, the idea of memory loss or fading memory popularly called as the echo state property (ESP) was treated as a stability property of the reservoir or network per se. In such a case, ESP guarantees that the entire past history of the input determines the state of the system precisely, i.e., there is no possibility of two or more reachable states at any given point in time [13] if the entire past values of the input have influenced the dynamics. The formal definition linking ESP intrinsically to an input is available in [21]. Here, we extend the results made in [21]

and consider a general setup to analyze the effect of both an input and a design parameter. A design parameter (not necessarily a scalar and includes the case of a vector or a matrix)

, an input value , a state-variable in a compact state space and a continuous function that takes values in would constitute a parametric driven system (we denote this system by throughout, and other entities silently understood). The dynamics of a parametric driven system is rendered through a temporal input via the equation . Given an and , any sequence in the state space that satisfies the equation for all integers is an image of in the space (and also called a solution of ). Having fixed and suppose we collect the component of all images , and denote it by , then we have a collection of sets which we call it the representation of . It turns out the components of this representation satisfy (see [21, Proposition 2] and [23] for details). Following [21] , we say that the driven system has echo-state property (ESP) with respect to (w.r.t.)̇ an input at if it has exactly one solution or one image in the space or (equivalently) if for each , is a singleton of . The emergence of a single image or a proxy, in general, does not necessarily mean it captures the “information” content of the input. For instance, consider the discrete-time system, , where and with . It may be verified that for every input , the map is a contraction on for each , and it easily follows that there is exactly one solution in (by applying Lemma 2 in [21]). However, every solution in converges monotonically to as . Such a convergence means that any image of would not reflect upon the temporal complexity of , and nor would there be any good separability of the representation of the inputs in the space . Such a driven system is not useful for applications. On that account, in RC, one employs driven dynamical systems with a higher dimensional state space.

Representation and Encoding maps

Given an and and a driven system , the set , a component of the representation can equivalently be defined in two other useful ways. To illustrate the first one, given , denote its left-infinite subset . If the entire left-infinite sequence were to be fed to the system, let the the set of all reachable states at time be denote by and call it the encoding of and . The set turns out to be the nested intersection of the sets , and so on (see Fig. 3 for a schematic description). It turns out that is identically the same as the set . The second interpretation of or is that the is the limit of the sequence of sets (defined using the Hausdorff metric on the space of subsets of ; see [23] for details). Thus, from the point of computer simulations, given an input segment of and a large number of initial conditions , suppose the driven system is fed the input on the system states to get new states , and then fed on the states to get the new states and so on, then represents the simulated approximation of , and also an approximation of itself.

Figure 3: Schematic figure to explain the sets : When is a filled square (boundary in black), is the filled square-like object (boundary in red), and when is the filled triangle (boundary in blue) then is the filled oval (boundary in green).

Our approach is to understand the influence of and separately on the dynamics in the space , and in particular on the encoding . To do this, we vary keeping constant, i.e., study the set-valued function , which we call it the input-encoding map (given ). Next, we vary keeping a constant, i.e., study , which we call it the parameter-encoding map (given ). While we study the entire representations in these two cases, i.e., when we study for a given and for a given , we call them the input-representation and parameter-representation maps respectively.

The analysis and simulations to follow would be much simpler if the single set would determine the ESP rather than inferring it from the representation maps. This is possible when we can restrict to an open-parametric driven system. A parametric driven system is an open-parametric driven system if the mapping does not map any set with a nonempty interior (e.g. [25]) to a singleton of . An example of an open-parametric driven system is a recurrent neural network(RNN) of the form , where is (the nonlinear activation) performed component-wise on ,

, a real-valued parameter would correspond to the scaling of the reservoir (invertible) matrix

of dimension and is a matrix with input connections and (the cartesian product of copies of ). It can be readily verified that in the case of an open-parametric driven system, has ESP w.r.t at , if and only if , i.e., is a singleton of . Note that the parameter does not have to be a scalar always and it can even represent either of the two matrices or in the RNN.

Neuromorphic computing schemes do not express unbounded dynamics. This boundedness is normally due to a nonlinearity such as a saturation function like that subdues or squashes the effect of large-amplitude inputs or, in other words, produces contractions on subsets of the space when an input value is of large amplitude. In general, we characterize the ability to obtain such contracting dynamics, by the accurate terminology of contractibility. Given an open-parametric driven system , and an , if there exists an input such that has ESP w.r.t.  at , then we say is contractible at . The RNN is contractible at every value of as long as the matrix does not have rows with only zeroes (formal proof in [21, (ii) of Theorem 2]). This condition on

ensures that when the input is of large amplitude, the states of the system are made to coalesce due to the squashing of the activation functions to obtain ESP. Note that

is contractible at does not mean that is a contraction mapping, but only talks of the existence of some input that could potentially drive the system into a subdomain where contracting dynamics takes place.

Continuity of the encoding and representation maps

In a practical application, heuristically speaking, it is desirable that the reachable system states should change by a small amount if the change in input is small (as in Fig. 

1). When this heuristic idea holds, it translates to the mathematical notion of continuity of the input-encoding or representation map that helps us present Theorem 1. We first present both the mathematical and intuitive idea of continuity (Fig. 4) of the encoding maps from the set-valued functions point-of-view. For a formal treatment, see [23].

Continuity of set-valued functions [3] can be defined with the notion of open neighborhoods like a regular function. A set-valued function is continuous at a point if it is both upper-semicontinuous and lower-semi-continuous at a point. A set-valued function is upper-semicontinuous (u.s.c) at if every open set containing there is a neighborhood of so that is contained in ; is lower-semicontinuous at (l.s.c) if every open set that intersects there is a neighborhood of so that also has non-empty intersection with (see Fig. 4 and for a formal definition in [23]). If is u.s.c but not l.s.c at then it has an explosion at , and if is l.s.c but not u.s.c at then it has an implosion at . The function is called continuous or u.s.c or l.s.c if it is respectively continuous, u.s.c and l.s.c at all points in its domain.

A property that we use in this paper is that if is u.s.c at and is single-valued, it is continuous at . In addition to the few cases illustrated in Fig. 4, we point out that the “graphs” of set-valued maps could behave wildly. One such case that we would quote later is a map that is u.s.c but not continuous on any open set. For example, the set-valued function defined by


is upper semicontinuous everywhere but continuous only when is irrational and at .

Figure 4: Schematic of graphs of set-valued maps to explain the notions of upper semicontinuity (u.s.c) and lower semicontinuity (l.s.c). is a singleton in (a) and (c), while it is multivalued in all other figures. For an open set containing there is a neighborhood of that maps into in (b) but not in (c). Also note that there is an explosion of at in (b) while there is an implosion in (c). is upper semicontinuous only (usc-only) if it is u.s.c and not l.s.c at some point . is lower semicontinuous only (lsc-only) if it is l.s.c and u.s.c at some point .

Theorem 1 (proof in [23]) states that undesirable responses to input can be avoided or robustness to input can be obtained if and only the reachable system states is a single value, i.e., when ESP holds.

Theorem 1.

(Continuity of the input-representation map and the input-encoding map) An open-parametric driven system that is contractible at has ESP w.r.t.  at the input-representation map is continuous at the input-encoding map is continuous at .

Details on how the above theorem would not be true if the contractibility hypothesis is removed are given in [23]. Further, when the input-encoding map is discontinuous at , it can be shown that it is u.s.c but not l.s.c at which means that there is an explosion in the set (see [23, Lemma 2]). A mathematically pedantic reader may be interested to note that at a point of discontinuity, the input-encoding map behaves wildly similar to the set-valued function in equation (1) at its point of discontinuity.

It may not be possible to simulate the input-encoding map or the input-representation maps by fixing and varying the inputs since each input lies in an infinite-dimensional space. However, to practically verify robustness of to a given input at a parameter value , one can infer if is a singleton of (or not robust if it multivalued) by numerically observing if is clustering towards a single point. A quantity to measure the deviation from a single point-cluster [23, Eqn. 7] could be used.

It is also undesirable to observe large changes in the system responses for small changes in the parameter in an application. Theorem 2 (proof in [23]), although not as assertive as Theorem 1, states that ESP implies the continuity of the parameter-encoding map. The contractibility condition is not essential. The converse of Theorem 2 is not true (see examples in [23]) and is addressed later. Note that the theorem is applicable to even systems that do not satisfy the contractibility condition.

Theorem 2.

(Continuity of the parameter-encoding and the parameter-representation map) Fix an and hence . If an open-parametric driven system has ESP w.r.t.  at then the parameter-encoding map is continuous at . More generally, if has ESP w.r.t.  at then the parameter-representation map is continuous at .

It is possible to simulate the parameter-encoding map when is a scalar at discrete steps to identify ESP. When ESP is satisfied, i.e., when is a singleton of for some , numerically would contain a single highly clustered set of points, and not when it has multiple-values of . The result of identifying such a distinction can be made even with a small number of initial conditions. Fig. 5 shows two coordinates of the parameter-encoding map evaluated at discrete set of points for two different RNNs. We consider with neurons and with neurons, a randomly generated input of length , a input matrix and a reservoir matrix (chosen randomly, but with unit spectral radius). For the plot on the left in Fig. 5, we use 1000 initial conditions chosen on the boundary of (to exploit the fact that is an open mapping [25] and hence preserves the boundary). For the plot on the right, there were only chosen samples in . Clearly, in both plots of Fig. 5, the clustering towards a single-point for smaller values of is obvious. Hence, one does not need to be worried about simulating the parameter-encoding map accurately if one’s task is only to identify if has a single point-cluster. On the other hand, if one’s task is to identify instabilities due to the parameter while there is no ESP, one needs to analyze the parameter-encoding map further.

Figure 5: Two coordinates of the parameter-encoding map against an increasing sequence . For the figure on the left, a reservoir with 2 neurons with a 1000 samples in was used to simulate, and on the right a reservoir with 250 neurons and 50 samples in were used.

Instability due to parameter, edge-of-criticality and the ESP threshold

If the parameter encoding map is multivalued at an (i.e., there is no ESP), but is continuous at then small fluctuations around would not be a practical concern although the concern due to a failure of input robustness in view of Theorem 1 remains. If one were to identify the continuities or the discontinuities of the parameter-encoding map through its finite-time approximations , the problem of identifying the parameters that cause instability would be solved. This is since in practice it is only the continuous finite-time approximations that can be computed. Theorem 3 (proof in [23]) helps us with this task using the notion of equicontinuity for a family of functions. In essence, equicontinuity means we could describe the continuity of the entire family at once (formal definition in [25, 23]). To explain our idea, we consider the family of functions defined by , on that is not equicontinuous at ; the limit of this sequence of functions that is not continuous at the point (see [23]). Suppose we take equally spaced points in the interval and compute , for large , we observe that as a function of turns distinctively positive at , and is precisely the point where the function is not equicontinuous in the interval . In general, when one has a monotonic sequence of continuous functions like , continuity of at can be verified numerically by verifying if is close to zero – this works since and evaluated in some neighborhood of would be close by. On the other hand, the discontinuity of at can be verified numerically by verifying the function to be distinctly positive since differs conspicuously from the values of in some neighborhood of due to the non-equicontinuity of the family at . Now since , they form a monotonically decreasing sequence of sets, and when is distinctly positive at , we could identify to be a point of discontinuity of the parameter-encoding map at . We can confirm this idea works since the discontinuities in the parameter-encoding map arise only due to the failure of the equicontinuity in as delineated in Theorem 3. More generally, Theorem 3 addresses the converse of Theorem 2.

Suppose is a scalar parameter in an interval , given an open-parametric driven system and an input , we define to be the ESP threshold if has ESP w.r.t.  at every while it does not have ESP w.r.t.  at every . In other words, is an ESP threshold if is a singleton of for and has multiple values of for . Note that, without loss of generality, we are sticking to an assumption that the parameter-encoding map loses stability whenever the parameter is above a threshold. The analysis to follow holds even if parameter-encoding map loses stability below a threshold, instead.

Suppose is the ESP threshold in an interval , and the parameter-encoding map is also continuous at , we would call to be a soft-ESP threshold, and call a hard-ESP threshold when parameter-encoding map is discontinuous at (examples are in [23]). If is a soft-ESP threshold, there would be a seamless transition of the parameter-encoding map from being single-valued to being multi-valued (as in (a) of Fig. 4) while crosses the soft-ESP threshold and hence there would be no instabilities (discontinuous changes) in the system response due to fluctuations in . We conjecture that a soft-ESP threshold is observed in very specific cases when the input is either a constant or periodic or in the limit turns to be a constant or periodic (see  [23]). If is a hard-ESP threshold, there would be a discontinuous change, and we call to be the input specific edge-of-criticality, a folklore term used to describe a sudden onset of “chaotic” behavior in complex systems.

Theorem 3.

Consider an open-parametric-driven system and an input . The parameter-encoding map is continuous at if and only if is equicontinuous at . In particular, when is a real-parameter in the interval and is the ESP threshold, then is a soft-ESP threshold when is equicontinuous at and a hard-threshold otherwise.

In the literature, different information-theoretic measures have been used to quantitatively demonstrate that the performance of RNNs or, in general, many other complex systems in which performance or computation is enhanced when they operate close to the edge-of-criticality (e.g., [4, 18]). Note that there is no clear mathematical definition of the edge-of-criticality in such literature. There are also heuristic methods to approximate the edge-of-criticality again without a clear definition of the edge-of-criticality. Here, we have mathematically identified the input specific edge-of-criticality in an interval of parameters as the smallest value of for which equicontinuity fails. Non-equicontinuity is essential for sensitive dependence on initial conditions, an ingredient of chaos [1].

A step-wise procedure to numerically identify the discontinuous points of the parameter-encoding map and the edge-of-criticality or hard-ESP threshold for a given (when the parameter is a scalar) is presented next (Fig. 6, that is discussed later employs this procedure):

  1. Fix equally distant points in an interval ;

  2. With initial conditions of the reservoir and an input of length , simulate; .

  3. Find , where in practice between two finite sets and with identical cardinality can be computed by
    ; here denotes the minimum of each row in a matrix , and the element of is the norm between the th element of and th element of , i.e., (any -norm could be employed in finite dimensions), and is the transpose of ;

  4. Obtain the parameter-stability plot, i.e., against and identify the smallest where the plot of turns conspicuously positive as the edge-of-criticality or the hard-ESP threshold in the interval , and other where the plot is conspicuously positive as the points of discontinuities of the parameter-encoding map. If the plot has only constant behavior close to , then the parameter-encoding map is continuous everywhere.

Numerical Results

Here we use the parameter-stability plot to find the hard-ESP threshold of an RNN. Upper bounds on the attributes of the reservoir matrix (like spectral radius) to ensure ESP for all possible inputs, i.e., ESP w.r.t. to the entire space are available (e.g., [28]). More commonly, the criterion of the unit spectral radius of the matrix is currently being used with the intuition that ESP holds w.r.t. all possible inputs. The disadvantage of these bounds or criteria is that they are not dependent on the input and particularly on the input’s temporal properties.   In a practical application, often, only a class of inputs with specific templates are employed for a given task for the network. Hence the bounds are suboptimal on two accounts: (i) temporal relation in the input is ignored (ii) the gap between say the actual edge-of-criticality and the bound (like the unity of spectral radius of the reservoir) is unknown.

Figure 6: Parameter-Stability plot: plotted against with . The smallest value of where the plot turns positive is the edge-of-criticality or the hard-ESP threshold in the interval .

With the data used in generating the plot (on the right) in Fig. 5 and noting that the reservoir matrix has a unit spectra radius, we obtain the parameter-stability plot shown in Fig. 6. The smallest for which the plot turns conspicuously positive can be identified as the hard-ESP threshold. Wherever the parameter-encoding map is discontinuous, it fails to be lower-semicontinuous, and hence it has wildly behaving explosions. Thus, the parameter-stability plot in Fig. 6 is wiggly in addition to being positive for . We remark that the scenario at a point of discontinuity of the parameter-encoding map would be similar to the behavior at a point of discontinuity of the example in (1)). Note, during simulations, a very accurate approximation of turns out to be not essential when one is only interested in determining the continuity and discontinuity points of the parameter-encoding map. Hence, even with 50 initial conditions, we can identify the discontinuous points in the parameter-stability plot. All numerical simulations are computationally inexpensive and quick ( minutes).

Several cross-checks can be made to ensure that the idea of determining the ESP threshold via the parameter-stability plot in Fig. 6 is not heavily influenced by the finite approximations of the parameter-encoding map can be made. These include varying step-size of (see [23, Fig. S4]), complementary plots that show a coordinate of the parameter-encoding map and a coefficient that measures the deviation from a single point-cluster along with the parameter-stability plot (see [23, Fig. S5]). Lastly, based on [21, (ii) of Theorem 4.1], we verify that the ESP threshold increases while the input is scaled (see [23, Fig. S6]) for details).


We have provided a mathematical framework in which the stability of a parametric-driven system to a perturbation in the input or a design parameter can be analyzed and presented the results in Theorem 1 and Theorem 2. Our results are applicable beyond the RC schemes such as fully trained recurrent neural networks [6], and newer adaptations like deep recurrent neural networks [9], conceptors [14], data-based evolving network models [22]. As an application of the design of an optimal scalar design parameter, we have shown how the hard-ESP threshold or the edge-of-criticality is identified as a particular discontinuity of the parameter-encoding map (Theorem 3). We remark that when inputs are drawn from a stationary ergodic process, the echo-state property threshold determined with a typical input is valid for all inputs in view of the - probabilistic law of the ESP property [21].

Our result to identify instabilities due to parameter is universal and not limited to a scalar parameter – it can be applied to any parameter of the system – for instance, the input or the reservoir matrix itself could be treated as a parameter. Theorem 2 would guarantee stability to these matrices (parameters) whenever the ESP property is satisfied. This explains the empirically observed robustness of echo state networks with respect to the random choice of reservoir or input matrices in applications. Such stability was explained in the context of a class of standard feedforward neural networks [11], but no such result is currently available for recurrent neural networks.

Future enterprising work would involve finding necessary conditions under which any input can be embedded into the reservoir. We caution the reader that the word embedding is colloquially used in some RC literature – embedding would mean that the input and its proxy or image in the reservoir subspace are homeomorphic images of each other. Some results on the existence of networks that can give an embedding are available in [8]. However, a practically more useful work should be directed towards finding the design of the reservoir so that inputs can be embedded on to the reservoir space.

Acknowledgements. The author thanks the National Research Foundation, South Africa, for supporting travel to the University of Pretoria in 2017 during which the work was initiated.