DeepAI

# Universality and approximation bounds for echo state networks with random weights

We study the uniform approximation of echo state networks with randomly generated internal weights. These models, in which only the readout weights are optimized during training, have made empirical success in learning dynamical systems. We address the representational capacity of these models by showing that they are universal under weak conditions. Our main result gives a sufficient condition for the activation function and a sampling procedure for the internal weights so that echo state networks can approximate any continuous casual time-invariant operators with high probability. In particular, for ReLU activation, we quantify the approximation error of echo state networks for sufficiently regular operators.

• 106 publications
• 12 publications
02/14/2020

### Approximation Bounds for Random Neural Networks and Reservoir Systems

This work studies approximation based on single-hidden-layer feedforward...
06/27/2022

### Expressive power of binary and ternary neural networks

We show that deep sparse ReLU networks with ternary weights and deep ReL...
05/14/2020

### Echo State Networks trained by Tikhonov least squares are L2(μ) approximators of ergodic dynamical systems

Echo State Networks (ESNs) are a class of single-layer recurrent neural ...
05/20/2021

### Neural networks with superexpressive activations and integer weights

An example of an activation function σ is given such that networks with ...
07/12/2020

### Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory

In this paper, we explain the universal approximation capabilities of de...
05/17/2022

### Sharp asymptotics on the compression of two-layer neural networks

In this paper, we study the compression of a target two-layer neural net...
10/15/2019

### Neural tangent kernels, transportation mappings, and universal approximation

This paper establishes rates of universal approximation for the shallow ...

## 1 Introduction

Reservoir computing (Lukoševičius and Jaeger, 2009), such as echo state networks (Jaeger and Haas, 2004) and liquid state machines (Maass et al., 2002)

, is a paradigm for supervised learning of dynamical systems, which transforms input data into a high-dimensional space by a state-space nonlinear system called reservoir, and performs the learning task only on the readout. Due to the simplicity of this computational framework, it has been applied to many fields and made remarkable success in many tasks such as temporal pattern prediction, classification and generation

(Tanaka et al., 2019). Motivated by these empirical success, researchers have devoted a lot of efforts to theoretically understanding the properties and performances of reservoir computing models (see, for instance, Jaeger (2001); Buehner and Young (2006); Lukoševičius and Jaeger (2009); Yildiz et al. (2012); Grigoryeva and Ortega (2019); Gonon et al. (2020a, b) and references therein).

In particular, recent studies (Grigoryeva and Ortega, 2018, 2018; Gonon and Ortega, 2020, 2021)

showed that echo state networks are universal in the sense that they can approximate sufficiently regular input/output systems (i.e. operators) in various settings. However, these universality results do not guarantee the important property of echo state networks (and reservoir computing) that the state-space system is randomly generated, which is the major difference between reservoir computing and recurrent neural networks. To be concrete, consider the echo state network

 {st=ϕ(W1st−1+W2ut+b),yt=a⋅st, (1.1)

where are the input, output and hidden state at time step , and is a prescribed activation function. In general, the weight matrices are randomly generated and only the readout is trained in a supervised learning manner. But the universality results in the mentioned papers require that all the weights depend on the target system that we want to approximate. Hence, these results can not completely explain the approximation capacity of echo state networks. To overcome this drawback of current theories, the recent work of Gonon et al. (2020a)

studied the approximation of random neural networks in a Hilbert space setting and proposed a sampling procedure for the internal weights

so that echo state network (1.1) with ReLU activation is universal in sense.

In this paper, we generalize these results and study the universality of echo state networks with randomly generated internal weights. Our main result gives a sufficient condition for the activation function and the sampling procedure of so that the system (1.1) can approximate any continuous time-invariant operator in sense with high probability. In particular, for ReLU activation, we show that the echo state network (1.1) is universal when and , where

is a random matrix whose entries are sampled independently from a general symmetric distribution

, is a constant depending on and are fixed matrices defined by (3.5). Comparing with Gonon et al. (2020a), which takes advantage of the property of ReLU function to construct the reservoir, our construction makes use of the concentration of probability measure, and hence it can be applied to general activation functions.

### 1.1 Notations

We denote as the set of positive integers. Let (respectively, and ) be the set of integers (respectively, the non-negative and non-positive integers). The ReLU function is denoted by . Throughout the paper, is equipped with the sup-norm , unless it is explicitly stated. For any , we use to denote the set of sequences of the form , , . The set and consist of right and left infinite sequences of the form and , respectively. For any and , we denote and use the convention that, when or , it denotes the left or right infinite sequence. And we will often regard as an element of . The supremum norm of the sequence is denoted by . A weighting sequence is a decreasing sequence with zero limit. The weight norm on associated to is denoted by .

## 2 Continuous causal time-invariant operators

We study the uniform approximation problem for input/output systems or operators of signals in discrete time setting. We mainly consider the operators that satisfy the following three properties:

1. [label=(0),parsep=0pt]

2. Causality: For any and , implies .

3. Time-invariance: for any , where is time delay operator defined by .

4. Continuity (fading memory property): is uniformly bounded for some and is continuous with respect to the product topology.

Grigoryeva and Ortega (2018) gave a comprehensive study of these operators. We recall that the causal time-invariant operator is one-to-one correspondence with the functional defined by where is any extension of such that ( is well-defined by causality). And we can reconstruct from by , where is the natural projection. In particular, for any causal time-invariant operators ,

 supu∈([−1,1]d)Z∥F(u)−G(u)∥∞=supu−∈([−1,1]d)Z−|F∗(u−)−G∗(u−)|.

Hence, the approximation problem of causal time-invariant operators can be reduced to the approximation of functionals on .

We remark that the product topology on is different from the uniform topology induced by the sup-norm. However, for the set of uniformly bounded sequences, the product topology coincides with the topology induced by the weight norm of any weighting sequence . It is shown by Grigoryeva and Ortega (2018) that the causal time-invariant operator is continuous (i.e. has the fading memory property) if and only if the corresponding functional is continuous with respect to the product topology, which is equivalent to has fading memory property with respect to some (and hence any) weighting sequence , i.e. is a continuous function on the compact metric space . In other words, for any , there exists a such that for any ,

 ∥u−v∥η=supt∈Z−η−t∥ut−vt∥∞<δ(ϵ)implies|F∗(u)−F∗(v)|<ϵ.

Next, we introduce several notations to quantify the regularity of causal time-invariant operators. For any , we can associate a function to any causal time-invariant operator by

 F∗m(u−m+1,…,u0):=F∗(⋯,0,u−m+1,…,u0). (2.1)

If the functional can be approximated by arbitrarily well, we say has approximately finite memory. The following definition quantifies this approximation.

###### Definition 2.1 (Approximately finite memory).

For any causal time-invariant operator , let and , we denote

 EF(m) :=supu∈([−1,1]d)Z−|F∗(u)−F∗m(u−m+1:0)|, mF(ϵ) :=min{m∈N:EF(m)≤ϵ}.

If for all , we say has approximately finite memory. If , we say has finite memory.

Note that is a non-increasing function of . By the time-invariance of , for any ,

 supu∈([−1,1]d)Z|F(u)t−F∗m(ut−m+1:t)|=supu∈([−1,1]d)Z|F∗(u−∞:t)−F∗m(ut−m+1:t)|=EF(m). (2.2)

Hence, quantifies how well can be approximated by functionals with finite memory.

###### Definition 2.2 (Modulus of continuity).

If has fading memory property with respect to some weighting sequence . We denote the modulus of continuity with respect to by

 ωF∗(δ;η):=sup{|F∗(u)−F∗(u′)|:u,u′∈([−1,1]d)Z−,∥u−u′∥η≤δ},

and the inverse modulus of continuity

 ω−1F∗(ϵ;η):=sup{δ>0:ωF∗(δ;η)≤ϵ}.

Similarly, the modulus of continuity (with respect to ) of a continuous function and the inverse modulus of continuity are defined by

 ωf(δ) :=sup{|f(x)−f(x′)|:x,x′∈[−1,1]m,∥x−x′∥∞≤δ}, ω−1f(ϵ) :=sup{δ>0:ωf(δ)≤ϵ}.

The next proposition quantifies the continuity of causal time-invariant operators by the approximately finite memory and modulus of continuity. This proposition is a modification of similar result in Hanson and Raginsky (2019) to our setting. It shows that a causal time-invariant operator is continuous if and only if it has approximately finite memory and each is continuous.

###### Proposition 2.3.

Let be a causal time-invariant operator.

1. [label=(0),parsep=0pt]

2. If has approximately finite memory and for any and , then has fading memory property with respect to any weighting sequence : for any , and ,

 ∥u−v∥η≤ηm−1ω−1F∗m(ϵ/3)implies|F∗(u)−F∗(v)|≤ϵ.

In other words, for .

3. If has fading memory property with respect to some weighting sequence , then has approximately finite memory with ,

 mF(ϵ)≤min{m∈N:ηm≤ω−1F∗(ϵ;η)},

and for any .

###### Proof.

For the first part, by the definition of ,

 supu∈([−1,1]d)Z−|F∗(u)−F∗m(u−m+1:0)|≤ϵ/3.

Since the weighting sequence is decreasing,

 ∥u−m+1:0−v−m+1:0∥∞≤η−1m−1∥u−v∥η≤ω−1F∗m(ϵ/3),

which implies . Therefore,

 |F∗(u)−F∗(v)| ≤ |F∗(u)−F∗m(u−m+1:0)|+|F∗m(u−m+1:0)−F∗m(v−m+1:0)|+|F∗m(v−m+1:0)−F∗(v)| ≤ ϵ/3+ϵ/3+ϵ/3=ϵ.

For the second part, we observe that, for any and ,

 ∥u−(…,0,u−m+1:0)∥η=supt≤−mη−t∥ut∥∞≤ηm.

Then, by the definition of ,

 EF(m)=supu∈([−1,1]d)Z−|F∗(u)−F∗(…,0,u−m+1:0)|≤ωF∗(ηm,η).

If satisfies , then . Hence,

 mF(ϵ)=min{m∈N:EF(m)≤ϵ}≤min{m∈N:ηm≤ω−1F∗(ϵ;η)}.

Finally, by the definition of and , we have

 ωF∗m(δ) =sup{|F∗(⋯,0,u−m+1:0)−F∗(⋯,0,u′−m+1:0)|:∥u−m+1:0−u′−m+1:0∥∞≤δ} ≤ωF∗(δ;η).\qed

In order to approximate the continuous causal time-invariant operator , we only need to approximate the functional , which can be approximated by the continuous function if is chosen sufficiently large. Hence, any approximation theory of continuous functions can be translated to an approximation result for continuous causal time-invariant operators. For instance, if we approximate by some function , then we can approximate by

 yt=h(ut−m+1,…,ut),t∈Z.

The function uniquely determine a causal time-invariant operator such that . Since has finite memory, is continuous if and only if is continuous by Proposition 2.3. When we approximate by polynomials, then is the Volterra series (Boyd and Chua, 1985). When is a neural network, then

is a temporal convolution neural network, studied by

Hanson and Raginsky (2019).

In this paper, we focus on the approximation by echo state networks (ESN), which are special state space models of the form

 {st=ϕ(W1st−1+W2ut+b),yt=a⋅st, (2.3)

where , , and the activation function is applied element-wise.

###### Definition 2.4 (Existence of solutions and echo state property).

We say the system (2.3) has existence of solutions property if for any , there exist such that holds for each . If the solution is unique, we say the system has echo state property.

Grigoryeva and Ortega (2018, Theorem 3.1) gave sufficient conditions for the system (2.3) to have existence of solutions property and echo state property. In particular, they showed that, if is a bounded continuous function, then the existence of solutions property holds. And if is a bounded Lipschitz continuous function with Lipschitz constant and , then the system (2.3) has echo state property, where is the operator norm of the matrix . As a sufficient condition to ensure the echo state property, the hypothesis has been been extensively studied in the ESN literature (Jaeger, 2001; Jaeger and Haas, 2004; Buehner and Young, 2006; Yildiz et al., 2012; Gandhi and Jaeger, 2013).

If the system (2.3) has existence of solutions property, the axiom of choice allows us to assign to each , and hence define a functional by . Thus, we can assign a causal time-invariant operator to the system such that . When the echo state property holds, this operator is unique. The operator is continuous if and only if the mapping is a continuous function. In the next section, we study the universality of these operators.

## 3 Universal approximation

As mentioned in the introduction, the recent works of Grigoryeva and Ortega (2018); Gonon and Ortega (2021) showed that the echo state networks (2.3) are universal: Assume is a bounded Lipschitz continuous function. Let be a continuous causal time-invariant operator, then for any , for sufficiently large , there exists an ESN (2.3) such that the corresponding causal time-invariant operator satisfies

 supu∈([−1,1]d)Z∥F(u)−FESN(u)∥∞≤ϵ. (3.1)

In this universal approximation theorem, the weights in the network (2.3) depend on the target operator . However, in practice, the parameters

are drawn at random from certain given distribution and only the readout vector

are trained by linear regression using observed data related to the target operator

. Hence, this universal approximation theorem can not completely explain the empirical performance of echo state networks.

In this section, our goal is to show that, with randomly generated weights, echo state networks are universal: For any , for sufficiently large , with high probability on , which are drawn from certain distribution, there exists such that we can associate a causal time-invariant operator to the ESN (2.3) and the approximation bound (3.1

) holds. In the context of standard feed-forward neural networks, one can show that similar universal approximation theorem holds for random neural networks. This will be the building block of our main theorem of echo state networks.

### 3.1 Universality of random neural networks

It is well-known that feed-forward neural networks with one hidden layer are universal. We recall the universal approximation theorem proved by Leshno et al. (1993).

###### Theorem 3.1.

If is continuous and is not a polynomial, then for any compact set , any function and , there exists and , , such that

 supx∈X∣∣ ∣∣f(x)−n∑i=1aiϕ(wi⋅x+bi)∣∣ ∣∣≤ϵ. (3.2)

In Theorem 3.1, the parameters depend on the target function . In order to take into account the fact that for ESN the inner weights are randomly chosen and only are trained from data, we consider the random neural networks whose weights

are drawn from some probability distribution

. This motivates the following definition.

###### Definition 3.2.

Suppose is a sequence of i.i.d. random vectors drawn from probability distribution defined on . If for any compact set , any function and , there exists such that, with probability at least , the inequality (3.2) holds for some , then we say the pair is universal.

###### Remark 3.3.

In the Definition 3.2, we only require the existence of neural networks that convergence to the target function in probability. This requirement is actually equivalent to almost sure convergence. To see this, convergence in probability implies that there exits a sub-sequence such that there exists , which is a linear combination of , converging to the target function almost surely as . Notice that, for any , is also a linear combination of . Hence, the almost sure convergence holds for all .

The universality of random neural networks was widely studied in the context of extreme learning machine (Huang et al., 2006, 2006, 2012). In particular, Huang et al. (2006) used an incremental construction to establish the random universal approximation theorem in -norm for bounded non-constant piecewise continuous activation function. The recent work of Hart et al. (2020) considered the approximation in -norm and assumed that the activation function satisfying . They argued that, since there exits a neural network that approximates the target function by universality of neural networks, there will eventually be some randomly generated samples that are close to the weights of and we can discard other samples by setting the corresponding

, hence the random universal approximation holds. It is possible to generalize their argument to the approximation of continuous functions. Nevertheless, we will give an alternative approach based on law of large numbers and show that

is universal under very weak conditions. Our analysis will need the following uniform law of large numbers.

###### Lemma 3.4 (Jennrich (1969), Theorem 2).

Let be i.i.d. samples from some probability distribution on . Suppose

1. [label=(0),parsep=0pt]

2. is compact;

3. is continuous on for almost all , and measurable on for each ;

4. there exists with for all and .

Then is continuous on and

 supx∈X∣∣ ∣∣1nn∑i=1f(x,wi)−Ew[f(x,w)]∣∣ ∣∣→0almost % surely as n→∞.

Now, we give a sufficient condition for the pair to be universal. Our proof is a combination of the uniform law of large number (Lemma 3.4) and the universality of neural networks (Theorem 3.1).

###### Theorem 3.5.

Suppose the continuous function is not a polynomial and for some . If is a probability distribution with full support on such that , then is universal.

###### Proof.

By Hahn-Banach theorem, for any compact set , the linear span of a function class is dense in if and only if

 {ν∈M(X):∫Xh(x)dν(x)=0,∀h∈H}={0},

where is the dual space of , that is, the space of all signed Radon measures with finite total variation (Folland, 1999).

We consider the linear space

 Hμ:={h(x)=∫Rd+1g(w,b)ϕ(w⋅x+b)dμ(w⊺,b):g∈L∞(μ)}.

Since , by assumption and Lemma 3.4, any is continuous and hence . Suppose satisfies for all . Then, by Fubini’s theorem,

 0= ∫X∫Rd+1g(w,b)ϕ(w⋅x+b)dμ(w⊺,b)dν(x) = ∫Rd+1g(w,b)∫Xϕ(w⋅x+b)dν(x)dμ(w⊺,b),

for all function . Therefore,

 ∫Xϕ(w⋅x+b)dν(x)=0% almost surely.

Since this function is continuous and has full support, the equality holds for all . By Theorem 3.1, the linear span of is dense in , which implies . We conclude that is dense in . In other words, for any and , there exists such that for all .

By Lemma 3.4, for any , there exists such that, with probability at least on the samples from ,

 supx∈X∣∣ ∣∣1nn∑i=1g(wi,bi)ϕ(wi⋅xi+bi)−h(x)∣∣ ∣∣≤ϵ2.

By triangle inequality, we conclude that (3.2) holds with and the pair is universal. ∎

In practice, it is more convenient to sample each weight independently from certain distribution on , so that is sample from , the products of . The next corollary is a direct application of Lemma 3.5 to this situation.

###### Corollary 3.6.

Suppose the continuous function is not a polynomial and for some . If is a probability distribution with full support on such that , then for any , the pair is universal.

###### Proof.

When ,

 ∫∥(w⊺,b)∥α∞dμd+1(w⊺,b) ≤∫∥(w⊺,b)∥α1dμd+1(w⊺,b) ≤∫|b|α+d∑i=1|wi|αdμd+1(w⊺,b)<∞.

When , since every norm of are equivalent,

 ∫∥(w⊺,b)∥α∞dμd+1(w⊺,b) ≤C∫∥(w⊺,b)∥ααdμd+1(w⊺,b) =C∫|b|α+d∑i=1|wi|αdμd+1(w⊺,b)<∞.

In any cases, the pair satisfies the condition in Lemma 3.5, hence it is universal. ∎

So far, we have assumed that has full support. When the activation is the ReLU function, this assumption can be weaken due to the absolute homogeneity of ReLU.

###### Corollary 3.7.

For the ReLU function , if is a probability distribution on whose support contains the interval for some , then is universal for any .

###### Proof.

We consider the continuous mapping defined by

 T(w⊺,b)={(w⊺,b),∥(w⊺,b)∥∞≤rr∥(w⊺,b)∥∞(w⊺,b)∥(w⊺,b)∥∞>r.

Let be the push-forward measure of under defined by for any measurable set . Then, the support of is by assumption.

We firstly show that is universal. As in the proof of Theorem 3.5, if satisfies for all , then by Fubini’s theorem,

 ∫Xσ(~w⋅x+~b)dν(x)=0

holds for all . By the absolute homogeneity of , this equation actually holds for all . The argument in the proof of Theorem 3.5 implies that is universal.

Observe that any sample from corresponding to a sample from , and

 n∑i=1aiϕ(wi⋅x+bi)=n∑i=1aiciϕ(~wi⋅x+~bi),

where if and otherwise. We conclude that is universal. ∎

### 3.2 Universality of echo state networks

In this section, we will state and prove the random universal approximation theorem for echo state networks. Our analysis is based on the uniform law of large numbers and the universality of random feed-forward neural networks. For simplicity, we will assume that all internal weights in the network have the same distribution from now on. We make the following assumption on the activation function and the distribution .

###### Assumption 3.8.

For any , the pair is universal and there exists a measurable mapping such that

 x=E(w⊺,b)∼μm+1[φμ,m(w,b)ϕ(w⋅x+b)],x∈[−2,2]m, (3.3)

and for some satisfying .

Corollary 3.6 gives sufficient condition for the pair to be universal. However, in Assumption 3.8, we also require the existence of function that satisfies equation (3.3), which may be difficult to check for general activation functions. Nevertheless, we will show that the assumption holds true when the activation function is ReLU and is symmetric (see Corollary 3.11).

By Lemma 3.4, Assumption 3.8 ensures that if we have i.i.d. samples from , we can approximately reconstruct the input by

 x≈1nn∑i=1φμ,m(wi,bi)ϕ(wi⋅x+bi).

In other words, the features contain enough information of the input and we can approximately recover it using as coefficients. For echo state networks, this assumption guarantees that the hidden state does not lose too much information about the history of the input, and hence we can view it as a “reservoir” and approximate the desired function at any time steps.

###### Theorem 3.9.

Suppose and the distribution satisfy Assumption 3.8. For , let be i.i.d. samples from , and define the ESN

 {st=ϕ