1 Introduction
Reservoir computing (Lukoševičius and Jaeger, 2009), such as echo state networks (Jaeger and Haas, 2004) and liquid state machines (Maass et al., 2002)
, is a paradigm for supervised learning of dynamical systems, which transforms input data into a high-dimensional space by a state-space nonlinear system called reservoir, and performs the learning task only on the readout. Due to the simplicity of this computational framework, it has been applied to many fields and made remarkable success in many tasks such as temporal pattern prediction, classification and generation
(Tanaka et al., 2019). Motivated by these empirical success, researchers have devoted a lot of efforts to theoretically understanding the properties and performances of reservoir computing models (see, for instance, Jaeger (2001); Buehner and Young (2006); Lukoševičius and Jaeger (2009); Yildiz et al. (2012); Grigoryeva and Ortega (2019); Gonon et al. (2020a, b) and references therein).In particular, recent studies (Grigoryeva and Ortega, 2018, 2018; Gonon and Ortega, 2020, 2021)
showed that echo state networks are universal in the sense that they can approximate sufficiently regular input/output systems (i.e. operators) in various settings. However, these universality results do not guarantee the important property of echo state networks (and reservoir computing) that the state-space system is randomly generated, which is the major difference between reservoir computing and recurrent neural networks. To be concrete, consider the echo state network
(1.1) |
where are the input, output and hidden state at time step , and is a prescribed activation function. In general, the weight matrices are randomly generated and only the readout is trained in a supervised learning manner. But the universality results in the mentioned papers require that all the weights depend on the target system that we want to approximate. Hence, these results can not completely explain the approximation capacity of echo state networks. To overcome this drawback of current theories, the recent work of Gonon et al. (2020a)
studied the approximation of random neural networks in a Hilbert space setting and proposed a sampling procedure for the internal weights
so that echo state network (1.1) with ReLU activation is universal in sense.In this paper, we generalize these results and study the universality of echo state networks with randomly generated internal weights. Our main result gives a sufficient condition for the activation function and the sampling procedure of so that the system (1.1) can approximate any continuous time-invariant operator in sense with high probability. In particular, for ReLU activation, we show that the echo state network (1.1) is universal when and , where
is a random matrix whose entries are sampled independently from a general symmetric distribution
, is a constant depending on and are fixed matrices defined by (3.5). Comparing with Gonon et al. (2020a), which takes advantage of the property of ReLU function to construct the reservoir, our construction makes use of the concentration of probability measure, and hence it can be applied to general activation functions.1.1 Notations
We denote as the set of positive integers. Let (respectively, and ) be the set of integers (respectively, the non-negative and non-positive integers). The ReLU function is denoted by . Throughout the paper, is equipped with the sup-norm , unless it is explicitly stated. For any , we use to denote the set of sequences of the form , , . The set and consist of right and left infinite sequences of the form and , respectively. For any and , we denote and use the convention that, when or , it denotes the left or right infinite sequence. And we will often regard as an element of . The supremum norm of the sequence is denoted by . A weighting sequence is a decreasing sequence with zero limit. The weight norm on associated to is denoted by .
2 Continuous causal time-invariant operators
We study the uniform approximation problem for input/output systems or operators of signals in discrete time setting. We mainly consider the operators that satisfy the following three properties:
-
[label=(0),parsep=0pt]
-
Causality: For any and , implies .
-
Time-invariance: for any , where is time delay operator defined by .
-
Continuity (fading memory property): is uniformly bounded for some and is continuous with respect to the product topology.
Grigoryeva and Ortega (2018) gave a comprehensive study of these operators. We recall that the causal time-invariant operator is one-to-one correspondence with the functional defined by where is any extension of such that ( is well-defined by causality). And we can reconstruct from by , where is the natural projection. In particular, for any causal time-invariant operators ,
Hence, the approximation problem of causal time-invariant operators can be reduced to the approximation of functionals on .
We remark that the product topology on is different from the uniform topology induced by the sup-norm. However, for the set of uniformly bounded sequences, the product topology coincides with the topology induced by the weight norm of any weighting sequence . It is shown by Grigoryeva and Ortega (2018) that the causal time-invariant operator is continuous (i.e. has the fading memory property) if and only if the corresponding functional is continuous with respect to the product topology, which is equivalent to has fading memory property with respect to some (and hence any) weighting sequence , i.e. is a continuous function on the compact metric space . In other words, for any , there exists a such that for any ,
Next, we introduce several notations to quantify the regularity of causal time-invariant operators. For any , we can associate a function to any causal time-invariant operator by
(2.1) |
If the functional can be approximated by arbitrarily well, we say has approximately finite memory. The following definition quantifies this approximation.
Definition 2.1 (Approximately finite memory).
For any causal time-invariant operator , let and , we denote
If for all , we say has approximately finite memory. If , we say has finite memory.
Note that is a non-increasing function of . By the time-invariance of , for any ,
(2.2) |
Hence, quantifies how well can be approximated by functionals with finite memory.
Definition 2.2 (Modulus of continuity).
If has fading memory property with respect to some weighting sequence . We denote the modulus of continuity with respect to by
and the inverse modulus of continuity
Similarly, the modulus of continuity (with respect to ) of a continuous function and the inverse modulus of continuity are defined by
The next proposition quantifies the continuity of causal time-invariant operators by the approximately finite memory and modulus of continuity. This proposition is a modification of similar result in Hanson and Raginsky (2019) to our setting. It shows that a causal time-invariant operator is continuous if and only if it has approximately finite memory and each is continuous.
Proposition 2.3.
Let be a causal time-invariant operator.
-
[label=(0),parsep=0pt]
-
If has approximately finite memory and for any and , then has fading memory property with respect to any weighting sequence : for any , and ,
In other words, for .
-
If has fading memory property with respect to some weighting sequence , then has approximately finite memory with ,
and for any .
Proof.
For the first part, by the definition of ,
Since the weighting sequence is decreasing,
which implies . Therefore,
For the second part, we observe that, for any and ,
Then, by the definition of ,
If satisfies , then . Hence,
Finally, by the definition of and , we have
In order to approximate the continuous causal time-invariant operator , we only need to approximate the functional , which can be approximated by the continuous function if is chosen sufficiently large. Hence, any approximation theory of continuous functions can be translated to an approximation result for continuous causal time-invariant operators. For instance, if we approximate by some function , then we can approximate by
The function uniquely determine a causal time-invariant operator such that . Since has finite memory, is continuous if and only if is continuous by Proposition 2.3. When we approximate by polynomials, then is the Volterra series (Boyd and Chua, 1985). When is a neural network, then
is a temporal convolution neural network, studied by
Hanson and Raginsky (2019).In this paper, we focus on the approximation by echo state networks (ESN), which are special state space models of the form
(2.3) |
where , , and the activation function is applied element-wise.
Definition 2.4 (Existence of solutions and echo state property).
We say the system (2.3) has existence of solutions property if for any , there exist such that holds for each . If the solution is unique, we say the system has echo state property.
Grigoryeva and Ortega (2018, Theorem 3.1) gave sufficient conditions for the system (2.3) to have existence of solutions property and echo state property. In particular, they showed that, if is a bounded continuous function, then the existence of solutions property holds. And if is a bounded Lipschitz continuous function with Lipschitz constant and , then the system (2.3) has echo state property, where is the operator norm of the matrix . As a sufficient condition to ensure the echo state property, the hypothesis has been been extensively studied in the ESN literature (Jaeger, 2001; Jaeger and Haas, 2004; Buehner and Young, 2006; Yildiz et al., 2012; Gandhi and Jaeger, 2013).
If the system (2.3) has existence of solutions property, the axiom of choice allows us to assign to each , and hence define a functional by . Thus, we can assign a causal time-invariant operator to the system such that . When the echo state property holds, this operator is unique. The operator is continuous if and only if the mapping is a continuous function. In the next section, we study the universality of these operators.
3 Universal approximation
As mentioned in the introduction, the recent works of Grigoryeva and Ortega (2018); Gonon and Ortega (2021) showed that the echo state networks (2.3) are universal: Assume is a bounded Lipschitz continuous function. Let be a continuous causal time-invariant operator, then for any , for sufficiently large , there exists an ESN (2.3) such that the corresponding causal time-invariant operator satisfies
(3.1) |
In this universal approximation theorem, the weights in the network (2.3) depend on the target operator . However, in practice, the parameters
are drawn at random from certain given distribution and only the readout vector
are trained by linear regression using observed data related to the target operator
. Hence, this universal approximation theorem can not completely explain the empirical performance of echo state networks.In this section, our goal is to show that, with randomly generated weights, echo state networks are universal: For any , for sufficiently large , with high probability on , which are drawn from certain distribution, there exists such that we can associate a causal time-invariant operator to the ESN (2.3) and the approximation bound (3.1
) holds. In the context of standard feed-forward neural networks, one can show that similar universal approximation theorem holds for random neural networks. This will be the building block of our main theorem of echo state networks.
3.1 Universality of random neural networks
It is well-known that feed-forward neural networks with one hidden layer are universal. We recall the universal approximation theorem proved by Leshno et al. (1993).
Theorem 3.1.
If is continuous and is not a polynomial, then for any compact set , any function and , there exists and , , such that
(3.2) |
In Theorem 3.1, the parameters depend on the target function . In order to take into account the fact that for ESN the inner weights are randomly chosen and only are trained from data, we consider the random neural networks whose weights
are drawn from some probability distribution
. This motivates the following definition.Definition 3.2.
Suppose is a sequence of i.i.d. random vectors drawn from probability distribution defined on . If for any compact set , any function and , there exists such that, with probability at least , the inequality (3.2) holds for some , then we say the pair is universal.
Remark 3.3.
In the Definition 3.2, we only require the existence of neural networks that convergence to the target function in probability. This requirement is actually equivalent to almost sure convergence. To see this, convergence in probability implies that there exits a sub-sequence such that there exists , which is a linear combination of , converging to the target function almost surely as . Notice that, for any , is also a linear combination of . Hence, the almost sure convergence holds for all .
The universality of random neural networks was widely studied in the context of extreme learning machine (Huang et al., 2006, 2006, 2012). In particular, Huang et al. (2006) used an incremental construction to establish the random universal approximation theorem in -norm for bounded non-constant piecewise continuous activation function. The recent work of Hart et al. (2020) considered the approximation in -norm and assumed that the activation function satisfying . They argued that, since there exits a neural network that approximates the target function by universality of neural networks, there will eventually be some randomly generated samples that are close to the weights of and we can discard other samples by setting the corresponding
, hence the random universal approximation holds. It is possible to generalize their argument to the approximation of continuous functions. Nevertheless, we will give an alternative approach based on law of large numbers and show that
is universal under very weak conditions. Our analysis will need the following uniform law of large numbers.Lemma 3.4 (Jennrich (1969), Theorem 2).
Let be i.i.d. samples from some probability distribution on . Suppose
-
[label=(0),parsep=0pt]
-
is compact;
-
is continuous on for almost all , and measurable on for each ;
-
there exists with for all and .
Then is continuous on and
Now, we give a sufficient condition for the pair to be universal. Our proof is a combination of the uniform law of large number (Lemma 3.4) and the universality of neural networks (Theorem 3.1).
Theorem 3.5.
Suppose the continuous function is not a polynomial and for some . If is a probability distribution with full support on such that , then is universal.
Proof.
By Hahn-Banach theorem, for any compact set , the linear span of a function class is dense in if and only if
where is the dual space of , that is, the space of all signed Radon measures with finite total variation (Folland, 1999).
We consider the linear space
Since , by assumption and Lemma 3.4, any is continuous and hence . Suppose satisfies for all . Then, by Fubini’s theorem,
for all function . Therefore,
Since this function is continuous and has full support, the equality holds for all . By Theorem 3.1, the linear span of is dense in , which implies . We conclude that is dense in . In other words, for any and , there exists such that for all .
In practice, it is more convenient to sample each weight independently from certain distribution on , so that is sample from , the products of . The next corollary is a direct application of Lemma 3.5 to this situation.
Corollary 3.6.
Suppose the continuous function is not a polynomial and for some . If is a probability distribution with full support on such that , then for any , the pair is universal.
Proof.
When ,
When , since every norm of are equivalent,
In any cases, the pair satisfies the condition in Lemma 3.5, hence it is universal. ∎
So far, we have assumed that has full support. When the activation is the ReLU function, this assumption can be weaken due to the absolute homogeneity of ReLU.
Corollary 3.7.
For the ReLU function , if is a probability distribution on whose support contains the interval for some , then is universal for any .
Proof.
We consider the continuous mapping defined by
Let be the push-forward measure of under defined by for any measurable set . Then, the support of is by assumption.
We firstly show that is universal. As in the proof of Theorem 3.5, if satisfies for all , then by Fubini’s theorem,
holds for all . By the absolute homogeneity of , this equation actually holds for all . The argument in the proof of Theorem 3.5 implies that is universal.
Observe that any sample from corresponding to a sample from , and
where if and otherwise. We conclude that is universal. ∎
3.2 Universality of echo state networks
In this section, we will state and prove the random universal approximation theorem for echo state networks. Our analysis is based on the uniform law of large numbers and the universality of random feed-forward neural networks. For simplicity, we will assume that all internal weights in the network have the same distribution from now on. We make the following assumption on the activation function and the distribution .
Assumption 3.8.
For any , the pair is universal and there exists a measurable mapping such that
(3.3) |
and for some satisfying .
Corollary 3.6 gives sufficient condition for the pair to be universal. However, in Assumption 3.8, we also require the existence of function that satisfies equation (3.3), which may be difficult to check for general activation functions. Nevertheless, we will show that the assumption holds true when the activation function is ReLU and is symmetric (see Corollary 3.11).
By Lemma 3.4, Assumption 3.8 ensures that if we have i.i.d. samples from , we can approximately reconstruct the input by
In other words, the features contain enough information of the input and we can approximately recover it using as coefficients. For echo state networks, this assumption guarantees that the hidden state does not lose too much information about the history of the input, and hence we can view it as a “reservoir” and approximate the desired function at any time steps.
Theorem 3.9.
Suppose and the distribution satisfy Assumption 3.8. For , let be i.i.d. samples from , and define the ESN