# Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks

Recurrent neural networks (RNNs) have drawn interest from machine learning researchers because of their effectiveness at preserving past inputs for time-varying data processing tasks. To understand the success and limitations of RNNs, it is critical that we advance our analysis of their fundamental memory properties. We focus on echo state networks (ESNs), which are RNNs with simple memoryless nodes and random connectivity. In most existing analyses, the short-term memory (STM) capacity results conclude that the ESN network size must scale linearly with the input size for unstructured inputs. The main contribution of this paper is to provide general results characterizing the STM capacity for linear ESNs with multidimensional input streams when the inputs have common low-dimensional structure: sparsity in a basis or significant statistical dependence between inputs. In both cases, we show that the number of nodes in the network must scale linearly with the information rate and poly-logarithmically with the ambient input dimension. The analysis relies on advanced applications of random matrix theory and results in explicit non-asymptotic bounds on the recovery error. Taken together, this analysis provides a significant step forward in our understanding of the STM properties in RNNs.

Comments

There are no comments yet.

## Authors

• 2 publications
• 15 publications
• 4 publications
• ### Memory and Information Processing in Recurrent Neural Networks

Recurrent neural networks (RNN) are simple dynamical systems whose compu...
04/23/2016 ∙ by Alireza Goudarzi, et al. ∙ 0

read it

• ### Short Term Memory Capacity in Networks via the Restricted Isometry Property

Cortical networks are hypothesized to rely on transient network activity...
07/01/2013 ∙ by Adam S. Charles, et al. ∙ 0

read it

• ### Use of recurrent infomax to improve the memory capability of input-driven recurrent neural networks

The inherent transient dynamics of recurrent neural networks (RNNs) have...
02/14/2018 ∙ by Hisashi Iwade, et al. ∙ 0

read it

• ### Automatic Construction of a Recurrent Neural Network based Classifier for Vehicle Passage Detection

Recurrent Neural Networks (RNNs) are extensively used for time-series mo...
09/26/2016 ∙ by Evgeny Burnaev, et al. ∙ 0

read it

• ### How recurrent networks implement contextual processing in sentiment analysis

Neural networks have a remarkable capacity for contextual processing–usi...
04/17/2020 ∙ by Niru Maheswaranathan, et al. ∙ 7

read it

• ### Recurrent Neural Network-based Model for Accelerated Trajectory Analysis in AIMD Simulations

The presented work demonstrates the training of recurrent neural network...
09/23/2019 ∙ by Mohammad Javad Eslamibidgoli, et al. ∙ 27

read it

• ### Determination of the edge of criticality in echo state networks through Fisher information maximization

It is a widely accepted fact that the computational capability of recurr...
03/11/2016 ∙ by Lorenzo Livi, et al. ∙ 0

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recurrent neural networks (RNNs) have drawn interest from researchers because of their effectiveness at processing sequences of data (Jaeger, 2001; Lukoševičius, 2012; Hinaut et al., 2014). While deep networks have shown remarkable performance improvements at task such as image classification, RNNs have recently been successfully employed as layers in conventional deep neural networks to expand these tools into tasks with time-varying data (Sukhbaatar et al., ; Gregor et al., 2015; Graves et al., 2013; Bashivan et al., 2016)

. This inclusion is becoming increasingly important as neural networks are being applied to a growing variety of inherently temporal high-dimensional data, such as video

(Donahue et al., 2015), audio (Graves et al., 2013), EEG data (Bashivan et al., 2016), two-photon calcium imaging  (Apthorpe et al., 2016). Despite the growing use of both deep and recurrent networks, theory characterizing the properties of such networks remain relatively unexplored. For deep neural networks, much of the computational power is often attributed to flexibility in learned representations (Mallat, 2016; Vardan et al., 2016; Patel et al., 2015). The power of RNNs, however, is tied to the ability of the recurrent network dynamics to act as a distributed memory substrate, preserving information about past inputs to leverage temporal dependencies for data processing tasks such as classification and prediction. To understand the success and limitations of RNNs, it is critical that we advance our analysis of the fundamental memory properties of these network structures.

There are many types of recurrent network structures that have been employed in machine learning applications, each with varying complexity in the network elements and the training procedures. In this paper we will focus on RNN structures known as echo state networks (ESNs). These networks have discrete time continuous-valued nodes that evolve at time in response to the inputs according to the dynamics:

 x[n+1]=f(Wx[n]+Zs[n]+ϵ[n]), (1)

where is the connectivity matrix defining the recurrent dynamics,

is the weight vector describing how the input drives the network,

is an element-wise nonlinearity evaluated at each node and represents the error due to potential system imperfections (Jaeger, 2001; Wilson and Cowan, 1972; Amari, 1972; Sompolinsky et al., 1988; Maass et al., 2002). In an ESN, the connectivity matrix

is random and untrained, while the simple individual nodes have a single state variable with no memory. This is in contrast to approaches such as long short-term memory units

(Sak et al., 2014; Lipton et al., 2016; Kalchbrenner et al., 2016) which have individual nodes with complex memory properties. As with many other recent papers, we will also focus on linear networks where is the identity function (Jaeger, 2001; Jaeger and Haas, 2004; White et al., 2004; Ganguli et al., 2008; Ganguli and Sompolinsky, 2010; Charles et al., 2014; Wallace et al., 2013).

The memory capacity of these networks has been studied in both the machine learning and computational neuroscience literature. In the approach of interest, the short-term memory (STM) of a network is characterized by quantifying the relationship between the transient network activity and the recent history of the exogenous input stream driving the network (Jaeger and Haas, 2004; Maass et al., 2002; Ganguli and Sompolinsky, 2010; Wallace et al., 2013; Verstraeten et al., 2007; White et al., 2004; Lukoševičius and Jaeger, 2009; Buonomano and Maass, 2009; Charles et al., 2014). Note that this is in contrast to alternative approaches that characterize long-term memory in RNNs through quantifying the number of distinct network attractors that can be used to stably remember input patterns with the asymptotic network state. In the vast majority of the existing theoretical analysis of STM, the results conclude that networks with nodes can only recover inputs of length  (White et al., 2004; Wallace et al., 2013) when the inputs are unstructured.

However, in any machine-learning problem of interest, the input statistical structure is precisely what we intend to exploit to accomplish meaningful tasks. For one example, many signals are well-known to admit a sparse representation in a transform dictionary (Elad et al., 2010; Davies and Daudet, 2006). In fact, some classes of deep neural networks have been designed to induce sparsity at higher layers that may serve as inputs into the recurrent layers (LeCun et al., 2010; Kavukcuoglu et al., 2010). For another example, a collection of time-varying input streams (e.g., pixels or image features in a video stream) are often heavily correlated. In the specific case of single input streams () with inputs that are -sparse in a basis, recent work (Charles et al., 2014) has shown that the STM capacity can scale as favorably as , where is a constant. In other words, the memory capacity can scale linearly with the information rate in the signal and only logarithmically with the signal dimension, resulting in the potential for recovery of inputs of length . Unfortunately, existing analyses (Jaeger, 2001; White et al., 2004; Ganguli and Sompolinsky, 2010; Charles et al., 2014) are generally specific to the restricted case of single time-series inputs () or unstructured inputs (Verstraeten et al., 2010).

Conventional wisdom is that structured inputs should lead to much higher STM capacity, though this has never been addressed with strong analysis in the general case of ESNs with multidimensional input streams. The main contribution of this paper is to provide general results characterizing the STM capacity for linear randomly connected ESNs with multidimensional input streams when the inputs are either sparse in a basis or have significant statistical dependence (with no sparsity assumption). In both cases, we show that the number of nodes in the network must scale linearly with the information rate and poly-logarithmically with the total input dimension. The analysis relies on advanced applications of random matrix theory, and results in non-asymptotic analysis of explicit bounds on the recovery error. Taken together, this analysis provides a significant step forward in our understanding of the STM properties in RNNs. While this paper is primarily focused on network structures in the context of RNNs in machine learning, these results also provide foundation for the theoretical understanding of recurrent network structures in biological neural networks, as well as the memory properties in other network structures with similar dynamics (e.g., opinion dynamics in social networks).

## 2 Background and Related Work

### 2.1 Short Term Memory in Recurrent Networks

Many approaches have been used to analyze the STM of randomly connected networks, including nonlinear networks (Sompolinsky et al., 1988; Massar and Massar, 2013; Faugeras et al., 2009; Rajan et al., 2010; Galtier and Wainrib, 2016; Wainrib, 2015) and linear networks (Jaeger, 2001; Jaeger and Haas, 2004; White et al., 2004; Ganguli et al., 2008; Ganguli and Sompolinsky, 2010; Charles et al., 2014; Wallace et al., 2013)

with both discrete-time and continuous-time dynamics. These methods can be broadly be classified as either correlation-based methods

(White et al., 2004; Ganguli et al., 2008) or uniqueness methods (Jaeger, 2001; Maass et al., 2002; Jaeger and Haas, 2004; Charles et al., 2014; Legenstein and Maass, 2007; Büsing et al., 2010). Correlation methods focus on quantifying the correlation between the network state and recent network inputs. In these studies, the STM is defined as the time of the oldest input where the correlation between the network state and that input remains above a given threshold (White et al., 2004; Ganguli et al., 2008). These methods have mostly been applied to discrete-time systems, and have resulted in bounds on the STM that scale linearly with the number of nodes (i.e. ).

In contrast, uniqueness methods instead aim to show that different network states correspond to unique input sequences (i.e. the network dynamics are bijective).111We note that uniqueness-based methods imply recovery-based methods, modulo a recovery algorithm, as often recovery guarantees are based on some semblance of a bijection. For uniqueness methods, the STM is defined as the longest input length where this input-network state bijection still holds. These methods have been used under the term separability property for continuous-time liquid state machines (Maass et al., 2002; Vapnik and Chervonenkis, 1971; Legenstein and Maass, 2007; Wallace et al., 2013; Büsing et al., 2010) and under the term echo-state property for discrete-time ESNs (Jaeger, 2001; Yildiz et al., 2012; Buehner and Young, 2006; Manjunath and Jaeger, 2013). The echo-state property is the method most related to the approach we take here, and essentially derives the maximum length of the input signal such that the resulting network states remain unique. While this property guarantees a bijection between inputs and network states, it does not take into account input signal structure, does not capture the robustness of the mapping, and does not provide guarantees for stably recovering the input from the network activity.

### 2.2 Compressed Sensing

The compressed sensing literature and its recent extensions include many tools for studying the effects of random matrices applied to low-dimensional signals. Specifically, in the basic compressed sensing problem we desire to recover the signal from measurements222In the case of RNNs, the network node values act as the measurements of our system, prompting the use of as the number of measurements in this section generated from a random linear measurement operator,

 x=A(s)+ϵ, (2)

where represents the potential measurement errors. Typically, is assumed to have low-dimensional structure and recovery is performed via a convex optimization program. The most common example is a sparsity model where can be represented as

 s=Ψa,

where is a transform matrix and is the sparse coefficient representation of with of its entries non-zero. Under this sparsity assumption, the coefficient representation is recoverable if the linear operator satisfies the restricted isometry property (RIP) that guarantees uniqueness of the compressed measurements. Specifically, we say that satisfies the RIP(2,) if for every 2-sparse signal , the following condition is satisfied:

where and is a positive constant. When satisfies the RIP(2,) the sparse coefficients can be recovered by solving an -norm based optimization function

 a=argmina||a||1such that||x−A(Ψa)||2≤||ϵ||2, (3)

up to a reconstruction error given by

 ||s−ˆs||2≤α||ϵ||2+β∣∣∣∣ΨT(s−sK)∣∣∣∣1√K, (4)

where and are constants (Candès et al., 2006). The first term of this recovery error bound depends on the norm of the measurement error , while the second term depends on the difference between the true signal and the best -sparse approximation of the true vector (). This term essentially measures how closely the signal matches the sparsity model. The optimization program in (3) required for recovery can be solved by many efficient algorithms, including neurally plausible architectures (Rozell et al., 2010; Balavoine et al., 2012; Shapero et al., 2014; Charles et al., 2012).

When the data of interest is a matrix , other low-dimensional models have also been explored. For example, as an alternative to a sparsity assumption, the successful low-rank model assumes that there are correlations between rows and columns such that has rank . We can then write the decomposition of the matrix as

 S=QV∗,

where and . There is a rich and growing literature dedicated to establishing guarantees for recovering low-rank matrices from incomplete measurements. Due to the difficulty of establishing a general matrix-RIP property for observations of a matrix (Recht et al., 2010)

, the guarantees in this literature more commonly use the optimality conditions for specific optimization procedures to show that the resulting solution has bounded error with high probability. The most common optimization program used for low-rank matrix recovery is the nuclear norm minimization,

 S=argminS||S||∗such that||x−A(S)||2≤||ϵ||2, (5)

where the nuclear norm

is defined as the sum of the singular values of

(Candès and Tao, 2010; Candès and Plan, 2010; Recht et al., 2010; Chen and Suter, 2004; Fazel, 2002; Singer and Cucuringu, 2010; Toh and Yun, 2010; Liu and Vandenberghe, 2009; Jaggi et al., 2010). This optimization procedure is similar to the -regularized optimization of Equation (3), however the nuclear-norm induces sparsity in the singular values rather than the matrix entries directly.

The solution to Equation (5) can be shown to satisfy performance guarantees via the dual-certificate approach (Candès and Plan, 2010; Ahmed and Romberg, 2015). This technique is a proof by construction and shows that a dual certificate (i.e., a vector whose projections into and out of the space spanned by the singular vectors of are bounded) exists. Showing that such a certificate exists demonstrates that Equation (5) converges to a valid solution and is key to deriving accuracy bounds (Candès and Plan, 2010; Ahmed and Romberg, 2015). Specifically, if the dual certificate exists, then the solution to Equation (5) satisfies the recovery bound

 ∣∣∣∣ˆS−S∣∣∣∣F≤(4√min(N,L)2NL+MM+2)ϵ, (6)

where the Forbenius norm is defined as the sum of the squares of all the matrix entries. This bound demonstrates that perfect recovery is achievable in the case where there is no error (). We note that alternate optimization programs with similar guarantees have been proposed in the literature for inferring low-rank matrices (i.e. Ahmed and Romberg, 2015), but we will focus on nuclear norm optimization approaches due to the extensive literature on nuclear-norm solvers and the connections to sparse vector inference.

### 2.3 STM Capacity via the RIP

The ideas and tools from the compressed sensing literature have recently been used to show that a-priori knowledge of the input sparsity can lead to improvements recovery-based STM capacity results for ESNs. For a single input stream under a sparsity assumption, Ganguli and Sompolinsky (2010) analyzed an annealed version of the network dynamics to show that the network memory capacity can be larger than the network size. Building on this observation, Charles et al. (2014) provided an analysis of the exact network dynamics in an ESN (for the single input case of ), yielding precise bounds on a network’s STM capacity captured in the following theorem: (Theorem 4.1.1, Charles et al., 2014) Suppose , , ,333We use notation to indicate that a variable is a finite constant. and . Let

be any unitary matrix of eigenvectors (containing complex conjugate pairs) of the connectivity matrix

and for

an even integer, denote the eigenvalues of

by . Let the first eigenvalues () be chosen uniformly at random on the complex unit circle (i.e.,

is uniformly distributed over

) and the other eigenvalues as the complex conjugates of these values. Furthermore, let the entries of the input weights

be i.i.d. zero-mean Gaussian random variables with variance

. Given RIP conditioning and failure probability , if

 M≥CKδ2μ2(Ψ)log5(N)log(η−1),

then for a universal constant , with probability the mapping of length- input sequences into network state variables satisfies the RIP(, ).

This theorem proves a rigorous and non-asymptotic bound on the length of the input that can be robustly extracted from the network nodes. By showing the RIP property on the network dynamics, the recovery bound given in Equation (4) establishes the recovery performance for any -length, -sparse signal from the resulting network state at time . In short, the number of required nodes scales linearly with the information rate of the signal (i.e., the sparsity level) and poly-logarithmically with the length of the input. The coherence factor , defined as

 μ(Ψ)=maxn=1,…,Nsupt∈[0,2π]∣∣ ∣∣N−1∑m=0Ψm,ne−jtm∣∣ ∣∣,

expresses the types of sparsity that are efficiently stored in the network. Essentially this coherence factor is large (on the order of ) for inputs sparse in the Fourier basis, and is very low (essentially a small constant) for inputs that are sparse in bases different from the Fourier bases (e.g. wavelet transforms). For the extreme case of Fourier-sparse inputs, the number of nodes must again exceed the number of inputs. When this coherence is low and , this bound is a clear improvement over existing results as it allows for . However, this result is restricted to single input streams with one type of low-dimensional structure. The current paper addresses the much more general problem of multidimensional inputs and other types of low-dimensional structure.

## 3 STM for Multi-Input Networks

In this work we will use the tools of random matrix theory to establish STM capacity results for recurrent networks under the general conditions of multiple simultaneous input streams and a variety of low-dimensional models. The temporal evolution of the linear network with multiple inputs is similar to the previous ESN definition, with the main difference being that the input at each time-step is a length vector that drives the network through a feed-forward matrix rather than a feed-forward vector,

 x[n] = Wx[n−1]+L∑l=1zlsl[n]+˜ϵ[n] (7) = Wx[n−1]+Zs[n]+˜ϵ[n].

We denote the columns of as to separately notate the vectors mapping each input stream.

We can write the current network state as a linear function of the inputs by iterating Equation (7),

 x[N]=N∑k=1WN−kZs[k]+ϵ,

where the error term is the accumulated error, and then rewriting sum as a matrix-vector multiply,

 x[N] = [Z,WZ,⋯,WN−1Z][sT[N],sT[N−1],⋯,sT[1]]T+ϵ.

Depending on the signal statistics in question, we will find it convenient in some cases to express the network dynamics in terms of a linear operator applied to an input matrix, i.e.

 x[N]=A(S)+ϵ,

where . In other cases, we find it more convenient to reorganize the columns into an effective measurement matrix applied to a vector of inputs. By defining the eigen-decomposition of , we can re-write the dynamics process as

 x[N] = U[D0U−1Z,DU−1Z,⋯,DN−1U−1Z][sT[N],sT[N−1],⋯,sT[1]]T+ϵ.

To simplify this expression, we can reorganize the columns of the linear operator (and the rows of the vector of inputs) such that all the inputs corresponding to the input vector create a single block. The row of the block of out matrix is now represented by , which can be written as , where and is the vector of the diagonal elements of raised to the power. We can more concisely by defining the matrix consisting of the eigenvalues of raised to different powers (i.e. ), resulting in the expression

 x[N]=U[˜Z1F,˜Z2F,⋯,˜ZLF][sT1,sT2,⋯,sTL]T+ϵ=A˜s+ϵ. (8)

Since the eigenvalues of here are restricted to reside on the unit circle, we note that is a Vandermonde matrix whose rows are Fourier basis vectors. From Equation (8) we see that the current state is simply the sum of compressed input streams, where the compression for each block essentially performs the same compression as for a single stream, but modulated by the different feed-forward vectors .

### 3.1 Sparse Multiple Inputs

To begin, we consider the direct extension of previous results based on sparsity models to the multi-input setting. In this setting we consider the model where the composite of all input signals is sparse in a basis so that . This means that each signal stream can be written as where is the block of . This signal model captures dependencies between input streams because a given coefficient can influence multiple channels. While in many application the basis is pre-specified (i.e. wavelet decomposition in image processing; Christopoulos et al., 2000), these bases can also be learned from exemplar data via dictionary learning algorithms (Olshausen and Field, 1996; Aharon et al., ). This sparsity model can be a useful model for signals of interest, such as video signals, where similar sparse decompositions have been used for action recognition (Guha and Ward, 2012) and video categorization (Chiang et al., 2013). With this model, we will use a generalized notion of the coherence parameter used in (Charles et al., 2014):

 μS(Ψ)=maxl,k=1,…,Lmaxn=1,…,Nsupt∈[0,2π]∣∣∑N−1m=0Ψl,km,ne−jtm∣∣∥Ψl,km∥2. (9)

In this case, each block must be different from the Fourier basis to achieve high STM capacity. This restriction is reasonable, since if a single sub-block of was coherent with the Fourier basis, then at least one input stream could be sparse in a Fourier-like basis and hence would be unrecoverable. Using this network and signal model, we obtain the following theorem on the stability of the network representation:

Suppose , and . Let be any unitary matrix of eigenvectors (containing complex conjugate pairs) and the entries of be i.i.d. zero-mean Gaussian random variables with variance . For an even integer, denote the eigenvalues of by . Let the first eigenvalues () be chosen uniformly at random on the complex unit circle (i.e., we chose uniformly at random from ) and the other eigenvalues as the complex conjugates of these values. For a given RIP conditioning , failure probability , and coherence as defined as in Equation (9), if

 M≥CKδ2μ2S(Ψ)log5(NL)log(η−1),

then satisfies RIP- with probability exceeding for a universal constant . The proof of Theorem 3.1 is provided in Appendix A.1. Note that when = 1, Theorem 3.1 reduces to Theorem 2.3. In this result we see that that the number of nodes relies only linearly on the underlying dimensionality () and poly-logarithmically on the total size of the input . This means that under favorable coherence and sparsity conditions on the input, the network can again have STM capacities that are higher than the number of nodes in the network. Specifically, showing that satisfies the RIP property, Theorem 3.1 ensures that standard recovery guarantees from the sparse inference literature hold. In particular, any -sparse input is recoverable from the network state at time up to the error bound of Equation (4) by solving the -regularized least-squares optimization of Equation (3).

### 3.2 Low Rank Multiple Inputs

Next we consider the case of a very different type of low-dimensional structure where the input signals are correlated but not necessarily sparse. Specifically, in this setting we assume that the inputs arise from a process where prototypical signals combine linearly to form the various input streams. Such a signal structure could arise, for instance, due to correlations between input streams at spatially neighboring locations. A number of interesting applications display such correlations, including important measurement modalities in neuroscience (e.g. two-photon calcium imaging; Denk et al., 1990; Maruyama et al., 2014 and neural electrophysiological recordings; Berényi et al., 2014; Ahmed and Romberg, 2015), and remote sensing applications (e.g. hyperspectral imagery; Zhang et al., 2014; Veganzones et al., 2016). The applicability of RNNs and machine learning methods to data well described by this low-rank model is also increasingly relevant as there is increasing interest in applying neural network techniques to such data, either for detection (Apthorpe et al., 2016), classification (Chen et al., 2016, 2014), or as samplers via variational auto-encoders (Gao et al., 2016). In this case, we can write out the input matrix in the reduced form , where is the matrix whose rows may represent environmental causes generating the data and represents the mixing matrix that defines the input stream. We will assume both and , meaning that is low-rank. With this model we use a definition of coherence given by:

 μ2L=R−1supω∈[0,2π]||V∗fω||22. (10)

where is the Fourier vector with frequency . This coherence parameter mirrors the coherence used for the sparse-input case. As measured the similarity between the measurement vectors and the sparsity basis , measures the similarity between the measurements and the left singular vectors of the measured matrix. The intuition here is that measurements that align with the left singular vectors are unlikely to measure significant information about the .

To analyze the STM of the network dynamics with respect to low-rank signal statistics, we leverage the dual certificate approach (Candès and Plan, 2010, 2011a; Ahmed and Romberg, 2015) to derive the following theorem, Suppose , , and . Let be i.i.d. zero-mean Gaussian random variables with variance . For an even integer, denote the eigenvalues of by . Let the first eigenvalues () be chosen uniformly at random on the complex unit circle (i.e., we chose uniformly at random from ) and the other eigenvalues as the complex conjugates of these values. For a given coherence as defined as in Equation (10), if

 M≥cR(N+μ2LL)log3(LN),

then, with probability at least , the minimization in Equation (5) recovers the rank- input matrix up to the error bound in Equation (6). The proof of Theorem 3.2 is in Appendix A.2 and follows a golfing scheme to find an inexact dual certificate. In fact, we note that since our architecture is extremely similar mathematically to the architecture in (Ahmed and Romberg, 2015), our proof is also very similar. The main difference is that due to the unbounded nature of our distributions (i.e. the feed-forward vectors are Gaussian random variables) and the fact that our Fourier vectors are on the unit circle (rather than gridded), we can consider our proof as a generalization of the proof in (Ahmed and Romberg, 2015).

Theorem 3.2 is qualitatively similar to Theorem 3.1 in the way the STM capacity scales. In this case, the bound still scales linearly with the information rate as captured by the number of elements in the left and right matrices that compose : . Interestingly, due to the left singular vectors interacting with the measurement operator first, the coherence term only affects the portion of the bound related to the number of elements in . Additionally, as before, the number of total inputs only impacts the bound poly-logarithmically.

## 4 Simulation

To empirically verify that these theoretical STM scaling laws are representative of the empirical behavior, we generated a number of random networks and evaluated the recovery of (sparse or low-rank) input sequences in the presence of noise. For each simulation we generate a random orthogonal connectivity matrix 444Orthogonal connectivity matrices were obtained by running an orthogonalization procedure on a random Gaussian matrix. and a random Gaussian feed-forward matrix . In both cases we fixed the number of inputs to and the number of time-steps to while varying the network size and underlying dimensionality of the input (i.e., the sparsity level or the input matrix rank). For the sparse input simulations, inputs were chosen with a uniformly random support pattern with random Gaussian values on the support. For low-rank simulations, the right singular vectors were chosen to be Gaussian random vectors, and the left singular values were chosen at random from a number of different basis sets.

In Figure 2 we show the relative mean-squared error of the input recovery as a function of the sparsity-to-network size ratio and the network size-to-input ratio . Each pixel value represents the average recovery relative mean-squared error (rMSE), as calculated by

 RMSE=∥ˆs−s∥22∥s∥22,

over 20 randomly generated trials with a noise level of . We show results for recovery of three different types of sparse signals: signals sparse in the canonical basis, signals sparse in a Haar wavelet basis, and signals sparse in a discrete cosine transform (DCT) basis. As our theory predicts, for canonical- and Haar wavelet-sparse signals the network has very favorable STM capacity results. The fact that the capacity achieves is demonstrated by the area left of the point () where the signal is recovered with high accuracy. Likewise, for the DCT-sparse signals we find that the inputs are never recovered well for any . This behavior is also predicted by our theory because of the unfavorable coherence properties of the DCT basis.

For the low-rank trials we see that recovery of low-rank inputs for a range of is possible as predicted by the theoretical results. As with the sparse input case we consider three types of low-rank inputs. Instead of changing the sparsity basis, however, we change the right singular vectors of the low-rank input matrix . We explore the cases where the elements of are chosen from the canonical basis, the haar basis and the DCT basis. These results are shown in Figure 3 with plots similar to those in Figure 2, but with only showing the range to reduce computational time. As our theory predicts, the recovery of inputs with canonical- and Haar right singular vectors is more accurate for a larger range of pairs than the inputs with DCT right singular vectors.

## 5 Conclusions

Determining the fundamental limits of memory in recurrent networks is critical to understanding their behavior in machine learning tasks. In this work we show that randomly connected echo-state networks can exploit the low-dimensional structure in multidimensional input streams to achieve very high short-term memory capacity. Specifically, we show non-asymptotic bounds on recovery error for input sequences that have underlying low-dimensional statistics described by either joint sparsity or low-rank properties (with no sparsity). For multiple sparse inputs, we find that the network size must be linearly proportional to the input sparsity and only logarithmically dependent on the total input size (a combination of the input length and number of inputs). For inputs with low-rank structure, we find a similar dependency where the network size depends linearly on the underlying dimension of the inputs (the product of the input rank with the input length and the number of inputs) and logarithmically on the total input size. Both results continue to demonstrate that ESNs can have STM capacities much larger than the network size.

These results are a significant (conceptual and technical) generalization over previous work that provided theoretical guarantees in the case of a single sparse input (Charles et al., 2014). While the linear ESN structure is a simplified model, rigorous analysis of these networks has remained elusive due to the recurrence itself. These results isolate the properties of the transient dynamics due to the recurrent connections, and may provide one foundation for which to explore the analysis of other complex network attributes such as nonlinearities and spiking properties. We also note that knowledge of how well neural networks compresses structured signals could indicate methods to pick the size of recurrent network layers. Specifically, if a task is thought to require a certain time-frame of an input signal, the overall sparsity (or rank) of the signal in that time-frame can be used in conjunction with the length of that time-frame to give a lower bound for the required number of nodes in the recurrent layer of the network.

While the current paper is restricted to orthogonal connectivity matrices, previous work (Charles et al., 2014) has shown that a number of network structures can satisfy these criteria, including some types of small-world network topologies. Additionally, we explored here low-rank and sparse inputs separately. The methods we have used to prove Theorem 3.2, however, have also been used to analyze recovery signals with other related structures (e.g. matrices that can be decomposed into the sum of a sparse and low-rank matrix) from compressive measurements (Candès and Plan, 2011b). Our bounds presented here therefore could open up avenues for similar analysis of other low-dimensional signal classes.

While the context of this paper is focused on the role of recurrent networks as a tool in machine learning tasks, these results may also lead to a better understanding of the STM properties in other networked systems (e.g., social networks, biological networks, distributed computing, etc.). With respect to the literature relating recurrent ESN and liquid-state machines to working memory in biological systems, the notion of sparsity has much of the same flavor as the concept of chunking (Gobet et al., 2001). Chunking is the concept of humans learning to remember items in highly correlated groups rather than remembering items individually as a way of artificially increasing their working memory. Similarly, the use of sparsity bases allow RNNs to ‘chunk’ items according to the basis elements. Thus, each basis counts only as one item (the true underlying sparsity) and the network needs only store these elements rather than storing every input separately.

### Acknowledgments

The authors are grateful to S. Bahmani, A. Ahmed and J. Romberg for valuable discussions related to this work. This work was supported in part by ONR grant N00014-15-1-2731 and NSF grants CCF-0830456 and CCF-1409422.

## A Appendix

### a.1 RIP for Multiple Gaussian Feed-forward Vectors

In this appendix we prove Theorem 3.1, showing that the matrix representing the network evolution with inputs and i.i.d. Gaussian feed-forward vectors satisfies the RIP. Recall that we have , where is derived in Section 3 and that (the vectorization of ) is sparse with respect to the basis , meaning that there is a -sparse signal such that . Similar to  (Charles et al., 2014), this proof is based on showing conditions on such that satisfies the RIP with respect to , i.e.

holds with high probability for all -sparse . This is equivalent to bounding the following probability of the event

 ∣∣∣∣(AΨ)HAΨ−I∣∣∣∣K≤δ. (11)

where the norm is defined as

 ||A||K:=supy is K−sparseyHAy||y||22.

First we bound the expectation of , and use the result to bound the tail probability of the event (11).

#### a.1.1 Expectation

First we let

 ˆA=[˜Z1F˜Z2F⋯˜ZLF]Ψ,

Since , then for any , . Therefore we only need to prove that

 ∣∣∣∣∣∣ˆAHˆA−I∣∣∣∣∣∣K≤δ,

holds with high probability when is large enough. Let denote the th row of , i.e.,

 VHi=L∑l=1~zi,lFHiΨi.

Let and ; it is easy to check that and are complex conjugates, and that . Therefore , and we only need to show that with high probability when is large enough. We can easily see that when , , and are independent.

First we show that is small with high probability when is large enough. We then show that is concentrated around its mean with high probability when is large enough. By Lemma 6.7 in Rauhut (2010), for a Rademacher sequence , , we have

 E[||B1||K]=E⎡⎣∣∣ ∣∣∣∣ ∣∣M/2∑i=1(ViVHi−1MI)∣∣ ∣∣∣∣ ∣∣K⎤⎦≤2E⎡⎣∣∣ ∣∣∣∣ ∣∣M/2∑i=1ϵiViVHi∣∣ ∣∣∣∣ ∣∣K⎤⎦.

We can now apply Lemma 8.2 from Rauhut (2010), giving us

 E[||B1||K] ≤ 2E⎡⎣∣∣ ∣∣∣∣ ∣∣M/2∑i=1ϵiViVHi∣∣ ∣∣∣∣ ∣∣K⎤⎦ (12) ≤ 2E[E[C0Vmax√Klog(100K)√log(4NL)ln(5M) ⋅ ⎷∣∣ ∣∣∣∣ ∣∣M/2∑i=1ViVHi∣∣ ∣∣∣∣ ∣∣K |Vi,i=1,...,M/2⎤⎥⎦⎤⎥⎦ ≤ √C1Klog4(NL)√E[V2max]E[||B1||K+12],

where the last inequality results due to , , the Cauthy-Schwarz inequality and the triangle inequality, and Note that the th element of can be written

 VHl(p)=L∑i=1~zl,iFHlΨi(p).

and since we know that

 L∑i=1|FHlΨi(p)|2≤L∑i=1||Ψi(p)||22μ2(Ψ)=μ2(Ψ),

and

 Vmax=max1≤l≤M/21≤p≤NL|Vl(p)|

we can now use Corollary A.1.3. Setting , in Corollary A.1.3 yields

 P[V2max>μ2(Ψ)MlogMNL2η]≤η, (13)

and

 E[V2max]≤μ2(Ψ)M(logMNL2+1)≤C2μ2(Ψ)Mlog(NL). (14)

Returning to the inequality in Equation (12), considering (14), we have

 E[||B1||K]≤a√E[||B1||K]+1,

where . Then . When , we get . Let , and we can conclude that when

 M≥C3Kμ2(Ψ)log5(NL)δ′,

then

 E[||B1||K]≤δ′.

#### a.1.2 Tail Bound

Now we study the tail bound of . First we construct a second set of random variables , which are independent of and are identically distributed as . Additionally, we let

 ˜B1=M/2∑i=1(ViVHi−V′iV′Hi),

and then according to Charles et al. (2014), there is

 E[∣∣∣∣˜B1∣∣∣∣K]≤2E[||B1||K], (15)
 P[||B1||K>2E[||B1||K]+u]≤2P[∣∣∣∣˜B1∣∣∣∣K>u]. (16)

Now since we have

 ∣∣∣∣ViVHi−V′iV′Hi∣∣∣∣K≤2max{∣∣∣∣ViVHi∣∣∣∣K,∣∣∣∣V′iV′Hi∣∣∣∣K},

and

 ∣∣∣∣ViVHi∣∣∣∣K≤supyisK−sparse||Vi||2∞||y||21||y||22≤KV2max,

then we know that and , where then by Equation (13), we obtain

 P[max1≤i≤M/2∣∣∣∣ViVHi∣∣∣∣K>Kμ2(Ψ)MlogMNL2η]≤P[V2max>μ2(Ψ)MlogMNL2η]≤η.

Since the probability theorems depend on bounded random variables, we define to denote the following event

 F={max{max1≤i≤M/2∣∣∣∣ViVHi∣∣∣∣K,max1≤i≤M/2∣∣∣∣V′iV′Hi∣∣∣∣K}≤Kμ2(Ψ)MlogMNL2η},

such that . Furthermore, we define as the indicator function of , and let , where , is a Rademacher sequence and independent of . The truncated variable has a symmetric distribution and is bounded by . By Proposition 19 in Tropp et al. (2009), we have

 (17)

for all . Following Tropp et al. (2009), we find that

 P[∣∣∣∣˜B1∣∣∣∣K>v]≤P[∣∣∣∣ˆB1∣∣∣∣K>v]+P[FC], (18)

and

 E[∣∣∣∣ˆB1∣∣∣∣K]≤E[∣∣∣∣˜B1∣∣∣∣K]. (19)

By combining Equations (17), (18), and (19), we get

In Equation (20), let , and . These values yield

 P[∣∣∣∣˜B1∣∣∣∣K>C4(√logη−1E[∣∣∣∣˜B1∣∣∣∣K]+(logη−1)Bmax)]≤4η. (20)

Now we can combine Equations (15), (16), and (20) to get

 P[||B1||K>2E[||B1||K]+2C4√logη−1E[||B1||K]+C4(logη−1)Bmax]≤8η.

By choosing

 M≥C3Kμ2(Ψ)log5(NL)δ′2, (21)

where , we obtain

 P⎡⎣||B1||K>2δ′+2C4δ′√logη−1+C5logη−1δ′2log(12MNLη−1)log5(NL)⎤⎦≤8η.

We observe now that

 log(12MNLη−1)log5(NL)≤2log(NL)+logη−1log5(NL),

indicating that when , i.e., , then

 log(12MNLη−1)log5(NL)≤C6,

, for a constant . This inequality reduces our probability statement to

 P[||B1||K>2δ′+2C4δ′√logη−1+C5C6logη−1δ′2]≤8η.

For an arbitrary with , we let

 δ′=δ2C4C7√logη−1, (22)

resulting in

 2C4δ′√logη−1=δC7,
 2δ′=δC4C7√logη−1≤δC4C7,
 C5C6logη−1δ′2=C5C6δ24C24C27≤C5C68C24C27δ.

Then we can see that if

 C7≥max⎧⎨⎩3,3C4,√3C5C68C24⎫⎬⎭,

then

 P[||B1||K>δ3+δ3+δ3]<8η.

Now by plugging (22) into (21), we know that when , there exists a constant such that when

 M≥CKμ2(Ψ)log5(NL)log(η−1)δ2,

there is

 P(||B1||K>δ)≤8η,

which completes the proof.

#### a.1.3 Lemmas for Theorem 3.1

Suppose we have complex Gaussian random variables, , , where and denote the real and imaginary parts of . Let and i.i.d Gaussian r.v.’s with mean 0 and variance .