1 Introduction
Recurrent neural networks (RNN) effectively learn features which characterize multidimensional sequential data. RNN utilize these features to solve a variety of machine learning problems such as classification, semantic understanding, translation, synthesis
[1]. Incorporation of RNN in systems with sequential inputs and/or outputs indeed enhanced performance in various application fields such as natural language processing, robot control, speech recognition, music composition, and human action modeling or recognition
[2, 3, 4, 5, 6, 7, 8, 9]. Notably, while the nature of the applications and of the sequences types is distinct, the structure of the underlying RNN is based on a single major principle of gatedunits (e.g. LSTM, GRU) and training with gradientdescent backpropagation optimization [10, 11].The narrow choice of network architectures is due to the fact that RNN with no particular structure, called vanilla or randomRNN (RRNN), turn out to be nonrobust and are susceptible to diminishing/exploding gradient problem when trained with gradient descent. While gatedRNN alleviate these problems, they leave an important question yet to be resolved: are there alternative architectures and training approaches naturally applicable to sequences? and if yes, what are they? Furthermore, it is intriguing to find out if these architectures are biologically plausible and could inherit the effectiveness of brain networks in processing multidimensional sequences and tasks.
A fundamental task upon which networks can be tested is the task of targetlearning where the output of the network is a sequence that the network generates. In such a setup, the network does not receive input and is expected to generate the prescribed target after training. Networks aimed at this task are often termed as liquid state machines as they can reorganize their weights and set their output without input to guide target generation. Similarly to other tasks, gradientdescent training of RRNN for the targetlearning task turns out to be inherently unstable. The problem stems from two aspects: (i) RRNN are either quiescent or chaotic
for most parameter settings sampled from the normal distribution. Both states are fundamentally considered "nontrainable" since they require to either control chaos or to prevent dissipation of energy
[12]. For a few parameters, the network might be in a transitional state between these two states (i.e., edgeofchaos). (ii) Training RRNN with gradient descent optimization will typically run into the diminishing/exploding gradients problem.Several methods have been introduced in an attempt to resolve the aforementioned challenges and proposed alternative network architectures and training. The Echo State approach proposed to clamp the network during training to achieve a feedforward propagation [13]. Such architecture is designed to avoid the delay effects and chaotic dynamics of recurrences in RRNN. While such architecture can indeed support variants of targetlearning, it turned out to be sensitive to perturbations and require longer training [14, 15, 16]. To tackle sensitivity in both training and performance, an alternative optimization approach was proposed for RRNN. The approach is named First Order Reduced and Controlled Error (FORCE) and proposed to replace gradientdescent backpropagation with recursive leastsquare training and feedback with fixed random weights [17]
. The recursive leastsquares method computes an inverse correlation matrix and utilizes the feedback to rapidly dampen the error to converge to the desired target as training proceeds. The method was shown to successfully train RRNN on targetlearning tasks with various configurations. In particular, it was shown to succeed in training Rank1RRNN, which are highly challenging as they consist of a single trainable readout unit which receives input from a reservoir of units connected by fixed weights. Such a structure hints at plausible similar network configurations in the brain which have been shown to include readout neurons summing up information from a neural population and to use synaptic plasticity of only particular neurons. Such a setup is perplexing from machine learning perspective as well since it constraints the optimization to a single dimension of the adjacency matrix.
For both Rank1 and RankNRRNN, FORCE method was shown to perform well when the network prior to training is set in the edgeofchaos regime, i.e., when the global scaling parameter of the connectivity, , is in the range of . In this regime, FORCE training achieves low errors and these errors continue to be controlled after training. Interestingly, even in this regime, performance would not be flawless such that in the most optimal regime there would exist about of network initializations for which the generated sequence diverges from the target. In a recent adaptation of FORCE (FullFORCE) the failure rates were reduced to lower percentages of about by incorporation of auxiliary RRNN [18]. Such an approach enhances robustness, however, requires longer training and is unable to fully exclude failures.
Concurrently, methods for stabilization of training for gradientdescent backpropagation besides the direct approach of gradient clipping and normalization
[19, 20, 21] were developed. Such methods impose additional constraints on the connectivity with the aim of generating RNN types which support optimal training and validation performance. A wide class of such approaches is spectral initialization of connectivity matrices according to a particular distribution, e.g., requiring the matrices to be identity, orthogonal or antisymmetric [22, 17, 23, 24, 25, 26, 27, 28, 29, 30]. Indeed, analysis of RNN as a discrete dynamical system suggests that these constraints lead to robust network models. The analysis computes the Jacobian matrix of RNN states step by step as they process the sequence with respect to initial network states. The position of the Jacobian matrix spectrum with respect to the stability region in the complex plane indicates possible stability or instability of the states [31, 32]. For eigenvalues outside of the stability region, their respective eigenvectors represent unstable directions, while for eigenvalues inside the stability region, their respective eigenvectors represent stable directions. It is, therefore, suggested to set the eigenvalues close to the boundary of the stability region to avoid both instabilities and rapid decay. In particular,
[30] proposed the Antisymmetric RNN which sets up the initial connectivity as an antisymmetric matrix such that all eigenvalues are in a narrow band near the imaginary axis. In this case, while the Antisymmetric RNN was shown to be trained using back propagation for classification tasks on MNIST and CIFAR10, this initialization can’t be directly utilized in FORCE training for targetlearning task, as we show in Section 2.In this work we analyze the spectral properties of the Jacobian that allow FORCE training to be a successful procedure for RRNN. We then proceed and summarize these successful spectral properties into four generating principles for the distribution of connectivity weights. The generating principles provide a method for initialization of the connectivity by sampling from this distribution. We call the method Robust FORCE (RFORCE) since it facilitates stable evolution of network states and efficient training in conjunction with FORCE weight optimization. We demonstrate that RFORCE performance on targetlearning task is superior to previously proposed approaches. Furthermore, the performance is robust for a wide range of
values for various RRNN including Rank1RRNN architecture. The robustness is expressed in achieving both a smaller average error over trials and targets and also in a narrower confidence interval, i.e., fewer outliers. This becomes critical in cases where the RRNN is expected to simultaneously model multidimensional targets. As the dimension of the target increases, the outliers are more likely to occur. We demonstrate such a scenario on an example of modeling 3D body joint dynamics where the joints coordinates are represented by 66 sequences.
2 Spectral Analysis of Rank1RNN and FORCE Learning
Rank1RRNN architecture, illustrated in Fig. 1, is the simplest RRNN architecture proposed for FORCE targetlearning [17]. The network is a recurrent neural network with firing rate units representing the activity of units, i.e. , where the activity variables follow the evolution equation
(1) 
The units (reservoir) are connected through the adjacency matrix of connection weights which are scaled by a single parameter . In Rank1RRNN these values are fixed and are initially set to a sample from the normal distribution , where is the number of neurons and indicates the sparsity of . In addition to units, there is a readout unit , which receives input from all activated units, i.e.,
and linearly integrates the inputs using the weight vector
(2) 
In addition, the readout unit has feedback connections to the reservoir ( units) through multiplication of with a vector
. The elements of this vector are fixed and are set according to a sample from the uniform distribution of
. In summary, after the Rank1RRNN network is initialized, only the vector which connects the units with the readout unit is optimized during training. For more details see [17] and references therein.When the connectivity of units is set according to , the spectrum of the adjacency matrix appears to be distributed homogeneously within the unit circle in the complex plane, as discussed in [33]. Accordingly, the eigenvalues of the scaled matrix are distributed within the circle of radius , as shown in the first row of Fig. 2.
When we add additional connections to and from the readout unit we obtain the matrix , of dimension , i.e additional column vector and row vector
Notably, will be modified since is optimized. Over the training period, the spectrum of appears to remain similar to the initial spectrum. This is expected since only a single dimension of the matrix is being updated [34]. We show the spectrum of at the last step of training () in the second row of Fig. 2A. The spectrum of is distinct from the spectrum of by having eigenvalues being placed outside of the circle of radius in the complex plane. In the case of and there are a handful of eigenvalues, and in the case of
there are many such eigenvalues. To better estimate the stability of training, we compute the Jacobian
of , current state at time , with respect to the initial state at [30]. The Jacobian is defined as follows(3) 
and is a matrix of dimension . Since Forward Euler (FE) scheme is used to evolve Eq. 1 in time, we identify the stability region of the scheme with respect to spectrum. The stability region of FE is a circle in the complex plane centered at with radius of . As decreases, its radius grows and the center moves to the left such that in the limit the circle covers the whole negative real quadrant. This indicates that for finite the right boundary of the circle asymptotes to the imaginary axis is a critical stability boundary since it is easier for eigenvalues to cross the right boundary and introduce instability into the network. When the eigenvalues are placed within the stability circle, , it indicates that the evolution/training is contracting, and when there are eigenvalues outside of the stability circle, , it indicates that the evolution could be unstable. For robust learning it is therefore desirable to keep the eigenvalues of the Jacobian within the stability circle with several of them being close to the boundary of the circle such that the evolution does not vanish.
In the third row of Fig. 2A we choose and plot along with a unit circle (solidred line) and the right boundary of FE stability circle (dashedred line). We inspect the spectrum of at the last step of training (), i.e. , to investigate whether indeed exhibits such properties. We observe that only for , has the desired distribution of the spectrum such that eigenvalues appear exclusively inside the unit circle and on its boundary. However, for and , the spectrum has too many eigenvalues either inside or outside the boundary, see the zoomin inserts in the third row of Fig. 2A. Such observations are consistent with FORCE learning being successful exclusively in the edgeofchaos regime of [17].
We therefore further investigate the sensitivity of FORCE learning to variation of by setting the target function as a superposition of four sine waves and following the experiments in [17]. We train the network with uniformly spaced values of in the range with a step of and repeated trials () for each . Our results are shown in Fig. 2B where we plot the Mean Absolute Error (MAE) as a function of , such that defined as . Our results indicate that FORCE learning is indeed sensitive and performs optimally in a narrow range of values, namely interval of . The narrow performance regime appears to be consistent across target functions that we tested. In Fig. 2C we show the performance of Rank1RRNN with three different target functions : (i) superposition of five sine waves (multiperiodic), (ii) rapidly changing superposition (discontinuous) (iii) superposition of four sine waves and Gaussian noise (noisy). During training with below (e.g. ) the readout unit typically follows the target, however during testing, the network dissipates information and eventually ‘looses’ the pattern learned during training. We note that for some signals (e.g., discontinuous target) even under constant training updates the error does not reduce to zero. When the value of is above the interval (e.g. ) the network becomes sensitive during testing such that the error grows rapidly and the generated sequence diverges from the ground truth target. Furthermore, even in the narrow range of (e.g. ), while Rank1RRNN appears to effectively learn various patterns and consistently predict them over time, outliers still occur in some cases (e.g., noisy target).
These experimental results are consistent with the interpretation of spectrum distribution in Fig. 2A. When , most of the eigenvalues are inside the stability circle (dashedred circle). This case corresponds to a lossy system which dissipates information. On the other hand, when there are too many eigenvalues outside the stability circle, the system is expected to be unstable, as shown for the case of . Ideal situation appears to be the situation where there are some eigenvalues hovering around the circle as in the case of , while the majority is located within the circle. Indeed such a condition was proposed as a potential criterion for preserving longterm dependency [30].
Given these results for Rank1RRNN our aim is to explore whether alternative approaches can provide more stable and consistent performance for a wider range of values the scaling parameter . We propose a novel method, Robust FORCE (RFORCE), which initializes to a distribution that makes a full use of FORCE training, i.e., designed to keep in the stability range.
3 RFORCE Initialization
According to our findings, we seek for approaches to initialize network connectivity such that the network will be in an optimal state for FORCE learning. Rank1RRNN Jacobian (Eq. 3) is controlled by four different quantities, namely , , and . The first three quantities are vectors and expected to have lowrank (rank 1) impact on the connectivity. Thereby, they are not expected to significantly impact the distribution of [34]. We also empirically verify this conjecture and confirm that during training and testing, has similar spectrum as . We therefore focus on the initialization of and seek for specific distributions of the initialization which lead to positioning in the stability region, such that are hovering around the stability circle for all times.
One possibility for initialization of is to constrain it to be an antisymmetric matrix as implemented for RNN in related work [30]. Such constraint guarantees that for Rank1RRNN all eigenvalues of will be positioned on the imaginary axis. While this condition is enough for a stable ODE, discretization of Forward Euler can introduce instability. Antisymmetric RNN can be also combined with a diffusion term as proposed in [30], however, as we find, such a condition is insufficient to keep in the stability range for all . This initialization typically results with instabilities that develop with evolution of the sequence. We thereby conclude that additional constraints on the distribution of the elements of are needed.
We call the new initialization of : , and we generate by generating its eigendecomposition. Specifically, we generate the matrix , in which each column is an eigenvector, and a diagonal matrix , in which diagonal entries correspond to eigenvalues
(4) 
Notably,
is assumed to be an orthogonal matrix, i.e.
.Generation of Eigenvectors. We generate a realvalued random matrix from normal distribution . To create that will produce realvalued we have to make sure that includes conjugate pairs and it is an orthogonal matrix [35]. We therefore generate an antisymmetric matrix from , , which is known to have conjugate eigenvectors. Eigendecomposition of provides the desired eigenvectors matrix .
Generation of Eigenvalues. We seek for optimal distribution of eigenvalues to satisfy the constraint that are positioned in the stability range for all . In the search for such distribution we explored various configurations and identified four key generating properties that the distribution should posses. We describe the properties below and visualize their necessity in Fig. 3(BG). Including these four properties provides a generating method for initialization of , namely RFORCE distribution for initialization of . The RFORCE distribution is depicted in Fig. 3A. Effectively, RFORCE is a random distribution and thereby many samples will provide stable configurations for Rank1RRNN. Since the elements of are realvalued, which implies that eigenvalues are conjugate pairs, we only generate half of eigenvalues and the other half is assigned as conjugate pairs. To test the optimality of RFORCE distribution and compare it with other candidate distributions we used Rank1RRNN with neurons and a target function composed of superposition of four sine waves. We alos set the scaling parameter values uniformly spaced in the range .
Circular Distribution.
The magnitude of the eigenvalues plays an important role in determining the stability of the system. To generate with the same magnitude, circular distribution centered at the origin, is convenient since it will generate eigenvalues with identical magnitude and various combinations of oscillatory (imaginary) and dissipative/expanding (real) components. Furthermore, such distribution can be tuned by two parameters: . We fix and sample from the uniform distribution over the closed set . While such distribution is theoretically promising, when we generate a single circle with radius , we obtain similar error curves as in standard initialization (normal distribution) for FORCE learning, see Fig. 3B. Specifically, we obtain a narrow range of values, for which the average error is controlled, with even larger confidence interval in this regime. We therefore keep the convenient circular shape (distribution of values) but extend the distribution to include multiple circles (multiple values).
Sets of Circles. We generate concentric circles, where the radius of the largest circle is placed at and the smallest at . The choice for these values is motivated by the observation that for the single circle initialization the optimal interval of values turned out to be and thus we would like to extend that optimal interval to the full range of . We then generate the remaining additional inner circles spaced uniformly in between the largest and smallest circles. As more circles are introduced ( increases) we observe improvement in performance, however, performance does not improve indefinitely and we find similar performance for distributions with , Fig. 3C. We thereby typically set in further extensions of the distribution. Analysis of the performance indeed indicates significant improvement for low values of (), but such distribution does not lead to same performance for . To explore whether radii of the inner circles could impact the performance, we also implement nonuniform distributed radii of circles. We show one such configuration with in Fig. 3D. In comparison with the uniform distributed case, such distribution correspond to testing error curves more robust for , however, they are no longer optimal in range.
Density.
We vary the density of the eigenvalues in each of the circles to achieve improvement in performance. The density, , is varied inverse proportionally to the radius of the circle, i.e., as increases decreases. The intuition for such a rule is that the eigenvalues with larger magnitude are more prone to instabilities and only a few of these should be included.
We show an example of a typical performance of such distribution in Fig. 3E. Indeed, in this setup the error curve shows improvement in performance for all examined values of both in terms of average error and magnitude of the confidence interval. However, when we focus on the performance for (the best case scenario for standard Rank1RRNN) such distribution does not lead to superior performance.
Segmented Circles.
Another plausible variation for the circular distribution is considering arc segments instead of full circles. This can be controlled by generating in a specific range rather than . Due to the symmetry of the eigenvalues, the arcs will be replicated on the conjugated quadrant. We first equally divide the upper half of the complex plane into four fanshaped regions without overlap, i.e., each region contains one arc with radius . Then eigenvalues are sampled on each arc separately. The radius and the density of each arc follow the same configuration as previously described and depicted in Fig. 3E.
The segmented configuration is shown in Fig. 3F and its computational results indicate that it corresponds to an error curve with optimal values at higher values of , i.e. . According to this observation, we notice when , the rightest arc, i.e. , has radius close to which is consistent with the one circle case in Fig. 3A. Thus we deduce that assigning the right component of spectrum of into the optimal region, i.e. is critical for network’s stability. The idea leads us to come up with the RFORCE distribution in Fig. 3A. This distribution is fulfilled by combining four generating principles described above: multiple circular sets, nonuniform distance, unequal density and segmented sets. Four circles are used in this setup. The radii are labeled as , respectively. The density of each circle is set according to the following formula obtained based on empirical computations
(5) 
Here is the percentage of eigenvalues to be placed on the th circle with the radius . We also implement a threshold of radius of the largest circle to determine and . When the largest radius is greater than this threshold, we only place of eigenvalues on this circle and distribute the other in accordance with . Otherwise, all eigenvalues are assigned in accordance with .
The angle for each circle are set as follows: when , the three nonoverlapping circles with , , are distributed over , and . When , the distribution changes to , and The circle overlaps with for low values of () and otherwise overlaps with .
4 RFORCE Properties
We investigate RFORCE distribution properties by comparing the reservoir connectivity matrix initialized according to RFORCE () against standard initialization () for a fixed value of (). The generated spectrum of is shown in Fig. 4A. In Fig. 4B we depict the histogram of elements. It appears that the majority of the weights are distributed according to a normal distribution between and as in
. To quantify the similarity between the two distributions, we compute the QuantileQuantile plot (qqplot) between
and normal distribution shown in Fig. 4C. Indeed, the upper segment of the blue line represents the majority of entries in and is closely aligned with the red line that represents the normal distribution. However, the lower segment of the blue line appears to depart from the normal distribution (quantiles which correspond to elements ) and indicates that there is a small amount of negative elements in that do not follow the normal distribution. These elements are shown in the zoomin inset of Fig. 4B. Such tail distribution can be attributed to the constraints that we applied to the generation of .We proceed and investigate how spectrum is modified during training and what are the properties that contribute to stability. In Fig. 4D, we follow the same setup used in third row of Fig. 2A where . We plot along with the unit circle(red line) and the right boundary of FE stability circle (dashedred line) for representative time steps and different evolution scenarios: (i) before training (top), (ii) during training (middle) and (iii) during testing (bottom). We also zoomin into the right part of the spectrum for these three scenarios. In (i) exhibits only minor fluctuations during evolution. This is expected since all the weights are fixed and the fluctuations are due to the evolution of term in Eq. 3. The zoomin plots show that the fluctuations are not visually noticeable near the right component of the stability circle. Notably, this spectrum stays near the boundary for all timestep which is the optimal condition for training and implies this network is trainable.
During training, however, the spectrum undergoes significant changes, especially in its right component (scenario (ii)). We observe that the eigenvalues located near the right boundary of the circle oscillate back and forth. This behavior hints at the ability of the network to learn to generate periodic and time evolving sequences through specific oscillations of eigenvalues across the critical circle boundary. We observe such oscillations during testing as well (scenario (iii)). This indicates that the network learns and memorizes this evolution pattern during training and is capable to reproduce it during testing.
5 RFORCE TargetLearning
Rank1RRNN Architecture: We demonstrate RFORCE performance on targetlearning task with Rank1RRNN architecture. RFORCE results are shown in Fig. 5 and we first show , and trained on multiperiodic target with , where notation corresponds to initiation of with RFORCE. Consistently with the construction of RFORCE, we find only a few samples of outside of the stability circle and several eigenvalues positioned near the stability circle. Notably, while it can be observed that the distribution of is of different nontrivial shape for each value of , the rightmost component of the spectrum is similar. This indeed demonstrates the contribution of the right component of spectrum to stability.
Error ()  g = 1.8  All  

FORCE  
RFORCE (Our) 
Error ()  g = 1.8  All  

FORCE  
FullFORCE  
RFORCE (Our) 
In correspondence with Fig. 2B, in Fig. 5B we show the error curves of RFORCE learning of the multiperiodic target in the range of . We then compare RFORCE error curves with the error curves of FORCE for the same target function and values. As can be observed, RFORCE (solidgreen line) targetlearning corresponds to a wider operational regime (average error is controlled for all tested values) while for FORCE (dashedgray line) there are values for which learning does not succeed, i.e., the error is not controlled. Furthermore, when comparing RFORCE error with the lowest error that FORCE can achieve at we observe that RFORCE achieves a consistently lower error for this same value. We summarize the comparison with a quantitative study of computing both MAE and CI (with ) for a range of and show the results in Table 1. Indeed RFORCE consistently performs with significantly lower error in terms of MAE and the upper value of CI for all values that we tested. On average, RFORCE achieves an order of magnitude improvement over FORCE.
In Fig. 5C we visually show the performance of Rank1RRNN for three example target functions (i) multiperiodic (ii) discontinuous and (iii) noisy for three representative values of () as was done for standard FORCE learning. We obtain that RFORCE generates a visually indistinguishable output from the groundtruth target, even when the training is done with noisy target. For comparison, FORCE learning on these targets was able to achieve the same shape of the output only for and training with noise dampened performance.
In Fig. 6 we randomly select three units from the reservoir of Rank1RRNN trained with FORCE (top) and RFORCE (bottom) for and compare their activity evolution before training (red), during training (blue) and after training (testing) (green). It appears that before training, the units in RFORCE Rank1RRNN have similar evolution to units in FORCE Rank1RRNN except for mild differences in the amplitude (RFORCE units amplitudes appear to be normalized). During training, however, there are significant differences between the two networks. RFORCE units appear to approximately follow the shape (or part of it) and the periods of the target while FORCE units are noisy and exhibit a variety of periods and oscillation patterns. Such behavior persists in RFORCE units after training while FORCE units become noisier after training for nonoptimal values.
HigherRank RRNN Architecture.
We continue and further test the performance of RFORCE initialization with higher rank networks, i.e., RRNN where more weights undergo optimization. Specifically, RankNRRNN architecture, depicted in Fig. 7(top) and proposed in [17], supports training of the weights within the reservoir and the readout weights, i.e., the connectivity matrix and weights are being optimized (green). In such a network the feedback loop, weights , are removed since recurrences within the reservoir support bidirectional flow.
We compare RankNRRNN performance on targetlearning task when it is initialized with FORCE and RFORCE. We also compare the performance with FullFORCE which proposed an additional auxiliary network along with highrank RRNN to achieve robustness in targetlearning [18]. For all the methods, we tested the performance on the same multiperiodic target function and with same values in the interval . Fig. 7 (bottom) shows the error curves for the training performance of RFORCE (leftblue), testing performance of RFORCE with individual points (middlegreen) and comparison of the error curves obtained by RFORCE, FullFORCE and FORCE (solidgreen, dottedgray, dashedgray).
These results indicate that RFORCE achieves a bounded and effective testing error for all values that we tested similarly to Rank1RRNN but with even lower error. This is not the case for FORCE method which still achieves a bounded error in a narrow interval of and FullFORCE which achieves a bounded error in the interval of . Furthermore, at the optimal values of FORCE () and FullFORCE (), RFORCE achieves a lower error and narrower CI by a significant margin, see zoomin inset in Fig. 7.
We also show a quantitative comparison between the three methods in Table 2 where we demonstrate that RFORCE with RankNRRNN architecture obtains a twofold improvement in accuracy in comparison with RFORCE with Rank1RRNN, while other methods are unable to harness the higher dimension of optimization parameters for significant accuracy improvement.
6 Modeling and Learning Multidimension Targets with RFORCE
Typical target functions in applications are multidimensional sequences represented as spatiotemporal timeseries. For example, modeling body joints dynamics involves modeling simultaneously all positions of the joints, i.e., their 3D coordinates evolving in time. Also biological networks of neurons are processing multivariate timedependent stimuli and signals. We thereby proceed and examine the performance of RFORCE on the task of targetlearning of human body movements. Consistently with onedimensional target functions experiments we compare RFORCE results with FORCE and FullFORCE.
We use the Physical Rehabilitation Movements Data Set (UIPRMD) which contains positions and angles of body joints for 10 common exercises performed by patients in physical therapy and rehabilitation programs, such as deep squat, hurdle step, etc [33]. We represent each movement as a matrix where the coordinates of joints (such as head, spine, waist, etc) are elements of dimensional column vector. The vector is sampled with frequency of Hz over the course of the movement. To learn to generate the movements we setup a Rank1RRNN with readout units such that each unit represents a single coordinate of a joint. The network also contains a fixed reservoir of or units and each readout unit linearly integrates the activity of the neurons in the reservoir. Each unit also transfers information back to the reservoir through a feedback layer typical to Rank1RRNN, Fig. 8 A. We train the network on each of the movements. To obtain sufficient data for training we concatenate the series such that they repeat for 15 episodes and we normalize the values such that they are in the range of . For testing, we reconstruct the normalized generated sequences back to the original range.
In Fig. 8 ABC we demonstrate the performance of RFORCE (green), FullFORCE (cyan) and FORCE (blue) at optimal value () for each method in generating the sequences for the coordinates representing the deepsquat body movement. In such movement most of the joints exhibit intricate rapidly evolving functions as shown in Fig. 8A. In Fig. 8B we show the error for the three methods being compared, where each coordinate (readout unit) is a dot. Our results indicate that among the compared methods, RFORCE achieves the most optimal accuracy with the tightest CI. The extent of CI and specifically the number of outliers from the CI play a critical role in the robust generation of the desired movement. We demonstrate the generated evolution of the outlying coordinate with the largest error in Fig. 8C. As can be observed, the sequences generated by FORCE (blue) or FullFORCE (cyan) diverge from the groundtruth target (red) after a short period of time (less than a single episode of the movement) while the sequence with the largest error generated by RFORCE (green) generates similar sequence for a long time. This experiment demonstrates that even in the optimal regime where many sequences will be generated approximately well, outliers which are inevitable in FORCE and FullFORCE will impact the overall generation of multidimensional target functions. RFORCE appears to minimize the CI along with the average error and thus exhibit a more robust performance.
We further test the performance of RFORCE on five different movements in the database (examples shown in Fig. 8D), various numbers of neurons in the reservoir ( (Rank1RRNN400) and (Rank1RRNN1000)) and three values of the parameter . We summarize the performance and compare it with the performance of other methods in Table 3. RFORCE consistently achieves the best error interval for all configurations and movements that we have tested. In the cases of nonoptimal choice of parameter, , RFORCE achieves improvement of approximately an order of magnitude than the methods it is compared with. For an optimal choice of the parameter , , RFORCE learning applied to Rank1RRNN1000 outperforms any other tested network architecture and training method. Furthermore, we observe that the application of RFORCE to a smaller network, Rank1RRNN400, outperforms other methods using Rank1RRNN400 on all tested movements. It also performs similarly or better than the performance of these methods with Rank1RRNN1000 architecture. Thereby, RFORCE400 training can be used as a means for speedup of training (e.g. for the deep squat movement, RFORCE400 training time is min, while FORCE1000 and FullFORCE1000 running times are min and min respectively).
Methods  Deep Squat  Right Arm  Right Leg  Hurdle Step  Right Arm 

Scaption  Raise  Raise  
Optimal Regime () Total Error  
FullFORCE400  
FORCE400  
FullFORCE1000  
FORCE1000  
RFORCE400 (Our)  
RFORCE1000 (Our)  
NonOptimal (Quiescent) Regime () Total Error  
FullFORCE1000  
FORCE1000  
RFORCE1000 (Our)  
NonOptimal (Chaotic) Regime ( ) Total Error  
FullFORCE1000  
FORCE1000  
RFORCE1000 (Our) 
7 Discussion
Random RNN (RRNN) are insusceptible for training using standard optimization methods. Here we introduce the RFORCE initialization method for RRNN connectivity prior to training. When the initialization is followed with FORCE learning it facilitates significantly more robust targetlearning performance than current approaches for a wide regime of RRNN intrinsic dynamics and network architectures. We infer the RFORCE distribution of weights with a series of computational experiments and formulate the outcome as four generating principles. These principles constrain the spectrum of the Jacobian of the network to remain in a well balanced regime in which the network is both stable and trainable.
Our method is inspired by related work which proposed that connectivity distribution according to orthogonal or antisymmetric adjacency matrices will initialize the spectrum of the Jacobian near the unit circle and hence will set the network in the desired balanced regime [30]. However, as we show, such placement of eigenvalues is insufficient to achieve connectivity that corresponds to a well balanced regime and to provide long term stability. In particular, further conditions that we infer in this work are required.
Our analysis of RFORCE distribution indicates that it is similar to the normal distribution with several exceptions. RRNN wired according to the normal distribution have been well studied [12, 36]. These networks exhibit intrinsic dynamics that are either quiescent or chaotic. Training of RRNN initiated from such distribution was shown to be robust exclusively in the parameter range of edgeofchaos (transition from quiescent to chaotic dynamics in terms of parameter which amplifies interactions). A close inspection of RFORCE distribution indicates negative weights which deviate from the normal distribution and form a longer tail on the negative axis of the distribution. This observation points out that the negative weights succeed to confine the spectrum of the Jacobian not only during training but also during testing, i.e., when the weights are fixed. Such a property suggests that negative (inhibitory) weights have an important role in balancing the dynamics of the network and could be configured with the initialization that we infer. Indeed, in biological neural networks, balance of excitation and inhibition has been found to be a critical requirement for a proper function of the network [37, 38, 39]. Our work strengthens these findings and provides additional insights into the generating rules of the distribution of connectivity that generic networks need to have to exhibit robust functionality.
Notably, RFORCE is different from approaches which control the robustness of RNN by regularization of the weights while the network is being trained or tested, e.g., [40, 41, 42]. In contrast to these methods, RFORCE is an initialization procedure and aims to place the network apriori in an optimal distribution such that it would be suited for robust training and performance. In particular, we show that RFORCE is even successful in stabilizing training and functionality of Rank1RRNN architecture. For these networks, initialization plays a key role since the majority of weights remains fixed from origination and only a fraction of the weights are optimized during training. Thereby our findings are potentially indicative of how generic networks of neurons are optimally structured to be ‘amenable for learning’ and potentially an interface between computational research of optimal wiring of RRNN and anatomical wiring of brain networks (i.e., connectomics [43, 44, 45]).
We demonstrate that RFORCE performance on targetlearning task is superior to previously proposed approaches. RFORCE achieves both a smaller average error and a narrower confidence interval, i.e., fewer outlying cases in which the sequence generated by the network deviates from the target in shorttime. Outliers are critical in modeling and generating multidimensional targets. As the dimension increases, the probability for an outlier to occur in one of the dimensions of the sequence is effectively multiplied. For example, for multiperiodic sequences that we have tested, we estimate that prior methods probability for outliers is 15% for FORCE and 7% for FullFORCE in the optimal regime of their performance. This indicates that for sequences with higher dimension than 15, it is almost surely guaranteed that at least one of the dimensions(sequences) will not be generated correctly and will be an outlier. For RFORCE, for most architectures and parameter settings that we have examined we did not observe shorttime outliers. We estimate that the probability for outliers in RFORCE is less than 1%. This allows us to safely model sequences with 100 dimensions with RFORCE. We demonstrate the impact of RFORCE learning on modeling five examples of 3D body joint dynamics. Each movement is represented with 66 sequences, i.e. 330 sequences in total, where each sequence is a different target function. Our experiments indicate that RFORCE is capable to generate these sequences with typical total error error of
and not exceeding .In summary, RFORCE initialization facilitates robust performance of the targetlearning task. We have demonstrated the principles upon which RFORCE distribution is generated and provided an algorithmic way to generate the distribution. Our results are generic and applicable for a wide range of networks and tasks since we demonstrate the effectiveness of RFORCE on challenging targets (e.g., noisy and highly dynamic) and network architectures (e.g., lowrank RRNN).
References
 [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
 [2] D. Eck and J. Schmidhuber, “Learning the longterm structure of the blues,” in International Conference on Artificial Neural Networks, pp. 284–289, Springer, 2002.
 [3] H. Mayer, F. Gomez, D. Wierstra, I. Nagy, A. Knoll, and J. Schmidhuber, “A system for robotic heart surgery that learns to tie knots using recurrent neural networks,” Advanced Robotics, vol. 22, no. 1314, pp. 1521–1537, 2008.
 [4] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Sequential deep learning for human action recognition,” in International workshop on human behavior understanding, pp. 29–39, Springer, 2011.

[5]
K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long shortterm memory neural networks,” in
2014 IEEE Spoken Language Technology Workshop (SLT), pp. 189–194, IEEE, 2014.  [6] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649, IEEE, 2013.
 [7] K. Su and E. Shlizerman, “Dimension reduction approach for interpretability of sequence to sequence recurrent neural networks,” arXiv:1905.12176, 2019.

[8]
E. Shlizerman, L. Dery, H. Schoen, and I. KemelmacherShlizerman, “Audio to
body dynamics,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  [9] K. Su, X. Liu, and E. Shlizerman, “Predict & cluster: Unsupervised skeleton based action recognition,” arXiv:1911.12409, 2019.
 [10] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [11] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
 [12] H. Sompolinsky, A. Crisanti, and H.J. Sommers, “Chaos in random neural networks,” Physical review letters, vol. 61, no. 3, p. 259, 1988.
 [13] H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networkswith an erratum note,” Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, vol. 148, no. 34, p. 13, 2001.
 [14] D. V. Buonomano and M. M. Merzenich, “Temporal information transformed into a spatial code by a neural network with realistic properties,” Science, vol. 267, no. 5200, pp. 1028–1030, 1995.
 [15] W. Maass, T. Natschläger, and H. Markram, “Realtime computing without stable states: A new framework for neural computation based on perturbations,” Neural computation, vol. 14, no. 11, pp. 2531–2560, 2002.
 [16] H. Jaeger, “Adaptive nonlinear system identification with echo state networks,” in Advances in neural information processing systems, pp. 609–616, 2003.
 [17] D. Sussillo and L. F. Abbott, “Generating coherent patterns of activity from chaotic neural networks,” Neuron, vol. 63, no. 4, pp. 544–557, 2009.
 [18] B. DePasquale, C. J. Cueva, K. Rajan, L. Abbott, et al., “fullforce: A targetbased method for training recurrent networks,” PloS one, vol. 13, no. 2, p. e0191527, 2018.
 [19] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, pp. 1310–1318, 2013.
 [20] Z. Chen, V. Badrinarayanan, C.Y. Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” arXiv preprint arXiv:1711.02257, 2017.
 [21] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
 [22] Z. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey, “Efficient orthogonal parametrisation of recurrent neural networks using householder reflections,” in Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2401–2409, JMLR. org, 2017.
 [23] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato, “Learning longer memory in recurrent neural networks,” arXiv preprint arXiv:1412.7753, 2014.
 [24] Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize recurrent networks of rectified linear units,” arXiv preprint arXiv:1504.00941, 2015.

[25]
K. D. Maduranga, K. E. Helfrich, and Q. Ye, “Complex unitary recurrent neural
networks using scaled cayley transform,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 33, pp. 4528–4535, 2019.  [26] S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas, “Fullcapacity unitary recurrent neural networks,” in Advances in neural information processing systems, pp. 4880–4888, 2016.
 [27] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neural networks,” in International Conference on Machine Learning, pp. 1120–1128, 2016.
 [28] K. Helfrich, D. Willmott, and Q. Ye, “Orthogonal recurrent neural networks with scaled cayley transform,” arXiv preprint arXiv:1707.09520, 2017.
 [29] E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal, “On orthogonality and learning recurrent networks with long term dependencies,” in Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3570–3578, JMLR. org, 2017.
 [30] B. Chang, M. Chen, E. Haber, and E. H. Chi, “Antisymmetricrnn: A dynamical system view on recurrent neural networks,” arXiv preprint arXiv:1902.09689, 2019.
 [31] E. Haber and L. Ruthotto, “Stable architectures for deep neural networks,” Inverse Problems, vol. 34, no. 1, p. 014004, 2017.
 [32] T. Laurent and J. von Brecht, “A recurrent neural network without chaos,” arXiv preprint arXiv:1612.06212, 2016.
 [33] A. Vakanski, H.p. Jun, D. Paul, and R. Baker, “A data set of human body movements for physical rehabilitation exercises,” Data, vol. 3, no. 1, p. 2, 2018.
 [34] J. R. Bunch, C. P. Nielsen, and D. C. Sorensen, “Rankone modification of the symmetric eigenproblem,” Numerische Mathematik, vol. 31, no. 1, pp. 31–48, 1978.
 [35] C. F. Van Loan and G. H. Golub, Matrix computations. Johns Hopkins University Press, 1983.
 [36] K. Rajan and L. Abbott, “Eigenvalue spectra of random matrices for neural networks,” Physical review letters, vol. 97, no. 18, p. 188104, 2006.
 [37] M. N. Shadlen and W. T. Newsome, “The variable discharge of cortical neurons: implications for connectivity, computation, and information coding,” Journal of neuroscience, vol. 18, no. 10, pp. 3870–3896, 1998.
 [38] B. Haider, A. Duque, A. R. Hasenstaub, and D. A. McCormick, “Neocortical network activity in vivo is generated through a dynamic balance of excitation and inhibition,” Journal of Neuroscience, vol. 26, no. 17, pp. 4535–4545, 2006.
 [39] R. Rubin, L. Abbott, and H. Sompolinsky, “Balanced excitation and inhibition are required for highcapacity, noiserobust neuronal selectivity,” Proceedings of the National Academy of Sciences, vol. 114, no. 44, pp. E9366–E9375, 2017.
 [40] J. Zhang, Q. Lei, and I. S. Dhillon, “Stabilizing gradients for deep neural networks via efficient svd parameterization,” arXiv preprint arXiv:1803.09327, 2018.
 [41] G. Kerg, K. Goyette, M. P. Touzel, G. Gidel, E. Vorontsov, Y. Bengio, and G. Lajoie, “Nonnormal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics,” in Advances in Neural Information Processing Systems, pp. 13591–13601, 2019.
 [42] I. S. Dhillon, “Stabilizing gradients for deep neural networks,” 2018.
 [43] O. Sporns, Discovering the human connectome. MIT press, 2012.
 [44] S. J. Cook, T. A. Jarrell, C. A. Brittin, Y. Wang, A. E. Bloniarz, M. A. Yakovlev, K. C. Nguyen, L. T.H. Tang, E. A. Bayer, J. S. Duerr, et al., “Wholeanimal connectomes of both caenorhabditis elegans sexes,” Nature, vol. 571, no. 7763, pp. 63–71, 2019.
 [45] A. S. Bates, P. Schlegel, R. J. Roberts, N. Drummond, I. F. Tamimi, R. G. Turnbull, X. Zhao, E. C. Marin, P. D. Popovici, S. Dhawan, et al., “Complete connectomic reconstruction of olfactory projection neurons in the fly brain,” BioRxiv, 2020.
Comments
There are no comments yet.