Input-Output Equivalence of Unitary and Contractive RNNs

10/30/2019 ∙ by M. Emami, et al. ∙ 0

Unitary recurrent neural networks (URNNs) have been proposed as a method to overcome the vanishing and exploding gradient problem in modeling data with long-term dependencies. A basic question is how restrictive is the unitary constraint on the possible input-output mappings of such a network? This work shows that for any contractive RNN with ReLU activations, there is a URNN with at most twice the number of hidden states and the identical input-output mapping. Hence, with ReLU activations, URNNs are as expressive as general RNNs. In contrast, for certain smooth activations, it is shown that the input-output mapping of an RNN cannot be matched with a URNN, even with an arbitrary number of states. The theoretical results are supported by experiments on modeling of slowly-varying dynamical systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks (RNNs) – originally proposed in the late 1980s Rumelhart et al. (1988); Elman (1990) – refer to a widely-used and powerful class of models for time series and sequential data. In recent years, RNNs have become particularly important in speech recognition Graves et al. (2013); Hinton et al. (2012)

and natural language processing

Collobert et al. (2011); Bahdanau et al. (2014); Sutskever et al. (2014) tasks.

A well-known challenge in training recurrent neural networks is the vanishing and exploding gradient problem Bengio et al. (1993); Pascanu et al. (2013). RNNs have a transition matrix that maps the hidden state at one time to the next time. When the transition matrix has an induced norm greater than one, the RNN may become unstable. In this case, small perturbations of the input at some time can result in a change in the output that grows exponentially over the subsequent time. This instability leads to a so-called exploding gradient. Conversely, when the norm is less than one, perturbations can decay exponentially so inputs at one time have negligible effect in the distant future. As a result, the loss surface associated with RNNs can have steep walls that may be difficult to minimize. Such problems are particularly acute in systems with long-term dependencies, where the output sequence can depend strongly on the input sequence many time steps in the past.

Unitary RNNs (URNNs) Arjovsky et al. (2016)

is a simple and commonly-used approach to mitigate the vanishing and exploding gradient problem. The basic idea is to restrict the transition matrix to be unitary (an orthogonal matrix for the real-valued case). The unitary transitional matrix is then combined with a non-expansive activation such as a ReLU or sigmoid. As a result, the overall transition mapping cannot amplify the hidden states, thereby eliminating the exploding gradient problem. In addition, since all the singular values of a unitary matrix equal 1, the transition matrix does not attenuate the hidden state, potentially mitigating the vanishing gradient problem as well. (Due to activation, the hidden state may still be attenuated). Some early work in URNNs suggested that they could be more effective than other methods, such as long short-term memory (LSTM) architectures and standard RNNs, for certain learning tasks involving long-term dependencies

Jing et al. (2017); Arjovsky et al. (2016) – see a short summary below.

Although URNNs may improve the stability of the network for the purpose of optimization, a basic issue with URNNs is that the unitary contraint may potentially reduce the set of input-output mappings that the network can model. This paper seeks to rigorously characterize how restrictive the unitary constraint is on an RNN. We evaluate this restriction by comparing the set of input-output mappings achievable with URNNs with the set of mappings from all RNNs. As described below, we restrict our attention to RNNs that are contractive in order to avoid unstable systems.

We show three key results:

  1. Given any contractive RNN with hidden states and ReLU activations, there exists a URNN with at most hidden states and the identical input-ouput mapping.

  2. This result is tight in the sense that, given any , there exists at least one contractive RNN such that any URNN with the same input-output mapping must have at least states.

  3. The equivalence of URNNs and RNNs depends on the activation. For example, we show that there exists a contractive RNN with sigmoid activations such that there is no URNN with any finite number of states that exactly matches the input-output mapping.

The implication of this result is that, for RNNs with ReLU activations, there is no loss in the expressiveness of model when imposing the unitary constraint. As we discuss below, the penalty is a two-fold increase in the number of parameters.

Of course, the expressiveness of a class of models is only one factor in their real performance. Based on these results alone, one cannot determine if URNNs will outperform RNNs in any particular task. Earlier works have found examples where URNNs offer some benefits over LSTMs and RNNs Arjovsky et al. (2016); Wisdom et al. (2016). But in the simulations below concerning modeling slowly-varying nonlinear dynamical systems, we see that URNNs with states perform approximately equally to RNNs with states.

Theoretical results on generalization error are an active subject area in deep neural networks. Some measures of model complexity such as Neyshabur et al. (2017) are related to the spectral norm of the transition matrices. For RNNs with non-contractive matrices, these complexity bounds will grow exponentially with the number of time steps. In contrast, since unitary matrices can bound the generalization error, this work can also relate to generalizability.

Prior work

The vanishing and exploding gradient problem in RNNs has been known almost as early as RNNs themselves Bengio et al. (1993); Pascanu et al. (2013)

. It is part of a larger problem of training models that can capture long-term dependencies, and several proposed methods address this issue. Most approaches use some form of gate vectors to control the information flow inside the hidden states, the most widely-used being LSTM networks 

Hochreiter and Schmidhuber (1997). Other gated models include Highway networks Srivastava et al. (2015)

and gated recurrent units (GRUs) 

Cho et al. (2014)

. L1/L2 penalization on gradient norms and gradient clipping were proposed to solve the exploding gradient problem in

Pascanu et al. (2013). With L1/L2 penalization, capturing long-term dependencies is still challenging since the regularization term quickly kills the information in the model. A more recent work Pennington et al. (2018) has successfully trained very deep networks by carefully adjusting the initial conditions to impose an approximate unitary structure of many layers.

Unitary evolution RNNs (URNNs) are a more recent approach first proposed in Arjovsky et al. (2016). Orthogonal constraints were also considered in the context of associative memories White et al. (2004). One of the technical difficulties is to efficiently parametrize the set of unitary matrices. The numerical simulations in this work focus on relatively small networks, where the parameterization is not a significant computational issue. Nevertheless, for larger numbers of hidden states, several approaches have been proposed. The model in Arjovsky et al. (2016)

parametrizes the transition matrix as a product of reflection, diagonal, permutation, and Fourier transform matrices. This model spans a subspace of the whole unitary space, thereby limiting the expressive power of RNNs. The work

Wisdom et al. (2016) overcomes this issue by optimizing over full-capacity unitary matrices. A key limitation in this work, however, is that the projection of weights on to the unitary space is not computationally efficient. A tunable, efficient parametrization of unitary matrices is proposed in Jing et al. (2017). This model provides the computational complexity of per parameter. The unitary matrix is represented as a product of rotation matrices and a diagonal matrix. By grouping specific rotation matrices, the model provides tunability of the span of the unitary space and enables using different capacities for different tasks. Combining the parametrization in Jing et al. (2017) for unitary matrices and the “forget” ability of the GRU structure, Cho et al. (2014); Jing et al. (2019) presented an architecture that outperforms conventional models in several long-term dependency tasks. Other methods such as orthogonal RNNs proposed by Mhammedi et al. (2017) showed that the unitary constraint is a special case of the orthogonal constraint. By representing an orthogonal matrix as a product of Householder reflectors, we are able span the entire space of orthogonal matrices. Imposing hard orthogonality constraints on the transition matrix limits the expressiveness of the model and speed of convergence and performance may degrade Vorontsov et al. (2017).

:

Unfold

Figure 1: Recurrent Neural Network (RNN) model.

2 RNNs and Input-Output Equivalence

RNNs.

We consider recurrent neural networks (RNNs) representing sequence-to-sequence mappings of the form

(1a)
(1b)

parameterized by . The system is shown in Fig. 1. The system maps a sequence of inputs , to a sequence of outputs . In equation (1),

is the activation function (e.g. sigmoid or ReLU);

is an internal or hidden state; , and are the hidden-to-hidden, input-to-hidden, and hidden-to-output weight matrices respectively; and

is the bias vector. We have considered the initial condition,

, as part of the parameters, although we will often take . Given a set of parameters , we will let

(2)

denote the resulting sequence-to-sequence mapping. Note that the number of time samples, , is fixed throughout our discussion.

Recall Strang (1993) that a matrix is unitary if . When a unitary matrix is real-valued, it is also called orthogonal. In this work, we will restrict our attention to real-valued matrices, but still use the term unitary for consistency with the URNN literature. A Unitary RNN or URNN is simply an RNN (1) with a unitary state-to-state transition matrix . A key property of unitary matrices is that they are norm-preserving, meaning that . In the context of (1a), the unitary constraint implies that the transition matrix does not amplify the state.

Equivalence of RNNs.

Our goal is to understand the extent to which the unitary constraint in a URNN restricts the set of input-output mappings. To this end, we say that the RNNs for two parameters and are input-output equivalent if the sequence-to-sequence mappings are identical,

(3)

That is, for all input sequences , the two systems have the same output sequence. Note that the hidden internal states in the two systems may be different. We will also say that two RNNs are equivalent on a set of of inputs if (3) holds for all .

It is important to recognize that input-output equivalence does not imply that the parameters and are identical. For example, consider the case of linear RNNs where the activation in (1) is the identity, . Then, for any invertible , the transformation

(4)

results in the same input-output mapping. However, the internal states will be mapped to . The fact that many parameters can lead to identical input-output mappings will be key to finding equivalent RNNs and URNNs.

Contractive RNNs.

The spectral norm Strang (1993) of a matrix is the maximum gain of the matrix . In an RNN (1), the spectral norm measures how much the transition matrix can amplify the hidden state. For URNNs, . We will say an RNN is contractive if , expansive if , and non-expansive if . In the sequel, we will restrict our attention to contractive and non-expansive RNNs. In general, given an expansive RNN, we cannot expect to find an equivalent URNN. For example, suppose is scalar. Then, the transition matrix is also scalar and is expansive if and only if . Now suppose the activation is a ReLU . Then, it is possible that a constant input can result in an output that grows exponentially with time: . Such an exponential increase is not possible with a URNN. We consider only non-expansive RNNs in the remainder of the paper. Some of our results will also need the assumption that the activation function in (1) is non-expansive:

This property is satisfied by the two most common activations, sigmoids and ReLUs.

Equivalence of Linear RNNs.

To get an intuition of equivalence, it is useful to briefly review the concept in the case of linear systems Kailath (1980). Linear systems are RNNs (1) in the special case where the activation function is identity, ; the initial condition is zero, ; and the bias is zero, . In this case, it is well-known that two systems are input-output equivalent if and only if they have the same transfer function,

(5)

In the case of scalar inputs and outputs, is a rational function of the complex variable with numerator and denominator degree of at most , the dimension of the hidden state . Any state-space system (1) that achieves a particular transfer function is called a realization of the transfer function. Hence two linear systems are equivalent if and only if they are the realizations of the same transfer function.

A realization is called minimal if it is not equivalent some linear system with fewer hidden states. A basic property of realizations of linear systems is that they are minimal if and only if they are controllable and observable. The formal definition is in any linear systems text, e.g. Kailath (1980). Loosely, controllable implies that all internal states can be reached with an appropriate input and observable implies that all hidden states can be observed from the ouptut. In absence of controllability and observability, some hidden states can be removed while maintaining input-output equivalence.

3 Equivalence Results for RNNs with ReLU Activations

Our first results consider contractive RNNs with ReLU activations. For the remainder of the section, we will restrict our attention to the case of zero initial conditions, in (1).

Theorem 3.1

Let be a contractive RNN with ReLU activation and states of dimension . Fix and let be the set of all sequences such that for all . Then there exists a URNN with state dimension and parameters such that for all , . Hence the input-output mapping is matched for bounded inputs.

  • See Appendix A.

Theorem 3.1 shows that for any contractive RNN with ReLU activations, there exists a URNN with at most twice the number of hidden states and the identical input-output mapping. Thus, there is no loss in the set of input-output mappings with URNNs relative to general contractive RNNs on bounded inputs.

The penalty for using RNNs is the two-fold increase in state dimension, which in turn increases the number of parameters to be learned. We can estimate this increase in parameters as follows: The raw number of parameters for an RNN (

1) with hidden states, outputs and inputs is . However, for ReLU activations, the RNNs are equivalent under the transformations (4) using diagonal positive

. Hence, the number of degrees of freedom of a general RNN is at most

. We can compare this value to a URNN with hidden states. The set of unitary has degrees of freedom Stewart (1980). Hence, the total degrees of freedom in a URNN with states is at most . We conclude that a URNN with hidden states has slightly fewer than twice the number of parameters as an RNN with hidden states.

We note that there are cases that the contractivity assumption is limiting, however, the limitations may not always be prohibitive. We will see in our experiments that imposing the contractivity constraint can improve learning for RNNs when models have sufficiently large numbers of time steps. Some related results where bounding the singular values help with the performance can be found in Vorontsov et al. (2017).

We next show a converse result.

Theorem 3.2

For every positive , there exists a contractive RNN with ReLU nonlinearity and state dimension such that every equivalent URNN has at least states.

  • See Appendix B.1 in the Supplementary Material.

The result shows that the achievability bound in Theorem 3.1 is tight, at least in the worst case. In addition, the RNN constructed in the proof of Theorem  3.2 is not particularly pathological. We will show in our simulations in Section 5 that URNNs typically need twice the number of hidden states to achieve comparable modeling error as an RNN.

4 Equivalence Results for RNNs with Sigmoid Activations

Equivalence between RNNs and URNNs depends on the particular activation. Our next result shows that with sigmoid activations, URNNs are, in general, never exactly equivalent to RNNs, even with an arbitrary number of states.

We need the following technical definition: Consider an RNN (1) with a standard sigmoid activation . If is non-expansive, then a simple application of the contraction mapping principle shows that for any constant input , there is a fixed point in the hidden state . We will say that the RNN is controllable and observable at if the linearization of the RNN around is controllable and observable.

Theorem 4.1

There exists a contractive RNN with sigmoid activation function with the following property: If a URNN is controllable and observable at any point , then the URNN cannot be equivalent to the RNN for inputs in the neighborhood of .

  • See Appendix B.2 in the Supplementary Material.

The result provides a converse on equivalence: Contractive RNNs with sigmoid activations are not in general equivalent to URNNs, even if we allow the URNN to have an arbitrary number of hidden states. Of course, the approximation error between the URNN and RNN may go to zero as the URNN hidden dimension goes to infinity (e.g., similar to the approximation results in Funahashi and Nakamura (1993)). However, exact equivalence is not possible with sigmoid activations, unlike with ReLU activations. Thus, there is fundamental difference in equivalence for smooth and non-smooth activations.

We note that the fundamental distinction between Theorem 3.1 and the opposite result in Theorem 4.1

is that the activation is smooth with a positive slope. With such activations, you can linearize the system, and the eigenvalues of the transition matrix become visible in the input-output mapping. In contrast, ReLUs can zero out states and suppress these eigenvalues. This is a key insight of the paper and a further contribution in understanding nonlinear systems.

5 Numerical Simulations

In this section, we numerically compare the modeling ability of RNNs and URNNs where the true system is a contractive RNN with long-term dependencies. Specifically, we generate data from multiple instances of a synthetic RNN where the parameters in (1) are randomly generated. For the true system, we use input units, output units, and hidden units at each time step. The matrices , and are generated as i.i.d. Gaussians. We use a random transition matrix,

(6)

where is Gaussian i.i.d. matrix and is a small value, taken here to be . The matrix (6) will be contractive with singular values in . By making small, the states of the system will vary slowly, hence creating long-term dependencies. In analogy with linear systems, the time constant will be approximately time steps. We use ReLU activations. To avoid degenerate cases where the outputs are always zero, the biases are adjusted to ensure that the each hidden state is on some target of the time using a similar procedure as in Fletcher et al. (2018).

The trials have time steps, which corresponds to 10 times the time constant

of the system. We added noise to the output of this system such that the signal-to-noise ratio (SNR) is 15 dB or 20 dB. In each trial, we generate 700 training samples and 300 test sequences from this system.

Given the input and the output data of this contractive RNN, we attempt to learn the system with: (i) standard RNNs, (ii) URNNs, and (iii) LSTMs. The hidden states in the model are varied in the range , which include values both above and below the true number of hidden states

. We used mean-squared error as the loss function. Optimization is performed using Adam

Kingma and Ba (2014)

optimization with a batch size = 10 and learning rate = 0.01. All models are implemented in the Keras package in Tensorflow. The experiments are done over 30 realizations of the original contractive system.

For the URNN learning, of all the proposed algorithms for enforcing the unitary constraints on transition matrices during training Jing et al. (2017); Wisdom et al. (2016); Arjovsky et al. (2016); Mhammedi et al. (2017)

, we chose to project the transition matrix on the full space of unitary matrices after each iteration using singular value decomposition (SVD). Although SVD requires

computation for each projection, for our choices of hidden states it performed faster than the aforementioned methods.

Since we have training noise and since optimization algorithms can get stuck in local minima, we cannot expect “exact" equivalence between the learned model and true system as in the theorems. So, instead, we look at the test error as a measure of the closeness of the learned model to the true system. Figure 2 on the left shows the test for a Gaussian i.i.d. input and output with SNR = 20 dB for RNNs, URNNs, and LSTMs. The red dashed line corresponds to the optimal achievable at the given noise level.

Note that even though the true RNN has hidden states, the RNN model does not obtain the optimal test at . This is not due to training noise, since the RNN is able to capture the full dynamics when we over-parametrize the system to hidden states. The test error in the RNN at lower numbers of hidden states is likely due to the optimization being caught in a local minima.

What is important for this work though is to compare the URNN test error with that of the RNN. We observe that URNN requires approximately twice the number of hidden states to obtain the same test error as achieved by an RNN. To make this clear, the right plot shows the same performance data with number of states adjusted for URNN. Since our theory indicates that a URNN with hidden states is as powerful as an RNN with hidden states, we compare a URNN with hidden units directly with an RNN with hidden units. We call this the adjusted hidden units. We see that the URNN and RNN have similar test error when we appropriately scale the number of hidden units as predicted by the theory.

For completeness, the left plot in Figure 2 also shows the test error with an LSTM. It is important to note that the URNN has almost the same performance as an LSTM with considerably smaller number of parameters.

Figure 3 shows similar results for the same task with SNR = 15 dB. For this task, the input is sparse

Gaussian i.i.d., i.e. Gaussian with some probability

and with probability . The left plot shows the vs. the number of hidden units for RNNs and URNNs and the right plot shows the same results once the number of hidden units for URNN is adjusted.

We also compared the modeling ability of URNNs and RNNs using the Pixel-Permuted MNIST task. Each MNIST image is a grayscale image with a label between 0 and 9. A fixed random permutation is applied to the pixels and each pixel is fed to the network in each time step as the input and the output is the predicted label for each image Arjovsky et al. (2016); Jing et al. (2017); Vorontsov et al. (2017).

We evaluated various models on the Pixel-Permuted MNIST task using validation based early stopping. Without imposing a contractivity constraint during learning, the RNN is either unstable or requires a slow learning rate. Imposing a contractivity constraint improves the performance. Incidentally, using a URNN improves the performance further. Thus, contractivity can improve learning for RNNs when models have sufficiently large numbers of time steps.

Figure 2: Test on synthetic data for a Gaussian i.i.d. input and output SNR=20 dB.
Figure 3: Test on synthetic data for a Gaussian i.i.d. input and output SNR=15 dB.
Figure 4:

Accuracy on Permuted MNIST task for various models trained with RMSProp, validation-based early termination, and initial learning rate

lr. (1) URNN model: RNN model with unitary constraint; (2) ContRNN: RNN with a contractivity constraint; (3 & 4) RNN model with no contractivity or unitary constraint (two learning rates). We see contractivity improves performance, and unitary constraints improve performance further.

6 Conclusion

Several works empirically show that using unitary recurrent neural networks improves the stability and performance of the RNNs. In this work, we study how restrictive it is to use URNNs instead of RNNs. We show that URNNs are at least as powerful as contractive RNNs in modeling input-output mappings if enough hidden units are used. More specifically, for any contractive RNN we explicitly construct a URNN with twice the number of states of the RNN and identical input-output mapping. We also provide converse results for the number of state and the activation function needed for exact matching. We emphasize that although it has been shown that URNNs outperform standard RNNs and LSTM in many tasks that involve long-term dependencies, our main goal in this paper is to show that from an approximation viewpoint, URNNs are as expressive as general contractive RNNs. By a two-fold increase in the number of parameters, we can use the stability benefits they bring for optimization of neural networks.

Acknowledgements

The work of M. Emami, M. Sahraee-Ardakan, A. K. Fletcher was supported in part by the National Science Foundation under Grants 1254204 and 1738286, and the Office of Naval Research under Grant N00014-15-1-2677. S. Rangan was supported in part by the National Science Foundation under Grants 1116589, 1302336, and 1547332, NIST, the industrial affiliates of NYU WIRELESS, and the SRC.

Appendix A Proof of Theorem 3.1

The basic idea is to construct a URNN with states such that first states match the states of RNN and the last states are always zero. To this end, consider any contractive RNN,

where . Since is contractive, we have for some . Also, for a ReLU activation, for all pre-activation inputs . Hence,

Therefore, with bounded inputs, , we have the state is bounded,

(7)

We construct a URNN as,

where the parameters are of the form,

(8)

Let . Since , we have . Therefore, there exists such that . With this choice of , the first columns of are orthonormal. Let extend these to an orthonormal basis for . Then, the matrix will be orthonormal.

Next, let , where is defined in (7). We show by induction that for all ,

(9)

If both systems are initialized at zero, (9) is satisfied at . Now, suppose this holds up to time . Then,

where we have used the induction hypothesis that . For , note that

(10)

where the last step follows from (7). Therefore,

(11)

Hence with ReLU activation . By induction, (9) holds for all . Then, if we define , we have the output of the URNN and RNN systems are identical

This shows that the systems are equivalent.

References

  • [1] M. Arjovsky, A. Shah, and Y. Bengio (2016) Unitary evolution recurrent neural networks. In

    International Conference on Machine Learning

    ,
    pp. 1120–1128. Cited by: §1, §1, §1, §5, §5.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv, pp. arXiv–1409. Cited by: §1.
  • [3] Y. Bengio, P. Frasconi, and P. Simard (1993) The problem of learning long-term dependencies in recurrent networks. In IEEE International Conference on Neural Networks, pp. 1183–1188. Cited by: §1, §1.
  • [4] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. External Links: Link, Document Cited by: §1, §1.
  • [5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (Aug), pp. 2493–2537. Cited by: §1.
  • [6] J. L. Elman (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211. Cited by: §1.
  • [7] A. K. Fletcher, S. Rangan, and P. Schniter (2018) Inference in deep networks in high dimensions. In Proc. IEEE International Symposium on Information Theory, pp. 1884–1888. Cited by: §5.
  • [8] K. Funahashi and Y. Nakamura (1993) Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks 6 (6), pp. 801–806. Cited by: §4.
  • [9] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. Cited by: §1.
  • [10] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine 29. Cited by: §1.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [12] L. Jing, C. Gulcehre, J. Peurifoy, Y. Shen, M. Tegmark, M. Soljacic, and Y. Bengio (2019) Gated orthogonal recurrent units: on learning to forget. Neural Computation 31 (4), pp. 765–783. Cited by: §1.
  • [13] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Soljačić (2017) Tunable efficient unitary neural networks (eunn) and their application to rnns. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. Cited by: §1, §1, §5, §5.
  • [14] T. Kailath (1980) Linear systems. Vol. 156, Prentice-Hall Englewood Cliffs, NJ. Cited by: §2, §2.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • [16] Z. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey (2017) Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2401–2409. Cited by: §1, §5.
  • [17] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017)

    Exploring generalization in deep learning

    .
    In Advances in Neural Information Processing Systems, pp. 5947–5956. Cited by: §1.
  • [18] R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310–1318. Cited by: §1, §1.
  • [19] J. Pennington, S. S. Schoenholz, and S. Ganguli (2018) The emergence of spectral universality in deep networks. arXiv preprint arXiv:1802.09979. Cited by: §1.
  • [20] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1988) Learning representations by back-propagating errors. Cognitive Modeling 5 (3), pp. 1. Cited by: §1.
  • [21] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §1.
  • [22] G. W. Stewart (1980) The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM Journal on Numerical Analysis 17 (3), pp. 403–409. Cited by: §3.
  • [23] G. Strang (1993) Introduction to linear algebra. Vol. 3, Wellesley-Cambridge Press Wellesley, MA. Cited by: §2, §2.
  • [24] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §1.
  • [25] M. Vidyasagar (2002) Nonlinear systems analysis. Vol. 42, Siam. Cited by: §B.2.
  • [26] E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal (2017) On orthogonality and learning recurrent networks with long term dependencies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3570–3578. Cited by: §1, §3, §5.
  • [27] O. L. White, D. D. Lee, and H. Sompolinsky (2004) Short-term memory in orthogonal neural networks. Physical review letters 92 (14), pp. 148102. Cited by: §1.
  • [28] S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas (2016) Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 4880–4888. Cited by: §1, §1, §5.

Appendix B Converse Theorem Proofs

b.1 Proof of Theorem 3.2

First consider the case when with scalar inputs and outputs. Let be the parameters of a contractive RNN with , and . Hence, the contractive RNN is given by

(12)

and is the ReLU activation. Suppose are the parameters of an equivalent URNN. If has less than states, it must have state. Let the equivalent URNN be

(13)

for some parameters . Since is orthogonal, either or . Also, either or . First, consider the case when and . Then, there exists a large enough input such that for all time steps , both systems are operating in the active phase of ReLU. Therefore, we have two equivalent linear systems,

(14)
(15)

In order to have identical input-output mapping for these linear systems for all , it is required that , which is a contradiction. The other cases and can be treated similarly. Therefore, at least states are needed for the URNN to match the contractive RNN with state.

For the case of general , consider the contractive RNN,

(16)

where , , , and . This system is separable in that if then for each input . A URNN system will need 2 states for each scalar system requiring a total of states.

b.2 Proof of Theorem 4.1

We use the same scalar contractive RNN (12), but with a sigmoid activation . Let be the parameters of any URNN with scalar input and outputs. Suppose the URNN is controllable and observable at an input value . Let and be, respectively, the fixed points of the hidden states for the contractive RNN and URNN:

(17)
(18)

We take the linearizations [25] of each system around its fixed point and apply a small perturbation around . Therefore, we have two linear systems with identical input-output mapping given by,

(19)
(20)

where

are the derivatives of the activations at the fixed points. Since both systems are controllable and observable, their dimensions must be the same and the eigenvalues of the transition matrix must match. In particular, the URNN must be scalar, so for some scalar . For orthogonality, either or . We look at the case; the case is similar. Since the eigenvalues of the transition matrix must match we have,

(21)

where and are the solutions to the fixed point equations:

(22)

Also, since two systems have the same output,

(23)

Now, (21) must hold at any input where the URNN is controllable and observable. If the URNN is controllable and observable at some , it is is controllable and observable in a neighborhood of . Hence, (21) and (23) holds in some neighborhood of . To write this mathematically, define the functions,

(24)

where, for a given , and are the solutions to the fixed point equations (22). We must have that for all in some neighborhood. Taking derivatives of (24) and using the fact that being a sigmoid, one can show that this matching can only occur when,

This is a contradiction since we have assumed that the RNN system is contractive which requires .