# Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.

## Authors

• 25 publications
• 19 publications
• 147 publications
• 75 publications
• ### Path-SGD: Path-Normalized Optimization in Deep Neural Networks

We revisit the choice of SGD for training deep neural networks by recons...
06/08/2015 ∙ by Behnam Neyshabur, et al. ∙ 0

• ### Can SGD Learn Recurrent Neural Networks with Provable Generalization?

Recurrent Neural Networks (RNNs) are among the most popular models in se...
02/04/2019 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### Improving performance of recurrent neural network with relu nonlinearity

In recent years significant progress has been made in successfully train...
11/12/2015 ∙ by Sachin S. Talathi, et al. ∙ 0

• ### Input-Output Equivalence of Unitary and Contractive RNNs

Unitary recurrent neural networks (URNNs) have been proposed as a method...
10/30/2019 ∙ by M. Emami, et al. ∙ 0

• ### Deep Recurrent Neural Networks for Sequential Phenotype Prediction in Genomics

In analyzing of modern biological data, we are often dealing with ill-po...

• ### Generalized Tensor Models for Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are very successful at solving challeng...
01/30/2019 ∙ by Valentin Khrulkov, et al. ∙ 8

• ### Neural Networks Fail to Learn Periodic Functions and How to Fix It

Previous literature offers limited clues on how to learn a periodic func...
06/15/2020 ∙ by Liu Ziyin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recurrent Neural Networks (RNNs) have been found to be successful in a variety of sequence learning problems [4, 3, 9], including those involving long term dependencies (e.g., [1, 23]

). However, most of the empirical success has not been with “plain” RNNs but rather with alternate, more complex structures, such as Long Short-Term Memory (LSTM) networks

[7]

or Gated Recurrent Units (GRUs)

[3]. Much of the motivation for these more complex models is not so much because of their modeling richness, but perhaps more because they seem to be easier to optimize. As we discuss in Section 3

, training plain RNNs using gradient-descent variants seems problematic, and the choice of the activation function could cause a problem of vanishing gradients or of exploding gradients.

In this paper our goal is to better understand the geometry of plain RNNs, and develop better optimization methods, adapted to this geometry, that directly learn plain RNNs with ReLU activations. One motivation for insisting on plain RNNs, as opposed to LSTMs or GRUs, is because they are simpler and might be more appropriate for applications that require low-complexity design such as in mobile computing platforms [22, 5]. In other applications, it might be better to solve optimization issues by better optimization methods rather than reverting to more complex models. Better understanding optimization of plain RNNs can also assist us in designing, optimizing and intelligently using more complex RNN extensions.

Improving training RNNs with ReLU activations has been the subject of some recent attention, with most research focusing on different initialization strategies [12, 22]. While initialization can certainly have a strong effect on the success of the method, it generally can at most delay the problem of gradient explosion during optimization. In this paper we take a different approach that can be combined with any initialization choice, and focus on the dynamics of the optimization itself.

Any local search method is inherently tied to some notion of geometry over the search space (e.g. the space of RNNs). For example, gradient descent (including stochastic gradient descent) is tied to the Euclidean geometry and can be viewed as steepest descent with respect to the Euclidean norm. Changing the norm (even to a different quadratic norm, e.g. by representing the weights with respect to a different basis in parameter space) results in different optimization dynamics. We build on prior work on the geometry and optimization in feed-forward networks, which uses the path-norm

[16] (defined in Section 4) to determine a geometry leading to the path-SGD optimization method. To do so, we investigate the geometry of RNNs as feedforward networks with shared weights (Section 2) and extend a line of work on Path-Normalized optimization to include networks with shared weights. We show that the resulting algorithm (Section 4) has similar invariance properties on RNNs as those of standard path-SGD on feedforward networks, and can result in better optimization with less sensitivity to the scale of the weights.

## 2 Recurrent Neural Nets as Feedforward Nets with Shared Weights

We view Recurrent Neural Networks (RNNs) as feedforward networks with shared weights.

We denote a general feedforward network with ReLU activations and shared weights is indicated by where is a directed acyclic graph over the set of nodes that corresponds to units in the network, including special subsets of input and output nodes ,

is a parameter vector and

is a mapping from edges in to parameters indices. For any edge , the weight of the edge is indicated by . We refer to the set of edges that share the th parameter by . That is, for any , and hence .

Such a feedforward network represents a function as follows: For any input node , its output is the corresponding coordinate of the input vector . For each internal node , the output is defined recursively as where is the ReLU activation function222The bias terms can be modeled by having an additional special node that is connected to all internal and output nodes, where .. For output nodes , no non-linearity is applied and their output determines the corresponding coordinate of the computed function . Since we will fix the graph and the mapping and learn the parameters , we use the shorthand to refer to the function implemented by parameters . The goal of training is to find parameters that minimize some error functional that depends on only through the function

. E.g. in supervised learning

and this is typically done by minimizing an empirical estimate of this expectation.

If the mapping is a one-to-one mapping, then there is no weight sharing and it corresponds to standard feedforward networks. On the other hand, weight sharing exists if is a many-to-one mapping. Two well-known examples of feedforward networks with shared weights are convolutional and recurrent networks. We mostly use the general notation of feedforward networks with shared weights throughout the paper as this will be more general and simplifies the development and notation. However, when focusing on RNNs, it is helpful to discuss them using a more familiar notation which we briefly introduce next.

##### Recurrent Neural Networks

Time-unfolded RNNs are feedforward networks with shared weights that map an input sequence to an output sequence. Each input node corresponds to either a coordinate of the input vector at a particular time step or a hidden unit at time . Each output node also corresponds to a coordinate of the output at a specific time step. Finally, each internal node refers to some hidden unit at time . When discussing RNNs, it is useful to refer to different layers and the values calculated at different time-steps. We use a notation for RNN structures in which the nodes are partitioned into layers and denotes the output of nodes in layer at time step . Let be the input at different time steps where is the maximum number of propagations through time and we refer to it as the length of the RNN. For , let and be the input and recurrent parameter matrices of layer and be the output parameter matrix. Table 1 shows forward computations for RNNs.The output of the function implemented by RNN can then be calculated as . Note that in this notations, weight matrices , and correspond to “free” parameters of the model that are shared in different time steps.

## 3 Non-Saturating Activation Functions

The choice of activation function for neural networks can have a large impact on optimization. We are particularly concerned with the distinction between “saturating” and “non-starting” activation functions. We consider only monotone activation functions and say that a function is “saturating” if it is bounded—this includes, e.g. sigmoid, hyperbolic tangent and the piecewise-linear ramp activation functions. Boundedness necessarily implies that the function values converge to finite values at negative and positive infinity, and hence asymptote to horizontal lines on both sides. That is, the derivative of the activation converges to zero as the input goes to both and

. Networks with saturating activations therefore have a major shortcoming: the vanishing gradient problem

[6]. The problem here is that the gradient disappears when the magnitude of the input to an activation is large (whether the unit is very “active” or very “inactive”) which makes the optimization very challenging.

While sigmoid and hyperbolic tangent have historically been popular choices for fully connected feedforward and convolutional neural networks, more recent works have shown undeniable advantages of non-saturating activations such as ReLU, which is now the standard choice for fully connected and Convolutional networks

[15, 10]. Non-saturating activations, including the ReLU, are typically still bounded from below and asymptote to a horizontal line, with a vanishing derivative, at . But they are unbounded from above, enabling their derivative to remain bounded away from zero as the input goes to . Using ReLUs enables gradients to not vanish along activated paths and thus can provide a stronger signal for training.

However, for recurrent neural networks, using ReLU activations is challenging in a different way, as even a small change in the direction of the leading eigenvector of the recurrent weights could get amplified and potentially lead to the explosion in forward or backward propagation

[1].

To understand this, consider a long path from an input in the first element of the sequence to an output of the last element, which passes through the same RNN edge at each step (i.e. through many edges in some in the shared-parameter representation). The length of this path, and the number of times it passes through edges associated with a single parameter, is proportional to the sequence length, which could easily be a few hundred or more. The effect of this parameter on the path is therefore exponentiated by the sequence length, as are gradient updates for this parameter, which could lead to parameter explosion unless an extremely small step size is used.

Understanding the geometry of RNNs with ReLUs could helps us deal with the above issues more effectively. We next investigate some properties of geometry of RNNs with ReLU activations.

### Invariances in Feedforward Nets with Shared Weights

Feedforward networks (with or without shared weights) are highly over-parameterized, i.e. there are many parameter settings that represent the same function . Since our true object of interest is the function , and not the identity of the parameters, it would be beneficial if optimization would depend only on and not get “distracted” by difference in that does not affect . It is therefore helpful to study the transformations on the parameters that will not change the function presented by the network and come up with methods that their performance is not affected by such transformations.

###### Definition 1.

We say a network is invariant to a transformation if for any parameter setting , . Similarly, we say an update rule is invariant to if for any , .

Invariances have also been studied as different mappings from the parameter space to the same function space [19] while we define the transformation as a mapping inside a fixed parameter space. A very important invariance in feedforward networks is node-wise rescaling [17]. For any internal node and any scalar , we can multiply all incoming weights into (i.e. for any ) by and all the outgoing weights (i.e. for any ) by without changing the function computed by the network. Not all node-wise rescaling transformations can be applied in feedforward nets with shared weights. This is due to the fact that some weights are forced to be equal and therefore, we are only allowed to change them by the same scaling factor.

###### Definition 2.

Given a network , we say an invariant transformation that is defined over edge weights (rather than parameters) is feasible for parameter mapping if the shared weights remain equal after the transformation, i.e. for any and for any , .

Therefore, it is helpful to understand what are the feasible node-wise rescalings for RNNs. In the following theorem, we characterize all feasible node-wise invariances in RNNs.

###### Theorem 1.

For any such that , any Recurrent Neural Network with ReLU activation is invariant to the transformation where for any :

 Tin,α(Win)i[j,k]=⎧⎨⎩αijWiin[j,k]i=1,(αij/αi−1k)Wiin[j,k]1

Furthermore, any feasible node-wise rescaling transformation can be presented in the above form.

The proofs of all theorems and lemmas are given in Appendix A. The above theorem shows that there are many transformations under which RNNs represent the same function. An example of such invariances is shown in Fig. 1. Therefore, we would like to have optimization algorithms that are invariant to these transformations and in order to do so, we need to look at measures that are invariant to such mappings.

## 4 Path-SGD for Networks with Shared Weights

As we discussed, optimization is inherently tied to a choice of geometry, here represented by a choice of complexity measure or “norm”333The path-norm which we define is a norm on functions, not on weights, but as we prefer not getting into this technical discussion here, we use the term “norm” very loosely to indicate some measure of magnitude [18].. Furthermore, we prefer using an invariant measure which could then lead to an invariant optimization method. In Section 4.1 we introduce the path-regularizer and in Section 4.2, the derived Path-SGD optimization algorithm for standard feed-forward networks. Then in Section 4.3 we extend these notions also to networks with shared weights, including RNNs, and present two invariant optimization algorithms based on it. In Section 4.4 we show how these can be implemented efficiently using forward and backward propagations.

### 4.1 Path-regularizer

The path-regularizer is the sum over all paths from input nodes to output nodes of the product of squared weights along the path. To define it formally, let be the set of directed paths from input to output units so that for any path of length , we have that , and for any , . We also abuse the notation and denote if for some , . Then the path regularizer can be written as:

 γ2net(w)=∑ζ∈P% len(ζ)−1∏i=0w2ζi→ζi+1 (2)

Equivalently, the path-regularizer can be defined recursively on the nodes of the network as:

 γ2v(w)=∑(u→v)∈Eγ2u(w)w2u→v,γ2net(w)=∑u∈Voutγ2u(w) (3)

### 4.2 Path-SGD for Feedforward Networks

Path-SGD is an approximate steepest descent step with respect to the path-norm. More formally, for a network without shared weights, where the parameters are the weights themselves, consider the diagonal quadratic approximation of the path-regularizer about the current iterate :

 ^γ2net(w(t)+Δw)=γ2net(w(t))+⟨∇γ2net(w(t)),Δw⟩+12Δw⊤diag(∇2γ2net(w(t)))Δw (4)

Using the corresponding quadratic norm , we can define an approximate steepest descent step as:

 w(t+1)=minwη⟨∇L(w),w−w(t)⟩+∥∥w−w(t)∥∥2^γ2net(w(t)+Δw). (5)

Solving (5) yields the update:

 w(t+1)e=w(t)e−ηκe(w(t))∂L∂we(w(t))where: κe(w)=12∂2γ2net(w)∂w2e. (6)

The stochastic version that uses a subset of training examples to estimate is called Path-SGD [16]. We now show how Path-SGD can be extended to networks with shared weights.

### 4.3 Extending to Networks with Shared Weights

When the networks has shared weights, the path-regularizer is a function of parameters and therefore the quadratic approximation should also be with respect to the iterate instead of which results in the following update rule:

 p(t+1)=minpη⟨∇L(p),p−p(t)⟩+∥∥p−p(t)∥∥^γ2net(p(t)+Δp). (7)

where . Solving (7) gives the following update:

 p(t+1)i=p(t)i−ηκi(p(t))∂L∂pi(p(t))where: κi(p)=12∂2γ2net(p)∂p2i. (8)

The second derivative terms are specified in terms of their path structure as follows:

###### Lemma 1.

where

 κ(1)i(p) =∑e∈Ei∑ζ∈P1e∈ζlen(ζ)−1∏j=0e≠(ζj→ζj+1)p2π(ζj→ζj+1)=∑e∈Eiκe(w), (9) κ(2)i(p) =p2i∑e1,e2∈Eie1≠e2∑ζ∈P1e1,e2∈ζlen(ζ)−1∏j=0e1≠(ζj→ζj+1)e2≠(ζj→ζj+1)p2π(ζj→ζj+1), (10)

and is defined in (6).

The second term measures the effect of interactions between edges corresponding to the same parameter (edges from the same ) on the same path from input to output. In particular, if for any path from an input unit to an output unit, no two edges along the path share the same parameter, then . For example, for any feedforward or Convolutional neural network, . But for RNNs, there certainly are multiple edges sharing a single parameter on the same path, and so we could have .

The above lemma gives us a precise update rule for the approximate steepest descent with respect to the path-regularizer. The following theorem confirms that the steepest descent with respect to this regularizer is also invariant to all feasible node-wise rescaling for networks with shared weights.

###### Theorem 2.

For any feedforward networks with shared weights, the update (8) is invariant to all feasible node-wise rescalings. Moreover, a simpler update rule that only uses in place of is also invariant to all feasible node-wise rescalings.

Equations (9) and (10) involve a sum over all paths in the network which is exponential in depth of the network. However, we next show that both of these equations can be calculated efficiently.

### 4.4 Simple and Efficient Computations for RNNs

We show how to calculate and by considering a network with the same architecture but with squared weights:

###### Theorem 3.

For any network , consider where for any , . Define the function to be the sum of outputs of this network: . Then and can be calculated as follows where is the all-ones input vector:

 κ(1)(p)=∇~pg(1),κ(2)i(p)=∑(u→v),(u′→v′)∈Ei(u→v)≠(u′→v′)~pi∂g(1)∂hv′(~p)∂hu′(~p)∂hv(~p)hu(~p). (11)

In the process of calculating the gradient , we need to calculate and for any . Therefore, the only remaining term to calculate (besides ) is .

Recall that is the length (maximum number of propagations through time) and is the number of layers in an RNN. Let be the number of hidden units in each layer and be the size of the mini-batch. Then calculating the gradient of the loss at all points in the minibatch (the standard work required for any mini-batch gradient approach) requires time . In order to calculate , we need to calculate the gradient of a similar network at a single input—so the time complexity is just an additional . The second term can also be calculated for RNNs in 444 For an RNN, and because only recurrent weights are can be shared multiple times along an input-output path. can be written and calculated in the matrix form: where for any we have . The only terms that require extra computation are powers of which can be done in and the rest of the matrix computations need .. Therefore, the ratio of time complexity of calculating the first term and second term with respect to the gradient over mini-batch is and respectively. Calculating only is therefore very cheap with minimal per-minibatch cost, while calculating might be expensive for large networks. Beyond the low computational cost, calculating

is also very easy to implement as it requires only taking the gradient with respect to a standard feed-forward calculation in a network with slightly modified weights—with most deep learning libraries it can be implemented very easily with only a few lines of code.

## 5 Experiments

### 5.1 The Contribution of the Second Term

As we discussed in section 4.4, the second term in the update rule can be computationally expensive for large networks. In this section we investigate the significance of the second term and show that at least in our experiments, the contribution of the second term is negligible. To compare the two terms and , we train a single layer RNN with hidden units for the task of word-level language modeling on Penn Treebank (PTB) Corpus [13]. Fig. 2 compares the performance of SGD vs. Path-SGD with/without . We clearly see that both version of Path-SGD are performing very similarly and both of them outperform SGD significantly. This results in Fig. 2 suggest that the first term is more significant and therefore we can ignore the second term.

To better understand the importance of the two terms, we compared the ratio of the norms for different RNN lengths and number of hidden units . The table in Fig. 2 shows that the contribution of the second term is bigger when the network has fewer number of hidden units and the length of the RNN is larger ( is small and is large). However, in many cases, it appears that the first term has a much bigger contribution in the update step and hence the second term can be safely ignored. Therefore, in the rest of our experiments, we calculate the Path-SGD updates only using the first term .

### 5.2 Synthetic Problems with Long-term Dependencies

Training Recurrent Neural Networks is known to be hard for modeling long-term dependencies due to the gradient vanishing/exploding problem [6, 2]. In this section, we consider synthetic problems that are specifically designed to test the ability of a model to capture the long-term dependency structure. Specifically, we consider the addition problem and the sequential MNIST problem.

Addition problem: The addition problem was introduced in [7]. Here, each input consists of two sequences of length , one of which includes numbers sampled from the uniform distribution with range and the other sequence serves as a mask which is filled with zeros except for two entries. These two entries indicate which of the two numbers in the first sequence we need to add and the task is to output the result of this addition.
Sequential MNIST: In sequential MNIST, each digit image is reshaped into a sequence of length , turning the digit classification task into sequence classification with long-term dependencies [12, 1].

For both tasks, we closely follow the experimental protocol in [12]. We train a single-layer RNN consisting of 100 hidden units with path-SGD, referred to as RNN-Path. We also train an RNN of the same size with identity initialization, as was proposed in [12], using SGD as our baseline model, referred to as IRNN. We performed grid search for the learning rates over for both our model and the baseline. Non-recurrent weights were initialized from the uniform distribution with range . Similar to [1], we found the IRNN to be fairly unstable (with SGD optimization typically diverging). Therefore for IRNN, we ran 10 different initializations and picked the one that did not explode to show its performance.

In our first experiment, we evaluate Path-SGD on the addition problem. The results are shown in Fig. 3 with increasing the length of the sequence: . We note that this problem becomes much harder as increases because the dependency between the output (the sum of two numbers) and the corresponding inputs becomes more distant. We also compare RNN-Path with the previously published results, including identity initialized RNN  [12] (IRNN), unitary RNN [1] (uRNN), and np-RNN555The original paper does not include any result for 750, so we implemented np-RNN for comparison. However, in our implementation the np-RNN is not able to even learn sequences of length of 200. Thus we put “>2” for length of 750. introduced by [22]. Table 3 shows the effectiveness of using Path-SGD. Perhaps more surprisingly, with the help of path-normalization, a simple RNN with the identity initialization is able to achieve a 0% error on the sequences of length 750, whereas all the other methods, including LSTMs, fail. This shows that Path-SGD may help stabilize the training and alleviate the gradient problem, so as to perform well on longer sequence. We next tried to model the sequences length of 1000, but we found that for such very long sequences RNNs, even with Path-SGD, fail to learn.

Next, we evaluate Path-SGD on the Sequential MNIST problem. Table 3, right column, reports test error rates achieved by RNN-Path compared to the previously published results. Clearly, using Path-SGD helps RNNs achieve better generalization. In many cases, RNN-Path outperforms other RNN methods (except for LSTMs), even for such a long-term dependency problem.

In this section we evaluate Path-SGD on a language modeling task. We consider two datasets, Penn Treebank (PTB-c) and text8 . PTB-c: We performed experiments on a tokenized Penn Treebank Corpus, following the experimental protocol of [11]. The training, validations and test data contain 5017k, 393k and 442k characters respectively. The alphabet size is 50, and each training sequence is of length 50. text8: The text8 dataset contains 100M characters from Wikipedia with an alphabet size of 27. We follow the data partition of [14], where each training sequence has a length of 180. Performance is evaluated using bits-per-character (BPC) metric, which is of perplexity.

Similar to the experiments on the synthetic datasets, for both tasks, we train a single-layer RNN consisting of 2048 hidden units with path-SGD (RNN-Path). Due to the large dimension of hidden space, SGD can take a fairly long time to converge. Instead, we use Adam optimizer [8] to help speed up the training, where we simply use the path-SGD gradient as input to the Adam optimizer.

We also train three additional baseline models: a ReLU RNN with 2048 hidden units, a tanh RNN with 2048 hidden units, and an LSTM with 1024 hidden units, all trained using Adam. We performed grid search for learning rate over for all of our models. For ReLU RNNs, we initialize the recurrent matrices from uniform, and uniform for non-recurrent weights. For LSTMs, we use orthogonal initialization [21] for the recurrent matrices and uniform for non-recurrent weights. The results are summarized in Table 3.

We also compare our results to an RNN that uses hidden activation regularizer [11] (TRec,), Multiplicative RNNs trained by Hessian Free methods [14] (HF-MRNN), and an RNN with smooth version of ReLU [20]. Table 3 shows that path-normalization is able to outperform RNN-ReLU and RNN-tanh, while at the same time shortening the performance gap between plain RNN and other more complicated models (e.g. LSTM by 57% on PTB and 54% on text8 datasets). This demonstrates the efficacy of path-normalized optimization for training RNNs with ReLU activation.

## 6 Conclusion

We investigated the geometry of RNNs in a broader class of feedforward networks with shared weights and showed how understanding the geometry can lead to significant improvements on different learning tasks. Designing an optimization algorithm with a geometry that is well-suited for RNNs, we closed over half of the performance gap between vanilla RNNs and LSTMs. This is particularly useful for applications in which we seek compressed models with fast prediction time that requires minimum storage; and also a step toward bridging the gap between LSTMs and RNNs.

#### Acknowledgments

This research was supported in part by an NSF RI-AF award and by Intel ICRI-CI. We thank Saizheng Zhang for sharing a base code for RNNs.

## References

• [1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. arXiv preprint arXiv:1511.06464, 2015.
• [2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
• [3] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

Proceeding of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

, pages 1724–1734, 2014.
• [4] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In

Proceeding of the International Conference on Machine Learning (ICML)

, pages 1764–1772, 2014.
• [5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceeding of the International Conference on Learning Representations, 2016.
• [6] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02), 1998.
• [7] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997.
• [8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceeding of the International Conference on Learning Representations, 2015.
• [9] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics, 2015.
• [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pages 1097–1105, 2012.
• [11] David Krueger and Roland Memisevic. Regularizing RNNs by stabilizing activations. In Proceeding of the International Conference on Learning Representations, 2016.
• [12] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
• [13] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
• [14] Tomás Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, and J Cernocky. Subword language modeling with neural networks. (http://www.fit.vutbr.cz/ imikolov/rnnlm/char.pdf), 2012.
• [15] Vinod Nair and Geoffrey E Hinton. In Proceedings of the International Conference on Machine Learning (ICML), pages 807–814, 2010.
• [16] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Advanced in Neural Information Processsing Systems (NIPS), 2015.
• [17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Data-dependent path normalization in neural networks. In the International Conference on Learning Representations, 2016.
• [18] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Proceeding of the 28th Conference on Learning Theory (COLT), 2015.
• [19] Yann Ollivier. Riemannian metrics for neural networks ii: recurrent networks and learning symbolic data sequences. Information and Inference, page iav007, 2015.
• [20] Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650, 2013.
• [21] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014.
• [22] Sachin S. Talathi and Aniket Vartak. Improving performance of recurrent neural network with relu nonlinearity. In the International Conference on Learning Representations workshop track, 2014.
• [23] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan Salakhutdinov, and Yoshua Bengio. Architectural complexity measures of recurrent neural networks. arXiv preprint arXiv:1602.08210, 2016.

## Appendix A Proofs

### a.1 Proof of Theorem 1

We first show that any RNN is invariant to by induction on layers and time-steps. More specifically, we prove that for any and , . The statement is clearly true for ; because for any , .

Next, we show that for , if we assume that the statement is true for , then it is also true for :

 h1t′+1(Tα(W))[j] =⎡⎣∑j′Tin,α(Win)1[j,j′]xt′+1[j′]+Trec,α(Wrec)1[j,j′]h1t′(Tα(W))[j′]⎤⎦+ =⎡⎣∑j′α1jW1in[j,j′]xt′+1[j′]+(α1j/α1j′)W1rec[j,j′]α1j′h1t′(W))[j′]⎤⎦+ =α1jhit(W)[j]

We now need to prove the statement for . Assuming that the statement is true for and the layers before , we have:

 hit′+1(Tα(W))[j] =⎡⎣∑j′Tin,α(Win)i[j,j′]hi−1t′+1(Tα(W))[j′]+Trec,α(Wrec)i[j,j′]hit′(Tα(W))[j′]⎤⎦+ =⎡⎣∑j′αijαi−1j′Wiin[j,j′]αi−1j′hi−1t′+1(W))[j′]+αijαij′Wirec[j,j′]αij′hit′(W))[j′]⎤⎦+ =αijhit(W)[j]

Finally, we can show that the output is invariant for any at any time step :

 fT(W),t(xt)[j] =∑j′Tout,α(Wout)[j,j′]hd−1t(Tα(W)[j′]=∑j′(1/αd−1j′)W% out[j,j′]αd−1j′hd−1t(W)[j′] =∑j′Wout[j,j′]hd−1t(W)[j′]=fW,t(xt)[j]

We now show that any feasible node-wise rescaling can be presented as . Recall that node-wise rescaling invariances for a general feedforward network can be written as for some where for internal nodes and for any input/output nodes. An RNN with has no weight sharing and for each node with index in layer , we have . For any however, we there is no invariance that is not already counted. The reason is that by fixing the values of for the nodes in time step 0, due to the feasibility, the values of for nodes in other time-steps should be tied to the corresponding value in time step . Therefore, all invariances are included and can be presented in form of .

### a.2 Proof of Lemma 1

We prove the statement simply by calculating the second derivative of the path-regularizer with respect to each parameter:

 κi(p) =12∂2γ2net∂p2i=12∂∂pi⎛⎝∂∂pi∑ζ∈Plen(ζ)−1∏j=0w2ζj→ζj+1⎞⎠ =12∂∂pi⎛⎝∂∂pi∑ζ∈Plen(ζ)−1∏j=0p2π(ζj→ζj+1)⎞⎠=12∑ζ∈P∂∂pi⎛⎝∂∂pilen(ζ)−1∏j=0p2π(ζj→ζj+1)⎞⎠

Taking the second derivative then gives us both terms after a few calculations:

 κi(p) =12∑ζ∈P∂∂pi⎛⎝∂∂pilen(ζ)−1∏j=0p2π(ζj→ζj+1)⎞⎠=∑ζ∈P∂∂pi⎛⎜ ⎜ ⎜⎝pi∑e∈Ei1e∈ζlen(ζ)−1∏j=0e≠(ζj→ζj+1p2π(ζj→ζj+1)⎞⎟ ⎟ ⎟⎠ =∑ζ∈P⎡⎢ ⎢ ⎢⎣pi∂∂pi⎛⎜ ⎜ ⎜⎝∑e∈Ei1e∈ζlen(ζ)−1∏j=0e≠(ζj→ζj+1p2π(ζj→ζj+1)⎞⎟ ⎟ ⎟⎠+∑e∈Ei1e∈ζlen(ζ)−1∏j=0e≠(ζj→ζj+1p2π(ζj→ζj+1)⎤⎥ ⎥ ⎥⎦ =p2i∑e1,e2∈Eie1≠e2⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∑ζ∈P1e1,e2∈ζlen(ζ)−1∏j=0e1≠(ζj→ζj+1)e2≠(ζj→ζj+1)p2π(ζj→ζj+1)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦+∑e∈Ei⎡⎢ ⎢ ⎢⎣∑ζ∈P1e∈ζlen(ζ)−1∏j=0e≠(ζj→ζj+1)p2π(ζj→ζj+1)⎤⎥ ⎥ ⎥⎦

### a.3 Proof of Theorem 2

Node-wise rescaling invariances for a feedforward network can be written as for some where for internal nodes and for any input/output nodes. Any feasible invariance for a network with shared weights can also be written in the same form. The only difference is that some of s are now tied to each other in a way that shared weights have the same value after transformation. First, note that since the network is invariant to the transformation, the following statement holds by an induction similar to Theorem 1 but in the backward direction:

 ∂L∂hv(Tβ(p))=1βv∂L∂hu(p) (12)

for any . Furthermore, by the proof of the Theorem 1 we have that for any , . Therefore,

 ∂L∂Tβ(p)i(Tβ(p))=∑(u→v)∈Ei∂L∂hv(Tβ(p))hu(Tβ(p))=βu′βv′∂L∂pi(p) (13)

where . In order to prove the theorem statement, it is enough to show that for any edge , because this property gives us the following update:

 Tβ(p)i−ηκi(Tβ(p))∂L(Tβ(p))∂Tβ(p)i=βvβupi−η(βu/βv)2κi(p)βuβv∂L∂pi(p)=Tβ(p+)i

Therefore, it is remained to show that for any edge , . We show that this is indeed true for both terms and separately.

We first prove the statement for . Consider each path . By an inductive argument along the path, it is easy to see that multiplying squared weights along this path is invariant to the transformation:

 len(ζ)−1∏j=0Tβ(p)2π(ζj→ζj+1)=len(ζ)−1∏j=0p2π(ζj→ζj+1)

Therefore, we have that for any edge and any ,

 len(ζ)−1∏j=0e≠(ζj→ζj+1)Tβ(p)2π(ζj→ζj+1)=(βuβv)2len(ζ)−1∏j=0e≠(ζj→ζj+1)p2π(ζj→ζj+1)

Taking sum over all paths and all edges completes the proof for . Similarly for , considering any two edges and any path , we have that:

 Tβ(p)2ilen(ζ)−1∏j=0e1≠(ζj→ζj+1)e2≠(ζj→ζj+1)Tβ(p)2π