1 Introduction
Recurrent Neural Networks (RNNs) have been found to be successful in a variety of sequence learning problems [4, 3, 9], including those involving long term dependencies (e.g., [1, 23]
). However, most of the empirical success has not been with “plain” RNNs but rather with alternate, more complex structures, such as Long ShortTerm Memory (LSTM) networks
[7]or Gated Recurrent Units (GRUs)
[3]. Much of the motivation for these more complex models is not so much because of their modeling richness, but perhaps more because they seem to be easier to optimize. As we discuss in Section 3, training plain RNNs using gradientdescent variants seems problematic, and the choice of the activation function could cause a problem of vanishing gradients or of exploding gradients.
In this paper our goal is to better understand the geometry of plain RNNs, and develop better optimization methods, adapted to this geometry, that directly learn plain RNNs with ReLU activations. One motivation for insisting on plain RNNs, as opposed to LSTMs or GRUs, is because they are simpler and might be more appropriate for applications that require lowcomplexity design such as in mobile computing platforms [22, 5]. In other applications, it might be better to solve optimization issues by better optimization methods rather than reverting to more complex models. Better understanding optimization of plain RNNs can also assist us in designing, optimizing and intelligently using more complex RNN extensions.
Improving training RNNs with ReLU activations has been the subject of some recent attention, with most research focusing on different initialization strategies [12, 22]. While initialization can certainly have a strong effect on the success of the method, it generally can at most delay the problem of gradient explosion during optimization. In this paper we take a different approach that can be combined with any initialization choice, and focus on the dynamics of the optimization itself.
Any local search method is inherently tied to some notion of geometry over the search space (e.g. the space of RNNs). For example, gradient descent (including stochastic gradient descent) is tied to the Euclidean geometry and can be viewed as steepest descent with respect to the Euclidean norm. Changing the norm (even to a different quadratic norm, e.g. by representing the weights with respect to a different basis in parameter space) results in different optimization dynamics. We build on prior work on the geometry and optimization in feedforward networks, which uses the pathnorm
[16] (defined in Section 4) to determine a geometry leading to the pathSGD optimization method. To do so, we investigate the geometry of RNNs as feedforward networks with shared weights (Section 2) and extend a line of work on PathNormalized optimization to include networks with shared weights. We show that the resulting algorithm (Section 4) has similar invariance properties on RNNs as those of standard pathSGD on feedforward networks, and can result in better optimization with less sensitivity to the scale of the weights.2 Recurrent Neural Nets as Feedforward Nets with Shared Weights
We view Recurrent Neural Networks (RNNs) as feedforward networks with shared weights.
We denote a general feedforward network with ReLU activations and shared weights is indicated by where is a directed acyclic graph over the set of nodes that corresponds to units in the network, including special subsets of input and output nodes ,
is a parameter vector and
is a mapping from edges in to parameters indices. For any edge , the weight of the edge is indicated by . We refer to the set of edges that share the th parameter by . That is, for any , and hence .Such a feedforward network represents a function as follows: For any input node , its output is the corresponding coordinate of the input vector . For each internal node , the output is defined recursively as where is the ReLU activation function^{2}^{2}2The bias terms can be modeled by having an additional special node that is connected to all internal and output nodes, where .. For output nodes , no nonlinearity is applied and their output determines the corresponding coordinate of the computed function . Since we will fix the graph and the mapping and learn the parameters , we use the shorthand to refer to the function implemented by parameters . The goal of training is to find parameters that minimize some error functional that depends on only through the function
. E.g. in supervised learning
and this is typically done by minimizing an empirical estimate of this expectation.
If the mapping is a onetoone mapping, then there is no weight sharing and it corresponds to standard feedforward networks. On the other hand, weight sharing exists if is a manytoone mapping. Two wellknown examples of feedforward networks with shared weights are convolutional and recurrent networks. We mostly use the general notation of feedforward networks with shared weights throughout the paper as this will be more general and simplifies the development and notation. However, when focusing on RNNs, it is helpful to discuss them using a more familiar notation which we briefly introduce next.
Recurrent Neural Networks
Timeunfolded RNNs are feedforward networks with shared weights that map an input sequence to an output sequence. Each input node corresponds to either a coordinate of the input vector at a particular time step or a hidden unit at time . Each output node also corresponds to a coordinate of the output at a specific time step. Finally, each internal node refers to some hidden unit at time . When discussing RNNs, it is useful to refer to different layers and the values calculated at different timesteps. We use a notation for RNN structures in which the nodes are partitioned into layers and denotes the output of nodes in layer at time step . Let be the input at different time steps where is the maximum number of propagations through time and we refer to it as the length of the RNN. For , let and be the input and recurrent parameter matrices of layer and be the output parameter matrix. Table 1 shows forward computations for RNNs.The output of the function implemented by RNN can then be calculated as . Note that in this notations, weight matrices , and correspond to “free” parameters of the model that are shared in different time steps.
Input nodes  Internal nodes  Output nodes  

FF (shared weights)  
RNN notation 
3 NonSaturating Activation Functions
The choice of activation function for neural networks can have a large impact on optimization. We are particularly concerned with the distinction between “saturating” and “nonstarting” activation functions. We consider only monotone activation functions and say that a function is “saturating” if it is bounded—this includes, e.g. sigmoid, hyperbolic tangent and the piecewiselinear ramp activation functions. Boundedness necessarily implies that the function values converge to finite values at negative and positive infinity, and hence asymptote to horizontal lines on both sides. That is, the derivative of the activation converges to zero as the input goes to both and
. Networks with saturating activations therefore have a major shortcoming: the vanishing gradient problem
[6]. The problem here is that the gradient disappears when the magnitude of the input to an activation is large (whether the unit is very “active” or very “inactive”) which makes the optimization very challenging.While sigmoid and hyperbolic tangent have historically been popular choices for fully connected feedforward and convolutional neural networks, more recent works have shown undeniable advantages of nonsaturating activations such as ReLU, which is now the standard choice for fully connected and Convolutional networks
[15, 10]. Nonsaturating activations, including the ReLU, are typically still bounded from below and asymptote to a horizontal line, with a vanishing derivative, at . But they are unbounded from above, enabling their derivative to remain bounded away from zero as the input goes to . Using ReLUs enables gradients to not vanish along activated paths and thus can provide a stronger signal for training.However, for recurrent neural networks, using ReLU activations is challenging in a different way, as even a small change in the direction of the leading eigenvector of the recurrent weights could get amplified and potentially lead to the explosion in forward or backward propagation
[1].To understand this, consider a long path from an input in the first element of the sequence to an output of the last element, which passes through the same RNN edge at each step (i.e. through many edges in some in the sharedparameter representation). The length of this path, and the number of times it passes through edges associated with a single parameter, is proportional to the sequence length, which could easily be a few hundred or more. The effect of this parameter on the path is therefore exponentiated by the sequence length, as are gradient updates for this parameter, which could lead to parameter explosion unless an extremely small step size is used.
Understanding the geometry of RNNs with ReLUs could helps us deal with the above issues more effectively. We next investigate some properties of geometry of RNNs with ReLU activations.
Invariances in Feedforward Nets with Shared Weights
Feedforward networks (with or without shared weights) are highly overparameterized, i.e. there are many parameter settings that represent the same function . Since our true object of interest is the function , and not the identity of the parameters, it would be beneficial if optimization would depend only on and not get “distracted” by difference in that does not affect . It is therefore helpful to study the transformations on the parameters that will not change the function presented by the network and come up with methods that their performance is not affected by such transformations.
Definition 1.
We say a network is invariant to a transformation if for any parameter setting , . Similarly, we say an update rule is invariant to if for any , .
Invariances have also been studied as different mappings from the parameter space to the same function space [19] while we define the transformation as a mapping inside a fixed parameter space. A very important invariance in feedforward networks is nodewise rescaling [17]. For any internal node and any scalar , we can multiply all incoming weights into (i.e. for any ) by and all the outgoing weights (i.e. for any ) by without changing the function computed by the network. Not all nodewise rescaling transformations can be applied in feedforward nets with shared weights. This is due to the fact that some weights are forced to be equal and therefore, we are only allowed to change them by the same scaling factor.
Definition 2.
Given a network , we say an invariant transformation that is defined over edge weights (rather than parameters) is feasible for parameter mapping if the shared weights remain equal after the transformation, i.e. for any and for any , .
Therefore, it is helpful to understand what are the feasible nodewise rescalings for RNNs. In the following theorem, we characterize all feasible nodewise invariances in RNNs.
Theorem 1.
For any such that , any Recurrent Neural Network with ReLU activation is invariant to the transformation where for any :
(1)  
Furthermore, any feasible nodewise rescaling transformation can be presented in the above form.
The proofs of all theorems and lemmas are given in Appendix A. The above theorem shows that there are many transformations under which RNNs represent the same function. An example of such invariances is shown in Fig. 1. Therefore, we would like to have optimization algorithms that are invariant to these transformations and in order to do so, we need to look at measures that are invariant to such mappings.
4 PathSGD for Networks with Shared Weights
As we discussed, optimization is inherently tied to a choice of geometry, here represented by a choice of complexity measure or “norm”^{3}^{3}3The pathnorm which we define is a norm on functions, not on weights, but as we prefer not getting into this technical discussion here, we use the term “norm” very loosely to indicate some measure of magnitude [18].. Furthermore, we prefer using an invariant measure which could then lead to an invariant optimization method. In Section 4.1 we introduce the pathregularizer and in Section 4.2, the derived PathSGD optimization algorithm for standard feedforward networks. Then in Section 4.3 we extend these notions also to networks with shared weights, including RNNs, and present two invariant optimization algorithms based on it. In Section 4.4 we show how these can be implemented efficiently using forward and backward propagations.
4.1 Pathregularizer
The pathregularizer is the sum over all paths from input nodes to output nodes of the product of squared weights along the path. To define it formally, let be the set of directed paths from input to output units so that for any path of length , we have that , and for any , . We also abuse the notation and denote if for some , . Then the path regularizer can be written as:
(2) 
Equivalently, the pathregularizer can be defined recursively on the nodes of the network as:
(3) 
4.2 PathSGD for Feedforward Networks
PathSGD is an approximate steepest descent step with respect to the pathnorm. More formally, for a network without shared weights, where the parameters are the weights themselves, consider the diagonal quadratic approximation of the pathregularizer about the current iterate :
(4) 
Using the corresponding quadratic norm , we can define an approximate steepest descent step as:
(5) 
Solving (5) yields the update:
(6) 
The stochastic version that uses a subset of training examples to estimate is called PathSGD [16]. We now show how PathSGD can be extended to networks with shared weights.
4.3 Extending to Networks with Shared Weights
When the networks has shared weights, the pathregularizer is a function of parameters and therefore the quadratic approximation should also be with respect to the iterate instead of which results in the following update rule:
(7) 
where . Solving (7) gives the following update:
(8) 
The second derivative terms are specified in terms of their path structure as follows:
Lemma 1.
The second term measures the effect of interactions between edges corresponding to the same parameter (edges from the same ) on the same path from input to output. In particular, if for any path from an input unit to an output unit, no two edges along the path share the same parameter, then . For example, for any feedforward or Convolutional neural network, . But for RNNs, there certainly are multiple edges sharing a single parameter on the same path, and so we could have .
The above lemma gives us a precise update rule for the approximate steepest descent with respect to the pathregularizer. The following theorem confirms that the steepest descent with respect to this regularizer is also invariant to all feasible nodewise rescaling for networks with shared weights.
Theorem 2.
For any feedforward networks with shared weights, the update (8) is invariant to all feasible nodewise rescalings. Moreover, a simpler update rule that only uses in place of is also invariant to all feasible nodewise rescalings.
4.4 Simple and Efficient Computations for RNNs
We show how to calculate and by considering a network with the same architecture but with squared weights:
Theorem 3.
For any network , consider where for any , . Define the function to be the sum of outputs of this network: . Then and can be calculated as follows where is the allones input vector:
(11) 
In the process of calculating the gradient , we need to calculate and for any . Therefore, the only remaining term to calculate (besides ) is .
Recall that is the length (maximum number of propagations through time) and is the number of layers in an RNN. Let be the number of hidden units in each layer and be the size of the minibatch. Then calculating the gradient of the loss at all points in the minibatch (the standard work required for any minibatch gradient approach) requires time . In order to calculate , we need to calculate the gradient of a similar network at a single input—so the time complexity is just an additional . The second term can also be calculated for RNNs in ^{4}^{4}4 For an RNN, and because only recurrent weights are can be shared multiple times along an inputoutput path. can be written and calculated in the matrix form: where for any we have . The only terms that require extra computation are powers of which can be done in and the rest of the matrix computations need .. Therefore, the ratio of time complexity of calculating the first term and second term with respect to the gradient over minibatch is and respectively. Calculating only is therefore very cheap with minimal perminibatch cost, while calculating might be expensive for large networks. Beyond the low computational cost, calculating
is also very easy to implement as it requires only taking the gradient with respect to a standard feedforward calculation in a network with slightly modified weights—with most deep learning libraries it can be implemented very easily with only a few lines of code.
5 Experiments
5.1 The Contribution of the Second Term
As we discussed in section 4.4, the second term in the update rule can be computationally expensive for large networks. In this section we investigate the significance of the second term and show that at least in our experiments, the contribution of the second term is negligible. To compare the two terms and , we train a single layer RNN with hidden units for the task of wordlevel language modeling on Penn Treebank (PTB) Corpus [13]. Fig. 2 compares the performance of SGD vs. PathSGD with/without . We clearly see that both version of PathSGD are performing very similarly and both of them outperform SGD significantly. This results in Fig. 2 suggest that the first term is more significant and therefore we can ignore the second term.
To better understand the importance of the two terms, we compared the ratio of the norms for different RNN lengths and number of hidden units . The table in Fig. 2 shows that the contribution of the second term is bigger when the network has fewer number of hidden units and the length of the RNN is larger ( is small and is large). However, in many cases, it appears that the first term has a much bigger contribution in the update step and hence the second term can be safely ignored. Therefore, in the rest of our experiments, we calculate the PathSGD updates only using the first term .
5.2 Synthetic Problems with Longterm Dependencies
Training Recurrent Neural Networks is known to be hard for modeling longterm dependencies due to the gradient vanishing/exploding problem [6, 2]. In this section, we consider synthetic problems that are specifically designed to test the ability of a model to capture the longterm dependency structure. Specifically, we consider the addition problem and the sequential MNIST problem.
Addition problem:
The addition problem was introduced in [7]. Here,
each input consists of two sequences of length , one of which includes numbers sampled from the uniform distribution with range and the other sequence serves as a mask which is filled with zeros except for two entries.
These two entries indicate which of the two numbers in the first sequence we need to add and the task is to output the result of this addition.
Sequential MNIST:
In sequential MNIST, each digit image is reshaped into a sequence of length ,
turning the digit classification task into sequence classification
with longterm dependencies [12, 1].
For both tasks, we closely follow the experimental protocol in [12]. We train a singlelayer RNN consisting of 100 hidden units with pathSGD, referred to as RNNPath. We also train an RNN of the same size with identity initialization, as was proposed in [12], using SGD as our baseline model, referred to as IRNN. We performed grid search for the learning rates over for both our model and the baseline. Nonrecurrent weights were initialized from the uniform distribution with range . Similar to [1], we found the IRNN to be fairly unstable (with SGD optimization typically diverging). Therefore for IRNN, we ran 10 different initializations and picked the one that did not explode to show its performance.
In our first experiment, we evaluate PathSGD on the addition problem. The results are shown in Fig. 3 with increasing the length of the sequence: . We note that this problem becomes much harder as increases because the dependency between the output (the sum of two numbers) and the corresponding inputs becomes more distant. We also compare RNNPath with the previously published results, including identity initialized RNN [12] (IRNN), unitary RNN [1] (uRNN), and npRNN^{5}^{5}5The original paper does not include any result for 750, so we implemented npRNN for comparison. However, in our implementation the npRNN is not able to even learn sequences of length of 200. Thus we put “>2” for length of 750. introduced by [22]. Table 3 shows the effectiveness of using PathSGD. Perhaps more surprisingly, with the help of pathnormalization, a simple RNN with the identity initialization is able to achieve a 0% error on the sequences of length 750, whereas all the other methods, including LSTMs, fail. This shows that PathSGD may help stabilize the training and alleviate the gradient problem, so as to perform well on longer sequence. We next tried to model the sequences length of 1000, but we found that for such very long sequences RNNs, even with PathSGD, fail to learn.
Next, we evaluate PathSGD on the Sequential MNIST problem. Table 3, right column, reports test error rates achieved by RNNPath compared to the previously published results. Clearly, using PathSGD helps RNNs achieve better generalization. In many cases, RNNPath outperforms other RNN methods (except for LSTMs), even for such a longterm dependency problem.
5.3 Language Modeling Tasks
In this section we evaluate PathSGD on a language modeling task. We consider two datasets, Penn Treebank (PTBc) and text8 ^{6}^{6}6http://mattmahoney.net/dc/textdata. PTBc: We performed experiments on a tokenized Penn Treebank Corpus, following the experimental protocol of [11]. The training, validations and test data contain 5017k, 393k and 442k characters respectively. The alphabet size is 50, and each training sequence is of length 50. text8: The text8 dataset contains 100M characters from Wikipedia with an alphabet size of 27. We follow the data partition of [14], where each training sequence has a length of 180. Performance is evaluated using bitspercharacter (BPC) metric, which is of perplexity.
Similar to the experiments on the synthetic datasets, for both tasks, we train a singlelayer RNN consisting of 2048 hidden units with pathSGD (RNNPath). Due to the large dimension of hidden space, SGD can take a fairly long time to converge. Instead, we use Adam optimizer [8] to help speed up the training, where we simply use the pathSGD gradient as input to the Adam optimizer.
We also train three additional baseline models: a ReLU RNN with 2048 hidden units, a tanh RNN with 2048 hidden units, and an LSTM with 1024 hidden units, all trained using Adam. We performed grid search for learning rate over for all of our models. For ReLU RNNs, we initialize the recurrent matrices from uniform, and uniform for nonrecurrent weights. For LSTMs, we use orthogonal initialization [21] for the recurrent matrices and uniform for nonrecurrent weights. The results are summarized in Table 3.
We also compare our results to an RNN that uses hidden activation regularizer [11] (TRec,), Multiplicative RNNs trained by Hessian Free methods [14] (HFMRNN), and an RNN with smooth version of ReLU [20]. Table 3 shows that pathnormalization is able to outperform RNNReLU and RNNtanh, while at the same time shortening the performance gap between plain RNN and other more complicated models (e.g. LSTM by 57% on PTB and 54% on text8 datasets). This demonstrates the efficacy of pathnormalized optimization for training RNNs with ReLU activation.
6 Conclusion
We investigated the geometry of RNNs in a broader class of feedforward networks with shared weights and showed how understanding the geometry can lead to significant improvements on different learning tasks. Designing an optimization algorithm with a geometry that is wellsuited for RNNs, we closed over half of the performance gap between vanilla RNNs and LSTMs. This is particularly useful for applications in which we seek compressed models with fast prediction time that requires minimum storage; and also a step toward bridging the gap between LSTMs and RNNs.
Acknowledgments
This research was supported in part by an NSF RIAF award and by Intel ICRICI. We thank Saizheng Zhang for sharing a base code for RNNs.
References
 [1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. arXiv preprint arXiv:1511.06464, 2015.
 [2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.

[3]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio.
Learning phrase representations using RNN encoder–decoder for
statistical machine translation.
In
Proceeding of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1724–1734, 2014. 
[4]
Alex Graves and Navdeep Jaitly.
Towards endtoend speech recognition with recurrent neural networks.
In
Proceeding of the International Conference on Machine Learning (ICML)
, pages 1764–1772, 2014.  [5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceeding of the International Conference on Learning Representations, 2016.
 [6] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 6(02), 1998.
 [7] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8), 1997.
 [8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceeding of the International Conference on Learning Representations, 2015.
 [9] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visualsemantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics, 2015.
 [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pages 1097–1105, 2012.
 [11] David Krueger and Roland Memisevic. Regularizing RNNs by stabilizing activations. In Proceeding of the International Conference on Learning Representations, 2016.
 [12] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.
 [13] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 [14] Tomás Mikolov, Ilya Sutskever, Anoop Deoras, HaiSon Le, Stefan Kombrink, and J Cernocky. Subword language modeling with neural networks. (http://www.fit.vutbr.cz/ imikolov/rnnlm/char.pdf), 2012.
 [15] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning (ICML), pages 807–814, 2010.
 [16] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. PathSGD: Pathnormalized optimization in deep neural networks. In Advanced in Neural Information Processsing Systems (NIPS), 2015.
 [17] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Datadependent path normalization in neural networks. In the International Conference on Learning Representations, 2016.
 [18] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Normbased capacity control in neural networks. In Proceeding of the 28th Conference on Learning Theory (COLT), 2015.
 [19] Yann Ollivier. Riemannian metrics for neural networks ii: recurrent networks and learning symbolic data sequences. Information and Inference, page iav007, 2015.
 [20] Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650, 2013.
 [21] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014.
 [22] Sachin S. Talathi and Aniket Vartak. Improving performance of recurrent neural network with relu nonlinearity. In the International Conference on Learning Representations workshop track, 2014.
 [23] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan Salakhutdinov, and Yoshua Bengio. Architectural complexity measures of recurrent neural networks. arXiv preprint arXiv:1602.08210, 2016.
Appendix A Proofs
a.1 Proof of Theorem 1
We first show that any RNN is invariant to by induction on layers and timesteps. More specifically, we prove that for any and , . The statement is clearly true for ; because for any , .
Next, we show that for , if we assume that the statement is true for , then it is also true for :
We now need to prove the statement for . Assuming that the statement is true for and the layers before , we have:
Finally, we can show that the output is invariant for any at any time step :
We now show that any feasible nodewise rescaling can be presented as . Recall that nodewise rescaling invariances for a general feedforward network can be written as for some where for internal nodes and for any input/output nodes. An RNN with has no weight sharing and for each node with index in layer , we have . For any however, we there is no invariance that is not already counted. The reason is that by fixing the values of for the nodes in time step 0, due to the feasibility, the values of for nodes in other timesteps should be tied to the corresponding value in time step . Therefore, all invariances are included and can be presented in form of .
a.2 Proof of Lemma 1
We prove the statement simply by calculating the second derivative of the pathregularizer with respect to each parameter:
Taking the second derivative then gives us both terms after a few calculations:
a.3 Proof of Theorem 2
Nodewise rescaling invariances for a feedforward network can be written as for some where for internal nodes and for any input/output nodes. Any feasible invariance for a network with shared weights can also be written in the same form. The only difference is that some of s are now tied to each other in a way that shared weights have the same value after transformation. First, note that since the network is invariant to the transformation, the following statement holds by an induction similar to Theorem 1 but in the backward direction:
(12) 
for any . Furthermore, by the proof of the Theorem 1 we have that for any , . Therefore,
(13) 
where . In order to prove the theorem statement, it is enough to show that for any edge , because this property gives us the following update:
Therefore, it is remained to show that for any edge , . We show that this is indeed true for both terms and separately.
We first prove the statement for . Consider each path . By an inductive argument along the path, it is easy to see that multiplying squared weights along this path is invariant to the transformation:
Therefore, we have that for any edge and any ,
Taking sum over all paths and all edges completes the proof for . Similarly for , considering any two edges and any path , we have that:
Comments
There are no comments yet.