Over the past ten years, neural networks with rectified linear hidden units (ReLU) hahnloser2000digital as activation functions have demonstrated the power in many important applications, such as information system cheng2016wide ; wang2017deep , image classification he2016deep ; huang2017densely , text understanding vaswani2017attention , etc. These networks are usually trained with Stochastic Gradient Descent (SGD), where the gradient of loss function with respect to the weights can be efficiently computed via back propagation method rumelhart1986learning .
Recent studies neyshabur2015path ; lecun2015deep show that ReLU networks have positively scale-invariant property, i.e., if the incoming weights of a hidden node with ReLU activation are multiplied by a positive constant and the outgoing weights are divided by , the neural network with the new weights will generate exactly the same output as the old one for an arbitrary input. Conventional SGD optimizes ReLU neural networks in weight space. However, it is clear that weight vector is not positively scale-invariant. This mismatch may lead to problems during the optimization process neyshabur2015path .
Then, a natural question is: can we construct a new vector space that is positively scale-invariant and sufficient to represent ReLU neural networks so as to better facilitate the optimization process ? In this paper, we provide positive answer to this question.
We investigate the positively scale-invariant space to sufficiently represent ReLU neural networks by the following four steps. Firstly, we define the positive scaling operators and show that they form a transformation group (denoted as ). The transformation group will induce an equivalence relationship called positive scaling equivalence. Then, We found that the values of the paths are invariant to positive scaling operators. Furthermore, we prove that two weight vectors are positively scale-equivalent if and only if the values of the paths in one neural network equal to those in the other neural network, given the signs of some weights unchanged. That is to say, the values of all the paths and can sufficiently represent the ReLU neural networks. After that, we define a generalized addition as the product operator and the generalized multiplication as the power operator. We show that the path vectors are generalized linearly dependent.111A path vector is represented by one element in , where is the number of weights. Please check the details in Section 2.2. We define the maximal group of paths which are generalized linearly independent as basis path, which corresponds to the basis of the structure matrix constituted by the path vectors. Thus, the values of the basis paths are also positively scale-invariant and can sufficiently to represent the ReLU neural networks. We denote the linear span of the values of basis paths as -space. In addition, we prove that the dimension of -space is "" smaller comparing to the weight space, where
is the total number of hidden units in a multi-layer perceptron (MLP) or feature maps in a convolutional networks (CNN).
To sum up, we find -space constituted by the values of the basis paths, which is positively scale-invariant, can sufficiently represent the ReLU neural networks, and has a smaller dimension than the vector space of weights.
Therefore, we propose to optimize the ReLU neural networks in its positively scale-invariant space, i.e., -space. We design a novel stochastic gradient descent algorithm in -space (abbreviated as -SGD) to optimize the ReLU neural networks utilizing the gradient with respect to the values of the basis paths. First, we design skeleton method
to construct one group of the basis paths. Then, we develop inverse-chain rule and weight allocation to efficiently compute the gradient of the values of the basis paths by leveraging the back-propagation method. Please note that by using these techniques, there is very little additional computation overhead for-SGD in comparison with the conventional SGD.
We conduct experiments to show the effectiveness of -SGD. First, we evaluate -SGD of training deep convolutional networks on benchmark datasets and demonstrate that -SGD achieves clearly better performance than baseline optimization algorithms. Second, we empirically test the performance of -SGD with different degrees of positive scale-invariance. The experimental results show that the higher the positive scale-invariance is, the larger the performance improvement of -SGD over SGD. This is consistent with that, the positive scale-invariance in weight space will negatively influence the optimization and our proposed -SGD algorithm can effectively solve this problem.
2.1 Related Works
There have been some prior works that study the positively scale-invariant property of ReLU networks and design algorithms that are positively scale-invariant. For example, badrinarayanan2015symmetry notice the positive scale-invariance in ReLU netowrks, and inspired by this, they design algorithms to normalize gradients by layer-wise weight norm. du2018algorithmic study the gradient flow in MLP or CNN models with linear, ReLU or Leaky ReLU activation, and prove the squared norms of gradient across different layers are automatically balanced and remained invariant in gradient descent with infinitesimal step size. In our work, we do not care whether the models are balanced or not. Besides, many other optimization algorithms also have positively scale-invariant property such as Newton’s method and natural gradient descent. The most related work is Path-SGD neyshabur2015path , which also considers the geometry inspired by path norm. This work is different from ours: 1) they regularize the gradient in weight space by path norm while we optimize the loss function directly in a positively scale-invariant space; 2) they do not consider the dependency between paths and it’s hard for them to compute the exactly path-regularized gradients. Different from the previous works, we propose to directly optimize the ReLU networks in its positively scale-invariant space, instead of optimizing in the weight space which is not positive scale-invariant. To the best of our knowledge, at the first time, we solve this mismatch by theoretical analysis and an effective and efficient algorithm.
2.2 ReLU Neural Networks
Let denote a -layer multi-layer perceptron (MLP) with weight , the input space and the output space . In the -th layer (), there are nodes. It is clear that, . We denote the -th node and its value as and , respectively. We use to denote the weight matrix between layer and layer , and use to denote the weight connecting nodes and . The values of the nodes are propagated as , where is the ReLU activation function. We use to denote the path starting from input feature node to output node passing though hidden nodes .We can calculate the -th output by paths (balduzzi2017shattered ; balduzzi2015kickback ), i.e.,
where is the weight connecting nodes and . 222The paths across the bias node can also be described in the same way. For simplicity, we omit the bias term. We denote the value of path as and the activation status as .
We can also regard the network structure as a directed graph , where is the set of nodes where denotes the number of hidden nodes and denote the set of edges in a network where denotes the edge pointing to from nodes . We use to denote the weight on edge . If , the weights compose a vector . We define a path as a vector and if the edge is contained in path , ; otherwise . Because a path crosses edges for an -layer MLP, there are elements with value and others elements with value . Using these notations, the value of path can be calculated as and the activation status of path can be calculated as . We denote the set composed by all paths as and the set composed by paths which contain edge connecting the -th output node as . Thus, the output can be computed as follows:
3 positively scale-invariant space of ReLU Networks
In this section, we first define positive scaling transformation group and the equivalence class induced by this group. Then we study the invariant space under positive scaling transformation group of ReLU networks and study its dimension.
3.1 positive scaling Transformation Group
We formally define the positive scaling operator. We first define a node positive scaling operator with constant and one hidden node as
where for ; for ; and values of other elements of are the same with .
(positive scaling operator) Suppose that is the set of all the hidden nodes in the network where denotes the number of hidden nodes. A positive scaling operator with is defined as
where denotes function composition.
We then collect all the together to form a set . It is easy to check that together with the operation "" is a group which is called positive scaling transformation group, and we call the group action of on as -action. (Please refer to Section 8 in Appendix.)
Clearly, if there exists an operator to make , ReLU networks and will generate the same output for any fixed input . We define the positive scaling equivalence induced by -action.
Consider two ReLU networks with weights and the positive scaling transformation group . We say and are positively scale-equivalent if such that , denote as .
Given -action on , the equivalence relation "" partitions into -equivalent classes. The following theorem shows that the sufficient and necessary condition for ReLU networks in the same equivalent class is that they have the same values and activation status of paths.
Consider two ReLU neural networks with weights . We have that iff for path and any fixed input , we have and .
Invariant variables for a group action are important and widely studied in group theory and geometry. We say a function is invariant variable of -action if . Based on Theorem 3.3, a direct corollary is that values and activation status of paths are invariant variables under -action. Considering that 1) values of paths are -invariant variables while the weights aren’t; 2) values of paths together with the activation status determines an positive scale-equivalent class, and are sufficient to determine the loss, we propose to optimize the values of paths instead of weights.
3.2 positively scale-invariant space and Its Dimension
Although Theorem 3.3 shows the values of paths are invariant variables under -action, we find that the paths have inner-dependency and therefore their values and activation statuses are not independent (see Figure 1). In order to describe this dependency clearly, we introduce a new representation of paths.
First, we define the addition operation "" and scalar-multiplication operation "" in space as: , where . We call the space equipped with addition operation "" and scalar-multiplication operation "" is a vector space that we call it generalized linear space.
Next we will consider representation of path and operations on path in the generalized linear space. As described in Section 2.2, each path can be represented as a -dimensional vector that each element equals or . In the generalized linear space, considering that and are the additive identity and multiplicative identity in the field equipped with "" and "", we assign to if and assign to if . Then becomes a vector that each element equals or . Thus the value of path for can be calculated by the inner product of the weight vector and the path vector, i.e., , where .
Suppose that is the set composed by all paths. We denote the matrix composed by all paths as and call it structure matrix of ReLU networks. The size of is where is the number of paths. We observe that the paths in matrix are not linearly independent, i.e.,some paths can be linearly represented by other columns. For example in Figure 1, and the corresponding values of paths have the relationship: . Thus we need to study the rank of matrix and find a maximal linearly independent group of paths.
If is the structure matrix for a ReLU network, then we have , where is the dimension of weight vector and is the total number of hidden nodes for MLP (or feature maps for CNN models) with ReLU respectively.
(basis path) A set of paths which is a subset of are called basis paths if compose a maximal linearly independent group of column vectors in structure matrix .
We design an algorithm called skeleton method to identify basis paths efficiently, which will be introduced in Section 4. For given values of basis paths and structure matrix, the values of can not be determined unless the values of free variables are fixed (lay1997linear ). Assume are selected to be the free variables which are called free skeleton weights, we prove that the activation status can be uniquely determined by the values of paths if signs of free skeleton weights are fixed. Thus, we have the following theorem which is a modification of Theorem 3.3.
Consider two ReLU neural networks with weights with the same signs of skeleton weights. We have that iff for , we have .
In the following context, we always suppose that have the same signs of free skeleton weights. According to Theorem 3.6 and the linear dependency between values of paths, the loss function can be calculated using values of basis paths if signs of free skeleton weights are fixed. We denote the the loss at training instance as and propose to optimize the values of basis paths. Considering that values of basis paths are obtained through structure matrix , the dimension of the space composed by values of basis paths should be equal to . Then we define the following space.
(-space) The -space is defined as .
We call the space composed by the values of basis paths -space, which is invariant under transformation group , i.e., it is composed by invariant variables under -action. Besides, we measure the reduction of the dimension for positively scale-invariant space using the invariant ratio , thus we can empirically test how severe this equivalence will influence the optimization in weight space.
4 Algorithm: -Sgd
In this section, we will introduce the -SGD that optimizes ReLU neural network models in the -space. This novel algorithm makes use of three methods, named Skeleton Method, Inverse-Chain-Rule (ICR) and Weight-Allocation (WA), respectively, to calculate the gradients w.r.t. basis path vector and project the updates back to weights efficiently (with little extra computation in comparison with standard SGD).
4.1 Skeleton Method
Before the calculation of gradients in -space, we first design an algorithm called skeleton method to construct skeleton weights and basis paths. Due to space limitation, we only show the MLP case with same number of hidden nodes in each hidden layer, and put the skeleton method for general case in Appendix.
1. Construct skeleton weights: for weight matrix , we select diagonal elements to be the skeleton weights. For weight matrix , we select the element for column with to be the skeleton weights. For weight matrix , we select the element for row with to be the skeleton weights. We call the rest weights non-skeleton weights. Figure 2 gives an illustration for skeleton weights in a MLP network.
2. Construct basis paths: A path which contains at most one non-skeleton weights is a basis path. The proof of this statement could be found in Appendix. For example, in Figure 2, the paths in red color and the paths with only one black weight are basis paths. Beyond that, the paths are non-basis paths.
Once we have basis paths, we can calculate the gradients w.r.t. basis path vector , and iteratively update the model by
where is the mini-batch training data in iteration . For the calculation of the gradients w.r.t. basis path vector, we introduce inverse-chain-rule method in next section.
4.2 Inverse-Chain-Rule (ICR) Method
The basic idea of the Inverse-Chain-Rule method is to connect the gradients w.r.t. weight vector and those w.r.t. basis path vector by exploring the chain rules in both directions. That is, we have,
We first compute the gradients w.r.t. weights, i.e., for using standard back propagation. Then we solve Eqn.(4) to obtain the gradients w.r.t. basis paths, i.e., for . We denote matrix at the right side of Eqn.(4) as . Given the following facts: (1) if the edge is contained in path , otherwise 0; (2) according to the skeleton method, each non-skeleton weight will be contained in only one basis path, which means there is only one non-zero element in each column corresponding to non-skeleton weights in , is sparse and thus the solution of Eqn.(4) is easy to obtain.
4.3 Weight-Allocation (WA) Method
After the values of basis paths is updated by SGD, in a new iteration, we employ ICR again by leveraging BP with the new weight. Thus, we need to project the updates on basis paths back to the updates of weights.
We define the path-ratio of at iteration as and the weight-ratio of at iteration as . Assume that we have already obtained the path-ratio for all the basis paths according to ICR method and the SGD update rule. Then we want to project the path-ratios onto the weight-ratios. Because we have , the weight-ratios obtained after the projection should satisfy the following relationship:
where the matrix . According to this relationship, we design Weight-Allocation Method to project the path-ratio to weight-ratio as described below. Suppose that are the free skeleton weights. We first add elements with value at the beginning in vector to get a new -dimensional vector. Then we append columns in matrix to get a new matrix as with where is an identity matrix with diagonal elements and is an zero matrix with all elements in generalized linear space. Then it is easy to prove that and we can calculate the inverse matrix of and get the weight-ratio as
After the projection, we can see that weight-ratios of free skeleton weights equal which means that free skeleton weights will not be changed during the training process. According to the skeleton method again, is a sparse matrix and it is easy to calculate its inverse.
Please note by combining the ICR and WA methods, we can obtain the explicit update rule for -SGD, which is concluded in Algorithm 1. In this way, we obtain the correct gradients. The extra computational complexity of the ICR and WA methods are far lower than that of forward and backward propagation, and can therefore be neglected in practice.
In this section, we first evaluate the performance of -SGD on training deep convolutional networks and verify that if our proposed algorithm outperforms other baseline methods. Then we investigate the influence of positive scaling invariance on the optimization in weight space, and examine whether optimization in space brings performance gain. At last, we compare -SGD with Path-SGD neyshabur2015path and show the necessity of considering the dependency between paths. All experiments are averaged over 5 independent trials if without explicit note.
5.1 Deep Convolutional Network
In this section, we apply our -SGD to image classification tasks and conduct experiments on CIFAR-10 and CIFAR-100 krizhevsky2009learning . In our experiments, we employ the original ResNet architecture described in he2016deep
. Specifically, there is no positive scaling invariance across residual blocks since the residual connections break down the structure matrix described in Section3.2, we target the invariance in each residual block. For better comparison, we also conduct our studies on a stacked deep CNN described in he2016deep (refer to PlainNet), and target the positive scaling invariance across all layers. We train 34 layers ResNet and PlainNet models on the datasets following the training strategies in the original paper, and compare the performance between -SGD333Batch normalization is widely used in modern CNN models. Please refer to Appendix for the combination of -SGD and batch normalization. and vanilla SGD algorithm. The detailed training strategies could be found in Appendix. In this section, we focus on the performance of different optimization algorithms, and will discuss the combination of -SGD and regularization in Appendix.
|-SGD||7.00 (0.10)||30.74 (0.29)|
|-SGD||6.66 (0.13)||27.74 (0.06)|
As shown in Figure 3 and Table 1, our -SGD clearly outperforms SGD on each network and each dataset. To be specific, 1) both the lowest training loss and best test accuracy are achieved by ResNet-34 with -SGD on both datasets, which indicates that -SGD indeed helps the optimization of ResNet model; 2) Since -SGD can eliminate the influence of positive scaling invariance across all layers of PlainNet, we observe the performance gain on PlainNet is larger than that on ResNet. For PlainNet model, -SGD surprisingly improves the accuracy numbers by 0.8 and 5.7 for CIFAR-10 and CIFAR-100, respectively, which verifies both the improper influence of positive scaling invariance for optimization in weight space and the benefit of optimization in space. Moreover, Plain-34 trained by -SGD achieves even better accuracy than ResNet-34 trained by SGD on CIFAR-10, which shows the influence of invariance on optimization in weight space as well.
5.2 The Influence of Invariance
In this section, we study the influence of invariance on the optimization for ReLU Networks. As proved in Section 3, the dimension of weight space is larger than -space by , where is the total number of the hidden nodes in a MLP or the feature maps in a CNN. We define the invariant ratio as . We train several 2-hidden-layer MLP models on Fasion-MNIST xiao2017online with different number of hidden nodes in each layer, and analyze the performance gap between the models optimized by -SGD and SGD. The detailed training strategies and network structures could be found in Appendix.
From Figure 4, we can see that, 1) for each number of , -SGD clearly outperforms SGD on both training loss and test error, which verifies our claim that optimization loss function in space is a better choice; 2) as increases, the invariant ratio decreases and gradually decreases as well, which provides the evidence for that the positive scaling invariance in weight space indeed improperly influences the optimization.
5.3 Comparison with Path-SGD
In this section, we compare the performance of Path-SGD and that of -SGD. As described in Section 2.1, Path-SGD also consider the positive scaling invariance, but 1) instead of optimizing the loss function in -space, Path-SGD regularizes optimization by path norm; 2) Path-SGD ignores the dependency among the paths. We extend the experiments in neyshabur2015path to -SGD without unbalance initialization, and conduct our studies on MNIST and CIFAR-10 datasets. The detailed training strategies and description of network structure can be found in Appendix.
As shown in Figure 5, while Path-SGD achieves better or equally good test accuracy and training loss than SGD for both MNIST and CIFAR10 datasets, -SGD achieves even better performance than Path-SGD, which is consistent with our theoretical analysis that considering the dependency between the paths and optimizing in -space bring benefit.
In this paper, we study the -space for ReLU neural networks and propose a novel optimization algorithm called -SGD. We study the positive scaling operators which forms a transformation group and prove that the value vector of all the paths is sufficient to represent the neural networks. Then we show that one can identify basis paths and prove that the linear span of their value vectors (denoted as -space) is an invariant space with lower dimension under the positive scaling group. We design -SGD algorithm in -space by leveraging back-propagation. We conduct extensive experiments to verify the empirical effectiveness of our proposed approach. In the future, we will examine the performance of -SGD on more large-scale tasks.
- (1) V. Badrinarayanan, B. Mishra, and R. Cipolla. Symmetry-invariant optimization in deep networks. arXiv preprint arXiv:1511.01754, 2015.
- (2) D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and B. McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? arXiv preprint arXiv:1702.08591, 2017.
- (3) D. Balduzzi, H. Vanchinathan, and J. M. Buhmann. Kickback cuts backprop’s red-tape: Biologically plausible credit assignment in neural networks. In AAAI, pages 485–491, 2015.
H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye,
G. Anderson, G. Corrado, W. Chai, M. Ispir, et al.
Wide & deep learning for recommender systems.In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 7–10. ACM, 2016.
- (5) S. S. Du, W. Hu, and J. D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in Neural Information Processing Systems, 2018.
- (6) R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947, 2000.
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- (9) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- (10) G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
- (11) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. CIFAR, 2009.
- (12) D. C. Lay. Linear Algebra and its applications, 1997. Addison Wesley Longman, Inc. ISBN 0-201-76717-1, 1997.
- (13) Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
- (14) B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015.
- (15) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
- (16) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- (17) R. Wang, B. Fu, G. Fu, and M. Wang. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, page 12. ACM, 2017.
H. Xiao, K. Rasul, and R. Vollgraf.
Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.2017.
- (19) S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- (20) S. Zheng, Q. Meng, H. Zhang, W. Chen, N. Yu, and T.-Y. Liu. Capacity control of relu neural networks by basis-path norm. arXiv preprint arXiv:1809.07122, 2018.
Appendix: -SGD: Optimizing ReLU Neural Networks in its Positively Scale-Invariant space
The Appendix document is composed of examples of skeleton weights and basis paths for different MLP structures, proofs of propositions, lemmas and theorems and the additional information about the experiments in the paper Optimization of ReLU Neural Networks using -Stochastic Gradient Descent .
|dimension of weight space|
|total number of hidden nodes or feature maps|
|total number of paths|
|total number of basis paths and dimension of -space|
|weight vector space|
|weight vector with for MLP|
|weight matrix at layer with size for MLP|
|weight element in matrix at position|
|the -th hidden node at layer|
|the set of edges in neural network model|
|index of layer|
|index of hidden nodes at -layer|
|explicit index of path|
|free skeleton weight|
|the number of|
8 Some Concepts in Abstract Algebra
(Transformation group) Suppose that is a set of transformations, and is an operation defined between the elements of . If satisfies the following conditions: 1) (operational closure) for any two elements , it has ; 2) (associativity) for any three elements , it has ; 3) (unit element) there exists unit element , so that for any element , there is ; 4) (inverse element) for any element , there exists an inverse element of such that . Then, together with the operation "" is called a transformation group.
(Group action) If is a group and is a set, then a (left) group action of on is a function that satisfies the following two axioms (where we denote as ): 1) (identity) ; 2) (compatibility) for all and all .
9 General skeleton methods and Examples of MLP Models
In this section, we will introduce the skeleton method in Algorithm 2 in a recursive way for general case. The algorithm will take round. When it takes the -th round, it regards the network as a -layer network with layer to layer and regards the node as the output nodes. When it takes the iteration, it regards nodes as hidden nodes and identify the skeleton incoming weight and skeleton outgoing weight. Thus at each round, the number of index for the basis paths will be added by 1. The logic for the construction of basis paths is that: 1) select one incoming skeleton weight (or incoming basis path only composed by skeleton weights) and one outgoing skeleton weight for each hidden node. 2) the paths which contain no more than one non-skeleton weight is basis paths.
Next, we use some examples to explain the types of skeleton weights, basis paths constructed by skeleton method in the main paper. We call the basis path which only contains skeleton weights all-basis paths and basis path which contains one non-skeleton weight skip-basis paths.
First, we consider MLP models with the same number of hidden nodes of each layer hinton2012improving . Fig.6 shows an example of MLP model. It only displays the skeleton weights. We can see that the number of basis paths is three which equals the number of hidden nodes of one layer. If the number of hidden nodes is , by using skeleton method in Algorithm 1, the number of skeleton weights is and the number of all-basis paths equals the number of hidden nodes of one layer, which is . Thus we have
Second, we consider MLP models with decreasing number of hidden nodes, i.e., . Fig.6 shows an example. We can see that the number of basis paths is five which equals the largest number of hidden nodes,i.e., . We can see that because the , some nodes (e.g., ) will have multiple incoming weights (e.g., ) which are skeleton outgoing weight for the front hidden nodes. For these nodes, they select one of skeleton weights to be the skeleton incoming weights, which is displayed using red full line (e.g.,). Others are displayed using red dotted line (e.g., ). Thus the path which contains only skeleton weights and less than one dotted skeleton weight is a all-basis path. We have
Third, we consider MLP models with unbalanced number of hidden nodes of each layer. If and for one layer with , there exists hidden nodes whose the skeleton incoming weight and skeleton outgoing weight are both dotted,e.g., in Fig.6. In this case, the number of all-basis paths is . It is because that each node must be passed by at least one all-basis paths. Thus for the example showed in Fig.6, although both the incoming skeleton weight and the outgoing skeleton weight of are dotted, the path passed is all-basis path. In this case, the number of all-basis paths is .
10 Proofs of Theoretical Results in Section 3
In this section, we will provide proofs of the lemma and theorems in Section 3 of the main paper.
10.1 Proof of Theorem 3.3
Theorem 3.3: Consider two ReLU neural networks with weights . We have that iff for path and input , we have and .
Proof: The sufficiency is trivial according to the representation of , which is shown in Eqn(1) in the main paper.
For the necessity, if , then there exist a positive scaling operator to make . We use to denote the node index of nodes in layer- and . Then we have for , because each weight may be modified by the operators of its connected two nodes and . Thus is satisfied because
Next we need to prove that is also satisfied. Because the value of the activation is determined by the sign of , we just need to prove that
where are positive numbers. We prove it by induction.
(1) For of a -layer MLP (): Suppose that is a ReLU activation function. For the -th hidden node, we have
(2) For of the -layer MLP (): Suppose that
Then we have
Thus we finish the proof.
10.2 Proof of Theorem 3.4 and Theorem 3.6
In order to prove Theorem 3.4 and Theorem 3.6, we need to prove that there exist a group of paths which are independent and can represent other paths, and the activation status can be calculated using values of basis paths and signs of free skeleton weights. In order to simplify the proof, we leverage the basis paths constructed by skeleton method. We only show the proof of the following lemma, from which we can easily get Theorem 3.4 and Theorem 3.6.
The paths selected by skeleton method are basis paths and can be calculated using signs of free skeleton weights and the values of basis paths in a recursive way.
Proof sketch: Let us consider the matrix <