Introduction
Deep neural networks have pushed the frontiers of a wide variety of AI tasks in recent years such as speech recognition [Xiong et al.2016, Chan et al.2016]
[Ioffe and Szegedy2015, Ren et al.2015] and neural language processing [Bahdanau, Cho, and Bengio2014, Gehring et al.2017], etc. More surprisingly, deep neural networks generalize well, even when the number of parameters is significantly larger than the amount of training data [Zhang et al.2017]. To explain the generalization ability of neural networks, researchers commonly used different norms of network parameters to measure the capacity [Bartlett, Foster, and Telgarsky2017, Neyshabur, Tomioka, and Srebro2014, Neyshabur, Tomioka, and Srebro2016].Among different types of deep neural networks, ReLU networks (i.e., neural networks with ReLU activations [Glorot, Bordes, and Bengio2011]) have demonstrated their outstanding performances in many fields such as image classification [He et al.2016, Huang et al.2017], information system [Cheng et al.2016, He et al.2017], and text understanding [Vaswani et al.2017] etc. It is well known that ReLU neural networks are positively scale invariant [Neyshabur, Salakhutdinov, and Srebro2015, Neyshabur et al.2016]. That is, for a hidden node with ReLU activation, if all of its incoming weights are multiplied by a positive constant and its outgoing weights are divided by the same constant, the neural network with the new weights will generate exactly the same output as the old one for any arbitrary input. [Neyshabur, Salakhutdinov, and Srebro2015] considered the product of weights along all paths from the input to output units as path norm which is invariant to the rescaling of weights, and proposed PathSGD which takes path norm as the regularization term to the loss function.
In fact, each path in a ReLU network can be represented by a small group of generalized linearly independent paths (we call them basispath in the sequels) with multiplication and division operation as shown in Figure 1. Thus, there is dependency among different paths. The smaller the percentage of basis paths, the higher the dependency. As the network is determined only by the basis paths, the generalization property of the network should depend only on the basis paths, as well as the relevant regularization methods. In addition, PathSGD controls the capacity by solving argmin of the regularized loss function, the solution of the argmin problem is approximate because dependency among different values of all paths is not considered in the network. This motivates us to establish a capacity bound based on only the basis paths instead of all the paths. This is in contrast to the generalization bound based on the path norm which counts the values of all the paths and does not consider the dependency among different paths. To tackle these problems, we define a new norm based on the values of the basis paths called Basispath Norm. In previous work, [Meng et al.2018] constructed the basis paths by skeleton method and proved that the values of all other paths can be calculated using the values of basis paths by multiplication and division operations. In this work, we take one step further and categorize the basis paths into positive and negative basis paths according to the sign of their coefficients in the calculations of nonbasis paths.
In order to control generalization error, we need to keep the hypothesis space being small. As we know, loss function can be computed by paths, hence we keep the values of all paths being small. To keep small values of nonbasis paths represented by positive and negative basis paths, we control the positive basis paths not being too large while the negative basis paths not being too small. In addition, to keep small values of basis paths, we control the negative basis paths not being too large as well. With this consideration, we define the new Basispath norm. We prove a generalization error bound for ReLU networks in terms of the basispath norm. We then study the relationship between this basispath norm bound and the empirical generalization gap – the absolute difference between test error and training error. The experiments included ReLU networks with different depths, widths, and levels of randomness to the label. For comparison purpose, we also compute the generalization error bounds induced by other capacity measures for neural networks proposed in the literature. Our experiments show that the generalization bound based on basispath norm is much more consistent with the empirical generalization gap than those based on other norms. In particular, when the network size is small, the ordinary path norm bound fit empirical generalization gap well. However, when the width and depth increases, the percentage of nonbasis paths increases, and the dependency among paths increases and we observe that the path norm bound degenerates in reflecting the empirical generalization gap. In contrast, our basispath norm bound fits the empirical generalization gap consistently as the network size changes. This validates the efficacy of BP norm as a capacity measure.
Finally, we propose a novel regularization method, called Basispath regularization (BP regularization), in which we penalize the loss function by the BP norm. Empirically, we first conduct experiments on recommendation system of MovieLens1M dataset to compare the multilayer perceptron (MLP) model’s generalization with BP regularization and baseline normbased regularization, then we verify the effectiveness of BP regularization on image classification task with ResNet and PlainNet on CIFAR10 dataset. The results of all experiments show that, with our method, optimization algorithms (i.e., SGD, Adam, Quotient SGD) can attain better test accuracy than with other regularization methods.
Related Work
Generalization of deep neural networks has attracted a great deal of attention in the community [Zhang et al.2017, Neyshabur et al.2017, Kawaguchi, Kaelbling, and Bengio2017]. Norm and marginbased measures have been widely studied, and commonly used in neural network optimization with capacity control [Bartlett and Mendelson2002, Evgeniou, Pontil, and Poggio2000, Neyshabur, Tomioka, and Srebro2016]. For example, in [Bartlett, Foster, and Telgarsky2017], the authors proposed a marginbased generalization bound for networks that scale with their marginnormalized spectral complexity. An analysis of generalization bounds based on PACBayes was proposed in [Dziugaite and Roy2017].
Among these measures, the generalization bound based on path norm is tighter theoretically [Neyshabur, Tomioka, and Srebro2016]. Empirically, path norm has been showed to be more accurate to describe the tendency of generalization error [Neyshabur et al.2017]. Thus, we are interested in the capacity measure which is related to the path norm. In [Neyshabur, Tomioka, and Srebro2016], the authors first proposed group norm and path norm. The results show that the path norm is equivalent to a kind of group norm. In [Neyshabur, Salakhutdinov, and Srebro2015, Neyshabur et al.2016], the authors proposed to use path norm as a regularization term for ReLU multilayers perceptron (MLP) network and recurrent network and designed PathSGD algorithm. In [Neyshabur et al.2017], the authors empirically compared different kinds of capacity measures including path norm for deep neural network generalization. However, none of those norms considered the dependency among paths in the networks.
Preliminaries
In this section, we introduce ReLU neural networks and generalization error.
First of all, we briefly introduce the structure of rectifier neural network models. Suppose is a layer neural network with weight , where input space and output space . In the th layer (), there are nodes. We denote the nodes and their values as . It is clear that, . The layer mapping is given as, , where is the adjacency matrix in the layer, and the rectifier activation function is applied elementwisely. We can also calculate the th output by paths, i.e.,
(1) 
where is the path starting from input feature node to output node via hidden nodes , and is the weight of the edge connecting nodes and . ^{1}^{1}1The paths across the bias node can also be described in the same way. For simplicity, we omit the bias term.
We denote and . The output can be represented using paths as
For ease of reference, we omit the explicit index and use be the index of path. We use where
to denote the path vector. The path norm used in PathSGD
[Neyshabur, Salakhutdinov, and Srebro2015] is defined asGiven the training set i.i.d sampled from the underlying distribution
, machine learning algorithms learn a model
from the hypothesis space by minimizing the empirical loss function . The uniform generalization error of empirical risk minimization in hypothesis space is defined asGeneralization error measures how well a model learned from the training data can fit an unknown test sample .
Empirically, we consider the empirical generalization error which is defined as the difference of empirical loss between the training set and test set at the trained model .
Basispath Norm
In this section, we define the Basispath Norm (abbreviated as BP norm) on the networks. Using the BP norm, we define a capacity measure which is called BPmeasure and we prove that the generalization error can be upper bounded using this measure.
The Definition of Basispath Norm
First, as shown in [Meng et al.2018], the authors constructed a group of basis paths by skeleton method. It means that the value of nonbasis paths can be calculated using the values of basis paths. In the calculation of nonbasis paths’ values, some basis paths always have positive exponents and hence appear in the numerator, others have negative exponents and hence appear in the denominator. We use to denote a nonbasis path and to denote basis paths. We have the following proposition.
Proposition 1
For any nonbasis path , , where .
Limited by the space, we put the detailed proof in the supplementary materials.
The proposition shows that basis paths always have negative exponent in the calculation, while always have positive exponent. We call the basis path with negative exponent Negative Basis Path and denote the negative basis path vector as . We call the basis path with positive exponent Positive Basis Path and denote it as .
In order to control generalization error, we need to keep the hypothesis space being small. Thus we want all the paths to have small values. For nonbasis path represented by and , we control not being too small because is negative, and not being too large because is positive. We control not being too large as well to keep small values of basis paths. We define the following basispath norm as follows.
Definition 1
The basis norm on the ReLU networks is
(2) 
We next provide the property of .
Theorem 1
is a norm in the vector space where is a vector in Euclidean space and is a vector in a generalized linear space under the generalized addition and generalized scalar multiplication operations: and for and .
Proof: The definition of is equivalent to
(3) 
where and . Obviously, is the norm in Euclidean space. Thus, it only needs to prove is a kind of norm. Next, we prove that is a norm in the generalized linear space.
In the generalized linear space, the zero vector is , where denotes a vector with all elements being equal to . Based on the generalized linear operators, we verify the properties including positive definite, absolutely homogeneous and the triangle inequality of as follows:
(1) (Positive definite) and when .
(2) (Absolutely homogeneous) For arbitrary , we have
(3) (Triangle inequality)
Considered that and are both norms, taking supreme of them is still a norm. Thus satisfies the definition of norm.
Generalization Error Bound by Basispath Norm
We want to use the basispath norm to define a capacity measure to get the upper bound for the generalization error. Suppose the binary classifier is given as
, where represents the linear operator on the output of the deep network with input vector . We consider the following hypothesis space which is composed of linear operator , and layered fully connected neural networks with width and input dimension :Theorem 2
Given the training set with , , and the hypothesis space which contains MLPs with depth , width and , for arbitrary , for every
, with probability at least
, for every hypothesis , the generalization error can be upper bounded aswhere
(4) 
We call Basispath measure. Therefore, the generalization error can be upper bounded by a function of Basispath measure.
The proof depends on estimating the value of different types of paths and counting the number of different types of paths. We give the proof sketch of Theorem
2.Proof of Theorem 2:
Step 1: If we denote , the generalization error of a binary classification problem is
where denotes the Rademacher complexity of a hypothesis space [Wolf2018]. Following the results of Theorem 1 and Theorem 5 in [Neyshabur, Tomioka, and Srebro2016] under and , we have
where is the maximal path norm of .
Step 2 (estimating path value): We give an upper bound using Basispath norm. Using , we have and . Then using Proposition 1, we have
(5)  
(6) 
where .
As shown in skeleton method in [Meng et al.2018] (which can also be referred in supplementary materials), basis paths are constructed according to skeleton weights. Here, we clarify the nonbasis paths according to the number of nonskeleton weights it contains. We denote the nonbasis path which contains nonskeleton weights as . The proofs of Proposition 1 indicates that for . Thus we have
Step 3 (counting the number of different type of paths): Based on the construction of basis paths (refer to the skeleton method in supplementary), in each hidden layer, there are skeleton weights and nonskeleton weights. We can get that the number of in a layer MLP with width is if and if .
Step 4: We have:
(7) 
The number of negative basis paths is and , so we have , where .
(8) 
where Ineq.8 is established by the calculation of .
Based on the above theorem, we discuss how changes as width and depth . (1) For fixed , increases exponentially as and goes to large. (2) increases as increases. If diminishes to zero, we have . In this case, the feature directly flow into the output, which means that , for . (3) If , we have . It increases linearly as and increase and decreases linearly as increases.
Empirical Verification
In the previous section, we derived a BP norm induced generalization error bound for ReLU networks. In this section, we study the relationship between this bound and the empirical generalization gap – the absolute difference between test error and training error with realdata experiments, in comparison with the generalization error bounds given by other capacity measures, including weight norm [Evgeniou, Pontil, and Poggio2000], path norm [Neyshabur, Tomioka, and Srebro2016] and spectral norm [Bartlett, Foster, and Telgarsky2017]. We follow the experiment settings in [Neyshabur et al.2017], and extend on our BP norm bound. As shown in the previous section, the BP norm with capacity is proportional to Eqn.2
We conduct experiments with multilayer perceptrons (MLP) with ReLU of different depths, widths, and global minima on MNIST classification task which is optimized by stochastic gradient descent. More details of the training strategies can be found in the supplementary materials. All experiments are averaged over 5 trials if without explicit note.
First, we train several MLP models and force them to converge to different global minima by intentionally replacing a different number of training data with random labels, and then calculate the capacity measures on these models. The training set consists of 10000 randomly selected samples with true labels and another at most 5000 intentionally mislabeled data which are gradually added into the training set. The evaluation of error rate is conducted on a fixed 10000 validation set. Figure 2 (a) shows that every network is enough to fit the entire training set regardless of the amount of mislabeled data, while the test error of the learned networks increases with increasing size of the mislabeled data. As shown in Figure 2 (b), the measure of BP norm is consistent with the behaviors of generalization on the data and indeed is a good predictor of the generalization error, as well as weight norms, path norm, and spectral norm.
We further investigate the relationship between generalization error and the network size with different widths. We train a bunch of MLPs with 2 hidden layers and varying number of hidden units from 16 to 8192 for each layer. The experiment is conducted on the whole training set with 60000 images. As shown in Figure 2(c), the networks can fit the whole training set when the number of hidden units is greater than or equal to 32, while the minimal test error is achieved with 512 hidden units, then shows a slightly over fitting on training set beyond 1024 hidden units. Figure 2(d) shows that the measure of BP norm behaves similarly to the trend of generalization errors which decreases at the beginning and then slightly increases, and also achieves minimal value at 512 hidden units. Weight norm and spectral norm keep increasing along with the network size growing while the trend of generalization error behaves differently. Path norm shows the good explanation of the generalization when the number of hidden units is small, but keeps decreasing along with increasing the network size in this experiment. One possible reason is that the proportion of basis paths in all paths is decreasing, and the vast majority improperly affects the capacity measure when the dependency in the network becomes large. In contrast, BP norm better explains the generalization behaviors regardless of the network size.
Similar empirical observation is shown when we train the network with a different number of hidden layers. Each network has 32 hidden units in each layer and can fit the whole training set in this experiment. As shown in Figure2(e,f), the minimal test error is achieved with 3 hidden layers, and then shows an over fitting along with the increasing of the layers. The weight norm keeps increasing with the growing of network size as discussed above, and the in spectral norm will be quite large when layers is increasing. Path norm can partially explain the decreasing generalization error before 4 hidden layers and it indicates that the networks with 4, 5 and 6 hidden layers have small generalization error, which doesn’t match our observations. The amount of nonbasis paths is exponentially growing when layers is increasing, therefore the path norm couldn’t measure the capacity accurately by counting all paths’ values. In contrast, the BP norm can nearly match the generalization error, these observations verify that BP norm bound is more tight to generalization error and can be a better predictor of generalization.
Basispath Regularization for ReLU Networks
In this section, we propose Basispath regularization, in which we penalize the loss function by the BP norm. According to the definition of BP norm in Eqn.(2), to make it small, we need to restrict the values of negative basis paths to be moderate (neither too large nor too small) and minimize the value of positive basis paths. To this end, in our proposed method, we penalize the empirical loss by the distance between the values of negative basis paths and , as well as the sum of the values of all positive basis paths.
The constraint equals to and , which means that the largest element in a vector is smaller than iff all of the element is smaller than . We choose to optimize their square because of the smoothness. So using the Lagrangian dual methods, we add the constraint and in the loss function and then optimize the regularized empirical risk function:
(9) 
We use to denote the gradient of loss with respect to , i.e., . For the nonskeleton weight , since it is contained in only one positive basis path , we can calculate the gradient of the regularization term with respect to as
(10) 
For the skeleton weight , it is contained in only one negative basis path (if the neural network has equal number of hidden nodes) and some of the positive basis paths . Thus its gradient can be calculated as follows
(11) 
where denotes all positive basis paths containing .
Combining them together, we get the gradient of the regularized loss function with respected to the weights. For example, if we use stochastic gradient descent to be the optimizer, the update rule is as follows:
(12) 
Please note that the computation overhead of is high, moreover, we observed that the values of the negative basis paths are relatively stable in the optimization, thus we set to be zero for ease of the computation. Specifically, basispath regularization can be easily combined with the optimization algorithm which is in quotient space.
The flow of SGD with basispath regularization is shown in Algorithm 1, it’s trivial to extend basispath regularization to other stochastic optimization algorithms. Comparing to weight decay, basispath regularization has little additional computation overhead. All the additional computations regarding Ineq.(10) only introduce very lightweight elementwise matrix operations, which is small compared with the forward and backward process.
Experimental Results
In this section, we evaluate Basispath Regularization on deep ReLU neural networks with the aim of verifying that does our proposed BP regularization outperforms other baseline regularization methods and whether it can improve the generalization on the benchmark datasets. For sake of fairness, we reported the mean of 5 independent runs with random initialization.
Recommendation System
We first apply our basispath regularization method to recommendation task with MLP networks and conduct experimental studies based on a public dataset, MovieLens^{2}^{2}2https://grouplens.org/datasets/movielens/1m/. The characteristics of the MovieLens dataset are summarized in Table 1. We use the version containing one million ratings, where each user has at least 20 ratings. We train an NCF framework with similar MLP network proposed in [He et al.2017] and followed their training strategies with Adam optimizer but without any pretraining. We test the predictive factors of [8,16,32,64], and set the number of hidden units to the embedding size in each hidden layer. We calculate both metrics for each test user and report the average score. For each method, we perform a wide range grid search of hyperparameter from where and report the experimental results based on the best performance on the validation set. The performance of a ranked list is judged by Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) [He et al.2015].
Dataset  Interaction#  Item#  User#  Sparsity 
MovieLens  1,000,209  3,706  6,040  95.53% 
Figure 3 (a) and (b) show the performance of HR@10 and NDCG@10 w.r.t. the number of predictive factors. From this figure, it’s clear to see that basispath regularization achieve better generalization performance than all baseline methods. Figure 3 (c) and (d) show the performance of TopK recommended lists where the ranking position K ranges from 1 to 10. As can be seen, the basispath regularization demonstrates consistent improvement over other methods across positions, which is consistent with our analysis of generalization error bound in the previous section.
Image Classification
In this section, we apply our basispath regularization to this task and conduct experimental studies based on CIFAR10 [Krizhevsky and Hinton2009], with 10 classes of images. We employ a popular deep convolutional ReLU model, ResNet [He et al.2016] for image classification since it achieves huge successes in many image related tasks. In addition, we conduct our studies on a stacked deep CNN described in [He et al.2016] (refer to PlainNet), which suffers serious dependency among the paths. We train 34 layers ResNet and PlainNet networks on this dataset, and use SGD with widely used weight decay regularization (WD) as our baseline. In addition, we implement QSGD, which is proposed in [Meng et al.2018] and optimize the networks on basis paths. We investigate the combination of SGD/QSGD and basispath regularization (BPR). Similar with the previous task, we perform a wide range grid search of from , where . More training details can be found in supplementary materials.
Algorithm 
PlainNet34  ResNet34  

Train  Test  Train  Test  
SGD 
0.06  7.76  7.70  0.01  7.13  7.12 
SGD + WD  0.06  6.34  6.27  0.01  5.71  5.70 
SGD + BPR  0.06  5.99  5.92  0.01  5.62  5.61 
QSGD  0.03  7.00  6.97  0.01  6.66  6.65 
QSGD + BPR  0.05  5.73  5.68  0.03  5.36  5.33 
Table 2 shows the training and test results of each algorithms. From the figure and table, we can see that our basispath regularization indeed improves test accuracy of PlainNet34 and Resnet34 by nearly 1.8% and 1.5% respectively. Moreover, the training behaviors of SGD with weight decay and basispath regularization are quite similar, but the basispath regularization can always find better generalization points during optimization, which is consistent with our theoretical analysis in the previous section. We further investigate the combination of QSGD and basispath regularization. QSGD with basispath regularization achieves the best test accuracy on both PlainNet and ResNet model, which indicates that taking BP norm as the regularization term to the loss function is helpful for optimization algorithms.
Conclusion
In this paper, we define Basispath norm on the group of basis paths, and prove that the generalization error of ReLU neural networks can be upper bounded by a function of BP norm. We then design Basispath regularization method, which shows clearly performance gain on generalization ability. For future work, we plan to test basispath regularization on larger networks and datasets. Furthermore, we are also interested in applying basispath regularization on networks with different architecture.
References
 [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 [Bartlett and Mendelson2002] Bartlett, P. L., and Mendelson, S. 2002. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3(Nov):463–482.
 [Bartlett, Foster, and Telgarsky2017] Bartlett, P. L.; Foster, D. J.; and Telgarsky, M. J. 2017. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 6241–6250.
 [Bayer et al.2017] Bayer, I.; He, X.; Kanagal, B.; and Rendle, S. 2017. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web, 1341–1350. International World Wide Web Conferences Steering Committee.
 [Chan et al.2016] Chan, W.; Jaitly, N.; Le, Q.; and Vinyals, O. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 4960–4964. IEEE.

[Cheng et al.2016]
Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.;
Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al.
2016.
Wide & deep learning for recommender systems.
In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 7–10. ACM.  [Dziugaite and Roy2017] Dziugaite, G. K., and Roy, D. M. 2017. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008.

[Evgeniou, Pontil, and
Poggio2000]
Evgeniou, T.; Pontil, M.; and Poggio, T.
2000.
Regularization networks and support vector machines.
Advances in computational mathematics 13(1):1.  [Gehring et al.2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.

[Glorot, Bordes, and
Bengio2011]
Glorot, X.; Bordes, A.; and Bengio, Y.
2011.
Deep sparse rectifier neural networks.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, 315–323.  [He et al.2015] He, X.; Chen, T.; Kan, M.Y.; and Chen, X. 2015. Trirank: Reviewaware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 1661–1670. ACM.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
 [He et al.2017] He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.S. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, 173–182. International World Wide Web Conferences Steering Committee.
 [Hinton et al.2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.

[Huang et al.2017]
Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L.
2017.
Densely connected convolutional networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
.  [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
 [Kawaguchi, Kaelbling, and Bengio2017] Kawaguchi, K.; Kaelbling, L. P.; and Bengio, Y. 2017. Generalization in deep learning. arXiv preprint arXiv:1710.05468.
 [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images.
 [Meng et al.2018] Meng, Q.; Zheng, S.; Ye, Q.; Chen, W.; and Liu, T.Y. 2018. Optimization of relu neural networks using quotient stochastic gradient descent. arXiv preprint arXiv:1802.03713.

[Neyshabur et al.2016]
Neyshabur, B.; Wu, Y.; Salakhutdinov, R. R.; and Srebro, N.
2016.
Pathnormalized optimization of recurrent neural networks with relu activations.
In Advances in Neural Information Processing Systems, 3477–3485.  [Neyshabur et al.2017] Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; and Srebro, N. 2017. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, 5949–5958.
 [Neyshabur, Salakhutdinov, and Srebro2015] Neyshabur, B.; Salakhutdinov, R. R.; and Srebro, N. 2015. Pathsgd: Pathnormalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, 2422–2430.
 [Neyshabur, Tomioka, and Srebro2014] Neyshabur, B.; Tomioka, R.; and Srebro, N. 2014. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.
 [Neyshabur, Tomioka, and Srebro2016] Neyshabur, B.; Tomioka, R.; and Srebro, N. 2016. Normbased capacity control in neural networks. In Conference on Learning Theory, 1376–1401.
 [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
 [Ruder2016] Ruder, S. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
 [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010.

[Wolf2018]
Wolf, M. M.
2018.
Mathematical foundations of supervised learning.
[J].  [Xiong et al.2016] Xiong, W.; Droppo, J.; Huang, X.; Seide, F.; Seltzer, M.; Stolcke, A.; Yu, D.; and Zweig, G. 2016. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256.
 [Zhang et al.2017] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2017. Understanding deep learning requires rethinking generalization. ICLR.
Supplementary: Capacity Control of ReLU Neural Networks by Basispath Norm
This document contains supplementary theoretical materials and additional experimental details of the paper "Capacity Control of ReLU Neural Networks by Essential Path Norm".
Skeleton Method to Construct BasisPath
We briefly review the construction of basispaths. For an layer feedforward neural network with width , skeleton methods [Meng et al.2018] construct basis paths following two steps:
(1) Select skeleton weights: The skeleton weights are the diagonal elements in weight matrix , . For , select elements . For , select elements . All the selected weights are called skeleton weights. Others are called nonskeleton weights.
(2) Construct basispaths: The paths contains no more than one nonskeleton weights are basispaths.
In [Meng et al.2018], the authors also proved the following properties of basispaths: each nonskeleton weight will only appear in one basispath.
Proof of Proposition 1
Proposition 2
For any nonbasis path , , where .
Proof: We prove this proposition by induction. For ease of reference, we use to denote the operator , i.e., where .
(1) If , hidden node () has an ingoing skeleton weight and an outgoing skeleton weight . Then for a nonskeleton path where and , it can be calculated as
(13)  
(14)  
(15)  
(16) 
According to skeleton method, , , are all basispaths. We can see that the basispath such as which contains no nonskeleton weights is in the denominator. The basispath which contains one skeleton weight such as and is in the numerator. An example is shown in Figure 1 in the main paper.
(2) If the proposition is satisfied for a layer FNN, i.e.,
where is the basispath which contains nonskeleton weight ^{3}^{3}3If is a nonskeleton weight. Otherwise, it will not appear in the numerator. and denotes a basispath which only contains skeleton weights.
Then for a layer FNN, a nonbasis path can be calculated as
where is the skeleton weight at layer which connects the basispath that contains , and denotes the skeleton weight at layer that connects the weight . Because are all skeleton weights, is the value of a basispath in a layer FNN. The establishing of the above equality also uses the fact that is a basispath for a Llayer FNN because it contains only one nonskeleton weight . Combining together, we can get that the denominator is a basispath which contains only skeleton weight.
Therefore, we prove that basispath which contains one nonskeleton weight will only appear in the numerator and basispath which only contains skeleton weights will only appear in the denominator in the calculation of the nonbasis paths.
Experiments  additional material
All experiments were conducted with Pytorch commit 2b47480. The ResNet implementation can be found in (https://github.com/pytorch/vision/
). Unless noted the initialization method was used by sampling values from a uniform distribution with range
, where ,is the dimension of previous layer, which is the default initialization method for linear layers and convolutional layers in PyTorch.
Experiment settings
Empirical Verification
Multilayer perceptrons with ReLU activation were trained by SGD in this experiments. The max iteration number is 500 epochs. The initial learning rate is set to 0.05 for each model. Exponential decay is a wildly used method when training neural network models
[Hinton et al.2012, Ruder2016], which is applied to the learning rate in this experiments with power 0.01. Minibatch size with 128 was performed in this experiments.Recommendation System
For the recommendation system experiments, to evaluate the performance of item recommendation, we employed the leaveoneout evaluation, which has been widely used in literatures [Bayer et al.2017, He et al.2017]. For each user, we heldout the latest interaction as the test set and utilized the remaining data for training. We followed the common strategy that randomly samples 100 items that are not interacted by the user, ranking the test item among the 100 items.
Image Classification
For the image classification experiment, we use the original RGB image of CIFAR10 dataset with size 3 32 32. As before, we rescale each pixel value to the interval [0, 1]. We then extract random crops (and their horizontal flips) of size 3 28 28 pixels and present these to the network in minibatches of size 128. The training and test loss and the test error are only computed from the center patch (3 28 28).
We trained 34 layers ResNet and PlainNet models (refer to resnet34 and plain34 in the original paper respectively) on this dataset. We performed training for 64k iterations, with a minibatch size of 128, and an initial learning rate of 1.0 which was divided by 10 at 32k and 48k iterations following the practice in the original paper.
Experimental Results on the Influence of
As for different model and norm, the should be selected carefully. As described in image classification task section in the main paper, we performed a wide range of grid search for each norm from , where , and reported the best performance based on the validation set. In this section, we show how affect our basispath regularization. The results are given in Figure 4. Each result is reported by 5 independent runs with random initialization. Please note that too large value of ( in this setting) will lead to diverge, meanwhile too small will make the regularization influence nearly disappear. A proper will lead to significant better accuracy.
Comments
There are no comments yet.