1 Introduction
Deep neural networks (DNN) have been the method of choice for owing to their great success in plethora of machine learning tasks, such as image classification and segmentation
(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012)(Collobert et al., 2011)(Mnih et al., 2015) and various other tasks (Schmidhuber, 2015; LeCun et al., 2015). It is known that depth and width of network plays a key role in its learning abilities. Although multiple architectures of DNNs exits like recurrent neural networks (RNNs)
(Hochreiter and Schmidhuber, 1997) and recursive nets (Socher et al., 2011) however, for the discussions in the paper, we focus on the feedforward architectures of DNNs. In the works of (Hornik, 1991; Cybenko, 1989)it was shown that a single hidden layer network or a shallow architecture can approximate any measurable Borel function given enough number of neurons in the hidden layer but recently it was shown by
(Montufar et al., 2014) that a deep network can divide the space into an exponential number of sets, a feat which cannot be achieved by a shallow architecture with same number of parameters. Similarly, it was shown by (Telgarsky, 2016) that for a given depth and the number of parameters there exists a DNN that can be approximated by a shallow network with parameters that are exponential in number of layers of the network. (Cohen et al., 2016) conclude that functions that can be implemented by DNNs are exponentially more expressive than functions implemented by a shallow network. These theoretical results which showcase the expressiveness of DNNs have been empirically backed up with deep architectures being the current stateoftheart in multiple applications across various domains.Many researchers have shown the effect of depth and width on the performance of deep architectures. It is known that increasing the depth or width increases the number of parameters in the network, and often these numbers can be much larger than the number of the samples used to the train the network itself. These networks are currently trained using stochastic gradient descent (SGD). With such a huge number, the obvious question is to ask is why do these machines learn effectively? Researchers have tried to answer this question by proving statistical guarantees on learning capacities of these networks. Multiple complexity measures have been proposed in the literature, namely VapnikChervonenkis (VC) dimension
(Vapnik and Vapnik, 1998) and its related extensions like pseudodimension, fat shattering dimension (Anthony and Bartlett, 2009) and radiusmargin bounds (Burges, 1998; Smola and Schölkopf, 1998), Radamacher Complexity (Bartlett and Mendelson, 2002) and covering numbers (Zhou, 2002) to name a few. All these measures define a number that characterizes the complexity of the hypothesis class which in this case is the neural network. The most popular among these is the VC dimension which defines the size of largest set that can be shattered by the given hypothesis class.(Bartlett et al., 1999) provided the VC dimension bounds for piecewise linear neural networks. (Karpinski and Macintyre, 1995; Sontag, 1998; Baum and Haussler, 1989)
gave the VC bounds for general feedforward neural network with sigmoidal nonlinear units. These bounds are defined with respect to the number of parameters and in general are quite large.
ShalevShwartz and BenDavid (2014) presented bounds which are linear in terms of the trainable parameters. These bounds grow larger with increase in width and depth of the networks and fail to explain the unreasonable effectiveness of the depth of neural networks. (Bartlett, 1998) showed that the VC dimension of a network can be bounded by the norm of parameters or weights rather than the number of parameters. The norm of weights can be much smaller than the number of weights. Thus this bound could explain the rationale behind the minimization of the norm of the weights. (Neyshabur et al., 2015) presented Radamacher complexity based bounds showing the deep network bounds in terms of norm of the weights and the number of neurons per layer. (Sun et al., 2016)also presented Radamacher average bounds for multiclass convolutional neural networks with pooling operations in terms of norm of the weights and the size of pooling regions.
(Xie et al., 2015)showed that mutual angular regularizer (MAR) can greatly improve the performance of a neural network. They showed that increasing the diversity of hidden units in a neural network reduces the estimation error and increases the approximation error. Authors also presented generalization bounds in terms of Radamacher complexity. However, as mentioned in
(Kawaguchi et al., 2017), the dependency on the depth of the network is exponential. (Sokolic et al., 2017) presented generalization bounds in terms of Jacobian matrix of the network and showed better performance of networks when presented with smaller number of samples. They provide theoretical justification to the contractive penalty used in (An et al., 2015; Rifai et al., 2011) by explaining the effect of Jacobian regularization on input margin.Currently, the neural networks are regularized using Dropout (Srivastava et al., 2014) or Dropconnect (Wan et al., 2013)
in conjugation with weight regularization. Dropout randomly drops the neurons to prevent their coadaptations, while Dropconnect randomly drops connections trying to achieve a similar objective. Both these methods can be thought of as ensemble averaging of multiple neural networks done through a simple technique of using Bernoulli gating random variables to remove certain neurons and weights respectively. The properties of Dropout are studied in
(Baldi and Sadowski, 2014) while the Radamacher complexity analysis of dropconnect is mentioned in (Wan et al., 2013).There have also been performance analysis of various architectures of feedforward neural networks. One such architecture is residual network (resnet) (He et al., 2016a), whose analysis has been presented in (He et al., 2016b). It uses a direct or an identity connection from previous layer to the next layer and allows very deep architectures to be trained effectively with minimal gradient vanishing problem (Bengio et al., 1994). Several variants of residual networks have been proposed namely wide residual networks (Zagoruyko and Komodakis, 2016), inception residual network (Szegedy et al., 2017) and the generalization of residual networks named as highway networks (Srivastava et al., 2015).
Our contribution in this work is as follows, Firstly we present radiusmargin bounds on feedforward neural networks. Then, we show the bounds for Dropout and Dropconnect and show that these regularizers brings down the expected sample complexity for deep architectures. Next, we present the margin bounds for residual architectures. Furthermore, we compute the radiusmargin bound for an input noiserobust algorithm and then show that Jacobian regularizer along with hinge loss approximates the input noiserobust hinge loss. Finally, we hint at the fact that enlarging the input margin for a neural network via minimization of Jacobian regularizer is required to obtain an input noiserobust loss function. To our knowledge this is one of the first effort in showing the effectiveness of various regularizers in bringing down the sample complexity of neural networks using the radius margin bounds for margin based loss function. In this paper make use of binary class support vector machine (SVM)
(Cortes and Vapnik, 1995) loss function at the output layer. This can also be generalized to any margin based loss function for both binary and multiclass settings.2 Preliminaries
Given a binary class problem, we denote as the input set and as our label set. Here is the dimensionality of the input pattern. The training set is defined as , which is a finite sequence of pairs in . Let
denote the probability distribution over
. The training set is i.i.d. sampled from the distribution . Let denotes the hypothesis class. The goal of the learning problem or a learning algorithm is to find a function . Letdenote the margin of classifier, which is given by:
Now, consider a feedforward neural network with one input layer, hidden layers and one output neuron. Let the number of neurons in layer be given by , where denote the dimension of the input sample and denote the number of neurons in output layer. The number of units in output layer is one for binary classification. Let denote the weights going from layer to layer such that, denote the weights going from layer to neuron in layer . Let
denote the activation function, which is Rectified Linear Units (ReLUs), tanh or any
Lipschitz continuous activation function passing though origin for each of the neurons in hidden layers and a linear activation function in output layer. We keep the norm of inputs bounded by i.e., and the norm of weights going from layer to layer bounded by such that . The function computed by network in layer is given by , where , , is applied elementwise and the operator denotes the dot product.Thus, the hypothesis class of feedforward neural networks with hidden layers and norm of weights bounded by is given by:
Lipschitz property: Let . A function is Lipschitz over if for every , we have:
(1) 
ReLU and Leaky ReLU are Lipschitz continuous function with Lipschitz constant . Likewise, sigmoid and hyperbolic tangent are Lipschitz continuous functions with Lipschitz constants and respectively. We will focus mostly on ReLU and activation functions which passes through origin like Leaky ReLU and tanh.
For activation functions passing through origin, eq. 1 holds true for all . Hence, the eq. 1 also holds for all and
(2) 
loss: The loss for random variable and predictor is given by:
True risk of loss: The true risk of the prediction rule is defined as:
Empirical risk of loss: The empirical risk of the prediction rule is defined as:
Hinge loss: The empirical risk is difficult to optimize owing to its nonconvex nature. Hinge loss satisfies the requirements of a convex surrogate for loss. The hinge loss is defined as:
Clearly, .
Empirical risk of hinge loss: The empirical risk of is defined as:
3 Radiusmargin bounds
In this section we provide radius margin bounds for feedforward neural networks including those of regularizers like Dropout and Dropconnect. The reader should refer to the section Appendix (Appendix) for the proofs of the theorems mentioned in the main text. The upper bound on VC dimension for a training set which is fully shattered with the output margin of by a function from the hypothesis class of neural networks with hidden layers, neurons in each layer with Lipschitz activation function passing through origin and the norm of weights constrained by for all is given by:
(3) 
The bound given in eq. 3 defines the dependence of VC dimension on the radius of data and the product of maxnorm terms with number of neurons per layer. There is always a dependence on the depth of the network in terms of number of product terms included in the eq. 3. The bound has several implications:

The bound is independent of dimensionality on input data, it is only dependent on radius of data.

The bound is independent of number of weights, but rather dependent on maxnorm of weights.

The bound depicts the key role of depth for deep networks. Increasing the depth of the network does not always increase the VC dimension. If the product for all then there is a decrease in the capacity of the network as the depth increases. On the other hand if the product for all , then the network capacity increases with depth. Thus, by changing the number of neurons and maxnorm constraints on weights one can alter the capacity of the network to the desired values.

Keeping the number of neurons in hidden layers fixed to and using ReLU activation function with , we get the VC bound similar to Theorem 1 of (Neyshabur et al., 2015):
(4) The bound presented in eq. 4 shows that keeping the number of neurons fixed in each layer, the VC dimension of the hypothesis class of neural network can be controlled by changing the maxnorm constraint on the weights on the network. However, the exponential dependency on the depth cannot be avoided.
Effect of Dropout: We now show the effect of Dropout on the same network, where we multiply each neuron in layer with Bernoulli selector random variable for each sample . Every selector random variable takes the value with probability and with dropout probability for each layer and is independent from each other. The dropout mask for each layer and each sample is given by , where each entry of the mask is . The new hypothesis class of neural network is given as:
Here, represents element wise multiplication. For the same network as mentioned in Theorem 3, when added with dropout to each layer for all with dropout probability , the VC dimension bound is bounded by:
(5) 
Effect of Dropconnect: We now show the effect of Dropconnect on the same network as mentioned in Theorem 3, where we multiply the individual elements of the weight matrix with the elements of a matrix of i.i.d drawn Bernoulli selector random variables for all and all samples . The elements of the matrix are the vectors and each vector is composed of Bernoulli random variable such that , which is with probability and with probability . The hypothesis class of feedforward neural networks with Dropconnect regularizer is given by:
Here, represents element wise multiplication.
For the same network as mentioned in Theorem 3, when added with Dropconnect to each layer for all with Dropconnect probability , the VC dimension bound is bounded by:
(6) 
Implications of Dropout and Dropconnect: The two bounds presented in eq. 5 and eq. 6 are equivalent. Thus the two techniques brings down the capacity of the network, thus preventing problems like overfitting. The reasons as to why these two methods outperform other kinds of regularizers is they act like ensemble of networks and allow to learn representations from fewer number of neurons or weights at each iterations. Details of such an interpretation is mentioned in Srivastava et al. (2014) and Wan et al. (2013).
Bounds for a Resnet architecture: Consider a generic resnet with residual blocks having residual units per block. Each of the residual unit consists of activation function followed by convolution layer (cv), followed by dropout, and cv layer. The final output of cv layer is added with the output of previous layer. We use a cv layer () after the input to increase the number of filters. After every one resnet block, we have a cv layer, maxpool or an averagepool layer for dimensionality reduction. For our discussions, we keep a cv unit for dimensionality reduction rather than maxpool or averagepool. After residual blocks we have fully connected layers with dropout. Lastly, we have our classifier layer and the hinge loss is applied to the classifier layer.
Consider the input data . Let the number of filters in each cv layer in block be and size of filters for the cv layers in those blocks as well as dimensionality reduction blocks be
with strides
. The functiontakes in filter size, number of filters, strides and padding as the parameters alongside the input, which are not shown for brevity. The output of residual unit
in block is given by:The bound of a residual network as described above is given by:
(7) 
Implications of the VC bound on Resnet architecture: The bound given in eq. 7 is dependent on the maxnorm of the weights, size of filters in each block, dropout probability, number of blocks, residual units per block and the Lipschitz constant of the activation function. It shows that the bound increases exponentially in the number of residual units per block which is expected as number of residual units increases the capacity of the network.
3.1 Robustness to input noise
Robustness measures the variation of the loss function w.r.t. the input . (Xu and Mannor, 2012) presented generalization bounds for an algorithm being robust in terms of Radamacher averages. The idea that a large margin implying robustness was applied to deep networks in (Sokolic et al., 2017). Here, we present the idea of robustness of an algorithm in terms of the VC dimension by incorporating the notion noise in a sample such that its label remains unchanged. Theorem 3.1 shows that for a robust algorithm the VC dimension is larger than a nonrobust algorithm.
Consider the set . Let this set denote the noise that can be added to the samples to obtain such that for some , , then the VC bound for such a hypothesis class is given by:
Gradient regularization: Consider the input noiserobust loss function,
We now use the first order approximation for to get,
(8) 
Using eq. 8 the objective function can be written as:
(9) 
The term is the norm of Jacobian matrix of the deep neural network (DNN). We now show that minimizing the term is equivalent to maximizing the input margin of the DNN. Input and output margin: The input margin of sample can be defined as:
(10) 
whereas, the output margin of the sample is given as:
(11) 
Using the Theorem 3, Corollary 2 of Sokolic et al. (2017) and the Lebesgue differentiation theorem, we get,
(12) 
Assume that the point lies on the decision boundary, then the term is equal to . Using the aforementioned fact, one can write:
(13) 
Using eqs. 13, 10 and 11 in eq. 12 we get,
(14) 
Let denotes convex set, then from eq. 12 and eq. 14 we can write,
(15) 
From eq. 15 we see that, minimizing the norm of Jacobian matrix amounts to increasing the lower bound on input margin whereas, eq. 9 shows that minimizing the norm of along with the hinge loss approximates the input noiserobust hinge loss function. The two hints at the fact that maximizing the input margin is required to obtain an input noiserobust deep architecture.
4 Conclusion
This paper studies the radius margin bounds for deep architectures both fully connected and residual convolutional networks in presence of hinge loss at the output. We show that the capacity of the deep architecture can be bounded by the number of neurons, the filter size for each layer, the Dropout probability or Dropconnect probability and the max norm of the weights. We also hint at the equivalence of minimizing the norm of the Jacobian matrix of the network and robustness to the input perturbation. We show that minimizing the norm of the Jacobian leads to a network with large input margin which in turn causes the network to be robust to perturbation in the input space. In the future, we would like to study the effect of weight quantization on the VC dimension bound of the deep architectures.
We would like to acknowledge support for this project from Safran Group.
Appendix
Proof of Theorem 3
Proof: Since the set is fully shattered by the hypothesis class, implies for all , there exists , such that,
(16) 
Summing up these inequalities yields,
Since, the inequality holds for all , it also holds on expectation over
drawn i.i.d. according to a uniform distribution over
. Since the distribution is uniform, hence for , we have . Thus, since the distribution is uniform if , otherwise. This gives,Applying Jensen’s inequality,
(17) 
Now, we prove the bound on .
Let be the index of the maximum absolute value in the vector
Using eq. 2 we get,
Applying it recursively till layer 1 we get,
(18)  
(19) 
using eq. 19 in eq. 17 we get,
Proof of Theorem 3
Proof: Since, is a vector of random variables for each sample and each layer , we have to take expectations over each random variable present to determine the expected VC dimension of the network. Here, . Following eq. 16 we get,
(20) 
Since, the inequality holds for all and for all , it also holds on expectation for and . The distribution over is Bernoulli, thus . This gives,
Applying Jensen’s inequality,
Using the fact that we get,
(21) 
Now, we prove the bound on