1 Introduction
A large number of metrics largemetric1; largemetric2; largemetric3
have been proposed over the years which aim to gauge the magnitude of a vector. Most of these approaches are in the Euclidean space and have been bunched together in the form of
norms. Though each of these norm functions are formulated in a way most suitable for the required objective, they have been used in varied forms in the deep learning community, often serving as losses and at times, as regularizers. Hence, the motivation behind each regularization is quite different. Regularizers which directly alter the parameters of the function estimator utilize these functions heavily. In the last few decades, people have tried to incorporate many additional constraints, along with these learning algorithms, to enforce smoothness
reg_smoothness and sparsity. For instance, the use of regularizer enforces sparsity and the use of regularizer enforces bound on the magnitude and introduces continuity in the parameter space.Another important dimension which has been explored is controlling the sparsity of the network. This approach follows the Bayesian paradigm, in which priors providing more information about the parameter space, as compared to the traditional ones, are used. In our case, the prior tries to accentuate the part of the parameter space which is significant and masks that part of the parameter space which is insignificant. group_sparse_reg
analyzed the effect of grouplevel sparsity on deep neural network and proposed a generalized approach for optimizing network weights, network architecture and feature selections.
benifits_group_sparsity discussed the advantages of strong group sparsity using the theory of group lasso and have also provided theoretical justification for using group sparse regularizer. bayesian_compression discussed a method to optimize the model by enforcing the constraints on the sparsity of the network and achieve state of the art results in terms of compression of the model.In Bayesian statistical inference, a prior (represented in the form of a probability distribution) is a device which is used to express one’s belief about a quantity before any kind of evidence is considered. Using the Bayes’ theorem, we can calculate the posterior probability distribution which takes into account the evidence (data). When definite information about a variable is known (for e.g, the empirical expectation and variance through previous experiments), it can be expressed using an informative prior. On the other hand, if only a little is known about the range of the variable, it can be expressed in the form of a weakly informative prior. The third class of priors, called
uninformative priors, are the ones which are used to lay equal bets over all the plausible outcomes, for instance, if the quantity under consideration is positive or within a limit.In this work, we propose a new form of regularizer based on informative priors. We make use of a prior which dynamically adapts to the type of weights during training. The prior is constantly updated as the probability distribution of the parameters keeps changing. The parameters, in turn, are updated by sampling this probability distribution. To compensate the abrupt shift in the distribution, we dampen its effect by introducing a momentum term which restricts the sudden changes in the parameters by combining the previous and current instances of the parameters in a convex manner. We evaluate our results on MNIST handwritten digit dataset and CIFAR10 natural images dataset. Further sections in this paper describe our methodology and results in greater detail.
2 Methodology
2.1 MAP estimate of an informative prior :
We can investigate the formulation of the regularizer by assuming our estimation function to be a linear regressor. The regressor is parameterized by a vector, .
The probability distribution of the parameter space characterized by W can be formulated in the Bayesian paradigm as shown in Eqn. 1. Using this formulation, it is possible to localize a point in the parameter space by maximizing the overall probability on the lefthand side.
(1) 
(2) 
Taking the negative loglikelihood on both sides splits the right side of the Eq. 2
in two terms. One term is responsible for learning, and the other term is responsible for enforcing our beliefs about the parameters. This, in effect, improves the generalization ability of the model. The resulting form is the maximum aposteriori estimate of the parameters.
We assume a prior where the current instance of the parameter is sampled from a probability distribution provided by the previous state of the parameter space. The probability distribution is a mixture of a softmax and a uniform distribution which, due to the threshold factor (
), results in a multiplespike and slab kind of distribution which is strictly controlled by the dynamically changing parameters. The prior depends on the distribution of weights, which makes it an informative prior and explicit in comparison to Gaussian or Laplacian priors.(3) 
In all the above equations and
refer to the standard deviation of the prediction and parameters respectively.
denotes the probability of weight getting sampled in an experiment. is the set of experiments over which the summation is taken and is a hyperparameter used to threshold the probability distribution , such that in experiment weight with are selected in the projected space. Following the rigorously proven schema of maximum apriori estimation, we arrive at the final formulation of the loss function (Eqn. 3) which is passed onto to the optimization algorithm to optimize this function over the parameter space .(4) 
where , As the shift in the probability distribution of the parameter space is dynamic, we dampen its effect by introducing the notion of momentum, which is in turn parameterized by a variable . The introduction of momentum reduces the interbatch variances which results in the smoothening of the loss manifold.
(5) 
2.2 Convexity and Projection Intuition
The proposed metric can be treated as a special case of group sparsity where the proposed groups are projections of the parameter vectors in a lowerdimensional space.
The overall heuristic of the process is as follows: we reduce the dimension of the parameter vector by projecting it onto a lowerdimensional space. This reduction is done by sampling the axes which are controlled by
(1Sparsity or number of ones in a sampled vector), and it reduces the dimensionality drastically. The choice of axes is based on a prior probability distribution and in effect allows parameter points with high magnitude to be present in the
sample vector as expressed in the Eqn. 4. Now consider the combined effect of sparse sampling with a sparse parent vector. The resulting sampled vectors are majorly of two types: Vectors with high magnitude entries which have a high norm, and vectors with low or zero magnitude entries which have a small or negligible norm as indicated in the Fig. 4. When a summation of the norm of all the sampled vectors is done, it ends up overrepresenting significant entries in the parent vector.where =
Based on the Bayesian framework in Section 2.1, the projection of the parameter vector in a lower dimension is done through a thresholded sampling. In an extreme case of sampling where we project the parameter vector on to the set basis vectors composing the space, we end up with the norm of the vector. Numerically, our proposed metric can be compared to a fractional norm, thus proving that the resulting manifold is convex if the parameter vector is convex. While projecting the vector into a lowerdimensional space, the probability of picking one single parameter is higher if the value of that parameter is high. This exercise reduces the effective parameter space to a smaller set that only takes into account significant parameter values. This new smaller parameter space is more significant and leads to an informative and robust prior.
2.3 Comparison of bounds with and
In this section, we provide a quantitative comparison of different kinds of penalties: proposed and traditional penalties. Figure 5 shows the magnitude of different penalties by varying the sparsity of the vector on which these penalties act. We are particularly interested in the region where the density (1  sparsity) of the vector is around 1 %. In this region, the manifold generated by the proposed metric is constantly below the traditional penalties. Hence, it becomes possible to obtain a better optima. This would theoretically force the optimization function of the parameter to converge on the loss manifold at different optima.
The square root in Eq.6 is the norm of the vector projected on the hyperplane in a lowerdimensional space. We sample from an experimental space . In this section, we take a simplified version of our proposed metric by using a thresholded uniform prior for the sampling. Here we do not include the momentum factor, thereby making it feasible to study and compare the bounds. We also keep the count of how many times one specific dimension of the parent vector (the vector from which the sampling is done, in this case, ) has been sampled. and are inversely proportional to the norm of the index counter as obtained from the probability distribution.
(6) 
Taking the expectation over all the experiments and then using CauchySchwarz or Jensen’s inequality (). We then arrive at the following inequality:
Now we try to find an expression for this expectation over in the experiment space with sparsity constrained on weight s. Here indicates number of experiments in the Expectation and indicates the size of the weight vector .
(7) 
The above Eqn. 7 proves that for any vector size and an adequately sampled projections , the proposed penalty is upper bounded by the norm of the same vector. This condition holds within a significant range of and carefully selected experimental space and threshold .
2.4 Data
We illustrate the performance on multiple datasets with various hyperparameter settings. We make use of MNIST handwritten dataset mnist_dataset ( MNIST is a benchmark dataset for images of segmented handwritten digits, containing 28x28 pixel images. It includes 50,000 training examples and 10,000 testing examples), CIFAR10 and CIFAR100 datasets cifar_dataset (CIFAR10 and CIFAR100 dataset consists images of natural images of 32x32 pixels with 10 and 100 classes respectively. CIFAR10 consists of 60000 color images, with 6000 images per class, 50000 for training and 10000 for testing. CIFAR100 consists of 60000, 32x32 color images, with 600 images. 50000 for training and 10000 for testing).
2.5 Experiments
We conduct numerous experiments for testing the effectiveness of our proposed method. We train popular networks, with varied sizes of the parameter vector, on MNIST and CIFAR10 datasets by varying the regularization and sparsity. We also perform a class of experiments where we test the proposed metric as a Loss criterion and not just as a regularizer. Table 1 shows the list of experiments done to illustrate the validity of our algorithm. Algorithm 1 provides the mathematical description of our method.
Dataset  Regularization  Network  Loss Type  

MNIST  {, , proposed}  2 Layered CNN  {0.01, 0.05, 0.1}  CE  
CIFAR10  {, , proposed} 

{0.01, 0.05, 0.1}  CE  
CIFAR100 

–  {CE, Projected CE} 
2.5.1 Training
Our method was tested with custom convolutional neural networks (CNN)
CNN with 2 hidden layers along with standard deep networks like VGG11 vgg, ResNet18 resnet and DenseNet121densenet. In all the cases the training was performed using Adam adam optimizer, with learning rate of 0.001 with batchsize of 32.2.5.2 Projected Cross Entropy
Using the function (6) defined above for the calculation of the proposed regularizer we use it to revise the crossentropy loss. The thresholded probability distribution governs the selection of softmaxoutput. Activations of the final layer replace the roll of weights in the definition of the probability distribution (4
). Performance of a machine learning task with sparse output is greatly affected by the true positives. A projected crossentropy will improve the performance of the neural network trained on sparse output as it drives the true positives through its focused attention on those axes which have higher activations.
2.5.3 Evaluation
Evaluation of the proposed method was done by comparing the Sparsity of the parameters (not to be confused with which is the sampling sparsity during sampling from the parameter vector), the magnitude of the parameters, overall accuracy and the loss obtained for each training iteration. Each plot provides a comparison between different type of regularizer for a particular metric, where the different types of metrics are loss, accuracy, magnitude, and sparsity.
Weight Magnitude and Sparsity :
We choose sparsity and magnitude as an evaluation metric for the following reasons: First, regularizers have been conventionally linked to the sparsity of parameters while
norm induces direct sparsity. Further, norm induces a continuity and boundedness on the parameters. Both these regularizers, in turn, have an effect of reducing the magnitude of the parameters. Second, sparsity plays a major role in the interpretability and pruning of a network. While we make no claims in this paper about these two ideas, we do report how this metric changes when the proposed metric changes in comparison with norm.Accuracy and Loss: Both these metrics are of the utmost importance for proposing changes to any part of the deep learning regime, be it a new loss or a new architecture. A lower testing loss indicates that the optimization algorithm has found a better optimum in the loss manifold. By extension, a higher value of testing accuracy shows that the point in parameter space corresponding to the optimum leads to a better network configuration which is superior at generalizing. This is a highly desirable quality in a network of any class.
3 Results and Discussion
Dataset  Network  Loss  Reg  Wt. Density  Accuracy 

MNIST10  2 Layered CNN  CE  Proposed  1.2 x  0.992 
1.6 x  0.990  
CIFAR10  2 Layered CNN  CE  Proposed  2.4 x  0.422 
3.2 x  0.38  
1.8 x  0.16  
VGG11  CE  Proposed  5.5 x  0.883  
6.3 x  0.875  
4.8 x  0.64  
ResNet18  CE  Proposed  8.3 x  0.910  
9.1 x  0.908  
8.6 x  0.862  
SENet18  CE  Proposed  4.2 x  0.913  
3.9 x  0.910  
4.0 x  0.843  
CIFAR100  VGG11  Projected CE    4.2 x  0.59 
CE    5.5 x  0.57  
ResNet18  Projected CE    5.3 x  0.60  
CE    8.5 x  0.59 
3.1 Proposed Regularization Results
In each plot illustrated in this section, different colors signify experiments done by varying the sampling sparsity . With respect to Loss and Accuracy, the proposed method clearly outlies traditional norm (Fig. 10 & Fig. 15) . The idea is that the addition of proposed regularization penalty shifts the manifold of the testing loss and makes the parameters to converge at a point which has a higher testing accuracy, thus pushing the model toward a better generalization domain.
For the induced sparsity and magnitude of the parameters, one configuration () of the proposed method stands apart from the rest. The combined magnitude of the parameters differs for the proposed metric. This change leads to a parameter configuration which is both lower in magnitude as well as sparse in density. This might lead to a better analysis of the interpretability of the model in future works. Here we have used the notion of momentum in the probability distribution in all the experiments as described in section 5. Table 2 shows the effect of the proposed regression on MNIST and CIFAR10 data on various models with . Figure 10 and 15 shows the plots of our method with multiple values and standard regularizer with MNIST and CIFAR dataset respectively.
3.2 Proposed loss results
Table 2 shows the effect of the projected loss on CIFAR100 data. Figure 20
shows the plots of our method (projected loss) with CIFAR100 dataset. The training regime of a deep neural network in a traditional setting follows the following arc. During the initial few iterations the test loss falls without kinks and the test accuracy also increases in a similar manner (around 200 iterations or 20 epochs in this case). When the training proceeds further, since the loss is calculated over the entire output layer, the gradient received at each weight is comparatively stronger than required. This leads to erratic or oscillatory behavior. When a projected version of the same loss is used, the nodes whose magnitude is significantly higher than the rest are taken into consideration, leading to an appropriate backflow of error. This can also be the reason for a much smoother accuracy and loss curve. Classification problems with multiple classes are clear targets for the projected loss because for one datapoint, there are only a few nodes which should be activated and doing a piecewise projected loss ensures such a behavior.
4 Future Work
Though we find better results in the initial study, there still remains a lot of experimentation and analyses to be done. One direction is to explore datasets which can benefit from these changes, to further explore in this direction, we would like to test the effect of proposed method over sparse NLP datasets. An additional line of study would be to explore the resulting sparse networks through the lens of compression and conduct the tests for interpretability of the obtained network.
5 Conclusion
In this work, we propose different regularizers, which works on the projected spaces of the original vector. Our approach makes use of the current state of the parameters in the network which results in informative and dynamically varying prior for regularization. Our experiments on MNIST, CIFAR10, and CIFAR100 with multiple custom and preexisting networks like VGG, ResNet, and SENet, illustrate the better results in terms of accuracy, loss, magnitude, and sparsity when experimented with the proposed regularizer as compared to the standard regularizer.
Comments
There are no comments yet.