Maxout networks for MatConvNet
This paper presents an efficient and robust approach for reducing the size of deep neural networks by pruning entire neurons. It exploits maxout units for combining neurons into more complex convex functions and it makes use of a local relevance measurement that ranks neurons according to their activation on the training set for pruning them. Additionally, a parameter reduction comparison between neuron and weight pruning is shown. It will be empirically shown that the proposed neuron pruning reduces the number of parameters dramatically. The evaluation is performed on two tasks, the MNIST handwritten digit recognition and the LFW face verification, using a LeNet-5 and a VGG16 network architecture. The network size is reduced by up to 74% and 61%, respectively, without affecting the network's performance. The main advantage of neuron pruning is its direct influence on the size of the network architecture. Furthermore, it will be shown that neuron pruning can be combined with subsequent weight pruning, reducing the size of the LeNet-5 and VGG16 up to 92% and 80% respectively.READ FULL TEXT VIEW PDF
Maxout networks for MatConvNet
Having today available a big number of large-scale datasets and powerful GPUs, deep neural networks have become the state-of-the-art in many computer vision, and speech recognition tasks[1, 6, 10]
. They achieve high performance in many applications, e.g., scene and object recognition, object detection, scene parsing, face recognition, and medical imaging. However, they utilize high computational resources coming along with high memory cost. For example, AlexNet and DeepFace have around 60M and 120M parameters, respectively . Furthermore, they consume significant energy making their application on embedded devices difficult . Containing a huge amount of parameters, deep neural networks may also be subject to over-parametrization. Thus, there could exist redundancies, and their generalization is not proper . In general, networks with a small number of parameters generalize better extracting the important information of the data, rather than over-parametrized networks. Nevertheless, smaller networks are harder to train, since they are sensible to initialization .
Designing a network, i.e., setting the number of layers, neurons per layer, and parameters is typically still a ”trial and error” process. Mostly, it depends on experience . Moreover, training does not affect the structure of the network . Several attempts have been developed for reducing the effect of the huge number of parameters, e.g., dropout , creating an optimal-sized network by adding additional regularizers, or pruning the network parameters [1, 12, 17]. The latter ones attempt to either remove edges or complete neurons from a network. However, most of these approaches require an expensive comparison of all neurons in the network or additional and expensive post-processing, e.g., the computation of the network’s Hessian.
We propose an efficient and robust method for neuron pruning based on a local decision for reducing the number of parameters in a deep neural network. We will empirically show that pruning neurons rather than weights is essential for reducing the size of a neural network at runtime. The proposed approach for pruning neurons is based on the good performance of maxout units , which were developed for boosting the impact of dropout in training, and on their capacity to combine neurons for approximating more complex functions. It is assumed that redundancies typically exist in a neural network, so they also exist in a maxout unit, which combines the output of multiple neurons. Thus, pruning can be performed in a very local approach based on a single maxout unit.
The remainder of the paper is structured as follows: section 2 will discuss the related work in the field of parameter pruning for deep networks. In section 3 and 4, the maxout approach will be reviewed and an approach for pruning neurons will be introduced. Experiments on two datasets, the MNIST digits dataset and the labeled faces in the wild dataset, will be shown in section 5. Face recognition has been chosen as an application that is of special interest for embedded devices such as smartphones. The last section presents a short conclusion.
Though deep neural networks are very powerful, they are known to be over-parametrized, possessing millions of parameters . This over-parametrization may cause performance deficits, e.g., poor generalization, overfitting, slow testing time, and enormous energy and memory consumption [3, 7, 8, 12]. Therefore, reducing the size of the network by removing unimportant parameters, or designing optimal-sized networks becomes imperative. For those purposes, different attempts have been developed. These attempts can be grouped in constructive and destructive methods.
In constructive methods, neurons or layers are added to a trained shallow neural network. For example, in 
a very deep convolutional neural network (CNN) is trained by continuously adding convolutional layers to an initial CNN oflayers for obtaining a better performance. However, the initial shallow networks must be properly trained, as the network can otherwise get stuck into a local optimum. Moreover, as the idea is to improve the network’s performance by adding layers and neurons, the network’s size increases, and redundancies could be introduced into the network.
In destructive methods, non-relevant neurons (neuron pruning) and/or parameters (weights pruning) of an initial deep neural network are removed, while maintaining its behaviour. The authors in  and  started the concept of pruning neural networks, both using a sort of relevance measure. In 
, parameters, with the smallest relevance in the network, which is computed by using the Hessian of the loss function, are deleted. In, complete neurons are deleted by using a relevance measurement based on the difference of the network’s performance with and without the neuron. Nevertheless, especially with today’s very deep neural networks with millions of parameters, computing the relevance of each neuron or parameter demands very high computational resources.
Different from the relevance measure methods, the authors in [7, 8] prune weights by thresholding them. Afterwards, the network is re-trained for compensating the lost connections. One of the most prominent destructive methods is Deep Compression . The authors reduced the storage required for a deep CNN by a factor of and . They used a combination of three steps: weight pruning, weight quantization and Huffman coding. First, weight pruning is applied by thresholding the weights and thus setting them to zero. The remaining weights are then quantized, which reduces the number of bits for representing weights. Finally, a Huffman coding is applied. However, the networks are currently de-compressed for inference.
Recently, the authors in  determined the best number of parameters, by using a regularizer while training the network. The regularizer forces all the weights of single neurons to be zero. For testing, these dead neurons are removed from the network. Nevertheless, additional hyper-parameters must be determined for the regularization. The authors in  reduced the number of parameters to be learned by factorizing the weight matrices as a low rank product of two matrices: a static, and a dynamic matrix. First, they trained the static matrices as a general dictionary, obtaining a prior knowledge of the smoothness structures that are expected to be seen. Second, they fine-tuned the dynamic matrix, which are the weights to be learned.
Comparing the approaches, the destructive methods are more popular than the constructive ones as the networks are often easier to train. Good compression results are achieved by the deep compression approach in . However, pruning weights has the disadvantage of rarely removing neurons from the network architecture. To make this clear, we show in Fig. 3 the relation between the proportion of remaining neurons in the network versus the proportion of pruned weights using the LeNet-5  and the VGG16  as examples. It clearly shows that neurons only get pruned at a very high compression ratio, which typically influences the networks performance. Thus, thresholding parameters allows for compressing a network but does not influence the size of the architecture. At runtime a sparse matrix library would be required, which efficiently evaluates the compressed network. Otherwise, the network must be de-compressed for inference. The zeroed weights are again stored in the memory, as in , which generates a waste in memory consumption as well as computational power.
The proposed approach builds on a maxout architecture  for pruning the neural networks in a destructive manner. A maxout layer can be considered as a cross-channel pooling operation, performing a max operation between adjacent neurons. These layers were designed to boost the model’s averaging ability of dropout , thought for preventing overfitting, and to improve the optimization. Given an input layer with neurons, a maxout layer computes:
where is the number of neurons that are combined into a single maxout unit.
As the authors in  show, a maxout unit is a universal approximator. It combines
single neurons implementing a piecewiese linear function that can approximate arbitrary convex functions. So, in theory, the maximum of several neurons is able to approximate a more complex neuron. Moreover, the maxout unit becomes a sort of an activation function, replacing other activation functions, but with a factor of
smaller number of parameters. For example, two linear functions can implement a ReLU function, or five different linear functions can implement an approximation of a quadratic one, as shown in.
The idea of the proposed approach is to use the maxout units and their model selection abilities for pruning entire neurons from an architecture without expensive processing. Thus, reducing the size and the memory consumption of a deep network. In some cases, the performance of the network may even increase as redundancies get reduced or eliminated.
Following the assumption that redundancies exist in a deep neural network, it is assumed that if a network contains a maxout layer, redundancies will, also, exist in the maxout units. This is a valid assumption, since dropout and other regularization approaches cause the learn process to create different paths through the deep network, which yield similar outputs . Using this premise, a reduction of the number of neurons in a maxout layer can be done without an expensive relevance measurement.
For reducing the size of a CNN using maxout units, an iterative process is followed. First, a CNN with a maxout layer is trained. This maxout layer performs a max function among adjacent neurons, reducing the amount of weights connecting with the next layer by a factor of . So, placing this maxout layer after the one with the highest number of weights would be advisable. Second, by counting the number of times neurons become the maximal value in each maxout unit when computing a forward pass over the training dataset, the least active neurons of each maxout unit are removed from the network. Their effects are negligible with respect to other neurons. Third, the remaining neurons of the CNN are re-trained. After re-training, the process is repeated; in this case, the maxout layer performs a max function among neurons, and so on. Fig. 4 shows an example for .
In comparison with , pruning neurons takes place locally, since relevance values are not computed depending on the network’s output for each single neuron. The pruning in maxout architectures is therefore more feasible for very large networks with millions of parameters.
Having reduced the number of parameters in the network by pruning neurons from the maxout units of the network, further compression operations can be performed. Following the approach in [7, 8], connections (weights) can be pruned in an additional processing step. Based on thresholding, edges with lower value than a threshold are set to zero. Thus, learning which connections are important and deleting the unimportant ones. By this weight pruning, the network becomes a sparse network. For pruning weights, a three-step procedure is followed, as proposed in [7, 8]. Given the network that has been compressed by the proposed neuron pruning, the important connections are learned based on a global threshold. The threshold can be set such that as many connections as possible are removed without deteriorating the performance on a validation set. Second, weights below this threshold are deleted; that is, weights are set to zero. Third, the network is re-trained, learning the final weights.
An evaluation of both neuron and weight pruning is carried out for two different tasks: handwritten digit recognition, MNIST dataset , and face verification, LFW dataset . The later is of special interest for embedded domains, e.g., in mobile phones. In general, the performance of the networks is evaluated with a varying percentage of pruned weights: after applying maxout, when pruning several neurons from the maxout units, and finally after applying additional weight pruning. While in the first task a very small LeNet-5 architecture is compressed, in the second task a large VGG16 architecture is compressed.
For the experiments, we chose for the size of the maxout units as it allows for a fairly good compression and does not reduce the descriptiveness of the network compared to a network without maxout units. The neurons are then iteratively pruned from the maxout units and the network is re-trained after each pruning step.
For the digit-recognition task, two networks, using the LeNet-5 architecture 
with two convolutional layers, a fully connected layer and a softmax layer as a classificator, were trained. One network contains a maxout layer after the fully connected layer (LeNet-MFC), while the other has a maxout layer after the last convolutional layer (LeNet-MC). An iterative training following the steps in section4.1 is executed using the MNIST dataset . This dataset consists of handwritten-digit images (of size ) for training and
images for testing. We used stochastic gradient descent (SGD) with a momentum of 0.9, weight decay ofwith inverse decay, a base learning rate of that is iteratively reduced and a batch size of for training. The networks were trained for iterations.
Table 1 shows the classification accuracy for both networks with different fully connected layer sizes, with and without maxout (after the fully connected layer or the last convolutional layer). Pruning of one up to three neurons is evaluated. It shows also the percentage of pruned weights which do not remain in the network’s architecture, denoted by . In general, for both networks when using maxout and pruning neurons, the accuracy is maintained. The slight deviations of the accuracies of both networks with respect to the original networks are not significant based on a randomization test . Moreover, the number of weights are considerably reduced with up to for LeNet-MFC and for LeNet-MC. However, this reduction changes with respect to the position of the maxout layer. In LeNet-MFC, each neuron pruning step reduces the number of weights by , because the neurons are pruned from the fully connected layer, which has the largest number of weights in the network. Besides, the maxout layer does not provide a considerable reduction, since it reduces the size of the softmax layer that has less number of weights compared with the other layers. In contrast, the weight reduction in LeNet-MC due to neuron pruning is just per step, and it comes mostly from the maxout layer. In this case, the maxout layer reduces the fully connected layer instead, and the neurons are pruned from the last convolutional layer. However, in the last convolutional layer the number of weights is negligible.
|Network||No Maxout||No prun||1 neuron prun.||2 neuron prun.||3 neuron prun.|
|Network||proportions in of pruned weights|
Following the neuron pruning, additional weight pruning, as discussed in section 4.2, can be applied. As mentioned in , neurons could also be pruned from the network if all their input weights are zero; that is, the neuron can be considered as dead. So, the number of neurons, and thus the number of weights, could be considerably reduced if a proper threshold is used. Nevertheless, analyzing the proportion of dead neurons versus the proportion of pruned weights, neurons do not become dead before pruning more than of the weights, see Fig. 3(b). Consequently, weight pruning rarely prunes neurons. Thus, zeroed weights remain in the network as part of the neurons and the network’s architecture does not change so that a sparse representation would be required at runtime . However, assuming the usage of such a representation and for reducing storage size of the network, additional weight pruning is applied to both networks. As a basis, we use the networks after pruning three out of four neurons in the maxout units. The results in Tab. 2 show that the accuracy will not drop if less than of the weights are thresholded. So, a total compression rate of for LeNet-MFC and for LeNet-MC, of pruned and zeroed weights, can be reached.
The neuron pruning was also carried out for a larger network for the purpose of face verification. The task is to verify whether two face-images portray the same person or not. For that purpose, the VGG16 network  was utilized, using The Visual Geometry Group Face Dataset (VGG face-dataset) as a training-dataset. This dataset is a large collection of face-images containing million face-images from identities. It does not contain overlapping identities with standard benchmark datasets (LFW, YFT), so it is suitable for training. The VGG16 network, configuration D in , is a deep CNN with 16 layers: convolutional layer, two fully connected layers, and a softmax layer. Analogous to the previous LeNet configurations, two configuration of VGG16 are used, in which a maxout network with is added after the first fully connected layer (), called VGG16-MFC, and after the last convolutional layer (), called VGG16-MC, see Fig. 5. The positions of the maxout layers are set after the layers with the most quantity of weights. Since, the connections between and have of the total amount of weights in the network and the connections between and have an additional of the network’s weights. The last three fully connected layers were fined-tuned for both networks. In the case of VGG16-MC, the was also fine-tuned. We used SGD with a momentum of , weight decay of , three learning rates , as , and a batch size of .
The network was tested following the procedure in , but using the restricted configuration of The Labeled Faces in the Wild (LFW) . The LFW dataset is a standard benchmark dataset for face verification. It contains face-images from identities extracted from the Internet. Faces in images were detected using the Viola-Jones face detector . Faces are roughly centered, contain lesser noise but larger bounding-box than the ones in the VGG dataset. Besides, the Bray-Curtis distance (BC; 
) was used instead of the Euclidean distance. Since, the BC distance works better for high-dimensional vectors in comparison with the Euclidean and the L1 distances. The BC distance was measured between the descriptors of two face-images from a set of matched and non-matched pairs of images. Different from  and , the feature vectors from the crops of the image’s corners were not utilized for computing the final descriptor, but only the crops from the image’s centers. If the BC distance is smaller than a threshold, then the two images portray the same identity. The Equal Error Rate (EER) [4, 16] was used as the metric, which is defined as the value where the False Acceptance Rate (FAR), and the False Rejection Rate (FRR) are equal.
Table 3 shows the EER for networks without a maxout layer and with a maxout layer with , as well as the results for pruning from one up to three neurons from each maxout unit. Similar to the previous results, neurons are pruned, and consequently weights are reduced, from the networks without affecting their performance negatively. In fact, the EER decreases by and for the VGG16-MFC and the VGG16-MC respectively. It is assumed that this improvement is produced by the elimination of redundancies in the maxout units. Based on a randomization test, the improvement in the VGG16-MFC is highly significant with .
The weight reduction changes depending on the position of the maxout layer and on the layer where neurons are pruned in the network. In VGG16-MFC, neurons are pruned from the largest layer in the network reducing the number of weights by per each neuron pruning step. Moreover, the maxout layer reduces directly the size of the second largest layer . In VGG16-MC on the contrary, neuron pruning does not affect considerably the size of the network, since it reduces a small layer compared to . The weight reduction comes precisely from the maxout layer, which reduces the size of . There is a difference of between the weight reduction for both networks, since the size of the layer is never changed in VGG16-MC.
|Network||No Maxout||No prun||1 neuron prun.||2 neuron prun.||3 neuron prun.|
Additional to neuron pruning, weights from both networks were thresholded after pruning three out of four neurons in the maxout units with the results shown in Table 4. The network’s performance will be, deeply, affected if more than for the VGG16-MFC and for the VGG16-MC of the weights are pruned. Nevertheless, a total compression rate of for VGG16-MFC and for VGG16-MC without performance deterioration can be reached.
|Network||proportions in of pruned weights|
We have presented an efficient approach for reducing the size of deep neural networks. This approach prunes entire neurons and thus reduces the number of weights in neural networks. It uses maxout units for combining single neurons into complex ones. A maxout layer reduces the number of weights between two adjacent layers by . By using these maxout units, the network’s performance is not negatively affected, since they boost the dropout benefits reducing redundancies in the network. Within these maxout units, neurons are pruned based on a local and non-expensive relevance measure. This relevance measure depends on the number of times neurons are maximal for each of the adjacent input-neurons per maxout unit. It differs from previous relevance measures, because it does not depend on the overall network’s performance with and without individual neurons . In general, this approach does not require expensive post-processing, only a single re-training after pruning. The performance of this reduction approach depends strongly on the position of the maxout layer in the network. As inputs from maxout units are the neurons to be pruned, it is advisable to place the maxout units after the largest layer in the network, because neurons in this layer have large numbers of weights compared with neurons in other layers. So, pruning these neurons out of the network is favorable.
By comparing the number of pruned neurons and network’s performances between the aforementioned approach and weight pruning, the last one does not delete entire neurons, but rather sets weights to zero. Therefore, the architecture’s size is only implicitly reduced, and the memory footprint remains equal without a sparse representation. The proposed approach allows to reduce a network’s size by on an architectural level and without affecting the network’s performance. Assuming a sparse representation, a combination of the proposed neuron pruning with additional weight pruning allows for reducing the size of a network by up to .
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 806–814 (2015)