Neural Network (NN) has become a widely-used model in many real-world tasks, such as image classification, translation, speech recognition, etc. In the meantime, the increasing size and complexity of the advanced NN models111Unless otherwise stated, the term “model” used in this paper refers to the Neural Network model. have raised a critical challenge (Wang et al., 2018) in applying them into many real application tasks, which can accept appropriate performance drop with very extremely-limited tolerance to high model complexity. Running NN models on mobile devices and embedded systems are emerging examples that make every effort to avoid expensive computation and storage cost but can endure slightly-reduced model accuracy.
Consequently, many research studies have been paying attention to producing compact and fast NN models with maintaining acceptable model performance. In particular, one of the most active directions investigated model compression through pruning (LeCun et al., 1990; Hassibi and Stork, 1993; Han et al., 2015b; Li et al., 2016; Frankle and Carbin, 2019) or quantizing (Courbariaux et al., 2015; Rastegari et al., 2016; Wu et al., 2016; Zhu et al., 2016; Hubara et al., 2017; Mellempudi et al., 2017) the trained large NN models into squeezed ones with trimmed redundancy but preserved accuracy. More recently, increasing efforts explored Knowledge Distillation (Hinton et al., 2015) to obtain compact NN models by training them with the supervision from well-trained larger NN models (Polino et al., 2018; Wang et al., 2018; Mishra and Marr, 2018; Luo et al., 2016; Sau and Balasubramanian, 2016)
. Compared with directly training a compressed model from scratch merely using the ground truth, the supervision in terms of soft distributed representations on the output layer of the large teacher model can even significantly enhance the effectiveness of the resulting compact student model. In practice, nevertheless, it is quite difficult to produce the compressed student model that can yield similar effectiveness to the complex teacher model, essential due to the limited expressiveness ability of the compressed one in terms of the strictly-restricted parameter size.
Intuitively, to enhance the power of the compressed model, it is necessary to increase its expressiveness ability. However, traditional approaches to introduce more layers or hidden units into the model can easily violate the strict restrictions on the model size. Fortunately, besides the model scale, the nonlinear transformation, in terms of the activation, within the NN model plays an equally-important role in reflecting the expressiveness ability. As pointed out by (Montufar et al., 2014), an NN model that uses multi-layer ReLU or other piecewise linear function as the activation is essentially a complex piecewise linear function. Moreover, the number of linear regions produced by an NN model depends on not only its model scale but also the number of regions in its activation. Inspired by the limited cost in adding more regions in the activation, it is more efficient to improve the expressiveness ability of NN models via the multi-segment activation rather than arbitrarily increasing the model scale.
Thus, in this paper, we introduce a novel highly efficient piecewise linear activation, in order to improve the expressiveness ability of the compressed models with little cost. Specifically, as shown in Fig. 1, we leverage a generic knowledge distillation framework, in which, however, the compact student model is equipped with multi-segment piecewise linear functions as its activation, named the Light Multi-segment Activation (LMA). By using the LMA, we first cut the input range into multiple segments by batch statistics information, and ensure that it can adapt to any range of input lightly and efficiently. Then, we assign the inputs with the customized slopes and biases according to the segments they belong to, which thus leads to NN models with higher expressiveness ability due to the stronger non-linearity of multi-segment activation. Owing to above design, LMA-equipped compact student models yield two advantages: 1) It has much higher expressiveness ability, compared with one merely endowed with vanilla ReLU; 2) Its resource cost is much smaller and even more controllable compared with the other type of multi-segment piecewise linear activation.
Extensive experiments of multiple scales NN architectures on various real tasks, including image classification and machine translation, have demonstrated both the effectiveness and the efficiency of LMA, which implies the improved the expressiveness ability, thus the performance, of the student model by LMA. Additional experiments further illustrate that our method can improve the expressiveness ability of the model compressed via even other popular techniques, especially quantization, such that jointly leveraging them can achieve even better compression results.
The main contributions of this paper are multi-fold:
It proposes a novel multi-segment piecewise linear function for activation, which improves the expressiveness ability of the compressed student model within the knowledge distillation framework. To the best of our knowledge, it is the first work that leverages multi-segment piecewise linear activation in the model compression.
With using well-designed statistical information of each batch, the proposed activation can efficiently improve the performance of compressed models with preserving low resource cost.
The proposed method is well compatible with the other popular compression techniques, such that it is easy to combine them together and further enhance compression effectiveness.
On various real challenging tasks, experimental results of multiple models with different scales show our methods have good performance, and the effectiveness of joint usage, that combines our method with the others, is also shown in the experiments.
2 Related work
Our work is mainly related to two research areas, model compression and piecewise linear activation. Model pruning, quantization and distillation are representative methods for the former one, while the latter one typically studies respective effects of ReLU, Maxout and APLU on NN performance.
Model Compression In this area, LeCun et al. (1990) and Hassibi and Stork (1993) first explored pruning based on second derivations. More recently, Han et al. (2015b, 2016); Jin et al. (2016); Hu et al. (2016); Yang et al. (2017) pruned the weights of Neural Networks with different strategies and made some progress. Most recently, Frankle and Carbin (2019) showed a dense Neural Network contains a sparse trainable subnetwork such that it can match the performance of the original network, named as the lottery ticket hypothesis. On the other hand, Gupta et al. (2015)
have done a comprehensive study on the effect of low precision fixed point computation for deep learning. Therefore, quantization is also an active research area, where various methods were proposed by many works(Mellempudi et al., 2017; Hubara et al., 2017; Rastegari et al., 2016; Wu et al., 2016; Zhu et al., 2016).
Besides, using distillation for size reduction is mentioned by (Hinton et al., 2015), which gives a new direction for training compact student models. The weighted average of soft distributed representation from the teacher’s output and ground truth is much useful when training a model, so that some practices (Wang et al., 2018; Luo et al., 2016; Sau and Balasubramanian, 2016) have been put for training compressed compact model. Moreover, recent works also proposed to combine the quantization with distillation, producing better compression results. Among these, Mishra and Marr (2018) used knowledge distillation for low-precision models, which proposes distillation can also help training the quantized model. Polino et al. (2018) proposed a more in-depth combination of these two methods, named Quantized Distillation. Besides, there are also some works (Han et al., 2015a; Iandola et al., 2016; Wen et al., 2016; Gysel et al., 2016; Mishra et al., 2017) further reduced the model size by combining multiple compression techniques like quantization, weight sharing and weight coding. Moreover, the combination of our method with the other is also shown in this paper.
Piecewise Linear Activation
A piecewise linear function is composed of multiple linear segments. Some piecewise functions are continuous when the boundary value calculated by two adjacent intervals function is the same, whereas some may not be continuous. Benefit from its simplicity and the fitting ability to any function with enough segments, it is widely-used in machine learning models(Landwehr et al., 2005; Malash and El-Khaiary, 2010), especially as activations in Neural Networks (LeCun et al., 2015). Theoretically, Montufar et al. (2014); Pascanu et al. (2013)
studied the number of linear regions in Neural Networks produced by piecewise linear activation functions (PLA), which can be used to measure the expressiveness ability of the networks.
Specifically, as a two-segment PLA, Rectified Linear Unit (ReLU)(Nair and Hinton, 2010) and its parametric variants can be generally defined as , where is the input, is a linear slope, and is the activated output. For original ReLU, it fixes to zero so the formula degenerates to ; Parametric ReLU (PReLU) (He et al., 2015) makes learnable and initializes it to 0.25. Besides, there are also some PLAs with multiple segments improved from ReLU. For example, Maxout (Goodfellow et al., 2013) is a typical multi-segment PLA, which is defined as for all , where can be treated as its segment number, and it transforms the input into the maximum of
-fold linear transformed candidates; Adaptive Piecewise Linear Units (APLU) (Agostinelli et al., 2014) is also a multi-segment one, which is defined as a sum of hinge-shaped functions,
where is a hyper-parameter set in advance, while the variables , for are learnable. The control the slope of the linear segments while the determine the locations of the hinges similar to segments.
In this paper, after studying the connection of above two areas, we are the first to leverage the properties of PLA for model compression, and to improve the expressiveness ability of compact model via multi-segment activation, thereby improving its performance.
We start by studying the connection between PLA and the expressiveness ability of Neural Networks, followed by introducing the Light Multi-segment Activation (LMA) that is used to further improve the performance of the compact model in model compression.
Expressiveness Ability Study Practically, increasing complexity of the neural networks, in terms of either width (Zagoruyko and Komodakis, 2016) or depth (He et al., 2016), can result in swelling performance, essentially due to the higher expressiveness ability of the NN. However, when applying the NN into some resource-exhausted environments, its scale cannot be inflated without limit. Fortunately, the nonlinear transformation within the NN, in terms of the activation, provides another vital channel to enhance the expressiveness ability. Yet, the widely-used ReLU in NN is just a simple PLA with only two segments, where the slope on the positive segment is fixed to one while the other is zero. Therefore, other than enlarging the scale of the NN model, another effective alternative method to enhance the expressiveness ability of the NN model is to leverage more powerful activation functions. In this paper, we propose to increase the segment number in activation function to enhance its expressiveness ability, and further empower the compact NN to yield good performance.
Theoretically, there are also some related analysis (Montufar et al., 2014) that can justify our motivation. As pointed out by them, the capacity, i.e. the expressiveness ability, of a PLA-activated Neural Network can be measured by the number of linear regions of this model. And for a deep Neural Network, in the -th hidden layer with units, the number of separate input-space neighbourhoods that are mapped to a common neighborhood can be decided recursively as
denotes the set of (vector valued) activations reachable by the-th layer for all possible input; denotes the set of subsets that are mapped by the activation onto . Based on the above result, the following lemma (see Montufar et al. (2014); Lemma 2) is given.
The maximal number of linear regions of the functions computed by an -layer Neural Network with piecewise linear activations is at least , where is defined by Eqn. (2), and is a set of neighborhoods in distinct linear regions of the function computed by the last hidden layer.
Given the above lemma, the number of linear regions of a Neural Network is in effect influenced by the layer number, the hidden unit size, and the region number of PLA. From ReLU to Maxout, the significant improvement is on the in the lemma, which is also the basis of our approach. Taking Maxout as an example of detailed analysis, it can lead to an important corollary that a Maxout network with layers of width and rank can compute functions with at least linear regions (see Montufar et al. (2014); Theorem 8). Meanwhile, ReLU can be treated as a special rank-2 case of Maxout, whose bound is obtained similarly by Pascanu et al. (2013). Obviously, the number of linear regions can be improved by increasing either , or . However, in a compressed model, neither the layers nor hidden units can be increased too much. Thus, we propose to construct a highly efficient multi-segment activation function with its linear regions becomes larger.
Analysis on Existing Multi-segment PLAs As mentioned in Related Work, some previous studies have already proposed some multi-segment PLAs. In the following of this subsection, we will analyze whether they are suitable for being applied in model compression. Considering Maxout first, its regions are produced by -fold weights and only the maximum of its -fold outputs is picked to feed forward, which obviously causes the redundancy within Maxout. On the contrary, to construct a PLA with multiple segments and ensure limited parameters increment in the meantime, a more intuitive inspiration, from the definition of piecewise linear function, lies in that it first cuts the input range into multiple segments, and then transforms the input linearly by individual coefficients (i.e. slopes and biases) on different segments. In this way, the parameter number of the network based on this scheme can be controlled as , compared with in the above assumed Maxout NN. In fact, APLU is a hinge-based implementation of this scheme, with few additive parameters. Specifically, in Eqn. 1, are the cut points of the input range, and can be grouped accumulatively into coefficients. However, APLU can increase the memory cost due to its accumulation operation. In details, APLU requires times intermediate variables to compute the items parallel and then accumulates all of them one-time. Although we also accumulate them recursively to avoid this, it will be times slower and is unacceptable. Besides, with the becomes larger, the memory cost will growth linearly.
In a word, neither Maxout nor APLU can be directly employed for model compression in that Maxout produces much more parameters and APLU is memory-consuming. In the following subsection, we will introduce a new activation process that are both effective and efficient for model compression.
3.2 Light Multi-segment Activation
Method LMA mainly contains two steps. The first is batch segmentation, which is proposed to find the segment cut points based on the batch statistical information. Then the inputs are transformed with the corresponding linear slopes and biases according to their own segments.
Firstly, to construct a multi-segment piecewise activation, it needs to cut the continuous inputs to multiple segments. There are two straight-forward solutions : 1) pre-defined like the vanilla ReLU; 2) training cut points like APLU. For the former, as the input ranges of hidden layers are dramatically changed during training, it is hard to define the appropriate cut points in advance. For the latter, the cut points are unstable due to the randomly initialization and stochastic update by back-propagation. As the naive solutions cannot work well, inspired by the success of Batch Normalization(Ioffe and Szegedy, 2015), we propose Batch Segmentation, which determines the segment boundaries by the statistical information.
There are two statistical schemes (Dougherty et al., 1995)
to find appropriate segments. One is based on frequency, and the other one is based on numerical values. Concretely, after using the frequency-based method, each segment has the same number of inputs, while if using numerical value based one, the numerical width of each segment is equal. Indeed, the frequency-based method is more robust since it is not sensitive to numerical values. However, it is not efficient, especially running on GPU and applied for model compression. Thus, the numerical value based solution is used in LMA for efficiency purpose. Specifically, we assume the input is a normal distribution and cut the segments by equal value width. So, here each segment cut point is defined as,
where is the segment number, a hyper-parameter, and
are the mean and standard deviation of the batch input
, respectively. To reduce the effect of outliers and make use of the property of normal distribution, we assumeare the range endpoints and assign cut points according to this assumption. Like Batch Normalization, the moving average of is used in the test phase. To further improve the efficiency, as well as more stable statistical information, the could be calculated and shared in the same layer.
After determining segment boundaries, it needs to assign the coefficient, i.e. slope and bias, to each input according to the segment which it belongs to. To avoid the memory-consuming problem in APLU, we use the independent slopes and biases in LMA. Formally, the activation process can be defined as,
where denotes the slope coefficient, denotes the bias, and denotes segment indices. Especially, considering there still may be few extreme inputs out of the normal distribution assumption, the first and last segment are set to and respectively, instead of determining by and . Finally, after the above steps, the linear transformed values feed forward to the next layer.
|Param. Size||Mem. Cost|
Analysis and Discussion In the following, we will take more detailed discussions on LMA from the perspective of complexity analysis and initialization. Obviously, in LMA, there is only two additional trainable variables and for each layer, whose total size is , where is segment number and is hidden unit number. Furthermore, to reduce the parameter size extremely, the and are shared in the layer-level, which means that all the units or feature maps are activated by the same LMA in one specific layer. Therefore, the parameters brought by LMA in one layer is only , even reduced by times compared with APLU. Moreover, about the running memory cost in inference phase, LMA only produces the belonging segment indices for inputs, whose space cost is , while APLU needs hinges and Maxout needs activation candidates. To conclude, the cost comparisons between each multi-segment PLAs are shown in Table 1, where the parameter size and the running space cost at activation in one layer are listed. It shows LMA is more suitable for model compression because of its less storage and running space cost.
Besides, the slopes and biases on all segments need to be initialized in LMA. The initialization methods always can be categorized into two classes: 1) random initialization like the other parameters in Neural Networks; 2) initializing it as a known activation, such as vanilla Relu or PReLU. Though the random initialization does not impose any assumptions and may achieve a better performance (Mishkin and Matas, 2015), it usually introduces uncertainty and leads to unstable training too. With this in mind, we choose the second initialization method for LMA. Specifically, we initialize the LMA to be the vanilla ReLU, which means that all biases are initialized to zero, the slopes of the half left segments are initialized to zero while the rest slopes are initialized to one.
Model Compression As an effective method to improve the expressiveness ability of the compressed model, LMA can be applied with distillation and other compression techniques. Under the distillation framework, we first train a state-of-the-art model and get as much good performance. Then given it as the teacher model, a more compact architecture is employed to as the student to learn the knowledge from the teacher. Because of the parameter reduction in the student, it always underperforms much lower than the teacher despite using knowledge distillation. Here, we replace all the original ReLUs with our LMA for the student model, improving its expressiveness ability, and further improving the performance much. The replacement is very convenient that it only needs to change one line of code in the implementation. After that, according to (Hinton et al., 2015; Polino et al., 2018), the distillation loss for training the student is also a normal weighted average of the loss from ground truth and the one from teacher’s output, which is formally defined as,
where is a hyper-parametric factor, which is always set to 0.7, to adjust the weight of two losses;
is the student’s output logits; the first lossis a Cross Entropy Loss or Negative Log Likelihood Loss with the ground truth labels , depending on the tasks (CE is for image classification and NNL is for machine translation in our experiments); the latter loss
is a Kullback–Leibler Divergence Loss with the teacher’s output logits. Additionally, when calculating , we also use a temperature factor to soften the and the , whose specific settings will be shown in the experiments.
Besides, LMA is well compatible with the other compression techniques, since it is convenient to replace the activations from ReLU with LMA. For example, based on a recent representative method, Quantized Distillation Polino et al. (2018), after replacing the ReLU with LMA in student model, though it is quantified to low-precision model during training, our method still empowers it to achieve higher performance than origin one, which will be shown in the experiments.
In this section, we will conduct thorough evaluations on the effectiveness of LMA for model compression under two popular scenarios, image classification and machine translation. Besides, we will compare the performance LMA with several widely-used baseline activations.222We anonymously released the source code at: https://github.com/LMA-NeurIPS19/LMA Particularly, we will start with our experimental setup, including the data and models employed in the experiments. After that, we will analyze the performance of our method being applied singly or jointly with some others to demonstrate its effectiveness and advantages for model compression.
General Settings To ensure credible results, all experiments are run 5 times with different random seeds, and we report the average and standard deviation of them. In addition, to ensure fair comparisons, the basic parameters, including learning rate, batch size, hyper-parameters in distillation loss, etc., are all set as the same respective values as the baselines. Note that, the settings for parametric baseline activations (PReLU (He et al., 2016), APLU (Agostinelli, 2015; Agostinelli et al., 2014) and Swish (Ramachandran et al., 2017)), are all consistent with the original authors’ demonstration. For multi-segment activations (APLU and LMA), the segment numbers are set as the same to each other, which is 8 in our main experiments. Moreover, to measure the resource cost by the models, we report their parameter size and inference memory cost (Mem.), which is recorded when predicting the testing samples one by one. The model size does not change significantly after replacing the activation function, since the additive parameters in all these activations are relatively very few. More details about various specific parameter setting, model specifications and convergence curves can be found in the reproducibility supplementary materials.
|CIFAR-10||Student 1||Acc.||88.74 0.25||89.31 0.35||89.03 0.11||89.92 0.21||90.57 0.20|
|21.4 MB||4.04 MB||Mem.||14.24||15.95 (+1.7)||15.10 (+0.9)||25.80 (+11.6)||16.81 (+2.6)|
|Acc. 92.83||Student 2||Acc.||82.67 0.46||84.35 0.37||84.06 0.36||85.31 0.60||85.66 0.34|
|Mem. 29.28||1.28 MB||Mem.||3.40||4.72 (+1.3)||4.06 (+0.7)||11.69 (+8.3)||5.57 (+2.2)|
|Student 3||Acc.||73.33 0.79||75.30 0.17||75.45 0.34||77.54 0.97||77.66 0.47|
|0.44 MB||Mem.||1.45||2.07 (+0.6)||1.76 (+0.3)||5.15 (+3.7)||2.55 (+1.1)|
|CIFAR-100||Student 1||Acc.||69.11 0.80||70.03 0.21||69.67 0.40||70.99 0.42||70.92 0.42|
|68.7 MB||4.88 MB||Mem.||16.27||16.40 (+0.13)||16.33 (+0.06)||17.03 (+0.76)||16.46 (+0.19)|
|Acc. 77.56||Student 2||Acc.||63.12 1.00||64.52 0.67||63.82 0.78||66.28 0.49||66.31 0.68|
|Mem. 140.2||1.28 MB||Mem.||6.37||6.44 (+0.07)||6.41 (+0.04)||6.91 (+0.54)||6.47 (+0.10)|
4.1 Image Classification
Settings Following what (Polino et al., 2018) does in its code333https://github.com/antspy/quantized_distillation, we first evaluate our method on CIFAR-10 and CIFAR-100, both of which are well-known image classification datasets. For experiments on CIFAR-10, some relatively small CNN architectures are employed, including one teacher model and three student models with different scales. Widen Residual Networks (WRN) Zagoruyko and Komodakis (2016) are employed for experiments on CIFAR-100, where WRN-16 is used as teacher model while two WRN-10 are used as students. In the first phase, we train the teacher models and save them for the next distilled training. Then, we compare the performance of the student models with different activations under the supervision from both the teacher models and the ground truth. Accuracy (Acc.
) is used as the evaluation metric on this task.
Result Table 2 summarizes the image classification results by various methods. From this table, we can find that the multi-segment activations (APLU and LMA) outperform the other baselines, on both two datasets with all kinds of model scales, where LMA outperforms ReLU by 2% to 6% on accuracy. Meanwhile, we can find that smaller compact model can imply more obvious improvement caused by multi-segment activations. Specifically, on CIFAR-10, the LMA outperforms ReLU by 2% on Student 1 while that is 6% on Student 3. Besides, comparing APLU with LMA, we can easily find though their accuracy is sometimes close, the additional inference memory cost brought by equipping APLU is much larger than that by LMA, about 3 to 4 times more.
|Ope||Student 1||Ppl.||31.84 0.31||31.89 0.64||30.91 0.43||30.80 0.00||30.21 0.25|
|443.4 MB||177.6 MB||BLEU||13.73 0.19||13.67 0.27||13.89 0.26||13.98 0.21||14.11 0.12|
|BLEU 14.92||Mem.||407.39||458.98 (+52)||430.77 (+23)||719.73 (+312)||487.23 (+80)|
|Ppl. 29.71||Student 2||Ppl.||44.51 0.52||44.23 0.56||43.44 0.39||42.97 0.62||41.21 0.35|
|Mem.1014.8||87.2 MB||BLEU||10.46 0.18||10.51 0.24||10.78 0.23||10.87 0.30||10.94 0.18|
|Mem.||282.05||335.34 (+53)||305.43 (+23)||596.10 (+314)||363.60 (+82)|
|Student 3||Ppl.||71.69 0.51||72.56 1.03||70.45 0.69||70.31 0.61||67.62 0.31|
|43.3 MB||BLEU||6.12 0.12||6.06 0.15||6.26 0.25||6.40 0.29||6.64 0.04|
|Mem.||220.49||274.63 (+54)||243.87 (+23)||535.39 (+315)||302.89 (+82)|
|WMT13||Student 1||Ppl.||6.44 0.02||6.47 0.03||6.34 0.03||OOM||6.29 0.04|
|443.4 MB||177.6 MB||BLEU||26.89 0.05||26.81 0.06||26.98 0.08||27.12 0.07|
|BLEU 28.56||Mem.||419.40||470.99 (+52)||442.78 (+23)||499.24 (+81)|
|Ppl. 5.31||Student 2||Ppl.||12.61 0.05||12.72 0.04||12.51 0.03||12.35 0.06||12.25 0.05|
|Mem. 1040.8||43.3 MB||BLEU||20.39 0.09||19.96 0.07||20.82 0.08||21.02 0.10||21.19 0.08|
|Mem.||230.83||284.97 (+54)||254.21 (+23)||545.73 (+315)||313.23 (+82)|
4.2 Machine Translation
Setting To further evaluate effectiveness of our method, we conduct experiments on machine translation using the OpenNMT integration test dataset (Ope) consisting of 200K train sentences and 10K test sentences and WMT13 Koehn (2005) dataset for a German-English translation task. The translational models we employed are based on the seq2seq models from OpenNMT 444https://github.com/OpenNMT/OpenNMT-py, where the encoder and decoder are both Transformers Vaswani et al. (2017) instead of LSTM used in (Polino et al., 2018). We do not use LSTM to evaluate our method because its activations are usually Sigmoid and Tanh, both of which are saturated and much different from PLA. Besides one teacher model, we also employ three student models with different scales on Ope, and two student models on WMT13. We use the perplexity (Ppl., lower is better) and the BLEU score (BLEU), computed by the moses project (mos), as two evaluation metrics.
Result Table 3 shows the results on machine translation. From this table, we can find that our method outperforms all the baseline activations. Specifically, the BLEU scores of LMA increase by 3% to 8% over ReLU on Ope and 1% to 4% on larger WMT13. Moreover, we can observe the similar advantage of LMA in terms of the multi-segment effectiveness and memory cost comparison as in image classification tasks. It is worth to note that using APLU may cause memory overflow due to its huge cost (Out of Memory, OOM), as shown by APLU-equipped Student-1 on WMT13.
4.3 Additional Experiment
Segment Study To verify if the expressiveness ability can be enhanced via increasing the segment number, we conduct additional experiments on CIFAR-10 to study the effect of segment number in LMA. As shown in Fig. 2, with the segment number increasing from 4 to 8, both APLU and LMA yield soaring performance. Despite a slight decline beyond 10, LMA is still much better than ReLU. Besides, the memory cost of APLU grows linearly with the segment number while that of LMA remains stable and much lower.
Joint Use To show the effectiveness of the jointly using our method with other compression techniques, we conduct further experiment to combine Quantized Distillation (Polino et al., 2018) with our method on CIFAR-10. From Table 4, we can find that the accuracy of LMA-equipped model is much higher than that of ReLU-equipped one, also by about 2% to 6%, with all different settings of the number of bits in the quantized model.
|Method||Student 1||Student 2||Student 3|
|4 bits||85.74 0.15||86.31 0.41||77.04 0.51||79.48 0.79||65.33 0.17||68.85 0.99|
|8 bits||87.02 0.23||88.56 0.52||80.53 0.75||83.37 0.51||70.23 0.98||74.47 0.74|
Overall, all experiments above have implied that, the multi-segment activation, including APLU and LMA, can achieve better performance than the two-segment ones, and the improvement brought by multi-segment design becomes increasingly obvious against reducing model scale. Therefore, it is quite effective to leverage the segment number of PLA to improve the performance of the compact model in model compression. Furthermore, LMA can outperform APLU mostly and maintain more efficient memory usage simultaneously even in the only one case LMA not beating APLU. It indicates that the high efficiency of LMA makes it quite suitable in resources-exhausted environments. More than this, LMA can also be used conveniently and effectively together with the other techniques. To conclude, LMA can play the most critical role in model compression due to its highly competitive effectiveness, efficiency and compatibility.
5 Conclusion and Outlook
In model compression, especially knowledge distillation, to fill the expressiveness ability gap between the compact NN and complex NN, we propose a novel highly efficient Light Multi-segment Activation (LMA) in this paper, which empowers the compact NN to yield comparable performance with the complex one. Specifically, to produce more segments but preserving low resource cost, LMA uses statistical information of the batch input to determine multiple segment cut points. Then, it transforms the inputs linearly over different segments. Experimental results on the real-world tasks with multiple scales NN have demonstrated the effectiveness and efficiency of LMA. Besides, LMA is well compatible with the other techniques like quantization, also helping the performance of other approaches improved.
To the best of our knowledge, it is the first work that leverages multi-segment piecewise linear activation for model compression, which provides a good insight on designing efficient and powerful compact models. In the future, on the one hand, we will further reduce the time and space costs of LMA computing from the bottom as much as possible, by hardware-level or specialized computation. On the other hand, improving the capacity of activation is also a novel and significant direction to simplify complex architectures and apply Neural Networks more efficiently.
Appendix A Reproducibility Details
We anonymously released the source code at: https://github.com/LMA-NeurIPS19/LMA, where all of the experimental codes and method implementations exist, and it is mainly built from the codebase555https://github.com/antspy/quantized_distillation of (Polino et al., 2018). Furthermore, we use this supplementary material to provide some important details about some specific settings, and some intuitive results in an example figure. On the one hand, due to the state-of-the-art complex models applied in general scenarios have good enough expressiveness ability, the gain from equipping LMA is sometimes incremental for them. On the other hand, LMA is designed particularly for the compact model in model compression. So the experiments in this paper mainly focus on evaluating the performance of LMA-based compact models in compression scenario.
Generally, in our experiments, the implementation is based on Pytorch, and all the experiments are running on NVIDIA Tesla P40, whose memory is 24 GB. Besides, the inference memory cost is recorded by the Pytorch API:torch.cuda.max_memory_allocated(), when feeding the streaming test samples one-by-one, i.e. the test batch size is set to one.
a.1 Baseline Details
We compare the performance of LMA with several widely-used and state-of-the-art activations:
Parametric ReLU (PReLU) He et al. (2016) with initial slope , as the author suggested.
Swish Ramachandran et al. (2017), whose equation is , where
is the Sigmoid function andis either a constant or a trainable parameter. It is a state-of-the-art activation but not PLA essentially, but we can treat it as an approximation to two-segment ReLU. We use the version outperforming in Ramachandran et al. (2017), that the parameter is trainable and initialized to one.
Adaptive Piecewise Linear Units (APLU) Agostinelli et al. (2014). The segment number of APLU is set to the same as LMA. The initialization of APLU is according to the author’s code Agostinelli (2015), that the slopes are initialized uniformly and the cut points normally. Besides, it is worth mentioning that the set in the equation of APLU (see Eqn. 1 in the paper) is not consistent with its segment number we claimed, that it denotes the number of cut points. Therefore, an APLU with boundaries has segments totally, with dynamic segments and one extra from the ReLU added. For a fair comparison, when we mentioned a APLU-K, exactly we set cut points in it.
We do not take Maxout as the baseline because its parameter size is obviously much huger than the others. Besides, in LMA, the moving average factor, for updating segment cut points in the inference phase, is set to 0.99. Except for the segment study experiments, the segment number is usually set to 8, which is the same as that in APLU.
a.2 Dataset Details
For image classification, CIFAR-10 and CIFAR-100 Krizhevsky and Hinton (2009) are used. Both CIFAR-10 and CIFAR-100 are the well-known image classification benchmark datasets, which contain 50K training set and 10K testing set, and where images contain pixels. The difference between them is that there are 10-classes labels in CIFAR-10 while 100-classes in CIFAR-100.
For machine translation, we evaluate our method on the OpenNMT integration test dataset (Ope) consists of 200K train sentences and 10K test sentences666Obtained like https://github.com/antspy/quantized_distillation/blob/master/datasets/translation_datasets.py#L211, and well-known WMT13 Koehn (2005) dataset for a German-English translation task. The WMT13 used contains 1.7M training sentences and 190K testing sentences.
a.3 Model Details
On CIFAR-10, the model specifications are listed in Table 5, where denotes convolutional layers, mp
denotes max pooling layer,dp denotes dropout layer and fc denotes fully-connected layer.
|Teacher Model||76-mp-dp-126-mp-dp-148-mp-dp-1200fc-dp-1200fc||5.34 M|
|Student Model 1||75c-mp-dp-50-mp-dp-25c-mp-dp-500fc-dp||1.01 M|
|Student Model 2||50c-mp-dp-25-mp-dp-10c-mp-dp-400fc-dp||0.32 M|
|Student Model 3||25c-mp-dp-10-mp-dp-5c-mp-dp-300fc-dp||0.11 M|
On CIFAR-100, the parameter settings for the structure of Wide Residual Networks777Implementation refers to https://github.com/meliketoy/wide-resnet.pytorch are listed in Table 6. The detailed architecture of Wide Residual Networks (WRN) refers to Zagoruyko and Komodakis (2016), where the meaning of the listed parameters is also shown. Additionally, we employ relatively shallow WRN-16 as the teacher model, because with the depth increases more, the performance of WRN is not improved obviously anymore (the Error metric improves from 21.59% only to 20.75%, with depth from 16 to 22 reported in Zagoruyko and Komodakis (2016)).
|Model||Widen Factor||Depth||Parameter Number|
|Teacher Model||10||16||17.2 M|
|Student Model 1||6||10||1.22 M|
|Student Model 2||4||10||0.32 M|
For machine translational models, we employ multi-layer transformers Vaswani et al. (2017) as the encoder and decoder in seq2seq framework, to evaluate the effectiveness of our method. The implementation is based on OpenNMT-py888https://github.com/OpenNMT/OpenNMT-py and Pytorch, and the model specifications are listed in Table 7. On Ope, all of the four models are running, while the teacher model, the first and the last student model are selected to evaluate on WMT13, for the space limitation and huge time cost. Additionally, the BLEU score is computed by multi-bleu.perl code from the moses project (mos)999http://www.statmt.org/moses/?n=moses.baseline.
|Model||Embedding Size||Hidden Units||Encoder Layers||Decoder Layers||Parameter Number|
|Teacher Model||512||512||6||6||116 M|
|Student Model 1||256||256||3||3||47 M|
|Student Model 2||128||128||3||3||23 M|
|Student Model 3||64||64||3||3||11 M|
a.4 Hyper-parameter Setting
As the distributions of the used datasets in experiments are different from each other, we use different hyper-parameters on different datasets. However, the settings are only different between the datasets, without varying on the models, which means the parameter setting is strictly the same in all of the models when on one specific dataset. Besides, all the hyper-parameters are basically set to be the same as the Polino et al. (2018)’s settings, without deliberate adjustments.
, where the adjustment of learning rate depends on the changing trend of validation accuracy. If the validation accuracy does not increase anymore, after waiting for a fixed epoch, the learning rate is halved. On CIFAR-100, the hyper-parameters are listed in Table9, and its learning rate adjustment is the same as the setting in Zagoruyko and Komodakis (2016).
The hyper-parameters for machine translation are listed in Table 10. The model and hyper-parameters setting are the same on both Ope and WMT-13, which are mainly set to official default settings recommended by OpenNMT. More details can be found in our codes, the standard_options.py file in onmt directory specifically.
|Learning Rate||Initial LR||0.01|
|Epoch to Wait Before Decaying||10|
|Epoch to Wait After Decaying||8|
|Maximum Decay Times||11|
|Optimizer||Method||Stochastic Gradient Descent|
|Distillation||Distillation Loss Weight||0.7|
|Learning Rate||LR in Epoch 1-60||0.1|
|LR in Epoch 61-120||0.02|
|LR in Epoch 121-160||4e-3|
|LR in Epoch 161-200||8e-4|
|Optimizer||Method||Stochastic Gradient Descent|
|Distillation||Distillation Loss Weight||0.7|
|Attention Mechanism||Head Count||8|
|Batch Setting||Batch Type||Tokens|
|Learning Rate||Initial LR||2.0|
|Start Decay at Epoch||8|
|Distillation||Distillation Loss Weight||0.7|
|Beam Computing||Beam Size||5|
a.5 Accuracy-Epoch Curves on CIFAR-100
To show the effectiveness of LMA more intuitively, we provide one more example figure here, shows some Testing Accuracy Curves of the student models based on ReLU and LMA-8 respectively, when training on CIFAR-100. From the curves in Fig. 3, it is easily to find the high effectiveness of LMA from its much improvement from ReLU.
- Agostinelli  Forest Agostinelli. Learned activation functions source. https://github.com/ForestAgostinelli/Learned-Activation-Functions-Source/tree/master, 2015.
- Agostinelli et al.  Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv:1412.6830, 2014.
- Courbariaux et al.  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
- Dougherty et al.  James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, pages 194–202. Elsevier, 1995.
- Frankle and Carbin  Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Goodfellow et al.  Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
- Gupta et al.  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
- Gysel et al.  Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. Hardware-oriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016.
- Han et al. [2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. [2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015b.
- Han et al.  Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John Tran, and William J Dally. Dsd: regularizing deep neural networks with dense-sparse-dense training flow. arXiv preprint arXiv:1607.04381, 3(6), 2016.
- Hassibi and Stork  Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
He et al. 
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
He et al. 
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hinton et al.  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hu et al.  Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
- Hubara et al.  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
- Iandola et al.  Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
- Jin et al.  Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423, 2016.
- Koehn  Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86, 2005.
- Krizhevsky and Hinton  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Landwehr et al.  Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine learning, 59(1-2):161–205, 2005.
- LeCun et al.  Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
- LeCun et al.  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
- Li et al.  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
Luo et al. 
Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang, and Xiaoou Tang.
Face model compression by distilling knowledge from neurons.In
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
Malash and El-Khaiary 
Gihan F Malash and Mohammad I El-Khaiary.
Piecewise linear regression: A statistical method for the analysis of experimental adsorption data by the intraparticle-diffusion models.Chemical Engineering Journal, 163(3):256–263, 2010.
- Mellempudi et al.  Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462, 2017.
- Mishkin and Matas  Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv:1511.06422, 2015.
- Mishra and Marr  Asit Mishra and Debbie Marr. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1ae1lZRb.
- Mishra et al.  Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. Wrpn: wide reduced-precision networks. arXiv preprint arXiv:1709.01134, 2017.
- Montufar et al.  Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
Nair and Hinton 
Vinod Nair and Geoffrey E Hinton.
Rectified linear units improve restricted boltzmann machines.In Proceedings of international conference on machine learning, pages 807–814, 2010.
- Pascanu et al.  Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098, 2013.
- Polino et al.  Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1XolQbRW.
- Ramachandran et al.  Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv:1710.05941, 2017.
- Rastegari et al.  Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
- Sau and Balasubramanian  Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650, 2016.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Wang et al.  Ji Wang, Weidong Bao, Lichao Sun, Xiaomin Zhu, Bokai Cao, and Philip S. Yu. Private model compression via knowledge distillation. 2018.
- Wen et al.  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
- Wu et al.  Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
- Yang et al.  Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5687–5695, 2017.
- Zagoruyko and Komodakis  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Zhu et al.  Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.