Light Multi-segment Activation for Model Compression

07/16/2019 ∙ by Zhenhui Xu, et al. ∙ Microsoft Peking University 0

Model compression has become necessary when applying neural networks (NN) into many real application tasks that can accept slightly-reduced model accuracy with strict tolerance to model complexity. Recently, Knowledge Distillation, which distills the knowledge from well-trained and highly complex teacher model into a compact student model, has been widely used for model compression. However, under the strict requirement on the resource cost, it is quite challenging to achieve comparable performance with the teacher model, essentially due to the drastically-reduced expressiveness ability of the compact student model. Inspired by the nature of the expressiveness ability in Neural Networks, we propose to use multi-segment activation, which can significantly improve the expressiveness ability with very little cost, in the compact student model. Specifically, we propose a highly efficient multi-segment activation, called Light Multi-segment Activation (LMA), which can rapidly produce multiple linear regions with very few parameters by leveraging the statistical information. With using LMA, the compact student model is capable of achieving much better performance effectively and efficiently, than the ReLU-equipped one with same model scale. Furthermore, the proposed method is compatible with other model compression techniques, such as quantization, which means they can be used jointly for better compression performance. Experiments on state-of-the-art NN architectures over the real-world tasks demonstrate the effectiveness and extensibility of the LMA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Network (NN) has become a widely-used model in many real-world tasks, such as image classification, translation, speech recognition, etc. In the meantime, the increasing size and complexity of the advanced NN models111Unless otherwise stated, the term “model” used in this paper refers to the Neural Network model. have raised a critical challenge (Wang et al., 2018) in applying them into many real application tasks, which can accept appropriate performance drop with very extremely-limited tolerance to high model complexity. Running NN models on mobile devices and embedded systems are emerging examples that make every effort to avoid expensive computation and storage cost but can endure slightly-reduced model accuracy.

Consequently, many research studies have been paying attention to producing compact and fast NN models with maintaining acceptable model performance. In particular, one of the most active directions investigated model compression through pruning (LeCun et al., 1990; Hassibi and Stork, 1993; Han et al., 2015b; Li et al., 2016; Frankle and Carbin, 2019) or quantizing (Courbariaux et al., 2015; Rastegari et al., 2016; Wu et al., 2016; Zhu et al., 2016; Hubara et al., 2017; Mellempudi et al., 2017) the trained large NN models into squeezed ones with trimmed redundancy but preserved accuracy. More recently, increasing efforts explored Knowledge Distillation (Hinton et al., 2015) to obtain compact NN models by training them with the supervision from well-trained larger NN models  (Polino et al., 2018; Wang et al., 2018; Mishra and Marr, 2018; Luo et al., 2016; Sau and Balasubramanian, 2016)

. Compared with directly training a compressed model from scratch merely using the ground truth, the supervision in terms of soft distributed representations on the output layer of the large teacher model can even significantly enhance the effectiveness of the resulting compact student model. In practice, nevertheless, it is quite difficult to produce the compressed student model that can yield similar effectiveness to the complex teacher model, essential due to the limited expressiveness ability of the compressed one in terms of the strictly-restricted parameter size.

Figure 1: Depiction of distillation with LMA.

Intuitively, to enhance the power of the compressed model, it is necessary to increase its expressiveness ability. However, traditional approaches to introduce more layers or hidden units into the model can easily violate the strict restrictions on the model size. Fortunately, besides the model scale, the nonlinear transformation, in terms of the activation, within the NN model plays an equally-important role in reflecting the expressiveness ability. As pointed out by (Montufar et al., 2014), an NN model that uses multi-layer ReLU or other piecewise linear function as the activation is essentially a complex piecewise linear function. Moreover, the number of linear regions produced by an NN model depends on not only its model scale but also the number of regions in its activation. Inspired by the limited cost in adding more regions in the activation, it is more efficient to improve the expressiveness ability of NN models via the multi-segment activation rather than arbitrarily increasing the model scale.

Thus, in this paper, we introduce a novel highly efficient piecewise linear activation, in order to improve the expressiveness ability of the compressed models with little cost. Specifically, as shown in Fig. 1, we leverage a generic knowledge distillation framework, in which, however, the compact student model is equipped with multi-segment piecewise linear functions as its activation, named the Light Multi-segment Activation (LMA). By using the LMA, we first cut the input range into multiple segments by batch statistics information, and ensure that it can adapt to any range of input lightly and efficiently. Then, we assign the inputs with the customized slopes and biases according to the segments they belong to, which thus leads to NN models with higher expressiveness ability due to the stronger non-linearity of multi-segment activation. Owing to above design, LMA-equipped compact student models yield two advantages: 1) It has much higher expressiveness ability, compared with one merely endowed with vanilla ReLU; 2) Its resource cost is much smaller and even more controllable compared with the other type of multi-segment piecewise linear activation.

Extensive experiments of multiple scales NN architectures on various real tasks, including image classification and machine translation, have demonstrated both the effectiveness and the efficiency of LMA, which implies the improved the expressiveness ability, thus the performance, of the student model by LMA. Additional experiments further illustrate that our method can improve the expressiveness ability of the model compressed via even other popular techniques, especially quantization, such that jointly leveraging them can achieve even better compression results.

The main contributions of this paper are multi-fold:

  • It proposes a novel multi-segment piecewise linear function for activation, which improves the expressiveness ability of the compressed student model within the knowledge distillation framework. To the best of our knowledge, it is the first work that leverages multi-segment piecewise linear activation in the model compression.

  • With using well-designed statistical information of each batch, the proposed activation can efficiently improve the performance of compressed models with preserving low resource cost.

  • The proposed method is well compatible with the other popular compression techniques, such that it is easy to combine them together and further enhance compression effectiveness.

  • On various real challenging tasks, experimental results of multiple models with different scales show our methods have good performance, and the effectiveness of joint usage, that combines our method with the others, is also shown in the experiments.

2 Related work

Our work is mainly related to two research areas, model compression and piecewise linear activation. Model pruning, quantization and distillation are representative methods for the former one, while the latter one typically studies respective effects of ReLU, Maxout and APLU on NN performance.

Model Compression    In this area, LeCun et al. (1990) and Hassibi and Stork (1993) first explored pruning based on second derivations. More recently, Han et al. (2015b, 2016); Jin et al. (2016); Hu et al. (2016); Yang et al. (2017) pruned the weights of Neural Networks with different strategies and made some progress. Most recently, Frankle and Carbin (2019) showed a dense Neural Network contains a sparse trainable subnetwork such that it can match the performance of the original network, named as the lottery ticket hypothesis. On the other hand, Gupta et al. (2015)

have done a comprehensive study on the effect of low precision fixed point computation for deep learning. Therefore, quantization is also an active research area, where various methods were proposed by many works

(Mellempudi et al., 2017; Hubara et al., 2017; Rastegari et al., 2016; Wu et al., 2016; Zhu et al., 2016).

Besides, using distillation for size reduction is mentioned by (Hinton et al., 2015), which gives a new direction for training compact student models. The weighted average of soft distributed representation from the teacher’s output and ground truth is much useful when training a model, so that some practices (Wang et al., 2018; Luo et al., 2016; Sau and Balasubramanian, 2016) have been put for training compressed compact model. Moreover, recent works also proposed to combine the quantization with distillation, producing better compression results. Among these, Mishra and Marr (2018) used knowledge distillation for low-precision models, which proposes distillation can also help training the quantized model. Polino et al. (2018) proposed a more in-depth combination of these two methods, named Quantized Distillation. Besides, there are also some works (Han et al., 2015a; Iandola et al., 2016; Wen et al., 2016; Gysel et al., 2016; Mishra et al., 2017) further reduced the model size by combining multiple compression techniques like quantization, weight sharing and weight coding. Moreover, the combination of our method with the other is also shown in this paper.

Piecewise Linear Activation

   A piecewise linear function is composed of multiple linear segments. Some piecewise functions are continuous when the boundary value calculated by two adjacent intervals function is the same, whereas some may not be continuous. Benefit from its simplicity and the fitting ability to any function with enough segments, it is widely-used in machine learning models

(Landwehr et al., 2005; Malash and El-Khaiary, 2010), especially as activations in Neural Networks (LeCun et al., 2015). Theoretically, Montufar et al. (2014); Pascanu et al. (2013)

studied the number of linear regions in Neural Networks produced by piecewise linear activation functions (PLA), which can be used to measure the expressiveness ability of the networks.

Specifically, as a two-segment PLA, Rectified Linear Unit (ReLU)

(Nair and Hinton, 2010) and its parametric variants can be generally defined as , where is the input, is a linear slope, and is the activated output. For original ReLU, it fixes to zero so the formula degenerates to ; Parametric ReLU (PReLU) (He et al., 2015) makes learnable and initializes it to 0.25. Besides, there are also some PLAs with multiple segments improved from ReLU. For example, Maxout (Goodfellow et al., 2013) is a typical multi-segment PLA, which is defined as for all , where can be treated as its segment number, and it transforms the input into the maximum of

-fold linear transformed candidates

; Adaptive Piecewise Linear Units (APLU) (Agostinelli et al., 2014) is also a multi-segment one, which is defined as a sum of hinge-shaped functions,

(1)

where is a hyper-parameter set in advance, while the variables , for are learnable. The control the slope of the linear segments while the determine the locations of the hinges similar to segments.

In this paper, after studying the connection of above two areas, we are the first to leverage the properties of PLA for model compression, and to improve the expressiveness ability of compact model via multi-segment activation, thereby improving its performance.

3 Methodology

We start by studying the connection between PLA and the expressiveness ability of Neural Networks, followed by introducing the Light Multi-segment Activation (LMA) that is used to further improve the performance of the compact model in model compression.

3.1 Preliminaries

Expressiveness Ability Study    Practically, increasing complexity of the neural networks, in terms of either width (Zagoruyko and Komodakis, 2016) or depth (He et al., 2016), can result in swelling performance, essentially due to the higher expressiveness ability of the NN. However, when applying the NN into some resource-exhausted environments, its scale cannot be inflated without limit. Fortunately, the nonlinear transformation within the NN, in terms of the activation, provides another vital channel to enhance the expressiveness ability. Yet, the widely-used ReLU in NN is just a simple PLA with only two segments, where the slope on the positive segment is fixed to one while the other is zero. Therefore, other than enlarging the scale of the NN model, another effective alternative method to enhance the expressiveness ability of the NN model is to leverage more powerful activation functions. In this paper, we propose to increase the segment number in activation function to enhance its expressiveness ability, and further empower the compact NN to yield good performance.

Theoretically, there are also some related analysis (Montufar et al., 2014) that can justify our motivation. As pointed out by them, the capacity, i.e. the expressiveness ability, of a PLA-activated Neural Network can be measured by the number of linear regions of this model. And for a deep Neural Network, in the -th hidden layer with units, the number of separate input-space neighbourhoods that are mapped to a common neighborhood can be decided recursively as

(2)

where

denotes the set of (vector valued) activations reachable by the

-th layer for all possible input; denotes the set of subsets that are mapped by the activation onto . Based on the above result, the following lemma (see Montufar et al. (2014); Lemma 2) is given.

Lemma 1

The maximal number of linear regions of the functions computed by an -layer Neural Network with piecewise linear activations is at least  , where is defined by Eqn. (2), and is a set of neighborhoods in distinct linear regions of the function computed by the last hidden layer.

Given the above lemma, the number of linear regions of a Neural Network is in effect influenced by the layer number, the hidden unit size, and the region number of PLA. From ReLU to Maxout, the significant improvement is on the in the lemma, which is also the basis of our approach. Taking Maxout as an example of detailed analysis, it can lead to an important corollary that a Maxout network with layers of width and rank can compute functions with at least linear regions (see Montufar et al. (2014); Theorem 8). Meanwhile, ReLU can be treated as a special rank-2 case of Maxout, whose bound is obtained similarly by Pascanu et al. (2013). Obviously, the number of linear regions can be improved by increasing either , or . However, in a compressed model, neither the layers nor hidden units can be increased too much. Thus, we propose to construct a highly efficient multi-segment activation function with its linear regions becomes larger.

Analysis on Existing Multi-segment PLAs    As mentioned in Related Work, some previous studies have already proposed some multi-segment PLAs. In the following of this subsection, we will analyze whether they are suitable for being applied in model compression. Considering Maxout first, its regions are produced by -fold weights and only the maximum of its -fold outputs is picked to feed forward, which obviously causes the redundancy within Maxout. On the contrary, to construct a PLA with multiple segments and ensure limited parameters increment in the meantime, a more intuitive inspiration, from the definition of piecewise linear function, lies in that it first cuts the input range into multiple segments, and then transforms the input linearly by individual coefficients (i.e. slopes and biases) on different segments. In this way, the parameter number of the network based on this scheme can be controlled as  , compared with in the above assumed Maxout NN. In fact, APLU is a hinge-based implementation of this scheme, with few additive parameters. Specifically, in Eqn. 1, are the cut points of the input range, and can be grouped accumulatively into coefficients. However, APLU can increase the memory cost due to its accumulation operation. In details, APLU requires times intermediate variables to compute the items parallel and then accumulates all of them one-time. Although we also accumulate them recursively to avoid this, it will be times slower and is unacceptable. Besides, with the becomes larger, the memory cost will growth linearly.

In a word, neither Maxout nor APLU can be directly employed for model compression in that Maxout produces much more parameters and APLU is memory-consuming. In the following subsection, we will introduce a new activation process that are both effective and efficient for model compression.

3.2 Light Multi-segment Activation

Method    LMA mainly contains two steps. The first is batch segmentation, which is proposed to find the segment cut points based on the batch statistical information. Then the inputs are transformed with the corresponding linear slopes and biases according to their own segments.

Firstly, to construct a multi-segment piecewise activation, it needs to cut the continuous inputs to multiple segments. There are two straight-forward solutions : 1) pre-defined like the vanilla ReLU; 2) training cut points like APLU. For the former, as the input ranges of hidden layers are dramatically changed during training, it is hard to define the appropriate cut points in advance. For the latter, the cut points are unstable due to the randomly initialization and stochastic update by back-propagation. As the naive solutions cannot work well, inspired by the success of Batch Normalization

(Ioffe and Szegedy, 2015), we propose Batch Segmentation, which determines the segment boundaries by the statistical information.

There are two statistical schemes (Dougherty et al., 1995)

to find appropriate segments. One is based on frequency, and the other one is based on numerical values. Concretely, after using the frequency-based method, each segment has the same number of inputs, while if using numerical value based one, the numerical width of each segment is equal. Indeed, the frequency-based method is more robust since it is not sensitive to numerical values. However, it is not efficient, especially running on GPU and applied for model compression. Thus, the numerical value based solution is used in LMA for efficiency purpose. Specifically, we assume the input is a normal distribution and cut the segments by equal value width. So, here each segment cut point is defined as,

(3)

where is the segment number, a hyper-parameter, and

are the mean and standard deviation of the batch input

, respectively. To reduce the effect of outliers and make use of the property of normal distribution, we assume

are the range endpoints and assign cut points according to this assumption. Like Batch Normalization, the moving average of is used in the test phase. To further improve the efficiency, as well as more stable statistical information, the could be calculated and shared in the same layer.

After determining segment boundaries, it needs to assign the coefficient, i.e. slope and bias, to each input according to the segment which it belongs to. To avoid the memory-consuming problem in APLU, we use the independent slopes and biases in LMA. Formally, the activation process can be defined as,

(4)

where denotes the slope coefficient, denotes the bias, and denotes segment indices. Especially, considering there still may be few extreme inputs out of the normal distribution assumption, the first and last segment are set to and respectively, instead of determining by and . Finally, after the above steps, the linear transformed values feed forward to the next layer.

Param. Size Mem. Cost
Maxout
APLU
LMA
Table 1: Cost comparison between multi-segment activation functions.

Analysis and Discussion    In the following, we will take more detailed discussions on LMA from the perspective of complexity analysis and initialization. Obviously, in LMA, there is only two additional trainable variables and for each layer, whose total size is , where is segment number and is hidden unit number. Furthermore, to reduce the parameter size extremely, the and are shared in the layer-level, which means that all the units or feature maps are activated by the same LMA in one specific layer. Therefore, the parameters brought by LMA in one layer is only , even reduced by times compared with APLU. Moreover, about the running memory cost in inference phase, LMA only produces the belonging segment indices for inputs, whose space cost is , while APLU needs hinges and Maxout needs activation candidates. To conclude, the cost comparisons between each multi-segment PLAs are shown in Table 1, where the parameter size and the running space cost at activation in one layer are listed. It shows LMA is more suitable for model compression because of its less storage and running space cost.

Besides, the slopes and biases on all segments need to be initialized in LMA. The initialization methods always can be categorized into two classes: 1) random initialization like the other parameters in Neural Networks; 2) initializing it as a known activation, such as vanilla Relu or PReLU. Though the random initialization does not impose any assumptions and may achieve a better performance (Mishkin and Matas, 2015), it usually introduces uncertainty and leads to unstable training too. With this in mind, we choose the second initialization method for LMA. Specifically, we initialize the LMA to be the vanilla ReLU, which means that all biases are initialized to zero, the slopes of the half left segments are initialized to zero while the rest slopes are initialized to one.

Model Compression     As an effective method to improve the expressiveness ability of the compressed model, LMA can be applied with distillation and other compression techniques. Under the distillation framework, we first train a state-of-the-art model and get as much good performance. Then given it as the teacher model, a more compact architecture is employed to as the student to learn the knowledge from the teacher. Because of the parameter reduction in the student, it always underperforms much lower than the teacher despite using knowledge distillation. Here, we replace all the original ReLUs with our LMA for the student model, improving its expressiveness ability, and further improving the performance much. The replacement is very convenient that it only needs to change one line of code in the implementation. After that, according to (Hinton et al., 2015; Polino et al., 2018), the distillation loss for training the student is also a normal weighted average of the loss from ground truth and the one from teacher’s output, which is formally defined as,

(5)

where is a hyper-parametric factor, which is always set to 0.7, to adjust the weight of two losses;

is the student’s output logits; the first loss

is a Cross Entropy Loss or Negative Log Likelihood Loss with the ground truth labels , depending on the tasks (CE is for image classification and NNL is for machine translation in our experiments); the latter loss

is a Kullback–Leibler Divergence Loss with the teacher’s output logits

. Additionally, when calculating , we also use a temperature factor to soften the and the , whose specific settings will be shown in the experiments.

Besides, LMA is well compatible with the other compression techniques, since it is convenient to replace the activations from ReLU with LMA. For example, based on a recent representative method, Quantized Distillation Polino et al. (2018), after replacing the ReLU with LMA in student model, though it is quantified to low-precision model during training, our method still empowers it to achieve higher performance than origin one, which will be shown in the experiments.

4 Experiment

In this section, we will conduct thorough evaluations on the effectiveness of LMA for model compression under two popular scenarios, image classification and machine translation. Besides, we will compare the performance LMA with several widely-used baseline activations.222We anonymously released the source code at: https://github.com/LMA-NeurIPS19/LMA Particularly, we will start with our experimental setup, including the data and models employed in the experiments. After that, we will analyze the performance of our method being applied singly or jointly with some others to demonstrate its effectiveness and advantages for model compression.

General Settings To ensure credible results, all experiments are run 5 times with different random seeds, and we report the average and standard deviation of them. In addition, to ensure fair comparisons, the basic parameters, including learning rate, batch size, hyper-parameters in distillation loss, etc., are all set as the same respective values as the baselines. Note that, the settings for parametric baseline activations (PReLU (He et al., 2016), APLU (Agostinelli, 2015; Agostinelli et al., 2014) and Swish (Ramachandran et al., 2017)), are all consistent with the original authors’ demonstration. For multi-segment activations (APLU and LMA), the segment numbers are set as the same to each other, which is 8 in our main experiments. Moreover, to measure the resource cost by the models, we report their parameter size and inference memory cost (Mem.), which is recorded when predicting the testing samples one by one. The model size does not change significantly after replacing the activation function, since the additive parameters in all these activations are relatively very few. More details about various specific parameter setting, model specifications and convergence curves can be found in the reproducibility supplementary materials.

Method ReLU PReLU Swish APLU-8 LMA-8
CIFAR-10 Student 1 Acc. 88.74 0.25 89.31 0.35 89.03 0.11 89.92 0.21 90.57 0.20
21.4 MB 4.04 MB Mem. 14.24 15.95 (+1.7) 15.10 (+0.9) 25.80 (+11.6) 16.81 (+2.6)
Acc. 92.83 Student 2 Acc. 82.67 0.46 84.35 0.37 84.06 0.36 85.31 0.60 85.66 0.34
Mem. 29.28 1.28 MB Mem. 3.40 4.72 (+1.3) 4.06 (+0.7) 11.69 (+8.3) 5.57 (+2.2)
Student 3 Acc. 73.33 0.79 75.30 0.17 75.45 0.34 77.54 0.97 77.66 0.47
0.44 MB Mem. 1.45 2.07 (+0.6) 1.76 (+0.3) 5.15 (+3.7) 2.55 (+1.1)
CIFAR-100 Student 1 Acc. 69.11 0.80 70.03 0.21 69.67 0.40 70.99 0.42 70.92 0.42
68.7 MB 4.88 MB Mem. 16.27 16.40 (+0.13) 16.33 (+0.06) 17.03 (+0.76) 16.46 (+0.19)
Acc. 77.56 Student 2 Acc. 63.12 1.00 64.52 0.67 63.82 0.78 66.28 0.49 66.31 0.68
Mem. 140.2 1.28 MB Mem. 6.37 6.44 (+0.07) 6.41 (+0.04) 6.91 (+0.54) 6.47 (+0.10)
Table 2: Image Classification Results. The metrics of Teacher models on each dataset are shown in the left-most cells. The accuracy (%) is shown in “mean std ” pattern and the inference memory cost (in MB) is shown in “A (+D)” pattern, where A denotes absolute memory cost and D is additional part compared with ReLU-equipped model.
Figure 2: Segment Study for APLU and LMA on CIFAR-10. The bars and left axis show accuracy (%) while the lines and right axis show memory cost (MB).

4.1 Image Classification

Settings     Following what (Polino et al., 2018) does in its code333https://github.com/antspy/quantized_distillation, we first evaluate our method on CIFAR-10 and CIFAR-100, both of which are well-known image classification datasets. For experiments on CIFAR-10, some relatively small CNN architectures are employed, including one teacher model and three student models with different scales. Widen Residual Networks (WRN) Zagoruyko and Komodakis (2016) are employed for experiments on CIFAR-100, where WRN-16 is used as teacher model while two WRN-10 are used as students. In the first phase, we train the teacher models and save them for the next distilled training. Then, we compare the performance of the student models with different activations under the supervision from both the teacher models and the ground truth. Accuracy (Acc.

) is used as the evaluation metric on this task.

Result    Table 2 summarizes the image classification results by various methods. From this table, we can find that the multi-segment activations (APLU and LMA) outperform the other baselines, on both two datasets with all kinds of model scales, where LMA outperforms ReLU by 2% to 6% on accuracy. Meanwhile, we can find that smaller compact model can imply more obvious improvement caused by multi-segment activations. Specifically, on CIFAR-10, the LMA outperforms ReLU by 2% on Student 1 while that is 6% on Student 3. Besides, comparing APLU with LMA, we can easily find though their accuracy is sometimes close, the additional inference memory cost brought by equipping APLU is much larger than that by LMA, about 3 to 4 times more.

Method ReLU PReLU Swish APLU-8 LMA-8
Ope Student 1 Ppl. 31.84 0.31 31.89 0.64 30.91 0.43 30.80 0.00 30.21 0.25
443.4 MB 177.6 MB BLEU 13.73 0.19 13.67 0.27 13.89 0.26 13.98 0.21 14.11 0.12
BLEU 14.92 Mem. 407.39 458.98 (+52) 430.77 (+23) 719.73 (+312) 487.23 (+80)
Ppl. 29.71 Student 2 Ppl. 44.51 0.52 44.23 0.56 43.44 0.39 42.97 0.62 41.21 0.35
Mem.1014.8 87.2 MB BLEU 10.46 0.18 10.51 0.24 10.78 0.23 10.87 0.30 10.94 0.18
Mem. 282.05 335.34 (+53) 305.43 (+23) 596.10 (+314) 363.60 (+82)
Student 3 Ppl. 71.69 0.51 72.56 1.03 70.45 0.69 70.31 0.61 67.62 0.31
43.3 MB BLEU   6.12 0.12   6.06 0.15   6.26 0.25   6.40 0.29   6.64 0.04
Mem. 220.49 274.63 (+54) 243.87 (+23) 535.39 (+315) 302.89 (+82)
WMT13 Student 1 Ppl.   6.44 0.02   6.47 0.03   6.34 0.03 OOM   6.29 0.04
443.4 MB 177.6 MB BLEU 26.89 0.05 26.81 0.06 26.98 0.08 27.12 0.07
BLEU 28.56 Mem. 419.40 470.99 (+52) 442.78 (+23) 499.24 (+81)
Ppl. 5.31 Student 2 Ppl. 12.61 0.05 12.72 0.04 12.51 0.03 12.35 0.06 12.25 0.05
Mem. 1040.8 43.3 MB BLEU 20.39 0.09 19.96 0.07 20.82 0.08 21.02 0.10 21.19 0.08
Mem. 230.83 284.97 (+54) 254.21 (+23) 545.73 (+315) 313.23 (+82)
Table 3: Machine Translation Results (Mem. in MB). The metrics of Teacher models are shown in the left-most cells. Note that on WMT13, the memory needed for training APLU-equipped Student-1 exceeds the maximum memory of our GPU (24GB), thus there is no result of APLU.

4.2 Machine Translation

Setting    To further evaluate effectiveness of our method, we conduct experiments on machine translation using the OpenNMT integration test dataset (Ope) consisting of 200K train sentences and 10K test sentences and WMT13 Koehn (2005) dataset for a German-English translation task. The translational models we employed are based on the seq2seq models from OpenNMT 444https://github.com/OpenNMT/OpenNMT-py, where the encoder and decoder are both Transformers Vaswani et al. (2017) instead of LSTM used in (Polino et al., 2018). We do not use LSTM to evaluate our method because its activations are usually Sigmoid and Tanh, both of which are saturated and much different from PLA. Besides one teacher model, we also employ three student models with different scales on Ope, and two student models on WMT13. We use the perplexity (Ppl., lower is better) and the BLEU score (BLEU), computed by the moses project (mos), as two evaluation metrics.

Result    Table 3 shows the results on machine translation. From this table, we can find that our method outperforms all the baseline activations. Specifically, the BLEU scores of LMA increase by 3% to 8% over ReLU on Ope and 1% to 4% on larger WMT13. Moreover, we can observe the similar advantage of LMA in terms of the multi-segment effectiveness and memory cost comparison as in image classification tasks. It is worth to note that using APLU may cause memory overflow due to its huge cost (Out of Memory, OOM), as shown by APLU-equipped Student-1 on WMT13.

4.3 Additional Experiment

Segment Study    To verify if the expressiveness ability can be enhanced via increasing the segment number, we conduct additional experiments on CIFAR-10 to study the effect of segment number in LMA. As shown in Fig. 2, with the segment number increasing from 4 to 8, both APLU and LMA yield soaring performance. Despite a slight decline beyond 10, LMA is still much better than ReLU. Besides, the memory cost of APLU grows linearly with the segment number while that of LMA remains stable and much lower.

Joint Use    To show the effectiveness of the jointly using our method with other compression techniques, we conduct further experiment to combine Quantized Distillation (Polino et al., 2018) with our method on CIFAR-10. From Table 4, we can find that the accuracy of LMA-equipped model is much higher than that of ReLU-equipped one, also by about 2% to 6%, with all different settings of the number of bits in the quantized model.

Method Student 1 Student 2 Student 3
ReLU LMA-8 ReLU LMA-8 ReLU LMA-8
4 bits 85.74 0.15 86.31 0.41 77.04 0.51 79.48 0.79 65.33 0.17 68.85 0.99
8 bits 87.02 0.23 88.56 0.52 80.53 0.75 83.37 0.51 70.23 0.98 74.47 0.74
Table 4: Joint Use Results with Quantized Distillation on CIFAR-10. The Teacher model employed is the same as the one in above experiments on CIFAR-10.

Overall, all experiments above have implied that, the multi-segment activation, including APLU and LMA, can achieve better performance than the two-segment ones, and the improvement brought by multi-segment design becomes increasingly obvious against reducing model scale. Therefore, it is quite effective to leverage the segment number of PLA to improve the performance of the compact model in model compression. Furthermore, LMA can outperform APLU mostly and maintain more efficient memory usage simultaneously even in the only one case LMA not beating APLU. It indicates that the high efficiency of LMA makes it quite suitable in resources-exhausted environments. More than this, LMA can also be used conveniently and effectively together with the other techniques. To conclude, LMA can play the most critical role in model compression due to its highly competitive effectiveness, efficiency and compatibility.

5 Conclusion and Outlook

In model compression, especially knowledge distillation, to fill the expressiveness ability gap between the compact NN and complex NN, we propose a novel highly efficient Light Multi-segment Activation (LMA) in this paper, which empowers the compact NN to yield comparable performance with the complex one. Specifically, to produce more segments but preserving low resource cost, LMA uses statistical information of the batch input to determine multiple segment cut points. Then, it transforms the inputs linearly over different segments. Experimental results on the real-world tasks with multiple scales NN have demonstrated the effectiveness and efficiency of LMA. Besides, LMA is well compatible with the other techniques like quantization, also helping the performance of other approaches improved.

To the best of our knowledge, it is the first work that leverages multi-segment piecewise linear activation for model compression, which provides a good insight on designing efficient and powerful compact models. In the future, on the one hand, we will further reduce the time and space costs of LMA computing from the bottom as much as possible, by hardware-level or specialized computation. On the other hand, improving the capacity of activation is also a novel and significant direction to simplify complex architectures and apply Neural Networks more efficiently.

Appendix A Reproducibility Details

We anonymously released the source code at: https://github.com/LMA-NeurIPS19/LMA, where all of the experimental codes and method implementations exist, and it is mainly built from the codebase555https://github.com/antspy/quantized_distillation of (Polino et al., 2018). Furthermore, we use this supplementary material to provide some important details about some specific settings, and some intuitive results in an example figure. On the one hand, due to the state-of-the-art complex models applied in general scenarios have good enough expressiveness ability, the gain from equipping LMA is sometimes incremental for them. On the other hand, LMA is designed particularly for the compact model in model compression. So the experiments in this paper mainly focus on evaluating the performance of LMA-based compact models in compression scenario.

Generally, in our experiments, the implementation is based on Pytorch, and all the experiments are running on NVIDIA Tesla P40, whose memory is 24 GB. Besides, the inference memory cost is recorded by the Pytorch API:

torch.cuda.max_memory_allocated(), when feeding the streaming test samples one-by-one, i.e. the test batch size is set to one.

a.1 Baseline Details

We compare the performance of LMA with several widely-used and state-of-the-art activations:

  • Rectified Linear Unit (ReLU) Nair and Hinton (2010)

    , without hyperparameter settings.

  • Parametric ReLU (PReLU) He et al. (2016) with initial slope , as the author suggested.

  • Swish Ramachandran et al. (2017), whose equation is , where

    is the Sigmoid function and

    is either a constant or a trainable parameter. It is a state-of-the-art activation but not PLA essentially, but we can treat it as an approximation to two-segment ReLU. We use the version outperforming in Ramachandran et al. (2017), that the parameter is trainable and initialized to one.

  • Adaptive Piecewise Linear Units (APLU) Agostinelli et al. (2014). The segment number of APLU is set to the same as LMA. The initialization of APLU is according to the author’s code Agostinelli (2015), that the slopes are initialized uniformly and the cut points normally. Besides, it is worth mentioning that the set in the equation of APLU (see Eqn. 1 in the paper) is not consistent with its segment number we claimed, that it denotes the number of cut points. Therefore, an APLU with boundaries has segments totally, with dynamic segments and one extra from the ReLU added. For a fair comparison, when we mentioned a APLU-K, exactly we set cut points in it.

We do not take Maxout as the baseline because its parameter size is obviously much huger than the others. Besides, in LMA, the moving average factor, for updating segment cut points in the inference phase, is set to 0.99. Except for the segment study experiments, the segment number is usually set to 8, which is the same as that in APLU.

a.2 Dataset Details

For image classification, CIFAR-10 and CIFAR-100 Krizhevsky and Hinton (2009) are used. Both CIFAR-10 and CIFAR-100 are the well-known image classification benchmark datasets, which contain 50K training set and 10K testing set, and where images contain pixels. The difference between them is that there are 10-classes labels in CIFAR-10 while 100-classes in CIFAR-100.

For machine translation, we evaluate our method on the OpenNMT integration test dataset (Ope) consists of 200K train sentences and 10K test sentences666Obtained like https://github.com/antspy/quantized_distillation/blob/master/datasets/translation_datasets.py#L211, and well-known WMT13 Koehn (2005) dataset for a German-English translation task. The WMT13 used contains 1.7M training sentences and 190K testing sentences.

a.3 Model Details

On CIFAR-10, the model specifications are listed in Table 5, where denotes convolutional layers, mp

denotes max pooling layer,

dp denotes dropout layer and fc denotes fully-connected layer.

Architecture Parameter Number
Teacher Model 76-mp-dp-126-mp-dp-148-mp-dp-1200fc-dp-1200fc 5.34 M
Student Model 1 75c-mp-dp-50-mp-dp-25c-mp-dp-500fc-dp 1.01 M
Student Model 2 50c-mp-dp-25-mp-dp-10c-mp-dp-400fc-dp 0.32 M
Student Model 3 25c-mp-dp-10-mp-dp-5c-mp-dp-300fc-dp 0.11 M
Table 5: Model specifications on CIFAR-10.

On CIFAR-100, the parameter settings for the structure of Wide Residual Networks777Implementation refers to https://github.com/meliketoy/wide-resnet.pytorch are listed in Table 6. The detailed architecture of Wide Residual Networks (WRN) refers to Zagoruyko and Komodakis (2016), where the meaning of the listed parameters is also shown. Additionally, we employ relatively shallow WRN-16 as the teacher model, because with the depth increases more, the performance of WRN is not improved obviously anymore (the Error metric improves from 21.59% only to 20.75%, with depth from 16 to 22 reported in Zagoruyko and Komodakis (2016)).

Model Widen Factor Depth Parameter Number
Teacher Model 10 16 17.2 M
Student Model 1 6 10 1.22 M
Student Model 2 4 10 0.32 M
Table 6: Model specifications on CIFAR-100.

For machine translational models, we employ multi-layer transformers Vaswani et al. (2017) as the encoder and decoder in seq2seq framework, to evaluate the effectiveness of our method. The implementation is based on OpenNMT-py888https://github.com/OpenNMT/OpenNMT-py and Pytorch, and the model specifications are listed in Table 7. On Ope, all of the four models are running, while the teacher model, the first and the last student model are selected to evaluate on WMT13, for the space limitation and huge time cost. Additionally, the BLEU score is computed by multi-bleu.perl code from the moses project (mos)999http://www.statmt.org/moses/?n=moses.baseline.

Model Embedding Size Hidden Units Encoder Layers Decoder Layers Parameter Number
Teacher Model 512 512 6 6 116 M
Student Model 1 256 256 3 3 47 M
Student Model 2 128 128 3 3 23 M
Student Model 3 64 64 3 3 11 M
Table 7: Model specifications for machine translation.

a.4 Hyper-parameter Setting

As the distributions of the used datasets in experiments are different from each other, we use different hyper-parameters on different datasets. However, the settings are only different between the datasets, without varying on the models, which means the parameter setting is strictly the same in all of the models when on one specific dataset. Besides, all the hyper-parameters are basically set to be the same as the Polino et al. (2018)’s settings, without deliberate adjustments.

We list the hyper-parameters on CIFAR-10 in Table 8. The learning rate decay strategy is according to the implementation in Polino et al. (2018)

, where the adjustment of learning rate depends on the changing trend of validation accuracy. If the validation accuracy does not increase anymore, after waiting for a fixed epoch, the learning rate is halved. On CIFAR-100, the hyper-parameters are listed in Table 

9, and its learning rate adjustment is the same as the setting in Zagoruyko and Komodakis (2016).

The hyper-parameters for machine translation are listed in Table 10. The model and hyper-parameters setting are the same on both Ope and WMT-13, which are mainly set to official default settings recommended by OpenNMT. More details can be found in our codes, the standard_options.py file in onmt directory specifically.

Batch Size 64
Maximum Epoch 200
Batch Normalization True
Weight Decay 2.2e-4
Learning Rate Initial LR 0.01
Decay Factor 0.5
Epoch to Wait Before Decaying 10
Epoch to Wait After Decaying 8
Maximum Decay Times 11
Optimizer Method Stochastic Gradient Descent
Momentum 0.9
Distillation Distillation Loss Weight 0.7
Soften Temperature 2
Table 8: Hyper-parameters on CIFAR-10.
Batch Size 128
Epoch 200
Batch Normalization True
Weight Decay 5e-4
Dropout Rate 0.3
Learning Rate LR in Epoch 1-60 0.1
LR in Epoch 61-120 0.02
LR in Epoch 121-160 4e-3
LR in Epoch 161-200 8e-4
Optimizer Method Stochastic Gradient Descent
Momentum 0.9
Distillation Distillation Loss Weight 0.7
Soften Temperature 2
Table 9: Hyper-parameters on CIFAR-100.
Epoch 15
Batch Normalization True
Dropout Rate 0.1
Attention Mechanism Head Count 8
Batch Setting Batch Type Tokens
Batch Size 3192
Learning Rate Initial LR 2.0
Decay Factor 0.5
Start Decay at Epoch 8
Decay Method noam
Optimizer Method Adam
beta 0.998
Distillation Distillation Loss Weight 0.7
Soften Temperature 1
Beam Computing Beam Size 5
Table 10: Hyper-parameters for machine translation.

a.5 Accuracy-Epoch Curves on CIFAR-100

To show the effectiveness of LMA more intuitively, we provide one more example figure here, shows some Testing Accuracy Curves of the student models based on ReLU and LMA-8 respectively, when training on CIFAR-100. From the curves in Fig. 3, it is easily to find the high effectiveness of LMA from its much improvement from ReLU.

Figure 3: Testing Accuracy-Epoch Curves on CIFAR-100. The vertical axis is accuracy (%) and the horizontal one is epoch. “S1” and “S2” means Student-1 and Student-2 respectively.

References