Network with Sub-Networks

08/02/2019 ∙ by Ninnart Fuengfusin, et al. ∙ 0

We introduce network with sub-network, a neural network which their weight layer could be removed into sub-neural networks on demand during inference. This method provides selectivity in the number of weight layer. To develop the parameters which could be used in both base and sub-neural networks models, firstly, the weights and biases are copied from sub-models to base-model. Each model is forwarded separately. Gradients from both networks are averaged and, used to update both networks. From the empirical experiment, our base-model achieves the test-accuracy that is comparable to the regularly trained models, while it maintains the ability to remove the weight layers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been given the attraction in the most recent years from their ability to provide state-of-the-art performance in varied applications. To deploy DNNs into the mobile devices, which are diverse in the specification, raises a question: how to effectively design DNNs by given the specification of the mobile phone? To answer this question, two main factors could be optimized.

The first factor is the performance of DNNs. In general, DNNs are provided an assumption by stacking the number of weight layers of DNNs, the better the performance of the model will be. One widely used example is the growing trend in the number of weight layers in ImageNet Large Scale Visual Recognition Competition (ILSVRC). AlexNet

[krizhevsky2012imagenet], the model that won ILSVRC-2012 consists of 8-weight layers. ResNet [he2016deep], the winner of ILSVRC-2015, contains of 152-weight layers. From AlexNet, ResNet reduces top-5 test error from 15.3 to 3.57. Even though, the growth in the number of weight layers might reduce the test-error rate of the model, it comes with the trade-off of the second factor, latency. More layers of DNNs means the higher number of variables to compute. This also increases in the memory footprint which is crucial for the mobile device.

In one of the solutions, we might select the model which achieves the real-time performance given a mobile phone. However, if the user differently prefers the performance over the latency, then, this method does not satisfy the user’s demand. One of the results is to let the user select his or her preference thus, match the preference to the most suitable model. However, the memory footprint for keeping various models into the mobile device is exceedingly huge. To satisfy user’s preference in selectivity in both performance and latency, we propose network with sub-networks (NSNs), DNNs which could be removed weight layers without dramatically decrease in performance.

Generally, if one of the weight layers is removed during the inference time, the performance of that model will dramatically be dropped. To explain what we speculate, one of the widely used examples to explain how DNNs operate is to compare it as a feature extraction model. From the first weight layer, extracts the low-level features to the last layers extract the high-level features. This creates a dependent relationship between each weight layer.

To solve this problem, we propose the training method that allows NSNs to dynamically adapt to the removing of weight layers. We call this method, copying learn-able parameters and sharing gradient. Both methods are designed to optimize the learn-able parameters for both models, the model with and without removing the weight layer.

2 Related Works

2.1 BrachyNet

BrachyNet [teerapittayanon2016branchynet]

is neural networks which could reduce the number of floating point operations (FLOP) during inference depending on the complexity of the input data. The concept of BrachyNet is between the certain immediate layers of neural networks, the output features of those layers might be connected with a branch. The branch consists of the weight layers and a classifier. If the prediction of input data of that branch classifier has higher confidence than the pre-defined threshold then, the output of the model will come out from that branch as the early exit. In another case, if the confidence is low then, the features will go further to process into the deeper layers instead.

2.2 Slimmable Neural Networks

Slimmable Neural Networks (SNNs) [yu2018slimmable]

is the main inspired of this research. If our purposed method adds or remove weights in depth-wise direction, SNNs append or detach weights in width-wise direction. The range of possible width of networks requires to be pre-defined as the switch. The main research problem is the mean and variance of features which come out from different-width weight layers are generally diverse. SNNs proposed

switchable batch normalization

which applied one for each switch. switchable batch normalization are used to correcting mean and variance of SNNs.

3 Network with Sub-Networks

There are two types of models in NSNs: the base and sub-model. We define the base-model as DNNs with hidden layers. Where is a positive integer more than zero. Then, we could create number of sub-models. Each of sub-model is mapped with hidden layers. From this concept, the biggest sub-model takes all of the weight layers of the base-model except the input layer. The second biggest sub-model takes all of the weight layers of the biggest sub-model except the input layer of the biggest sub-model. This could be done repeatedly until we get the sub-model that has not any hidden layer.

In the next section, after we define the base-model and sub-model, we will describe two of the processes in our purposed method: copying learn-able parameters and sharing gradient. Those processes are designed to be used repeatedly in every mini-batch training.

3.1 Copying Learn-able Parameters

The goal of copying learn-able parameters is to combine each sub-model into the base-model. To enforce the similarity between weight and bias parameters between each model, the weights and biases are copied from the lesser sub-model to bigger sub-model and repeat until going to the base-model. The process is shown in Eq. (1) and Fig. 1. From Eq. (1), is a weight variable, is an integer indicating the order of weight layer and is an integer indicating the model number.

(1)
Figure 1: Illustration of both network with sub-networks and copying learn-able parameters process. Where the base-model is two hidden layers DNNs and the sub-model as one hidden layer DNNs and softmax-regression. The name of the variable of weight, following with the size of weight array. Bias terms are excluded in this figure. copying learn-able parameters makes , and to have exactly the same array.

After we apply this process, if we remove the input weight layer of base-model with the non-linear activation function, it will become the sub-model.

3.2 Sharing Gradient

sharing gradient has a goal for the learnable-variables could effectively perform in two or more networks. sharing gradient starts with we forward propagate all of the models. During back propagation, we collect the gradients from each model separately. Each model is paired from the sub-model without the hidden layer to sub-model with a hidden layer until, the sub-model with hidden layers to base-model. The gradients from each model’s pair are averaged and used to tune the model’s weights and bias. Overview of sharing gradient process is shown in Eq. (2) and Fig. 2, where is the learning rate and

is the loss function.

(2)

The reason behinds sharing gradient for only a pair of models is from when sharing more than a pair gradient, the optimization process becomes more complicated. In this case, the performance of NSNs hardly reaches the satisfiable point. Nonetheless, only an input layer of base-model, which has not a pair, is updated with the regular back propagation.

Figure 2: sharing gradient process in model0-1-2 experiments’ section. The gradients are shared from the sub-model to base-mode, pair by pair. Only the input weight layer of is regularly updated without sharing. Where is the loss function at the model.

4 Experiments

The experiment was conducted using a hand-written digit image dataset, MNIST [lecun2010mnist]. MNIST dataset consists of 60,000 training images and 10,000 test images. Each image in the dataset is the gray-scale image and, composed of 28x28 pixel. Each image pixel of MNIST image was pre-processed into the range of [0, 1] by dividing all pixel value with 255.

Multi-layer perceptron (MLP) was applied with rectified linear unit (ReLU) as the non-linear activation function. The last layer was applied with log-softmax with the cost function as cross-entropy loss. The input layer of MLP was applied with dropout

[srivastava2014dropout], . The hidden layers were put with dropout rate, . In the case of softmax-regression, we did not apply dropout into the model since it was already under-fitting. The base-models were further regulated by using L2-weight penalty.

We applied stochastic gradient descent (SGD) with momentum,

. Although, we applied with slightly different format of SGD with momentum. From Tensorflow

[abadi2016tensorflow], neural networks framework, regular format of SGD with momentum [momentum2019] was shown in Eq. 3. Where is the gradient accumulation term, is the batch-wise iteration step and is the gradient at . Our format of SGD with momentum is shown in Eq. 4. After we found , both of format was used the same Eq. 5 to update the weight, .

(3)
(4)
(5)

NSNs performed better with our format of SGD with momentum comparing the regular format at . We speculated that NSNs required the higher proportion of the gradient accumulation, comparing with the current gradient, to converge. In the other hand, with the regularly trained DNNs, our format of SGD with momentum performed slightly worse in term of test accuracy. Hence, to perform a fair comparison between both type of models, the regularly-trained models were trained with Eq. 3. Our purposed method models were trained Eq. 4.

We set the training batch as 128. Each model had been training for 600 epoch. However, we reported the best test accuracy which might occur during the training. The initial learning rate,

and step down by one third every 200 epoch.

The experimental result consists of two sections. First section is model0-1 or the base-model as MLP with a hidden layer, model1, with a sub-model as the soft-max regression, model0. Second section is model0-1-2 or the base-model as two layers MLP, model2. The sub-models are MLP with a hidden layer, model1, and the soft-max regression, model0. The graphical model of model0-1-2 is shown in Fig. 1. The base-line models which are regularly trained are referred as ref-model and following with number hidden layers. For example, ref-model1 is the base-line MLP with a hidden layer. The results of base-line model are shown in Table. 1.

Test
Accuracy
Number
Parameters
Regularization
Parameter
ref-model2
0.9886 1.24M
ref-model1
0.9882 0.62M
ref-model0
0.9241 7.85k 0
Table 1: Results of MNIST classification of base-line models

4.1 model0-1

MLP with a hidden layer was used as the base-model. The sub-model was the softmax-regression. In all of the following experiment, we prioritized the base-model performance hence we reported all of the test accuracy of models in epoch that contains the best test accuracy of the base-model. model0-1 results are displayed in Table. 2. We applied the regularization parameter as at the base-model.

Test
Accuracy
Number
Parameters
model1
0.9857 0.62M
model0
0.9253 7.85k
Table 2: Results of MNIST classification of model0-1

Comparing with the ref-model1 and model1, the test accuracy of model1 were dropped for an extent. This indicated that our purpose methods negatively affected the performance of the model for ability to removing the weight layers.

4.2 model0-1-2

MLP with two hidden layer was used as the base-model. The sub-models were MLP with a hidden layer and the softmax-regression. model0-1-2 results were displayed in Table. 3. We applied the regularization parameter as at the base-model.

Test
Accuracy
Number
Parameters
model2
0.989 1.24M
model1
0.9843 0.62M
model0
0.926 7.85k
Table 3: Results of MNIST classification of model0-1-2

The difference in test accuracy between model1 and model2 indicated the bias of our purposed method towards the base-model. We speculated this bias might come from the sharing gradient process. All of the gradients that sub-models received, were averaged from multi-models. However, our base-model had an input layer that updated from gradient from the model itself as shown in Fig. 2.

Comparing with the result of model1 in model0-1 and ref-model1, our model2 in model0-1-2 contrastingly out-performed with ref-model2 for a tiny margin. We hypothesized that the constraints of our purposed method might cause the some type of regularization into the models. In case of model1 in model0-1, this regularization effect might excessively strong and negatively affected the performance. Nevertheless, in case of model2 in model0-1-2, the regularization effect seem to be adequate and positively affected the accuracy.

5 Conclusion

We purpose network with sub-networks, DNNs that could be removed weight layers on fly. network with sub-networks consists of a base-model and, sub-models inside. To combine sub-models into the base-model, copying learn-able parameters is introduced. sharing gradient

is applied for learn-able parameters could be used in two or more models. Our purposed method was conducted in the small scale experiment with a few hidden layers DNNs and only with MNIST dataset. The bigger-scale models with the dataset will be focused on future works. We also interest to apply with convolutional neural networks (CNNs) which have been proved to have the better performance in the computer vision tasks.

Acknowledgment

This research was supported by JSPS KAKENHI Grant Numbers 17K20010.

References