1 Introduction
Deep neural networks (DNNs) have been given the attraction in the most recent years from their ability to provide stateoftheart performance in varied applications. To deploy DNNs into the mobile devices, which are diverse in the specification, raises a question: how to effectively design DNNs by given the specification of the mobile phone? To answer this question, two main factors could be optimized.
The first factor is the performance of DNNs. In general, DNNs are provided an assumption by stacking the number of weight layers of DNNs, the better the performance of the model will be. One widely used example is the growing trend in the number of weight layers in ImageNet Large Scale Visual Recognition Competition (ILSVRC). AlexNet
[krizhevsky2012imagenet], the model that won ILSVRC2012 consists of 8weight layers. ResNet [he2016deep], the winner of ILSVRC2015, contains of 152weight layers. From AlexNet, ResNet reduces top5 test error from 15.3 to 3.57. Even though, the growth in the number of weight layers might reduce the testerror rate of the model, it comes with the tradeoff of the second factor, latency. More layers of DNNs means the higher number of variables to compute. This also increases in the memory footprint which is crucial for the mobile device.In one of the solutions, we might select the model which achieves the realtime performance given a mobile phone. However, if the user differently prefers the performance over the latency, then, this method does not satisfy the user’s demand. One of the results is to let the user select his or her preference thus, match the preference to the most suitable model. However, the memory footprint for keeping various models into the mobile device is exceedingly huge. To satisfy user’s preference in selectivity in both performance and latency, we propose network with subnetworks (NSNs), DNNs which could be removed weight layers without dramatically decrease in performance.
Generally, if one of the weight layers is removed during the inference time, the performance of that model will dramatically be dropped. To explain what we speculate, one of the widely used examples to explain how DNNs operate is to compare it as a feature extraction model. From the first weight layer, extracts the lowlevel features to the last layers extract the highlevel features. This creates a dependent relationship between each weight layer.
To solve this problem, we propose the training method that allows NSNs to dynamically adapt to the removing of weight layers. We call this method, copying learnable parameters and sharing gradient. Both methods are designed to optimize the learnable parameters for both models, the model with and without removing the weight layer.
2 Related Works
2.1 BrachyNet
BrachyNet [teerapittayanon2016branchynet]
is neural networks which could reduce the number of floating point operations (FLOP) during inference depending on the complexity of the input data. The concept of BrachyNet is between the certain immediate layers of neural networks, the output features of those layers might be connected with a branch. The branch consists of the weight layers and a classifier. If the prediction of input data of that branch classifier has higher confidence than the predefined threshold then, the output of the model will come out from that branch as the early exit. In another case, if the confidence is low then, the features will go further to process into the deeper layers instead.
2.2 Slimmable Neural Networks
Slimmable Neural Networks (SNNs) [yu2018slimmable]
is the main inspired of this research. If our purposed method adds or remove weights in depthwise direction, SNNs append or detach weights in widthwise direction. The range of possible width of networks requires to be predefined as the switch. The main research problem is the mean and variance of features which come out from differentwidth weight layers are generally diverse. SNNs proposed
switchable batch normalization
which applied one for each switch. switchable batch normalization are used to correcting mean and variance of SNNs.3 Network with SubNetworks
There are two types of models in NSNs: the base and submodel. We define the basemodel as DNNs with hidden layers. Where is a positive integer more than zero. Then, we could create number of submodels. Each of submodel is mapped with hidden layers. From this concept, the biggest submodel takes all of the weight layers of the basemodel except the input layer. The second biggest submodel takes all of the weight layers of the biggest submodel except the input layer of the biggest submodel. This could be done repeatedly until we get the submodel that has not any hidden layer.
In the next section, after we define the basemodel and submodel, we will describe two of the processes in our purposed method: copying learnable parameters and sharing gradient. Those processes are designed to be used repeatedly in every minibatch training.
3.1 Copying Learnable Parameters
The goal of copying learnable parameters is to combine each submodel into the basemodel. To enforce the similarity between weight and bias parameters between each model, the weights and biases are copied from the lesser submodel to bigger submodel and repeat until going to the basemodel. The process is shown in Eq. (1) and Fig. 1. From Eq. (1), is a weight variable, is an integer indicating the order of weight layer and is an integer indicating the model number.
(1) 
After we apply this process, if we remove the input weight layer of basemodel with the nonlinear activation function, it will become the submodel.
3.2 Sharing Gradient
sharing gradient has a goal for the learnablevariables could effectively perform in two or more networks. sharing gradient starts with we forward propagate all of the models. During back propagation, we collect the gradients from each model separately. Each model is paired from the submodel without the hidden layer to submodel with a hidden layer until, the submodel with hidden layers to basemodel. The gradients from each model’s pair are averaged and used to tune the model’s weights and bias. Overview of sharing gradient process is shown in Eq. (2) and Fig. 2, where is the learning rate and
is the loss function.
(2) 
The reason behinds sharing gradient for only a pair of models is from when sharing more than a pair gradient, the optimization process becomes more complicated. In this case, the performance of NSNs hardly reaches the satisfiable point. Nonetheless, only an input layer of basemodel, which has not a pair, is updated with the regular back propagation.
4 Experiments
The experiment was conducted using a handwritten digit image dataset, MNIST [lecun2010mnist]. MNIST dataset consists of 60,000 training images and 10,000 test images. Each image in the dataset is the grayscale image and, composed of 28x28 pixel. Each image pixel of MNIST image was preprocessed into the range of [0, 1] by dividing all pixel value with 255.
Multilayer perceptron (MLP) was applied with rectified linear unit (ReLU) as the nonlinear activation function. The last layer was applied with logsoftmax with the cost function as crossentropy loss. The input layer of MLP was applied with dropout
[srivastava2014dropout], . The hidden layers were put with dropout rate, . In the case of softmaxregression, we did not apply dropout into the model since it was already underfitting. The basemodels were further regulated by using L2weight penalty.We applied stochastic gradient descent (SGD) with momentum,
. Although, we applied with slightly different format of SGD with momentum. From Tensorflow
[abadi2016tensorflow], neural networks framework, regular format of SGD with momentum [momentum2019] was shown in Eq. 3. Where is the gradient accumulation term, is the batchwise iteration step and is the gradient at . Our format of SGD with momentum is shown in Eq. 4. After we found , both of format was used the same Eq. 5 to update the weight, .(3) 
(4) 
(5) 
NSNs performed better with our format of SGD with momentum comparing the regular format at . We speculated that NSNs required the higher proportion of the gradient accumulation, comparing with the current gradient, to converge. In the other hand, with the regularly trained DNNs, our format of SGD with momentum performed slightly worse in term of test accuracy. Hence, to perform a fair comparison between both type of models, the regularlytrained models were trained with Eq. 3. Our purposed method models were trained Eq. 4.
We set the training batch as 128. Each model had been training for 600 epoch. However, we reported the best test accuracy which might occur during the training. The initial learning rate,
and step down by one third every 200 epoch.The experimental result consists of two sections. First section is model01 or the basemodel as MLP with a hidden layer, model1, with a submodel as the softmax regression, model0. Second section is model012 or the basemodel as two layers MLP, model2. The submodels are MLP with a hidden layer, model1, and the softmax regression, model0. The graphical model of model012 is shown in Fig. 1. The baseline models which are regularly trained are referred as refmodel and following with number hidden layers. For example, refmodel1 is the baseline MLP with a hidden layer. The results of baseline model are shown in Table. 1.





0.9886  1.24M  

0.9882  0.62M  

0.9241  7.85k  0 
4.1 model01
MLP with a hidden layer was used as the basemodel. The submodel was the softmaxregression. In all of the following experiment, we prioritized the basemodel performance hence we reported all of the test accuracy of models in epoch that contains the best test accuracy of the basemodel. model01 results are displayed in Table. 2. We applied the regularization parameter as at the basemodel.




0.9857  0.62M  

0.9253  7.85k 
Comparing with the refmodel1 and model1, the test accuracy of model1 were dropped for an extent. This indicated that our purpose methods negatively affected the performance of the model for ability to removing the weight layers.
4.2 model012
MLP with two hidden layer was used as the basemodel. The submodels were MLP with a hidden layer and the softmaxregression. model012 results were displayed in Table. 3. We applied the regularization parameter as at the basemodel.




0.989  1.24M  

0.9843  0.62M  

0.926  7.85k 
The difference in test accuracy between model1 and model2 indicated the bias of our purposed method towards the basemodel. We speculated this bias might come from the sharing gradient process. All of the gradients that submodels received, were averaged from multimodels. However, our basemodel had an input layer that updated from gradient from the model itself as shown in Fig. 2.
Comparing with the result of model1 in model01 and refmodel1, our model2 in model012 contrastingly outperformed with refmodel2 for a tiny margin. We hypothesized that the constraints of our purposed method might cause the some type of regularization into the models. In case of model1 in model01, this regularization effect might excessively strong and negatively affected the performance. Nevertheless, in case of model2 in model012, the regularization effect seem to be adequate and positively affected the accuracy.
5 Conclusion
We purpose network with subnetworks, DNNs that could be removed weight layers on fly. network with subnetworks consists of a basemodel and, submodels inside. To combine submodels into the basemodel, copying learnable parameters is introduced. sharing gradient
is applied for learnable parameters could be used in two or more models. Our purposed method was conducted in the small scale experiment with a few hidden layers DNNs and only with MNIST dataset. The biggerscale models with the dataset will be focused on future works. We also interest to apply with convolutional neural networks (CNNs) which have been proved to have the better performance in the computer vision tasks.
Acknowledgment
This research was supported by JSPS KAKENHI Grant Numbers 17K20010.
Comments
There are no comments yet.