1 Introduction
Multitask learning (MTL) [5] simultaneously learns multiple tasks to improve generalization performance for the tasks. Most of recent MTL approaches [22, 24, 23, 29]
are based on deep neural networks (DNNs) which have outstanding performance compared to traditional machine learning methods in computer vision and machine learning, such as image classification
[10, 37], object detection [21], and pose estimation [26], to name a few.Since it is believed that MTL methods using a DNN require a huge number of parameters and computing resources, a compact network with a small number of parameters and low computational complexity is highly desirable for many practical applications, such as mobile and embedded platforms [13]. To address this, there have been studies on designing compact DNNs, such as network pruning [9, 34], knowledge distillation [12, 28], network architecture search [39], and adaptive model compression [3, 20, 35]. However, these prior works have been applied to a single task problem and multiple tasks have been little considered in a single framework.
The MTL problem has a potential issue that the required number of parameters may increase depending on the number of tasks [5]. However, a single shared model for multiple tasks may cause performance degradation when associated tasks are less relevant [29]. To avoid this issue, recent approaches [15, 16] proposed a network architecture which can contain several submodels to assign the them to multiple tasks. Despite their attempts for MTL, they require human efforts to construct submodels from the network architecture and assign the model to each task. For more flexible and adaptive model assignment for multiple tasks, it is desired to realize a model selection approach which automatically determines a proper submodel depending on a given instance.
In this work, we aim to develop an instanceaware dynamic model selection approach for a single network to learn multiple tasks. To that end, we present an efficient learning framework that exploits a compact but highperforming model in a backbone network, depending on each instance of all tasks. The proposed framework consists of two main components of different roles, termed an estimator and a selector (see Figure 1). The estimator is based on a backbone (baseline) network, such as VGG [30] or ResNet [10]. It is structured hierarchically based on modularized blocks which consist of several convolution layers in the backbone network. It can produce multiple network models of different configurations and scales in a hierarchy. The selector is a relatively small network compared to the estimator and outputs a probability distribution over candidate network models for a given instance. The model with the highest probability is chosen by the selector from a pool of candidate models to perform the task. Note that the approach is learned to choose a model corresponding to each instance throughout all tasks. This makes it possible to share the common models or features across all tasks [7, 15]. We design the objective function to achieve not only competitive performance but also resource efficiency (i.e., compactness) required for each instance. Inspired by [31], we introduce a samplingbased learning strategy to approximate the gradient for the selector which is hard to derive exactly. Both the estimator and the selector are trained in a unified learning framework to optimize the associated objective function, which does not require additional efforts (e.g., finetuning) performed in existing works [35, 39].
We perform a number of experiments to demonstrate the competitiveness of the proposed method, including model selection and model compression problems when a single or multiple tasks are given. For the experiments, we use an extensive set of benchmark datasets: CIFAR10 and CIFAR100 [18]
, TinyImageNet
^{1}^{1}1https://tinyimagenet.herokuapp.com/, STL10 [6], and ImageNet [19]. The experimental results on different learning scenarios show that the proposed method outperforms existing stateoftheart approaches. Notably, our approach addresses both model selection and multitask learning simultaneously in a single framework without introducing additional resources, making it highly efficient.2 Related Work
Model selection.
In order to reduce the burden of an expert for designing a compact network, architecture search methods [39] were proposed to explore the space of potential models automatically.
To shrink the daunting search space which usually requires a timeconsuming exploration, methods based on a welldeveloped backbone structure find an efficient model architecture by compressing a given backbone network [3, 2].
Furthermore, the recent studies realizing such strategy [20, 35, 33] determine a different network model for each instance to reduce an additional redundancy.
However, they usually achieve the lower performance compared to their backbone network [20, 33] or require additional finetuning process [35].
In contrast to them, we propose an efficient learning framework which can achieve better performance than the backbone network due to the dynamic model search and also does not includes an additional finetuning stage.
Besides, our approach can be applied to learn multiple tasks simultaneously in a single framework, while aforementioned methods are limited to a single task.
Multitask learning. The purpose of multitask learning (MTL) is to develop a learning framework that jointly learns multiple tasks [5]. Note that we focus on a MTL method that learns a single DNN architecture for memory efficiency. There are several recent studies [11, 23, 24] that proposed a network structure in which parameters can be efficiently shared across tasks. Other approaches [22, 15, 16] suggest a single architecture which includes multiple internal networks (or models) so that they can assign different models to multiple tasks without increasing the parameters. However, they use a fixed model structure for each task and it requires expert efforts to assign the model to each task. In contrast, we propose a dynamic model selection for MTL which determines a proper model automatically for a given instance. Even if a recent MTL method [29] attempts model selection by a routing mechanism, it does not consider an optimized network structure associated with the number of parameters or FLOPs.
3 Approach
3.1 Overall framework
The goal of the proposed method is to develop a dynamic model selection framework when an input instance drawn from one of the target tasks is given. The proposed framework consists of two different components: an “estimator ” which is a network of the same size to the target backbone network and contains multiple different models of different network configurations, and a “selector ” which reveals a model with the highest probability in the estimator. Both estimator and selector are constructed based on a CNNbased architecture, and the selector is designed to be much smaller than the estimator (see Section 4). The proposed approach explores a model search space and identifies an efficient network model to perform the given task in an instancewise manner. The overall framework of the proposed approach is illustrated in Figure 1.
Note that there are a vast number of candidate models produced by the estimator, and this makes it difficult for the selector to explore the extensive search space. As a simplification strategy of the daunting task, we use a block notation to shrink the search space over the candidate models. A block is defined as a disjoint collection of multiple convolution (or fully connected) layers. The block is constructed as a hierarchical structure such that a lower level of hierarchy only refers fewer channels of hidden layers in the block and a higher level refers more channels, maintaining input and output dimensions of the block. Moreover, the lowest level of hierarchy can be constructed without any channels when the block is equivalent to a residual module [10]. This is similar to a layer skipping method in [35]. The hierarchical structure in a block is illustrated in Figure 2.
We determine a model structure by selecting a level of hierarchy in each block as follows: , where is the number of the blocks in the estimator and denotes the selected level in the th block. Namely, a network model is collected in the estimator when the network model structure is given. The inference of the determined network model is represented as follows:
(1) 
where is a set of parameters in the estimator, and and denote input and output domains for task , respectively. To address different input or output dimensions, we assume that the task ID is given beforehand.
The goal of the selector is to find an appropriate network model for a given instance from a task by inferring the probability distribution over candidate models in the estimator. As mentioned earlier, we design the selector to produce a set of probability distributions over the modularized blocks (with their levels of hierarchy) as follows:
(2) 
where is a set of parameters of the selector and is the number of levels of hierarchy in each block. We define the output of the selector as and each column of reveals probabilities of selecting levels in the corresponding block (i.e., ). Then, the probability of a candidate model for an instance can be calculated as
(3) 
where denotes the th element of the th column of , which means the probability that the th level is selected in the th block for an input . Thus, we can represent up to different candidate models, and one of them is selected to produce its corresponding model to perform the task. The overall framework is shown in Figure 2.
3.2 Optimization
The proposed approach is optimized to perform multitask learning in an instancewise manner within a single framework. We denote a set of datasets as , where and are an image and a label, respectively, and is a dataset for task . The proposed model selection problem is to minimize the loss functions for instances of all tasks while imposing the model size compact:
(4) 
where denotes a classification loss function (e.g., crossentropy). is a sparse regularization term on the model structure , which is defined as:
(5) 
where gives the ratio of the number of parameters determined by from the total number of parameters in the th block, and is a weighting factor. The square function in (5) can help enforce high sparsity ratio, and we have empirically found that it performs better than other regularization function, such as the norm.
The proposed approach involves alternating optimization steps for two sets of parameters, and ^{2}^{2}2 We call this alternating step as a stage.. While
can be updated by a stochastic gradient descent optimizer (SGD
[4]), the gradient with respect to (4) for is difficult to calculate without a exact expected value in (4). For this reason, we introduce a samplingbased approach to approximate the gradient. To describe the approximation, we introduce which is equivalent to the loss function for as follows:(6) 
Then, we can approximate the gradient value with sampled model structures, following the strategy in [31]:
(7) 
where . The last line approximates the expectation as the average for some randomly chosen samples ’s which are collected from the same probability distribution when is given. is a set of the collected ’s and denotes the number of samples in .
Note that the sampling scheme follows the common strategy in the reinforcement learning literature
[25]. However, this can often lead to a worse network structure when the selected model is poor [36]. As a remedy, we apply the greedy method [32] to allow more dynamic exploration at the earliest training time. In addition, we would like to note that the performance of the selected model may be sensitive to the initial distribution of the selector. For this reason, we use the following predetermined distributions of the network model in the initial stage:(8) 
where is a weighting factor, is a probability that the model structure is selected, and denotes the full model structure which includes all parameters in the estimator. In this work, we set to
in all conducted experiments. We increase the probability that the full model structure is selected more often in the initial stage, and it shows better performance compared to other initial distributions, such as a uniform distribution.
The overall training procedure of the proposed method, named deep elastic network (DEN), is summarized in Algorithm 1, where S denotes the number of stages. We optimize two sets of parameters, and , during the several stages of the training process. At each stage, one of the above parameter sets is trained until it reaches the local optima.
4 Experiments
4.1 Experimental setup
Datasets. We evaluated the proposed framework on several classification datasets as listed in Table 1. For CIFAR10, CIFAR100, TinyImageNet, and STL10 datasets, we used the original image size. MiniImageNet is a subset of ImageNet [19] which has 50 class labels and each class has 800 training instances. We resized each image in the MiniImageNet dataset to and centercropped it to have the size of
. As preprocessing techniques, we performed the random horizontal flip for all datasets and added zero padding of four pixels before cropping for CIFAR, TinyImageNet, and STL10 datasets. CIFAR100 dataset includes two types of class categories for each image: 20 coarse and 100 fine classes. We used both of them for hierarchical classification; otherwise, we used the fine classes for the rest of the experiments.
Scenarios.
We evaluated three scenarios for multitask learning (MTL) and one scenario for network compression.
For MTL, we organized two scenarios (M1, M2) using multiple datasets and one scenario (M3) using a single dataset with hierarchical class categories.
For the first scenario, M1, we used three datasets of different image scales: CIFAR100 (3232), TinyImageNet (6464), and STL10 (9696).
For M2, 50 labels are randomly chosen from the 1000 class labels in the ImageNet dataset and the chosen labels are separated into 10 disjoint subsets (tasks) each of which has 5 labels.
M3 is a special case of MTL (we call it hierarchical classification), which aims to predict two different labels (coarse and fine classes) simultaneously for each image.
CIFAR100 was used for the scenario M3.
We also conducted the network compression scenario (C1) as a single task learning problem for CIFAR10 and CIFAR100, respectively.
Dataset  Size  # train  # test  # classes 

CIFAR10 [18]  32  50,000  10,000  10 
CIFAR100 [18]  32  50,000  10,000  100 
TinyImageNet  64  100,000  10,000  200 
STL10 [18]  96  5,000  8,000  10 
MiniImageNet [29]  224  40,000  2,500  50 
Dataset  Baseline [10]  NestedNet [15]  PackNet* [22]  DEN ()  DEN () 

CIFAR100 (3232)  75.05  74.53  72.22  74.30  75.11 
TinyImageNet (6464)  57.22  56.71  56.49  56.74  60.21 
STL10 (9696)  76.25  82.54  80.78  83.90  87.58 
Average  69.51  71.26  69.83  71.65  74.30 
FLOPs  2.91G  1.70G  1.70G  1.35G  1.61G 
No. parameters  89.4M [3]  29.8M [1]  29.8M [1]  29.8M [1]  29.8M [1] 
(a) ResNet42  
Dataset  Baseline [38]  NestedNet [15]  PackNet* [22]  DEN ()  DEN () 
CIFAR100 (3232)  75.01  74.09  73.56  75.43  75.65 
TinyImageNet (6464)  58.89  57.87  57.17  58.17  58.25 
STL10 (9696)  79.88  83.78  84.15  87.54  87.56 
Average  71.26  71.91  71.63  73.71  73.82 
FLOPs  2.13G  1.24G  1.24G  1.13G  1.14G 
No. parameters  22.0M [3]  7.35M [1]  7.35M [1]  7.35M [1]  7.35M [1] 
(b) WRN324 
Implementation details. We used ResNet [10] and WRN [38] as backbone networks in the MTL scenarios, where is the number of layers and is the scale factor on the number of convolutional channels. We borrowed a residual network architecture designed for ImageNet [19] to handle largescale images and a WRN architecture designed for CIFAR [18] to handle smallscale images. We also used SimpleConvNet introduced in [29, 27] as a backbone network for MiniImageNet. SimpleConvNet consists of four 3x3 convolutional layers (32 filters) and three fully connected layers (128 dimensions for hidden units). In the network compression scenario, we used ResNeXt (d) [37] and VGG [30] to apply our methods in various backbones, where and d are the number of individual convolution blocks and unit depth of the convolution blocks in each layer, respectively [37]. The backbone networks are used as baseline methods performing an individual task in each scenario. To build the structure of the estimator, we defined a block as a residual module [10] and as two consecutive convolution layers for VGG networks. Then we split each block into multiple convolution groups along the channel dimension (2 or 3 groups in our experiments) to construct a hierarchical structure. Note that the lowest level of hierarchy does not have any convolution groups for ResNet, WRN, and ResNeXt, but has one group for VGG, and SimpleConvNet. The selector was designed with a network which is smaller than the estimator. The size of the selector is stated in each experiment.
For the proposed method, named deep elastic network (DEN), the estimator was trained by the SGD optimizer with Nesterov momentum of 0.9, with the batch size of 256 for largescale dataset (ImageNet) and 128 for other datasets. The ADAM optimizer
[17]was used to learn the selector with the same batch size. The initial learning rates were 0.1 for the estimator and 0.00001 for the selector, and we decayed the learning rate with a factor of 10 when it converges (three or four decays happened in all experiments for both estimator and selector). All experiments are conducted in the TensorFlow environment
[1].Compared methods. We compared with four stateoftheart algorithms considering resource efficiency for multitask learning: PackNet*, NestedNet [15], Routing [29], and Crossstitch [24]. PackNet* is a variant of PackNet [22], which considers groupwise compression along the channel dimension, in order to achieve practical speedup like ours. Both PackNet* and NestedNet divide convolutional channels into multiple disjoint groups and construct a hierarchical structure such that the th level of hierarchy includes divided groups (the number of levels of hierarchy corresponds to the number of tasks). When updating the th level of hierarchy, NestedNet considers parameters in the th level but PackNet* considers parameters except those in the ()th level. For Routing and Crossstitch, we used the results in [29] under the same circumstance. We also compared with BlockDrop [35], N2N [3], Pruning (which we termed) [14], and NestedNet [15] for the network compression problem. Note that we reported FLOPs and the number of parameters of the proposed method for the estimator in all experiments.
4.2 Multitask learning
For the first scenario M1 (on three tasks), we used both ResNet42 and WRN324 as backbone networks, respectively. The three tasks, TinyImageNet, CIFAR100, and STL10, are assigned to the levels of hierarchy for PackNet* and NestedNet from the lowest to highest levels, respectively. The number of parameters and FLOPs of the selector are 1.49M and 0.15G for the ResNet42 backbone and 0.37M and 0.11G for the WRN324 backbone, respectively. The baseline method requires three separate networks, each trained independently. Table 2 shows the results with respect to accuracy, FLOPs and the number of parameters of the compared methods. Here, FLOPs denotes the average FLOPs for multiple tasks, and the number of parameters is measured from all networks required to perform the tasks. Overall, our approach outperforms other methods including the baseline method. In addition, we provide the results by varying the weighting factor of our sparse regularizer in (5). As shown in the table, the performance is better when is lower, and more compact model is selected when is higher.
For the scenario M2, SimpleConvNet was used as a backbone network. Since the scenario contains a larger number of tasks than the previous scenario, PackNet* and NestedNet, which divide the model by human design, cannot be applied. We divided the convolution parts which takes most of the FLOPs in the network into two levels such that lowest level of hierarchy contains half the parameters of the highest level. The number of parameters of the selector is 0.4M, whereas the number of parameters of the estimator is 0.8M. In this scenario, the selector is not much smaller than the estimator because the estimator is constructed in sufficiently small size. However, the number of parameters of the selector for other scenarios are negligible compared to those of the estimator. The accuracy, FLOPs and the number of parameters of the compared methods are reported in Table 3. The result of the compared methods are reported in the work in [29]. For a fair comparison, we experimented with our algorithm on two input scales. The proposed method shows a significant performance improvement compared to the other methods, even though ours uses lower average FLOPs than others for evaluations. In addition, the proposed method has similar FLOPs to the comparison methods even when dealing with large scale inputs and has outstanding performance. Note that since the number of parameters and FLOPs are not precisely reported in the paper, we provide lower bounds.
4.3 Hierarchical classification
For the scenario M3, we dealt with CIFAR100 which has coarse and fine class categories for each image as described in Section 4.1. WRN324 was used as a backbone network for this scenario. We compared with PackNet* and NestedNet, and the lowest and highest levels of hierarchy for them were allocated to perform the coarse and fine classifications, respectively. The structure of the selector in our method is equal to that in the scenario M1.
Method  Accuracy  FLOPs  No. params 

Baseline  51.03  27.4M  0.14M 
Crossstitch [24]  56.03  27.4  0.14M 
Routing [29]  58.97  27.4M  0.14M 
DEN ()  60.78  18.7M  0.14M 
DEN ()  62.62  19.4M  0.14M 
DEN ()  63.20  33.3M  0.85M 
DEN ()  65.23  39.1M  0.85M 
Table 4 summarizes the results of the compared methods for the coarse and fine classification problems. Our approach shows the best accuracy while giving the lowest FLOPs compared to the competitors except the baseline method for both problems. Furthermore, the proposed method has higher performance than the baseline method on average. Since each image has two different tasks (coarse and fine class categories), the selector exploits the same model structure and thus gives almost the same FLOPs.
4.4 Network compression
The goal of the network compression problem is to design a compact network model from a given backbone network while minimizing the performance degradation. We applied the proposed method to the network compression problem which is a singletask learning problem. We compared with BlockDrop [35] and NestedNet [15] on two backbone networks: ResNeXt [37] and VGG [30]. Since BlockDrop is developed for residual networks, we compared with it using ResNeXt. The CIFAR10 and CIFAR100 [18] datasets were used for the scenario, respectively. To verify the efficiency of the proposed method, we constructed our method with four levels of hierarchy for ResNeXt29 (864d) and three levels for ResNeXt29 (464d), respectively. The numbers of parameters of the selector are 3.9M and 3.6M for VGG and ResNeXt backbone networks, respectively.
Table 5 summarizes the classification accuracy of the compared approaches for the backbone networks. Overall, the proposed method shows the highest accuracy compared to other compression approaches. Our results with different show that can provide a tradeoff between the network size and its corresponding accuracy. We also tested the proposed method (estimator) with a random selector, which reveals a model structure randomly among the candidate models in the estimator, to compare it with our model selection method. From the result, we can observe that the accuracy of the random selector is lower than the proposed selector, which reveals that the selector has potential to explore the desired model. Moreover, we compared with the stateoftheart network compression methods, N2N [3], and Pruning [14], whose results were reported from their papers [3, 14]. Our approach has 94.47% classification accuracy using 5.8M parameters and the Pruning method has 94.15% accuracy using 6.4M parameters on the CIFAR10 dataset. The proposed method also shows better performance than N2N and Pruning methods on the CIFAR100 dataset.
Method  Accuracy  FLOPs  No. params 

Baseline [38]  83.53  2.91G  14.7M 
NestedNet [15]  84.55  1.46G  7.35M 
PackNet* [22]  84.53  1.46G  7.35M 
DEN ()  84.87  1.37G  7.35M 
(a) Coarse classification (20)  
Method  Accuracy  FLOPs  No. params 
Baseline [38]  76.32  2.91G  14.7M 
NestedNet [15]  75.84  2.91G  7.35M 
PackNet* [22]  75.65  2.91G  7.35M 
DEN ()  75.93  1.37G  7.35M 
(b) Fine classification (100) 
Dataset  CIFAR10  CIFAR100  

Backbone  Method  Acc (%)  No. params  FLOPs  Acc (%)  No. params  FLOPs 
VGG16  Baseline [30]  92.52  38.9M  1.0  69.64  38.9M  1.0 
NestedNet [15], L  91.29  19.4M  2.0  68.10  19.4M  2.0  
NestedNet [15], H  92.40  38.9M  1.0  69.01  38.9M  1.0  
DEN ()  92.31  18.5M  2.4  68.87  18.9M  1.7  
ResNet18  N2N [3]  91.97  2.12M  69.64  4.76M  
ResNet34  93.54  3.87M  70.11  4.25M  
ResNet50  Pruning [14]  94.15  6.44M  74.10  9.24M  
DEN ()  94.50  4.25M  77.98  4.67M  
ResNeXt29 (8 64d)  Baseline [37]  94.61  22.4M  1.0  78.73  22.4M  1.0 
NestedNet [15], L  93.56  5.6M  4.0  74.83  5.6M  4.0  
NestedNet [15], M  93.64  11.2M  2.0  74.98  11.2M  2.0  
NestedNet [15], H  94.13  22.4M  1.0  76.16  22.4M  1.0  
BlockDrop [35]  93.56  16.9M  1.2  78.35  15.5M  1.4  
DEN rand sel  90.55  9.8M  2.3  69.67  9.8M  2.3  
DEN ()  91.45  4.1M  5.5  78.27  7.3M  3.0  
DEN ()  94.61  8.7M  2.7  78.68  13.5M  1.9  
ResNeXt29 (4 64d)  Baseline [37]  94.37  11.2M  1.0  77.95  11.2M  1.0 
NestedNet [15], L  93.59  5.6M  2.0  75.70  5.6M  2.0  
NestedNet [15], H  94.11  11.2M  1.0  76.36  11.2M  1.0  
BlockDrop [35]  93.07  6.53M  1.7  77.23  7.47M  1.5  
DEN (rand sel)  87.33  5.6M  2.0  65.44  5.6M  2.0  
DEN ()  93.38  5.4M  2.1  76.71  5.6M  2.0  
DEN ()  94.47  5.8M  1.9  77.58  6.3M  1.8  
ResNeXt29 (2 64d)  Baseline [37]  93.60  5.6M  76.54  5.6M 
4.5 Quantitative results
The proposed instancewise model selection for multitask learning can associate similar features for similar images, which means that similar model structures can be selected for similar images. To verify this, we chose one input image (query) at each task and derived its output model distribution from the selector. Here, we measured the similarity between the distributions using distance. Then we collected four samples from each task, whose corresponding outputs have the similar model distribution to the query image. To do so, we constructed the proposed method based on the WRN324 backbone architecture for the three tasks (datasets): CIFAR100, TinyImageNet, and STL10. We set the size of input image to for all the datasets to see the similarity under the same image scale. Figure 3 shows some selected images from all tasks for each query image. The results show that instancewise model selection can be a promising strategy for multitask learning as it can reveal the common knowledge across the tasks. We provide model distributions for instances from the test set in supplementary materials, along with the ablation study of using different numbers of levels.
5 Conclusion
In this work, we have proposed an efficient learning approach to perform resourceaware dynamic model selection for multitask learning. The proposed method contains two main components of different roles, an estimator which produces multiple candidate models, and a selector which exploits a compact and competitive model among the candidate models to perform the designated task. We have also introduced a samplingbased optimization strategy to address the discrete action space of the potential candidate models. The proposed approach is learned in a single framework without introducing many additional parameters and much training efforts. The proposed approach has been evaluated on several problems including multitask learning and network compression. The results have shown the outstanding performance of the proposed method compared to other competitors.
Acknowledgements: This research was supported in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) No. 2019001190, [SW Star Lab] Robot Learning: Efficient, Safe, and SociallyAcceptable Machine Learning, and No. 2019001371, Development of BrainInspired AI with HumanLike Intelligence, and by AIR Lab (AI Research Lab) of Hyundai Motor Company through HMCSNU AI Consortium Fund.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.
 [2] Karim Ahmed and Lorenzo Torresani. MaskConnect: Connectivity learning by gradient descent. In European Conference on Computer Vision (ECCV). Springer, 2018.
 [3] Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M Kitani. N2N learning: network to network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030, 2017.
 [4] Léon Bottou. Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
 [5] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.

[6]
Adam Coates, Andrew Ng, and Honglak Lee.
An analysis of singlelayer networks in unsupervised feature
learning.
In
Proceedings of International Conference on Artificial Intelligence and Statistics
, 2011.  [7] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), pages 647–655, 2014.
 [8] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2010.
 [9] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), 2015.

[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)
, 2016. 
[11]
Xiaoxi He, Zimu Zhou, and Lothar Thiele.
Multitask zipping via layerwise neuron sharing.
In Advances in Neural Information Processing Systems (NIPS), 2018.  [12] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [14] Yiming Hu, Siyang Sun, Jianquan Li, Xingang Wang, and Qingyi Gu. A novel channel pruning method for deep neural network compression. arXiv preprint arXiv:1805.11394, 2018.
 [15] Eunwoo Kim, Chanho Ahn, and Songhwai Oh. NestedNet: Learning nested sparse structures in deep neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [16] Eunwoo Kim, Chanho Ahn, Philip HS Torr, and Songhwai Oh. Deep virtual networks for memory efficient inference of multiple tasks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [18] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 and cifar100 datasets. URL: https://www.cs.toronto.edu/kriz/cifar.html (visited on Mar.1, 2016), 2009.

[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems (NIPS), 2012.  [20] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [21] TsungYi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
 [22] Arun Mallya and Svetlana Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [23] Elliot Meyerson and Risto Miikkulainen. Beyond shared hierarchies: Deep multitask learning through soft layer ordering. arXiv preprint arXiv:1711.00108, 2017.
 [24] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Crossstitch networks for multitask learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3994–4003, 2016.
 [25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [26] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV). Springer, 2016.
 [27] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. 2017.
 [28] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 [29] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of nonlinear functions for multitask learning. arXiv preprint arXiv:1711.01239, 2017.
 [30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [31] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 2000.
 [32] Michel Tokic. Adaptive greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pages 203–210. Springer, 2010.
 [33] Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. In European Conference on Computer Vision (ECCV). Springer, 2018.
 [34] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
 [35] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. BlockDrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [36] Jeremy Wyatt. Exploration and inference in learning from reinforcement. 1998.
 [37] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [38] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [39] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Appendix
a.1 Details of Hierarchical Structure
The estimator of the proposed method can produce multiple network models of different sizes based on the hierarchical structure in a block. To control the actual speedup for inference, each hierarchy accesses a different number of channels in each convolution layer. The ratio of the required number of channels for each level can be adjusted. As shown in Figure 4, the lowest level of hierarchy is represented and it accesses only a few channels. The highest level of hierarchy contains all channels in the figure. If the block is based on a residual block [10], the lowest level does not include any channels.
a.2 Ablation Studies
We evaluated the performance depending on the number of levels in each block or depending on the initial model distribution. We used WRN324 [38] as a backbone network and the CIFAR100 dataset [18].
First, we tested the performance on varying numbers of levels. The number of candidate models increases greatly as the number of levels increases, while the size of the selector is held fixed (the number of candidate models is , where and are the number of levels and blocks, respectively). Figure 5 shows that the larger the number of levels, the smaller the network size can be found as exploring a larger model space. The performance also improved incrementally until the number of levels is four. However, when the number of levels is five, the performance is degraded due to the failure on dealing with a number of candidate models.
Second, we verified the effect of the initial model distribution. We applied two other distributions to compare with the proposed model distribution as described in Section 3.2 in the main paper: uniform distribution (Uniform) and random distribution which is obtained from the untrained initial selector (Random). The initial model distribution was used for training the estimator in the initial stage. As observed in Figure 5, we can verify that learning the network with the proposed initial distribution shows the best performance. Using the other distributions in the initial stage, the accuracy of the initial stage converged to the 2 to 3 % lower value compared to our method. Our approach reveals high performance in the initial stage and this affects the overall performance in Figure 5(b).
a.3 Model Distribution for Test Set
We describe the model distribution for the test set to verify that diverse models can be selected depending on given input instances. The proposed framework was trained on three datasets, CIFAR100 [18], TinyImageNet, and STL10 [6], based on a backbone network, WRN324 [38]. We designed the estimator to have 15 blocks each of which contains four levels of hierarchy. Figure 6 shows the histogram of different models which are used for instances in the test sets. We can observe variability of selected models and the distribution of chosen models is neither deterministic nor uniform. We also calculated the average of probabilities that each level is selected over the test set. As shown in Figure 6(b), the high values represent that the corresponding levels of hierarchy are frequently selected over the test set and there are common filters which are used for the most instances.
From the experiment, we have found that different models are selected by different groups of images. Examples of selected models and corresponding input images are shown in Figure 7. Three example models are shown in the figure: Model A, B, and C. Model A is selected for images with children and Model B is selected for images with people doing different activities. Note that Model A and B shares the same network architecture. Model C is selected for vessels. Similar groups are selected for Model A and Model B while the selected groups for Model C are different from Model A and B. We can see that each group is learned for specific features and the proposed selector explores appropriate groups for efficient inference.
Comments
There are no comments yet.