1 Introduction
Increasing network depth has been a dominant trend [11]
in the design of deep neural networks for computer vision. However, increased network depth comes at the expense of computational overhead and increased training time. To reduce the computational cost of machine translation models, Shazeer et al.
[22] recently explored the design of “outrageously” wide sparselygated mixture of experts models, which employs a combination of simple networks, called experts, to determine the output of the overall network. They demonstrated that these relatively shallow models can reduce computational costs and improve prediction accuracy. However, their resulting models needed to be many times larger than existing translation models to recover stateoftheart translation accuracy. Expanding on this, preliminary work by Eigen et al. [9] demonstrated the advantages of stacking two layers of mixture of experts models for MNIST digits classification. With these results, a natural question arises: can we stack and train many layers of mixture of experts models to improve accuracy and reduce prediction cost without radically increasing the network width?In this paper, we explore the design of deep mixture of experts models (DeepMoEs) that compose hundreds of mixture of experts layers. DeepMoEs combines the improved accuracy of deep models with the computational efficiency of sparselygated mixture of expert models. However, constructing and training DeepMoEs has several key challenges. First, mixture decisions interact across layers in the network requiring joint reasoning and optimization. Second, the discrete expert selection process is nondifferentiable, complicating gradientbased training. Finally, the composition of multiple mixture of experts models increases the chance of degenerate (i.e., singular) combinations of experts at each layer.
To address these challenges we propose a general DeepMoE architecture that combines a deep convolutional network with a shallow embedding network and a multiheaded sparse gating network (see Fig. 1). The shallow embedding network terminates in a softmax output layer that computes latent mixture weights
over a fixed set of latent experts. These latent mixture weights are then fed into the multiheaded sparse gating networks (with ReLU outputs) to
select and reweightthe channels in each layer of the base convolutional network. We then jointly train the base model, shallow embedding network, and multiheaded gating network with an auxiliary classification loss function over the shallow embedding network and sparse regularization on the gating network outputs to encourage diversity in the latent mixture weights and sparsity in the layer selection. This helps balance expert utilization and keep computation costs low.
Recent work [5] proves that the expressive power of a deep neural network increases superexponentially with its depth, based on the width. By stacking multiple mixture of expert layers and dynamically generating the sparse channel weights, we analyze in Sec. 4 that DeepMoEs reserve the expressive power of the unsparsified deep networks.
Based on this theoretical analysis, we further propose two variants wideDeepMoE and narrowDeepMoE to improve prediction accuracy while reducing the computational cost compared to standard convolutional networks. For wideDeepMoEs, we first double the number of channels in the standard convolutional networks and then replace the widened convolutional layers with MoE layers. We examine in experiments that if only half of channels in the widened layers are selected at inference, wideDeepMoE is able to achieve higher prediction accuracy due to the increase of model capacity while maintaining the same computational cost as the unwidened network. For narrowDeepMoEs, we directly replace the convolutional layers with MoE layers in the standard convolution networks which generalizes the existing work [18] on dynamic channel pruning and produces models that are more accurate and more efficient than the existing channel pruning literature.
We empirically evaluate the DeepMoE architecture on both image classification and semantic segmentation tasks using four benchmark datasets (i.e.,CIFAR10, CIFAR100, ImageNet2012, CityScapes) and conduct extensive ablation study on the gating behavior and the network design in Sec. 5. We find that DeepMoEs achieves the goal of improving the prediction accuracy with reduced computational cost on various benchmarks.
Our contributions can be summarized as: (1) We first propose a novel DeepMoE design which allows the network to dynamically select and execute part of the network at inference. (2) We theoretically analyze that the proposed DeepMoE design preserves the expressive power of a standard convolutional network with reduced computational cost. (3) We further introduce two DeepMoE variants that are more accurate and efficient than the prior methods on different benchmarks.
2 Related Work
Mixture of experts. Jacobs et al. [14] introduced the original formulation of mixture of experts (MoE) models. In this early work, they describe a learning procedure for systems composed of many separate neural networks each devoted to subsets of the training data. Later work [6, 7, 15]
applied the MoE idea to classic machine learning algorithms such as support vector machines. More recently, several
[22, 10, 1]have proposed MoE variants for deep learning in language modeling and image recognition domains. These more recent efforts to combine deep learning and mixtures of experts have focused on mixtures of deep subnetworks rather than stacking many mixture of expert models. While preliminary work by Eigen et al.
[9] explored stacked MoE models, they only successfully demonstrated networks up to depth two and only evaluated their design on MNIST Digits. In contrast, we construct deep models with hundreds of MoE layers based on a shared shallow embedding rather than the layer outputs [9] which makes DeepMoE more suitable to parallel hardware with batch parallelism as the gate decisions are predetermined. We also address several of the key challenges around the design and training of multilayer MoE models. More recently, the mixture of experts design has been applied in different applications, e.g., video captioning [26], multitask learning [20], etc.Conditional computation. Related to mixture of experts, recent works by Bengio et al. [2, 3, 4]
explored conditional computation in the context of neural networks which selectively executes part of the network based on the input. They use reinforcement learning (RL) for the discrete selection decisions which are delicate to train while our sparselygated DeepMoE design can be embedded into standard convolutional networks and optimized with stochastic gradient descent.
Dynamic channel pruning. To reduce storage and computation overhead, many [17, 12, 19] have explored channel level pruning which removes entire channels at each layer in the network and thus leads to structured sparsity. However, permanently dropping channels limits the network capacity. Bridging conditional computation and channel pruning, recent works [18, 27, 28] have explored dynamic pruning, which use perlayer gating networks to dynamically drop individual channels or entire layers based on the output of previous layers. Therefore the channels to be dropped are dependent on the input, resulting in a more expressive network than one that applies static pruning techniques. Like the work on conditional computation [25], dynamic pruning relies on sample inefficient reinforcement learning techniques to train many convolutional gates. In this work, we generalize the earlier work on dynamic channel pruning by introducing a more efficient shared convolutional embedding and simple ReLU based gates to enable sparsification and feature recalibration and allowing endtoend training using stochastic gradient descent.
3 Deep Mixture of Experts
In this section, we first describe the DeepMoE formulation and then introduce the detailed architecture design and loss function formulation.
3.1 Mixture of Experts
The original mixture of experts [14]
formulation combines a set of experts (classifiers),
, using a mixture (gating) function that returns a distribution over the experts given the input :(1) 
Here is the weight assigned to the expert . Later work [7] generalized this mixture of experts formulation to a nonprobabilistic setting where the gating function
outputs arbitrary weights for the experts instead of probabilities. We adopt this nonprobabilistic view since it provides increased flexibility in rescaling and composing the expert outputs.
3.2 Deepmoe Formulation
In this work, we propose the DeepMoE architecture which extends the standard singlelayer MoE model to multiple layers within a single convolutional network. While traditional MoE frameworks focus on the model level combinations of experts, DeepMoE operates within a single model and treats each channel as an expert. The experts in each MoE layer consist of the output channels of the previous convolution operation. In this section, we derive the equivalence between gated channels in a convolution layer and the classic mixture of experts formulation.
A convolution layer with tensor input
having spatial resolution and input channels, and convolutional kernel of dimension can be written as:(2) 
where is the output tensor. To construct an MoE convolutional layer we scale the input channels by the gate values for that layer and rearrange terms:
(3)  
(4) 
Defining convolution operator , we can eliminate the summations and subscripts in (4) to obtain:
(5) 
Thus, we have shown that gating the input channels to a convolutional network is equivalent to constructing a mixture of experts for each output channel. In the following sections, we describe how the gate values are obtained for each layer and then present how individual mixture of experts layers can be efficiently composed and trained in the DeepMoE architecture.
3.3 Deepmoe Architecture
DeepMoE is composed of three components: a base convolutional network, a shallow embedding network, and a multiheaded sparse gating network.
The base convolutional network is a deep network where each convolution layer is replaced with an MoE convolution layer as described in the previous section. In our experiments we use ResNet [11] and VGG [23] as the base convolutional networks.
The shallow embedding network
maps the raw input image into a latent mixture weights to be fed into the multiheaded sparse gating network. To reduce the computational overhead of the embedding network, we use a 4layer (for CIFAR) or 5layer (for ImageNet) convolutional network with 3by3 filters with stride 2 (roughly 2% of the computation of the base models).
The multiheaded sparse gating network transforms the latent mixture weights produced by the shallow embedding network into sparse mixture weights for each layer in the convolutional network. The gate for layer is defined as:
(6) 
where is the output of the shared embedding network M and are the learned parameters which, using the ReLU operation, project the latent mixture weights into sparse layer specific gates.
We refer to this gating design as on demand gating. The number of experts chosen at each level is datadependent and the expert selection across different layers can be optimized jointly. Unlike the “noisy TopK” design in [22], it is not necessary to determine the number of experts at each layer and indeed each layer can learn to use a different number of experts.
3.4 Deepmoe Training
As with standard convolutional neural networks, DeepMoE models can be trained endtoend using gradient based methods. The overall goals of the DeepMoE are threefold: (1) achieve high prediction accuracy, (2) lower computation costs, and (3) keep the network highly expressive. Thus, DeepMoE must learn a gating policy that selects a diverse, lowcost mixture of experts for each input. To this end, given the input
and the target , we define the learning objective as(7) 
is the cross entropy loss for the base convolutional model, which encourages a high prediction accuracy.
The term defined:
(8) 
is used to control the computational cost (via the parameter) by encouraging sparsity in the gating network.
Finally, we introduce an additional embedding classification loss , which is the crossentropy classification loss. This encourages the embedding or some transformation of the embedding to be predictive of the class label, preventing the phenomenon of gating networks converging to an imbalanced utilization of experts [22]. The intutition behind this loss construction is that examples from the same class should have similar embeddings and thus similar subsequent gate decisions, while examples from different classes should have divergent embeddings, which would in turn discourage the network from overusing a certain subset of channels.
4 Expressive Power
The expressive power of deep neural networks is associated with both the width and the depth of the network. Intuitively, the wider the network is, the more expressive power the network has. Cohen et al. [5] proves that the expressive power of a deep neural network increases superexponentially with respect to the network depth, based on the network width. In this section, we demonstrate that due to the dynamic execution nature and the multilayer stacking design, DeepMoE preserves the expressive power of a standard unsparsified neural network with reduced runtime computational cost.
We define the expressive power of a convolutional neural network as the ability to construct labeling to differentiate input values. Following Cohen et al. [5], we view a neural network as a mapping from a particular example to a cost function ( e.g., negative log probability) over labels. The mapping can be represented by a tensor operated on the combination of the representation functions.
More concretely, the rank of , which scales as with measure 1 over the space of all possible network parameters ( is the number of channels of a convolutional layer, a.k.a, network width; and is the network depth), is a measure of the expressive power of a neural network as established in [5]. In static channel pruning, if channels are kept, then the expressive power of the pruned network becomes which is a strict subspace of as .
What makes our DeepMoE prevail is that (the sparsity pattern of) our mapping depends on the data. We prove in the Appendix A.1 that DeepMoE has an expressive power of with probability when stacking multiple MoE layers, indicating DeepMoE preserves the expressive power of the unsparsified network.
Motivated by the theoretical analysis, we propose two variants of DeepMoE: wideDeepMoE and narrowDeepMoE. In the former one, we first increase the number of channels in the convolutional networks to increase the expressive power of the network and then replace the widened layers with MoE layers. By controlling the number of channels selected at runtime, we can improve the prediction accuracy with the same amount of the computation as the unwidened network. This design has the potential to be applied to the realworld deployment on the new hardware architecture supporting dynamic routing, e.g., TPU, where we can place a wide network on it and only execute part of the network at runtime instead of placing a static thin network with the same amount of computation. NarrowDeepMoE is closer to the dynamic channel pruning setting in [17] and comparable to the traditional static channel pruning.
5 Experiments
In this section, we first evaluate the performance of both wideDeepMoE and narrowDeepMoE on the image classification (Sec. 5.1 and 5.2) and semantic segmentation tasks (Sec. 5.4). We observe that DeepMoEs can achieve lower prediction error rate with reduced computational cost. We also analyze the behavior of our gating network, DeepMoEs regularization effect, and other strategies for widening the network in Sec. 5.3.
Datasets. For the image classification task, we use the CIFAR10 [16], CIFAR100 [16] and ImageNet 2012 datasets [21]. For the semantic segmentation task, we use the CityScapes [8] dataset, which provides pixellevel annotations in the images with a resolution of 20481024. We apply standard data augmentation using basic mirroring and shifting [24] for CIFAR datasets and scale and aspect ratio augmentation with color perturbation [29] for ImageNet. We follow [30] to enable random cropping and basic mirroring and shifting augmentation for the CityScapes dataset.
Models. We examine DeepMoE with VGG [23] and ResNet [11] network designs as the base convolutional network (a.k.a, backbone network). VGG is a typical feedforward network without skip connections and feature aggregation while ResNet, which is composed of many residual blocks, has more complicated connections. To construct DeepMoE, we add a gating header after each convolutional layer in VGG and modify the residual blocks in ResNet (Fig. 2).
In wideDeepMoE, we increase the number of channels in each convolutional layer by a factor of two unless stated otherwise. In narrowDeepMoE, we retain the same channel configuration as the original base convolutional model.
Training. To train DeepMoE we follow common training practices [11, 31]. For the CIFAR datasets, we start training with learning rate 0.1 for ResNet and 0.01 for VGG16, which is reduced by at 150 and 250 epochs with total 350 epochs for the baselines and 270 epochs for DeepMoE joint optimization stage and another 80 epochs for finetuning with fixed gating networks.
For ImageNet, we train the network with initial learning rate 0.1 for 100 epochs and reduce it by 10 every 30 epochs. We do not further finetune the base convolutional network on ImageNet as we find the improvement from finetuning is marginal compared to that on CIFAR datasets.
We set the computational cost parameter in the DeepMoE loss function (Eq. 7) between [0.001, 8] (larger values reduce computation) and for the CIFAR datasets to match the scale of the cross entropy loss on the base model. For ImageNet we set
to improve base model feature extraction. The training schedule for semantic segmentation is detailed in Sec.
5.4.5.1 WideDeepmoe
In this section, we evaluate the performance of wideDeepMoE as well as its memory usage.
5.1.1 Improved Accuracy with Reduced Computation
To conduct the evaluation, we first increase the number of channels in the residual networks by a factor of 2 and then control the sparsification so that on average half of the convolutional channels are selected at the inference time. Through our evaluations we find that wideDeepMoE has lower prediction error rate than the standard ResNets on ImageNet (Tab. 1), CIFAR10 and CIFAR100 (Tab. 2).
We evaluate ResNet56 and ResNet110 on the CIFAR10 and CIFAR100 datasets and ResNet18, ResNet34, ResNet50 on ImageNet, using the basic block (Fig. 1(a)) for 18 and 34 and bottleneckA(Fig. 1(b)) for 50. The more memory efficient bottleneckB (Fig. 1(c)) is also adopted on ImageNet.
As expressed in Tab. 1, wideDeepMoE is able to reduce the error rate of the ImageNet benchmark without increasing the computational cost (measured by FLOPs) with networks of different depths. In particular, wideDeepMoE with ResNet18 and 34 reduce 1% Top1 error on ImageNet on which the previous work [10] fails to show any improvement. Similar results can be observed on CIFAR datasets as shown in Tab. 2, where wideDeepMoE improves the prediction accuracy of the baseline ResNet by 34% on CIFAR100 and 0.5% on CIFAR10.
5.1.2 Memory Usage
Another aspect to consider about DeepMoE is its memory footprint (proportional to the number of parameters). We examine the memory usage of wideDeepMoE with widened ResNet50 as the backbone network and compare it to the standard ResNet101 which has a similar prediction accuracy. We find that wideDeepMoE using BottleneckA (Fig. 1(b)) achieves a 22.88% Top1 error rate, which compared to ResNet110 with an error rate of 22.63%, is only 0.2% lower in error but requires 20% less computation. Moreover, wideDeepMoE using BottleneckB (Fig. 1(c)), which is more memory efficient than BottleneckA (Fig. 1(b)), achieves 22.84% top1 error with 6% less parameters and 18% less FLOPs than the standard ResNet101 indicating that wideDeepMoE is competitive on the memory usage.
5.2 NarrowDeepmoe
In this section we compare DeepMoE to current static and dynamic channel pruning techniques. We show that DeepMoE is able to outpreform both dynamic and static channel pruning techniques in prediction accuracy while maintaining or reducing computational costs.
5.2.1 NarrowDeepMoE vs Dynamic Channel Pruning
DeepMoE generalizes existing channel pruning work since it both dynamically prunes and rescales channels to reduce the computational cost and improve accuracy. In previous dynamic channel pruning work [18], channels are pruned based on the outputs of previous layers. In contrast, the gate decisions in DeepMoEs are determined in advance based on the shared embedding (latent mixture weights) which enables improved batch parallelism at inference.
We compare DeepMoE to the latest dynamic channel pruning work RNP [18] with VGG16^{1}^{1}1
Our baseline accuracy is higher than RNP since we use a version with batch normalization in contrast to the published method.
as the base model on CIFAR100. As we can see from Fig. 2(a), without finetuning, the prediction error and computation tradeoff curve (dotted blue line) of DeepMoE is much flatter than RNP (dotted red line) which indicates DeepMoE has a greater reduction in computation without loss of accuracy. Moreover, when finetuning DeepMoE for only 10 epochs (dotted green line in Fig. 2(a)), DeepMoE improves the prediction accuracy by a large margin by 4% which is a 13% improvement over the baseline VGG model due to the regularization effect of DeepMoE (Sec. 5.3.2).5.2.2 NarrowDeepMoE vs Static Channel Pruning
Similarly, DeepMoE outperforms the stateoftheart static channel pruning results [19, 11, 17, 13] on both ImageNet shown in Tab. 3 and the CIFAR10 dataset in Fig. 2(b) and 2(c). DeepMoE with ResNet50 reduces 56.8% of the computation of the standard ResNet50 with a top1 error rate of , approximately 2% better than He et al. [12], which currently has the best accuracy for an equivalent amount of computation among previous work on ImageNet. Fig. 2(b) and 2(c) show that DeepMoE achieves a higher accuracy less computation times than current techniques.
5.3 Analysis
In this section, we first analyze the effectiveness of the gating behavior in generating embeddings that are predictive of the class label and thus its ability balance expert utilization. We then study the regularization effect of DeepMoEs sparsification of the channel outputs. Lastly, we explore the effects of widening certain combinations of layers in a network as opposed to widening all convolutional layers as we do in DeepMoE.
5.3.1 Gating Behavior Analysis
To analyze the gating behavior of DeepMoE, we evaluate the trained DeepMoE with VGG16 as follows: for a given finegrained class (e.g., dolphin), we reassign the gate embedding for each input in class with a randomly chosen gate embedding from other classes either within the same coarse category (referred to as ingroup shuffling) or different categories (referred to as outofgroup shuffling).
In Fig. 4, we plot the test accuracy of class dolphin, belonging to the coarse category aquatic mammals with randomly selected gate embeddings (repeated 20 times for each input) from 5 classes in the same coarse category (in red) and 5 classes for other coarse categories (in blue). Fig. 4 shows that the test accuracy with ingroup embeddings is 2060% higher than with outofgroup shuffling. Especially when applying the gate embeddings from the tulip category, the test accuracy drops to 1% while the accuracy with ingroup shuffling is mostly above 50%. This indicates that the latent mixture of weights are similar for semantically related image categories, and since DeepMoE is never given this coarse class structure, our results are significant.
5.3.2 Regularization Effect of DeepMoE
Since DeepMoE sparsifies the channel outputs during training and testing, we study the regularization effect of such sparsification. We increase the number of channels of a modified ResNet18 with bottleckB (in Fig. 1(c)) by 28 on CIFAR100. In Fig. 5, we plot the accuracy and computation FLOPs of the baseline widened ResNet18 models (in blue) and wideDeepMoE (in orange) with .
Fig. 5 suggests that DeepMoE has a lower computation cost and higher accuracy than the baseline widened ResNet, and the advantages of DeepMoE increase with the width of the base convolutional network. This indicates a potential regularization effect to the DeepMoE design.
5.3.3 DeepMoE vs SingleLayer MoE
So far in our experiments, we have widened the network by increasing the number of experts/channels for all convolutional layers. Here, we study other strategies for widening the network. We try to widen the VGG16 model in four different kinds of layers: the top layer (W1High), the middle layer (W1Mid), the lower 4 layers (W4Low), and finally all the 13 convolutional layers (W13All) as used in all the other experiments (details in Sec. A.2).
Control  Model  Params  FLOPs (x)  Acc. (%) 

Params  W1High  24.15M  3.51  71.96 
W1Mid  24.16M  9.18  72.02  
W4Low  24.18M  43.16  72.51  
W13All  24.18M  8.15  73.91  
Params & FLOPs  W1High  24.15M  2.98  73.28 
W1Mid  24.16M  2.74  72.68  
W4Low  24.18M  2.45  73.33  
W13All  24.18M  2.29  73.39 
As shown in Tab. 4, the prediction accuracy of W13All is strictly better than that of a singlelayer MoE, even though they have the same number of parameters. Adding MoE to the bottom or top layers is more effective than adding it to the middle layer. Alternatively, if we control both the number of parameters and the computation FLOPs, the accuracy differences between different strategies are reduced but W13All is still favorable to other widening strategies.
Road 
Sidewalk 
Building 
Wall 
Fence 
Pole 
Light 
Sign 
Vegetation 
Terrain 
Sky 
Person 
Rider 
Car 
Truck 
Bus 
Train 
Motorcycle 
Bicycle 
mIoU  FLOPs()  

DRNA50  96.9  77.4  90.3  35.8  42.8  59.0  66.8  74.5  91.6  57.0  93.4  78.7  55.3  92.1  43.2  59.5  36.2  52.0  75.2  67.3  703 
wideDeepMoE50A  97.2  78.9  90.3  45.6  48.4  56.2  61.6  72.9  91.6  60.7  94.2  77.4  50.6  92.5  48.7  68.7  44.1  52.7  74.2  68.8  804 
wideDRNA50  97.4  80.6  90.6  38.5  49.0  58.7  65.1  73.4  91.8  59.5  93.9  78.2  51.1  92.9  49.1  68.7  51.3  52.2  74.5  69.3  2173 
wideDeepMoE50B  97.5  80.4  91.0  48.9  50.6  58.5  65.7  75.3  92.0  60.1  94.7  79.2  54.7  93.2  53.8  73.2  53.2  54.8  75.6  71.2  1738 
5.4 Semantic Segmentation
Semantic image segmentation requires predictions for each pixel, instead of one label for the whole image in classification. We evaluate DeepMoE on the segmentation task to understand its generalizability. In specific, we apply DeepMoE to DRNA [30], which adopts ResNet architecture as the backbone, and evaluate the results on the popular segmentation dataset CityScapes [8]. We follow the same training procedure as Yu et al. [30] for fair comparison. The optimizer is SGD with momentum 0.9 and crop size 832. The starting learning rate is set to 5e4 and divided by 10 after 200 epochs. The intersectionoverunion (IoU) scores and computation costs in FLOPs of DeepMoE are presented in Tab. 5.
The hyperparameter can adjust the tradeoffs between computer efficiency and prediction accuracy. Our efficient model wideDeepMoE50A beats the baseline by 1.5% of mIoU with a slight increase in FLOPs, while our accurate model wideDeepMoE50B outperforms the wide baseline by almost 2% mIoU with lower FLOPs. These results indicate DeepMoE is effective on pixellevel prediction such as semantic segmentation as well as image classification.
6 Conclusion
In this work we introduced our design of deep mixture of experts models, which produces a more accurate and computationally inexpensive model for computer vision applications. Our DeepMoE architecture leverages a shallow embedding network to construct latent mixture weights, which is then used by sparse multiheaded gating networks to select and reweight individual channels at each layer in the deep convolutional network. This design in conjunction with a novel sparsifying and diversifying loss enabled joint differentiable training, addressing the key limitations of existing mixture of experts approaches in deep learning. We provided theoretical analysis on the expressive power of DeepMoE and proposed two design variants. The extensive experimental evaluation indicated that DeepMoE can reduce computation and surpass accuracy over baseline convolutional networks, as well as improving upon the residual network result on the challenging ImageNet benchmark by a full 1%. Through our analysis we were also able to prove that our embedding and gating network is able to resolve coarse grain class structure in the underlying problem. This work shows promising results when applied to semantic segmentation tasks, and could be incredibly useful for various other problems.
References
 [1] K. Ahmed, M. H. Baig, and L. Torresani. Network of experts for largescale image categorization. In European Conference on Computer Vision, pages 516–532. Springer, 2016.
 [2] E. Bengio, P.L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
 [3] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [4] K. Cho and Y. Bengio. Exponentially increasing the capacitytocomputation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362, 2014.
 [5] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728, 2016.
 [6] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of svms for very large scale problems. In Advances in Neural Information Processing Systems, pages 633–640, 2002.

[7]
R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
International Journal of pattern recognition and artificial intelligence
, 17(03):349–365, 2003. 
[8]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.  [9] D. Eigen, M. Ranzato, and I. Sutskever. Learning factored representations in a deep mixture of experts. International Conference on Learning Representations Workshop, 2014.
 [10] S. Gross, M. Ranzato, and A. Szlam. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865–6873, 2017.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [12] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, page 6, 2017.
 [13] Z. Huang and N. Wang. Datadriven sparse structure selection for deep neural networks. arXiv preprint arXiv:1707.01213, 2017.
 [14] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 [15] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural computation, 6(2):181–214, 1994.
 [16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [17] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. International Conference on Learning Representations, 2017.
 [18] J. Lin, Y. Rao, J. Lu, and J. Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems, pages 2178–2188, 2017.
 [19] J.H. Luo, J. Wu, and W. Lin. ThiNet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5058–5066, 2017.
 [20] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi. Modeling task relationships in multitask learning with multigate mixtureofexperts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939. ACM, 2018.
 [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [22] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. International Conference on Learning Representations, 2017.
 [23] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [24] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pages 1058–1066, 2013.
 [25] X. Wang, Y. Luo, D. Crankshaw, A. Tumanov, F. Yu, and J. E. Gonzalez. Idk cascades: Fast deep learning by learning not to overthink. Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
 [26] X. Wang, J. Wu, D. Zhang, Y. Su, and W. Y. Wang. Learning to compose topicaware mixture of experts for zeroshot video captioning. arXiv preprint arXiv:1811.02765, 2018.
 [27] X. Wang, F. Yu, Z.Y. Dou, T. Darrell, and J. E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 409–424, 2018.
 [28] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8817–8826, 2018.
 [29] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [30] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Computer Vision and Pattern Recognition, volume 1, 2017.
 [31] F. Yu, D. Wang, and T. Darrell. Deep layer aggregation. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
Appendix
A.1 Expressive Power of DeepMoE
To characterize the expressive power of DeepMoE, we follow the tensor analysis approach of Cohen et al. [5]. We first represent an instance of data as a collection of vectors , where . For the image data, the collection corresponds to vector arrangements of possibly overlapping patches around pixels. We represent different features in data using (positive) representation functions:
(9) 
so that the convolution operations over data become multiplications over representation functions. For the representation functions, index , where is the number of different features in data that we wish to distinguish and can be combinatorially large with respect to the number of pixels.
For classification tasks, we view a neural network as a mapping from a particular instance to a cost function (e.g., the log probability) over labels for that instance. With the new representation of data instances following Eq. (9), the mapping can be represented by a tensor operated on the combination of the representation functions:
(10) 
To be able to distinguish data instances from , we need to be nonzero. For a fixed mapping , this requirement is equivalent to:
for . It can directly be seen that the inequality is satisfied when the difference is not in the null space of . Therefore, the expressive power is equivalent to the rank of the tensor . This approach, taken by [5], establishes that for a certain type of networks, the rank of scales as with measure over the space of all possible network parameters, where is the number of channels between network layers (width) and is the network depth.
If we directly apply the theorem to a wider network (width satisfying ), then the rank of will scale as , which is times better. However, when the channels are gated with static sparse weights, the set of with this restriction has measure in the overall space of network parameters. In fact, if the number of nonzero weights over the channels is , then the rank of still scales as .
What makes our DeepMoE prevail is that (the sparsity pattern of) our mapping depends on the data. We hereby compare an layer DeepMoE with width equal to and number of nonzero weights over the channels equal to against an layer fixed, nonsparse neural network with width equal to . For the latter, we know that it will be able to distinguish between features in a subspace of dimension . For the former, if for the same choices of features (in the dimensional subspace), then we know that it will have expressive power of at least :
(11)  
(12)  
(13) 
Since the gating network is independent from the convolution neural network, to have Line (12) exactly equal to the negative of Line (13)—when they are both nonzero—has zero measure over the space of network parameters (even with the sparsity constraint). We simply need to focus on the cases where Line (12) is zero for the pair of and , and discuss whether Line (13) is also zero. In those cases, we assume that the sparsity pattern of the weights over the gated channels is i.i.d. with respect to each channel. With this assumption, probability of choosing exactly the same channels for different data: is . When they are not equal, the difference can be represented as combinations of linearly independent basis in and positivity of the representation functions ensures that Line (13) is not zero with probability . Therefore, holds with probability . In other words, there is a probability that the expressive power of our DeepMoE equals to or is bigger than .
A.2 Network Configurations of Wide VGG
In Sec. 5.3.3, we conduct experiments to investigate different widening strategies. We used four different strategies to widen the VGG16 network which contains 13 convolutional layers in total: W1High widens the top layer only, W1Mid widens the middle layer only, W4Low widens the lower 4 layers, and finally W13All that widens all 13 convolutional layers in Tab. 6.
Layers  W1High  W1Mid  W4Low  W13All 

Conv1  64  64  512  128 
Conv2  64  64  512  128 
Max Pooling         
Conv3  128  128  615  256 
Conv4  128  128  615  256 
Max Pooling         
Conv5  256  2990  256  405 
Conv6  256  256  256  405 
Conv7  256  256  256  405 
Max Pooling         
Conv8  512  512  512  615 
Conv9  512  512  512  615 
Conv10  512  512  512  615 
Max Pooling         
Conv11  1536  512  512  615 
Conv12  512  512  512  615 
Conv13  512  512  512  615 
Max Pooling         
Softmax         
Comments
There are no comments yet.