1 Introduction
Deep convolutional neural networks (CNNs) have achieved state of the art performance on many tasks in computer vision
(LeCun et al., 1990; Krizhevsky et al., 2012). One trend in designing neural networks with better performance has been increasing scale: bigger models with more capacity (more parameters) trained on large datasets (Mahajan et al., 2018; Huang et al., 2018; Real et al., 2019). However, large models can be hard to use in practice because of inference latency constraints. In this work, we take inspiration from conditional computation to build networks that have a large number of parameters, but retain good performance characteristics at inference. In particular, we aim to reduce the computation cost of inference for singleimage batches, which are used in many applications.Most prior work on conditional computation proposes learning functions that route individual input examples through a subset of experts in a larger network (Eigen et al., 2013; Bengio et al., 2015; Shazeer et al., 2017)
. We term this hardrouting, since a given example will only be evaluated by a subset of experts, and other experts are ignored. In this case, the inference cost scales directly with the number of evaluated experts; model performance usually improves when more experts are evaluated. However, the hardrouting approach can be difficult to implement in practice. Learning good routing functions for discrete decisions is challenging. Previous approaches require reinforcement learning
(Bengio et al., 2015) or evolutionbased methods (Fernando et al., 2017), or gradient descent with additional learning objectives for the routing function (Eigen et al., 2013; Shazeer et al., 2017; Teja Mullapudi et al., 2018).In this work, we propose a soft conditional computation approach that enables easy optimization and utilization of all experts at low inference cost. Consider a MixtureofExperts model, where we wish to compute a linear combination of experts , where are functions of the input learned through gradient descent. In a ConvNet, the above method requires convolutions, which is expensive when is large. The hardrouting approach truncates the computation to convolutions where . However, hardrouting, as mentioned above, is difficult to train. Our main observation is that we can reorder the computation as . This requires only a single convolution, which significantly reduces computation, since a weighted sum of kernels is much cheaper than a convolution. In other words, we softly combine the weights of all experts, before performing the expensive computation of applying the experts to the inputs once. We call this approach soft conditional computation to contrast the traditional hardrouting approach. When applied to ConvNets, we name our method CondConv.
CondConv can be used as a dropin replacement for existing convolution layers to scale them with soft conditional computation. We demonstrate that replacing convolutional layers with CondConv is an effective way to increase model capacity on the MobileNetV1, MobileNetV2, and ResNet50 architectures, and is able to significantly improve the model performance on ImageNet classification and COCO object detection across a spectrum of model sizes.
2 Related Work
Conditional computation. Conditional computation aims to increase model capacity without a proportional increase in computation cost by activating only a portion of the entire network for each example (Bengio et al., 2013; Davis & Arel, 2013; Cho & Bengio, 2014; Bengio et al., 2015). Hardrouting models have shown promising results in recent work, but key challenges are how to route the individual examples and train the network effectively. One approach is to use reinforcement learning to learn discrete routing functions (Rosenbaum et al., 2017; McGill & Perona, 2017; Liu & Deng, 2018). BlockDrop (Wu et al., ) and SkipNet (Wang et al., 2018) use reinforcement learning to learn the subset of blocks needed to process a given input. Another approach is to use evolutionary methods to evolve pathways through a larger supernetwork (Fernando et al., 2017). Unlike these methods, CondConv can be trained endtoend with gradient descent on the original loss.
Hydranets (Mullapudi et al., 2018)
is a hardrouting approach for vision that can also be trained with gradient descent, but requires an unsupervised clusteringbased method to partition examples to perform well. Increasing the number of experts evaluated at inference time improves performance of Hydranets, but results in increased inference cost. CondConv does not require auxiliary loss functions for learning the routing functions, and allows for the use of all experts at a small inference cost.
Finally, Shazeer et al. (2017) propose the sparselygated mixtureofexperts layer, which enables hardrouting to be trained with gradient descent and achieves significant success on large language modeling and machine translation tasks. Their proposed routing technique is noisy top gating, but this approach introduces noise and discontinuities to the loss function. Ramachandran & Le (2019) apply noisy top gating to vision tasks, and find that noisy top gating is difficult to optimize with increasing routing depth and requires careful choice of for each layer to balance between weightsharing and specialization of experts. In contrast, CondConv layers are easy and reliable to optimize with gradient descent, without special regularization. Moreover, CondConv layers automatically learn to tradeoff between specialization and weightsharing at each layer. This comes at a cost of architectural diversity between experts, but the ease of optimization enables more complex and deeper routing schemes.
Weights per example. Ha et al. (2017) propose the use of a small network to generate weights for a larger network. For vision, these weights are not conditional on the input, and thus have lower parameter count but also higher computation cost and reduced performance. Attentionbased methods scale previous layer inputs based on learned attention weights. Hardattention (Luong et al., 2015) uses only a subset of the inputs and has parallels to hard conditional routing. Softattention (Bahdanau et al., 2015; Vaswani et al., 2017) uses all parts of the inputs but upweights some parts, which is more inline with our proposed technique.
Example dependent activation scaling. Finally, some recent work proposes to adapt the activations of neural networks conditionally on the input. SqueezeandExcitation networks (Hu et al., 2018) learn to scale the activations of every layer output. GaterNet (Chen et al., 2018) proposes to use a separate network to select a binary mask for filters for a larger backbone network. Scaling activations has similar motivations as soft conditional computation, but is restricted to modulating activations in the base network.
Continuous neural decision graphs. In recent work, SplineNets (Keskin & Izadi, 2018) propose the use of Bsplines to model learnable weights for continuous neural decision graphs. Compared to SplineNets, CondConv explores more complex weightgenerating functions.
3 Soft Conditional Computation with Parameter Routing
In a regular convolutional layer, the same convolutional kernel is used for all input examples. In a CondConv layer, the convolutional kernel is computed as a function of the input example (Fig 1(a)). Specifically, we parameterize CondConv by:
where each is an exampledependent routing weight computed using a routing function with learned parameters, is the number of experts, and
represents the nonlinear normalization and activation functions.
We observe that the CondConv layer is equivalent to the more expensive mixture of experts formulation, where each expert represents a convolution (Fig 1(b)):
Hence, CondConv can be viewed equivalently as a weighted sum of the outputs from multiple regular convolutions, or as a single convolution whose kernel is inputdependent.
In a CondConv layer, the cost of the convolution operation is independent of the number of experts, since the size of the kernel is fixed. Thus, adding an expert increases the inference cost by approximately one multiplyadd per additional parameter. In contrast, a normal convolution requires many multiplyadds for each additional parameter in the kernel. Thus, CondConv enables us to efficiently scale the capacity with respect to inference cost.
To compute the exampledependent routing weights , we perform global average pooling on the input, followed by a fullyconnected layer to output routing weights, followed by the Sigmoid activation function. The global average pooling of the input provides global context to the routing function. Our parameterization enables CondConv to recover hardrouting models with the same architecture. A hardrouting model with active experts corresponds to soft parameter routing with routing weights for all but experts. Our model is able to take advantage of all experts for every example, enabling the model to be more expressive.
The design of the CondConv layer allows it to be used in place of any regular convolution layer in a network. Although we illustrate the approach with ordinary convolutions, the same approach can easily be extended to other linear functions like those in depthwise convolutions and fullyconnected layers.
4 Experiments
We evaluate CondConv by scaling up the MobileNetV1 (Howard et al., 2017), MobileNetV2 (Sandler et al., 2018), and ResNet50 (He et al., 2016) architectures on image classification and object detection tasks.
As an implementation note, to train models with CondConv layers, we need to perform convolutions with a batch size of one, since each example has a different convolutional kernel. However, current accelerators are optimized to train on large batch sizes. Thus, with small numbers of experts, we found it to be more efficient to train CondConv layers with the mixture of experts formulation (Fig. 1(b)) to support large convolutional batch sizes, then use our efficient CondConv approach (Fig. 1(a)) at inference.
Model  Num Experts  Params  Inference MADDs  Train CE  Valid Top1 
()  ()  (%)  
CCMobileNetV1(0.5x)  1  1.3  151  1.80  63.6 
CCMobileNetV1(0.5x)  2  2.6  152  1.61  66.0 
CCMobileNetV1(0.5x)  4  5.2  154  1.41  67.6 
CCMobileNetV1(0.5x)  8  10.4  160  1.20  68.4 
CCMobileNetV1(0.5x)  16  20.7  170  0.91  69.9 
CCMobileNetV1(0.5x)  32  41.3  190  0.67  71.6 
MobileNetV1(0.5x)^{1}^{1}1 Our implementation. Our training hyperparameters improve upon the reported top1 accuracy of 70.6 for MobileNetV1 (1.0x) and 63.7 for MobileNetV1 (0.5x) in Howard et al. (2017). 
  1.3  150  1.83  63.8 
MobileNetV1(1.0x)^{1}^{1}1Our implementation. Our training hyperparameters improve upon the reported top1 accuracy of 70.6 for MobileNetV1 (1.0x) and 63.7 for MobileNetV1 (0.5x) in Howard et al. (2017).    4.2  569  1.20  71.9 
4.1 ImageNet Classification
We evaluate our approach on the image classification task using the ImageNet 2012 classification dataset (Russakovsky et al., 2015). The ImageNet dataset consists of 1.28 million training images and 50K validation images from 1000 classes. We train all models on the entire training set and compare the singlecrop top1 validation set accuracy. We use the same training hyperparameters for all models on ImageNet, following Kornblith et al. (2018), except we use a BatchNorm momentum of 0.9 and disable exponential moving average on the weights.
We introduce two additional regularization techniques for models with large capacity. First, we use Dropout (Srivastava et al., 2014) on the input to the fullyconnected layer. We also add data augmentation using the AutoAugment (Cubuk et al., 2018) ImageNet policy and Mixup (Zhang et al., 2017) with . We search over these regularization techniques for all models for fair comparison.
4.1.1 MobileNetV1
We evaluate our approach against the base MobileNetV1 architecture at many different computational costs. Concretely, we replace the depthwise and pointwise convolutional layers in the final 8 separable convolutional blocks with CondConv layers. We use a single routing function to compute routing weights for both the depthwise and pointwise CondConv layers within a separable convolutional block from the input to the block. We also replace the final classification layer with a 1x1 CondConv block. We analyze these architectural choices in Section 4.3. We refer to our models as CCMobileNetV1.
We evaluate our CCMobileNetV1 models and the base MobileNetV1 model across four different width multipliers {0.25, 0.50, 0.75, 1.0} in Figure 2. In a MobileNetV1 model, the width multiplier scales the number of output channels at each layer by fraction ; as a result, the computational cost and parameters are scaled proportional to . For CCMobileNetV1, at each width multiplier, we use a constant input size of 224 and vary the number of experts per layer of the CCMobileNetV1 models by powers of 2, from 1 to 64 for 0.25x and 1 to 32 for 0.50x, 0.75x, and 1.0x. We compare CCMobileNetV1 with the baseline MobileNetV1 results from the publicly available Slim checkpoints (Silberman & Guadarrama, 2016). For MobileNetV1 models, at each width multiplier, the input size varies from {128x128, 160x160, 192x192, 224x224}. This gives a performance frontier for the accuracy and inference cost tradeoff for MobileNetV1. We compare CCMobileNetV1 models to this frontier.
Across all width multipliers, CondConv layers significantly improve the ImageNet validation accuracy relative to inference cost compared to the MobileNetV1 base models. We provide a detailed table of parameter count and inference cost for scaling the MobileNetV1 (0.5x) model in Table 1. Increasing the number of experts significantly improves the capacity of the model, which we compare using the training crossentropy loss following Sec. 4.1 with no additional regularization or data augmentation. With additional regularization, our models with more experts per layer also attain better generalization accuracy. We note that regularization with Dropout (Srivastava et al., 2014) and additional data augmentation actually reduces the accuracy of the baseline MobileNetV1 models.
4.1.2 MobileNetV2
We further evaluate our approach against the MobileNetV2 (Sandler et al., 2018) architecture, which improves the accuracy with respect to inference cost over MobileNetV1 using a deeper architecture with inverted residual blocks. We replace the convolutional layers in the last 6 inverted residual blocks with CondConv layers, and replace the final 1x1 convolution and classification layer with 1x1 CondConv layers. Inverted residual blocks consist of 2 pointwise convolutional layers and 1 depthwise convolutional layer, and within each block, we replace all three with CondConv layers. We compute the routing weights from the input to the inverted residual block, and share them across all three layers within a block. We refer to our models as CCMobileNetV2.
Even on top of the inverted bottleneck structure, which introduces additional parameters at lower inference cost, our approach significantly improves the capacity and accuracy of the MobileNetV2 architecture with respect to the inference cost. Table 2 shows the performance of our model relative to the base MobileNetV2 models. With additional regularization, our CCMobileNetV2 (1.0x) model with 8 experts achieves comparable performance to the MobileNetV2 (1.4x) model at 56% of the inference cost. As with MobileNetV1, additional reguliarzation with Dropout (Srivastava et al., 2014) and additional data augmentation reduces the accuracy of the baseline MobileNetV2 models.
Model  Params  MADDs  Train  Valid 
()  ()  CE  Top1 (%)  
MV2 (0.5x)^{2}^{2}2Our implementation. Kornblith et al. (2018) reports 71.6 top1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x) with similar hyperparameter settings. Sandler et al. (2018) uses different hyperparameters and reports 72.0 top1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x). Silberman & Guadarrama (2016) uses different hyperparameters and reports 65.4 top1 accuracy for MobileNetV2 (0.5x).  2.0  97  1.9  62.9 
CCMV2 (0.5x) [8]  15.5  113  1.1  68.4 
MV2(1.0x)^{2}^{2}2Our implementation. Kornblith et al. (2018) reports 71.6 top1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x) with similar hyperparameter settings. Sandler et al. (2018) uses different hyperparameters and reports 72.0 top1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x). Silberman & Guadarrama (2016) uses different hyperparameters and reports 65.4 top1 accuracy for MobileNetV2 (0.5x).  3.5  301  1.3  71.6 
CCMV2 (1.0x) [8]  27.5  329  0.7  74.6 
MV2 (1.4x)^{2}^{2}2Our implementation. Kornblith et al. (2018) reports 71.6 top1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x) with similar hyperparameter settings. Sandler et al. (2018) uses different hyperparameters and reports 72.0 top1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x). Silberman & Guadarrama (2016) uses different hyperparameters and reports 65.4 top1 accuracy for MobileNetV2 (0.5x).  6.1  583  1.0  74.5 
4.1.3 ResNet50
Model  Params  MADDs  Train  Valid 

()  ()  CE  Top1 (%)  
RN50^{3}^{3}3Our implementation. Our hyperparameter setting with addition data augmentation improves upon the top1 accuracy of 76.4 for largebatch ResNet50 training in Goyal et al. (2017).  25.6  4093  0.50  77.7 
CCRN50 [2]  42.6  4127  0.35  78.4 
CCRN50 [8]  130.2  4213  0.19  77.7 
Finally, we evaluate our approach on the ResNet50 (He et al., 2016) architecture, a much larger architecture which uses ordinary 3x3 convolutions and residual bottleneck building blocks. We replace the convolutional layers in the final three residual blocks with CondConv layers, as well as the final classification layer with a 1x1 CondConv layer. We refer to our model as CCResNet50.
The CCResNet50 model with 2 experts per layer achieves 0.7 higher ImageNet Top1 Accuracy compared to the ResNet50 baseline. Both the ResNet50 baseline and our CCResNet50 models are trained with additional data augmentation with Mixup (Zhang et al., 2017) and AutoAugment (Cubuk et al., 2018). With larger numbers of experts per layer, the CCResNet50 models demonstrate even higher capacity measured by the training crossentropy loss following Sec. 4.1 with no additional regularization or data augmentation. We believe the generalization accuracy could be further improved with more complex regularization techniques and larger datasets.
4.2 COCO Object Detection
Model  Params  MADDs  mAP 

()  ()  
MV1 (0.25x)^{4}^{4}4Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported.  0.7  119  9.3 
CCMV1 (0.25x) [8]  3.1  122  12.1 
MV1(0.5x)^{4}^{4}4Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported.  2.0  352  14.4 
CCMV1 (0.5x) [8]  11.4  363  18.0 
MV1 (0.75x)^{4}^{4}4Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported.  4.1  730  18.2 
CCMV1 (0.75x) [8]  25.0  755  21.0 
MV1(1.0x)^{4}^{4}4Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported.  6.8  1230  20.3 
CCMV1 (1.0x) [8]  43.8  1280  22.4 
We further evaluate the effectiveness of CondConv to improve object detection with the Single Shot Detector (Liu et al., 2016) framework with 300x300 input image resolution (SSD300). We use our CCMobileNetV1 models with depth multipliers {0.25, 0.50, 0.75, 1.0} as the feature extractor for object detection. We further replace the rest of the convolutional feature extractor layers in SSD with CondConv layers, which we term CCSSD.
The COCO dataset (Lin et al., 2014) consists of 80,000 training and 40,000 validation images. We train on the combined COCO training and validation sets excluding 8,000 minival images, which we evaluate our networks on. We train our models using a batch size of 1024 for 20,000 steps. For the learning rate, we use linear warmup from 0.3 to 0.9 over 1,000 steps, followed by Cosine decay from 0.9. We use the data augmentation scheme proposed by Liu et al. (2016). We use the same convolutional feature layer dimensions, SSD hyperparameters, and training hyperparameters across all models.
We find that the learned features and additional capacity from CondConv layers with 8 experts significantly improves object detection results at a variety of model sizes in Figure 4. In particular, our CCMobileNetV1(0.75x) CCSSD model exceeds the mAP of the MobileNetV1(1.0x) SSD baseline by 0.7 mAP at 60% of the inference cost. Moreover, our CCMobileNetV1(1.0x) CCSSD model improves upon the MobileNetV1(1.0x) SSD baseline by 2.1 mAP at similar inference cost.
4.3 Ablation studies
Layer  Params  MADDs  Train  Valid 
()  ()  CE  Top1 (%)  
CCMV1_0.25_32  
(Depth + Point)  14.6  55.7  1.4  62.0 
Pointwise  14.3  55.3  1.5  62.3 
Depthwise  8.8  49.7  1.8  58.0 
None (MV1_0.25)  0.47  41.2  2.6  50.0 
Routing  Params  MADDs  Train  Valid 
Fn  ()  ()  CE  Top1 (%) 
CCMV1_0.25_32  
(Baseline)  14.6  55.7  1.4  62.0 
Single  14.6  55.5  1.8  56.5 
Partiallyshared  14.6  55.6  1.4  62.5 
Hidden (small)  14.6  55.6  1.7  57.7 
Hidden (medium)  14.8  55.9  1.4  62.2 
Hidden (large)  16.8  57.8  1.4  54.1 
Hierarchical  14.6  55.7  1.4  60.3 
Softmax  14.6  55.7  1.7  60.5 
CondConv  Params  MADDs  Train  Valid 

Begin  ()  ()  CE  Top1 (%) 
CCMV1_0.25_32  
(7)  14.6  55.7  1.4  62.0 
1  14.9  56.3  1.4  62.5 
3  14.9  56.2  1.4  62.1 
5  14.8  56.0  1.4  62.0 
13  11.6  52.5  1.6  59.5 
15 (FC)  8.42  49.3  2.0  54.2 
16 (MV1_0.25)  0.47  41.2  2.6  50.0 
Cond  Params  MADDs  Train  Valid 
FC  ()  ()  CE  Top1 (%) 
CCMV1_0.25_32  
(FC + FE)  14.6  55.7  1.4  62.0 
FE only  6.65  47.6  1.6  60.2 
FC only  8.42  49.3  2.0  54.2 
None (MV1_0.25)  0.47  41.2  2.6  50.0 
Mean routing weights in the final CondConv layer in our CCMobileNetV1 (0.5x) model with 32 experts per layer. Error bars indicate one standard deviation. Some experts are classspecific, and have high weights across allexamples in the class with low variance. Other experts are examplespecific within the class and show high variance.
We perform ablation experiments to highlight important architectural decisions in designing soft conditional computation models with the CondConv block. In all experiments, we compare against the same baseline CCMobileNetV1(0.25x) with 32 experts per layer trained as described in Sec. 4.1 with no additional regularization, which achieves 61.98% ImageNet Top1 validation accuracy with 55.7M multiplyadds. In this section, we refer to this baseline model as CCMV1_0.25_32. This significantly outperforms the base MobileNetV1(0.25x) architecture, which achieves 50.0% Top1 accuracy with 41.2M multiplyadds wiht the same training setup.^{5}^{5}5Our implementation. Howard et al. (2017) reports top1 accuracy of 50.0% with different hyperparameters. We choose this model for ablation because of the large effect of introducing CondConv layers, and because it does not require additional data augmentation to perform well.
4.3.1 Perchannel routing
We explore learning different routing weights for each output channel in each expert, rather than using one routing weight for the entire expert. In theory, this increases the expressiveness of the model, by allowing each channel to have a different routing decision.^{6}^{6}6Note that with an expert count of one, perfilter routing is similar to Squeeze and Excite (Hu et al., 2018), computed using input features rather than output features. The drawback of this approach is that the routing function complexity for each layer increases significantly, since it must output weights as opposed to weights. To keep the computation cost for the routing function small, we use a 2layer routing function with bottleneck hidden layer of size ) for perchannel routing. We replace the routing weights at every CondConv layer in the baseline CCMobileNetV1 (0.25x) model. We then vary the number of experts in powers of 2 from 1 to 64, trained as described in Sec. 4.1 with no additional regularization. We plot the results in Fig. 3.
At small expert counts, perchannel routing outperforms perexpert routing with the same number of experts. However, perchannel routing requires more computation. With large expert counts, perexpert routing outperforms perchannel routing, even on training crossentropy loss. We hypothesize that although perchannel routing models are more expressive, they are difficult to optimize with many experts.
4.3.2 Pointwise vs. Depthwise Convolutions
We study the effects of CondConv for the pointwise vs depthwise convolutions in Table 5. The pointwise convolutions contain most of the parameters in the network and most of the inference cost compared to the depthwise convolutions. Introducing CondConv on just the pointwise convolutions has nearly as much capacity as on both the pointwise and depthwise convolutions, and slightly higher generalization accuracy. Introducing CondConv on depthwise convolutions only has less capacity than on both, but requires fewer parameters and has lower inference cost, while still significantly improving performance over the base MobileNetV1(0.25x) model accuracy of 50.6%.
4.3.3 Routing function
We investigate different choices for the routing function in Table 6. We first investigate sharing the routing weights between layers. The baseline model computes new routing weights for each layer. Single computes the routing weights only once at CondConv 7 (the 7th convolutional or separable convolutional layer) and uses the same routing weights in all subsequent layers. This significantly reduces the capacity and accuracy of the network. Partiallyshared shares the routing weights between every other layer, and we find this slightly improves performance. We hypothesize that sharing the routing function among nearby layers can improve the learned routing weights.
We then experiment with more complex routing functions, by introducing a hidden layer with ReLU activation after the global average pooling step. We vary the capacity of the network by changing size of the hidden layer to be
for Hidden (small), for Hidden (medium), and for Hidden (large). We find adding that a nonlinear hidden layer can slightly improve capacity. However, very large hidden layers exhibit overfitting, while hidden layers that are too small reduce the capacity of the network.Next, we experiment with Hierarchical routing functions, which take the routing weights of the previous layer as additional inputs to the routing function. This increases network capacity but also reduces generalization accuracy.
Finally, we experiment with the use of the Softmax activation function to compute routing weights, rather than the Sigmoid activation function. Using the Sigmoid activation function in the routing function improves the capacity and generalization accuracy of the network. In Section 5, we find that multiple experts are often useful for a single example. The Sigmoid activation function allows many experts to be used, while the Softmax activation function forces the experts to compete. With the Sigmoid routing function, CondConv allows the number of active experts to be chosen by the model for each example.
4.3.4 CondConv Layer Depth
We also analyze the performance of CondConv layers at different depths in the architecture. Table 7 introduces CondConv layers starting at different depths of the base MobileNetV1 network and in all subsequent layers. CondConv layers improve performance at every depth. Even with no additional data augmentation, replacing all layers with CondConv layers achieves the highest top1 validation accuracy. CondConv layers at greater depth require more inference cost, but also improve accuracy more significantly. We hypothesize that later layers are more classspecific, and benefit from larger parameter counts and better input features for routing.
We then investigate the importance of using CondConv in the final fullyconnected layer in Table 8
. The final fullyconnected layer consists of a significant fraction of the parameters and computation cost of the base MobileNetV1(0.25x) network. Even with a standard fullyconnected layer, using CondConv layers in the final 8 separable convolutional blocks significantly improves accuracy over the MobileNetV1(0.25x) baseline. This suggests that the learned features themselves are significantly better in CCMobileNetV1(0.25x) with 32 experts. Introducing CondConv at the final classifiation layer further improves performance.
5 Analysis
In this section, we aim to gain a better understanding of the experts and routing functions learned by the CCMobileNetV1 architectures. We study the CCMobileNetV1(0.50x) architecture with 32 experts per layer trained on ImageNet with Mixup and AutoAugment and achieves 71.6% top1 validation accuracy.
We first study interclass and intraclass variation between the routing weights at different layers in the network. We evaluate the CCMobileNetV1(0.5x) model on the 50,000 ImageNet validation examples, and compute the mean and standard deviation of the routingweights for each class. In Figure 4, we visualize the average routing weight for four classes: cliff, pug, goldfish, and plane, as suggested by Hu et al. (2018) for semantic and appearance diversity. As comparison, we visualize the average routing weights for each expert across all classes.
The distribution of the routing weights is very similar across classes at early layers in the network. At later layers of the network, the routing weights become more and more class specific. This shows that features are largely shared between classes at earlier layers of the network, with more differentiation at later layers of the network. Moreover, it may be easier to perform good routing with deeper representations. This agrees with empirical results in Table 7, which show that CondConv layers at greater depth in the network improve final accuracy more than those at shallower depth.
We then visualize variation between routing weights within examples of the same class for experts in the final fullyconnected layer in Figure 5. For both the goldfish class and the cliff class, we see that some experts are activated with high weight and small variance between examples, suggesting that those experts are specialized for this class. Other experts are activated for some examples in the same class but not others with high variance, which suggests that the features are useful for discriminating for specific examples within the same class.
Figure 6 illustrates the top10 classes with highest mean routing weight for four different experts in the final fullyconnected layer of the network. To visualize, we also show the exemplar image with highest routing weight within each class. Our routing function is trained endtoend using the final classification loss only, but the experts still learn to specialize in semantically and visually meaningful ways.
Finally, we analyze the distribution of the routing weights of the final fullyconnected layer in Figure 7. The routing weights follow a bimodal distribution, with experts receiving a routing weight of close to 0 or 1. This shows that the experts are sparsely activated, even without regularization, and further suggests the specialization of the experts.
6 Discussion
In this paper, we introduced soft conditional computation, a parameter routing approach that enables easy optimization and efficient utilization of all experts. Rather than using hardrouting to choose a subset of experts to evaluate, CondConv applies softrouting to the weights of all experts first, then applies the expensive convolution of the base network once. This introduces a new paradigm for designing highcapacity models with efficient inference: first, generate an expert for each input example, then perform the expensive computation of applying the expert to the inputs once. The capacity of the network increases with the size and complexity of the expertgenerating function, while the inference cost increases over the base network by the cost of the expertgenerating function. By designing expertgenerating functions with large capacity, we can expand the expressiveness of expensive operations in the base network at small inference cost. This paradigm for conditional computation supports more complex information sharing across experts than previous routing mechanisms, which we believe is a promising direction for future conditional computation models. We hope to further explore the design space and limitations of this paradigm with larger datasets, more complex expertgenerating functions, and architecture search to design better base architectures.
References
 Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
 Bengio et al. (2015) Bengio, E., Bacon, P.L., Pineau, J., and Precup, D. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
 Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Chen et al. (2018) Chen, Z., Li, Y., Bengio, S., and Si, S. Gaternet: Dynamic filter selection in convolutional neural network via a dedicated global gating network. arXiv preprint arXiv:1811.11205, 2018.
 Cho & Bengio (2014) Cho, K. and Bengio, Y. Exponentially increasing the capacitytocomputation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362, 2014.
 Cubuk et al. (2018) Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
 Davis & Arel (2013) Davis, A. and Arel, I. Lowrank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013.
 Eigen et al. (2013) Eigen, D., Ranzato, M., and Sutskever, I. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
 Fernando et al. (2017) Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., and Wierstra, D. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Ha et al. (2017) Ha, D., Dai, A., and Le, Q. V. Hypernetworks. In International Conference on Learning Representations, 2017.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Hu et al. (2018) Hu, J., Shen, L., and Sun, G. Squeezeandexcitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, 2018.
 Huang et al. (2018) Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1808.07233, 2018.
 Keskin & Izadi (2018) Keskin, C. and Izadi, S. Splinenets: Continuous neural decision graphs. In Advances in Neural Processing Systems, 2018.
 Kornblith et al. (2018) Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974, 2018.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105, 2012.
 LeCun et al. (1990) LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., and Jackel, L. D. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pp. 396–404, 1990.
 Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.

Liu & Deng (2018)
Liu, L. and Deng, J.
Dynamic deep neural networks: Optimizing accuracyefficiency
tradeoffs by selective execution.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A. C. Ssd: Single shot multibox detector. In European conference on computer vision, pp. 21–37. Springer, 2016.
 Luong et al. (2015) Luong, M.T., Pham, H., and Manning, C. D. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
 Mahajan et al. (2018) Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
 McGill & Perona (2017) McGill, M. and Perona, P. Deciding how to decide: Dynamic routing in artificial neural networks. arXiv preprint arXiv:1703.06217, 2017.
 Mullapudi et al. (2018) Mullapudi, R. T., R.Mark, W., Shazeer, N., and Fatahalian, K. Hydranets: Specialized dynamic architectures for efficient inference. In CVPR, 2018.
 Ramachandran & Le (2019) Ramachandran, P. and Le, Q. V. Diversity and depth in perexample routing models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BkxWJnC9tX.
 Real et al. (2019) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. In AAAI, 2019.
 Rosenbaum et al. (2017) Rosenbaum, C., Klinger, T., and Riemer, M. Routing networks: Adaptive selection of nonlinear functions for multitask learning. arXiv preprint arXiv:1711.01239, 2017.
 Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
 Shazeer et al. (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparselygated mixtureofexperts layer. In International Conference on Learning Representations, 2017.
 Silberman & Guadarrama (2016) Silberman, N. and Guadarrama, S. Tensorflowslim image classification model library, 2016. URL https://github.com/tensorflow/models/tree/master/research/slim.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html.
 Teja Mullapudi et al. (2018) Teja Mullapudi, R., Mark, W. R., Shazeer, N., and Fatahalian, K. Hydranets: Specialized dynamic architectures for efficient inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Wang et al. (2018) Wang, X., Yu, F., Dou, Z.Y., Darrell, T., and Gonzalez, J. E. Skipnet: Learning dynamic routing in convolutional networks. In European Conference on Computer Vision, pp. 420–436. Springer, 2018.
 (38) Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L. S., Grauman, K., and Feris, R. Blockdrop: Dynamic inference paths in residual networks.
 Zhang et al. (2017) Zhang, H., Cisse, M., Dauphin, Y. N., and LopezPaz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Comments
There are no comments yet.