Soft Conditional Computation

04/10/2019 ∙ by Brandon Yang, et al. ∙ 0

Conditional computation aims to increase the size and accuracy of a network, at a small increase in inference cost. Previous hard-routing models explicitly route the input to a subset of experts. We propose soft conditional computation, which, in contrast, utilizes all experts while still permitting efficient inference through parameter routing. Concretely, for a given convolutional layer, we wish to compute a linear combination of n experts α_1 · (W_1 * x) + ... + α_n · (W_n * x), where α_1, ..., α_n are functions of the input learned through gradient descent. A straightforward evaluation requires n convolutions. We propose an equivalent form of the above computation, (α_1 W_1 + ... + α_n W_n) * x, which requires only a single convolution. We demonstrate the efficacy of our method, named CondConv, by scaling up the MobileNetV1, MobileNetV2, and ResNet-50 model architectures to achieve higher accuracy while retaining efficient inference. On the ImageNet classification dataset, CondConv improves the top-1 validation accuracy of the MobileNetV1(0.5x) model from 63.8 detection, CondConv improves the minival mAP of a MobileNetV1(1.0x) SSD model from 20.3 to 22.4 with just a 4

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) CondConv:
(b) Mixture of Experts:
Figure 1: Our CondConv layer architecture with experts vs. a mixture of experts approach. We use a weighted sum to combine experts by their routing weights, where are routing weights computed as functions of the input. CondConv is mathematically equivalent to the mixture of experts approach, but requires only convolution, while mixture of experts requires convolutions.

Deep convolutional neural networks (CNNs) have achieved state of the art performance on many tasks in computer vision 

(LeCun et al., 1990; Krizhevsky et al., 2012). One trend in designing neural networks with better performance has been increasing scale: bigger models with more capacity (more parameters) trained on large datasets  (Mahajan et al., 2018; Huang et al., 2018; Real et al., 2019). However, large models can be hard to use in practice because of inference latency constraints. In this work, we take inspiration from conditional computation to build networks that have a large number of parameters, but retain good performance characteristics at inference. In particular, we aim to reduce the computation cost of inference for single-image batches, which are used in many applications.

Most prior work on conditional computation proposes learning functions that route individual input examples through a subset of experts in a larger network (Eigen et al., 2013; Bengio et al., 2015; Shazeer et al., 2017)

. We term this hard-routing, since a given example will only be evaluated by a subset of experts, and other experts are ignored. In this case, the inference cost scales directly with the number of evaluated experts; model performance usually improves when more experts are evaluated. However, the hard-routing approach can be difficult to implement in practice. Learning good routing functions for discrete decisions is challenging. Previous approaches require reinforcement learning 

(Bengio et al., 2015) or evolution-based methods  (Fernando et al., 2017), or gradient descent with additional learning objectives for the routing function (Eigen et al., 2013; Shazeer et al., 2017; Teja Mullapudi et al., 2018).

In this work, we propose a soft conditional computation approach that enables easy optimization and utilization of all experts at low inference cost. Consider a Mixture-of-Experts model, where we wish to compute a linear combination of experts , where are functions of the input learned through gradient descent. In a ConvNet, the above method requires convolutions, which is expensive when is large. The hard-routing approach truncates the computation to convolutions where . However, hard-routing, as mentioned above, is difficult to train. Our main observation is that we can reorder the computation as . This requires only a single convolution, which significantly reduces computation, since a weighted sum of kernels is much cheaper than a convolution. In other words, we softly combine the weights of all experts, before performing the expensive computation of applying the experts to the inputs once. We call this approach soft conditional computation to contrast the traditional hard-routing approach. When applied to ConvNets, we name our method CondConv.

CondConv can be used as a drop-in replacement for existing convolution layers to scale them with soft conditional computation. We demonstrate that replacing convolutional layers with CondConv is an effective way to increase model capacity on the MobileNetV1, MobileNetV2, and ResNet-50 architectures, and is able to significantly improve the model performance on ImageNet classification and COCO object detection across a spectrum of model sizes.

2 Related Work

Conditional computation. Conditional computation aims to increase model capacity without a proportional increase in computation cost by activating only a portion of the entire network for each example  (Bengio et al., 2013; Davis & Arel, 2013; Cho & Bengio, 2014; Bengio et al., 2015). Hard-routing models have shown promising results in recent work, but key challenges are how to route the individual examples and train the network effectively. One approach is to use reinforcement learning to learn discrete routing functions (Rosenbaum et al., 2017; McGill & Perona, 2017; Liu & Deng, 2018). BlockDrop (Wu et al., ) and SkipNet (Wang et al., 2018) use reinforcement learning to learn the subset of blocks needed to process a given input. Another approach is to use evolutionary methods to evolve pathways through a larger supernetwork  (Fernando et al., 2017). Unlike these methods, CondConv can be trained end-to-end with gradient descent on the original loss.

Hydranets (Mullapudi et al., 2018)

is a hard-routing approach for vision that can also be trained with gradient descent, but requires an unsupervised clustering-based method to partition examples to perform well. Increasing the number of experts evaluated at inference time improves performance of Hydranets, but results in increased inference cost. CondConv does not require auxiliary loss functions for learning the routing functions, and allows for the use of all experts at a small inference cost.

Finally, Shazeer et al. (2017) propose the sparsely-gated mixture-of-experts layer, which enables hard-routing to be trained with gradient descent and achieves significant success on large language modeling and machine translation tasks. Their proposed routing technique is noisy top- gating, but this approach introduces noise and discontinuities to the loss function. Ramachandran & Le (2019) apply noisy top- gating to vision tasks, and find that noisy top- gating is difficult to optimize with increasing routing depth and requires careful choice of for each layer to balance between weight-sharing and specialization of experts. In contrast, CondConv layers are easy and reliable to optimize with gradient descent, without special regularization. Moreover, CondConv layers automatically learn to trade-off between specialization and weight-sharing at each layer. This comes at a cost of architectural diversity between experts, but the ease of optimization enables more complex and deeper routing schemes.

Weights per example. Ha et al. (2017) propose the use of a small network to generate weights for a larger network. For vision, these weights are not conditional on the input, and thus have lower parameter count but also higher computation cost and reduced performance. Attention-based methods scale previous layer inputs based on learned attention weights. Hard-attention (Luong et al., 2015) uses only a subset of the inputs and has parallels to hard conditional routing. Soft-attention (Bahdanau et al., 2015; Vaswani et al., 2017) uses all parts of the inputs but upweights some parts, which is more in-line with our proposed technique.

Example dependent activation scaling. Finally, some recent work proposes to adapt the activations of neural networks conditionally on the input. Squeeze-and-Excitation networks (Hu et al., 2018) learn to scale the activations of every layer output. GaterNet (Chen et al., 2018) proposes to use a separate network to select a binary mask for filters for a larger backbone network. Scaling activations has similar motivations as soft conditional computation, but is restricted to modulating activations in the base network.

Continuous neural decision graphs. In recent work, SplineNets (Keskin & Izadi, 2018) propose the use of B-splines to model learnable weights for continuous neural decision graphs. Compared to SplineNets, CondConv explores more complex weight-generating functions.

3 Soft Conditional Computation with Parameter Routing

In a regular convolutional layer, the same convolutional kernel is used for all input examples. In a CondConv layer, the convolutional kernel is computed as a function of the input example (Fig 1(a)). Specifically, we parameterize CondConv by:

where each is an example-dependent routing weight computed using a routing function with learned parameters, is the number of experts, and

represents the non-linear normalization and activation functions.

We observe that the CondConv layer is equivalent to the more expensive mixture of experts formulation, where each expert represents a convolution (Fig 1(b)):

Hence, CondConv can be viewed equivalently as a weighted sum of the outputs from multiple regular convolutions, or as a single convolution whose kernel is input-dependent.

In a CondConv layer, the cost of the convolution operation is independent of the number of experts, since the size of the kernel is fixed. Thus, adding an expert increases the inference cost by approximately one multiply-add per additional parameter. In contrast, a normal convolution requires many multiply-adds for each additional parameter in the kernel. Thus, CondConv enables us to efficiently scale the capacity with respect to inference cost.

To compute the example-dependent routing weights , we perform global average pooling on the input, followed by a fully-connected layer to output routing weights, followed by the Sigmoid activation function. The global average pooling of the input provides global context to the routing function. Our parameterization enables CondConv to recover hard-routing models with the same architecture. A hard-routing model with active experts corresponds to soft parameter routing with routing weights for all but experts. Our model is able to take advantage of all experts for every example, enabling the model to be more expressive.

The design of the CondConv layer allows it to be used in place of any regular convolution layer in a network. Although we illustrate the approach with ordinary convolutions, the same approach can easily be extended to other linear functions like those in depth-wise convolutions and fully-connected layers.

4 Experiments

We evaluate CondConv by scaling up the MobileNetV1  (Howard et al., 2017), MobileNetV2  (Sandler et al., 2018), and ResNet-50  (He et al., 2016) architectures on image classification and object detection tasks.

As an implementation note, to train models with CondConv layers, we need to perform convolutions with a batch size of one, since each example has a different convolutional kernel. However, current accelerators are optimized to train on large batch sizes. Thus, with small numbers of experts, we found it to be more efficient to train CondConv layers with the mixture of experts formulation (Fig. 1(b)) to support large convolutional batch sizes, then use our efficient CondConv approach (Fig. 1(a)) at inference.

Figure 2: Performance of our CC-MobileNetV1 models vs the base MobileNetV1 architecture on the ImageNet classification dataset. Increasing the number of experts per layer of CC-MobileNetV1 significantly improves accuracy with respect to inference cost compared to the MobileNetV1 frontier reported in Silberman & Guadarrama (2016).
Model Num Experts Params Inference MADDs Train CE Valid Top-1
() () (%)
CC-MobileNetV1(0.5x) 1 1.3 151 1.80 63.6
CC-MobileNetV1(0.5x) 2 2.6 152 1.61 66.0
CC-MobileNetV1(0.5x) 4 5.2 154 1.41 67.6
CC-MobileNetV1(0.5x) 8 10.4 160 1.20 68.4
CC-MobileNetV1(0.5x) 16 20.7 170 0.91 69.9
CC-MobileNetV1(0.5x) 32 41.3 190 0.67 71.6
MobileNetV1(0.5x)111

Our implementation. Our training hyperparameters improve upon the reported top-1 accuracy of 70.6 for MobileNetV1 (1.0x) and 63.7 for MobileNetV1 (0.5x) in

Howard et al. (2017).
- 1.3 150 1.83 63.8
MobileNetV1(1.0x)111Our implementation. Our training hyperparameters improve upon the reported top-1 accuracy of 70.6 for MobileNetV1 (1.0x) and 63.7 for MobileNetV1 (0.5x) in Howard et al. (2017). - 4.2 569 1.20 71.9
Table 1: Scaling up our CC-MobileNetV1(0.5x) models at increasing numbers of experts per layer. CondConv significantly increases model capacity while maintaining efficient inference. Model capacity is measured by training cross-entropy loss (Train CE) following Sec. 4.1 with no additional regularization. With additional regularization, more experts leads to better generalization accuracy.

4.1 ImageNet Classification

We evaluate our approach on the image classification task using the ImageNet 2012 classification dataset  (Russakovsky et al., 2015). The ImageNet dataset consists of 1.28 million training images and 50K validation images from 1000 classes. We train all models on the entire training set and compare the single-crop top-1 validation set accuracy. We use the same training hyperparameters for all models on ImageNet, following  Kornblith et al. (2018), except we use a BatchNorm momentum of 0.9 and disable exponential moving average on the weights.

We introduce two additional regularization techniques for models with large capacity. First, we use Dropout  (Srivastava et al., 2014) on the input to the fully-connected layer. We also add data augmentation using the AutoAugment  (Cubuk et al., 2018) ImageNet policy and Mixup  (Zhang et al., 2017) with . We search over these regularization techniques for all models for fair comparison.

4.1.1 MobileNetV1

We evaluate our approach against the base MobileNetV1 architecture at many different computational costs. Concretely, we replace the depth-wise and point-wise convolutional layers in the final 8 separable convolutional blocks with CondConv layers. We use a single routing function to compute routing weights for both the depth-wise and point-wise CondConv layers within a separable convolutional block from the input to the block. We also replace the final classification layer with a 1x1 CondConv block. We analyze these architectural choices in Section  4.3. We refer to our models as CC-MobileNetV1.

We evaluate our CC-MobileNetV1 models and the base MobileNetV1 model across four different width multipliers {0.25, 0.50, 0.75, 1.0} in Figure  2. In a MobileNetV1 model, the width multiplier scales the number of output channels at each layer by fraction ; as a result, the computational cost and parameters are scaled proportional to . For CC-MobileNetV1, at each width multiplier, we use a constant input size of 224 and vary the number of experts per layer of the CC-MobileNetV1 models by powers of 2, from 1 to 64 for 0.25x and 1 to 32 for 0.50x, 0.75x, and 1.0x. We compare CC-MobileNetV1 with the baseline MobileNetV1 results from the publicly available Slim checkpoints  (Silberman & Guadarrama, 2016). For MobileNetV1 models, at each width multiplier, the input size varies from {128x128, 160x160, 192x192, 224x224}. This gives a performance frontier for the accuracy and inference cost trade-off for MobileNetV1. We compare CC-MobileNetV1 models to this frontier.

Across all width multipliers, CondConv layers significantly improve the ImageNet validation accuracy relative to inference cost compared to the MobileNetV1 base models. We provide a detailed table of parameter count and inference cost for scaling the MobileNetV1 (0.5x) model in Table 1. Increasing the number of experts significantly improves the capacity of the model, which we compare using the training cross-entropy loss following Sec. 4.1 with no additional regularization or data augmentation. With additional regularization, our models with more experts per layer also attain better generalization accuracy. We note that regularization with Dropout  (Srivastava et al., 2014) and additional data augmentation actually reduces the accuracy of the baseline MobileNetV1 models.

4.1.2 MobileNetV2

We further evaluate our approach against the MobileNetV2 (Sandler et al., 2018) architecture, which improves the accuracy with respect to inference cost over MobileNetV1 using a deeper architecture with inverted residual blocks. We replace the convolutional layers in the last 6 inverted residual blocks with CondConv layers, and replace the final 1x1 convolution and classification layer with 1x1 CondConv layers. Inverted residual blocks consist of 2 point-wise convolutional layers and 1 depth-wise convolutional layer, and within each block, we replace all three with CondConv layers. We compute the routing weights from the input to the inverted residual block, and share them across all three layers within a block. We refer to our models as CC-MobileNetV2.

Even on top of the inverted bottleneck structure, which introduces additional parameters at lower inference cost, our approach significantly improves the capacity and accuracy of the MobileNetV2 architecture with respect to the inference cost. Table 2 shows the performance of our model relative to the base MobileNetV2 models. With additional regularization, our CC-MobileNetV2 (1.0x) model with 8 experts achieves comparable performance to the MobileNetV2 (1.4x) model at 56% of the inference cost. As with MobileNetV1, additional reguliarzation with Dropout  (Srivastava et al., 2014) and additional data augmentation reduces the accuracy of the baseline MobileNetV2 models.

Model Params MADDs Train Valid
() () CE Top-1 (%)
MV2 (0.5x)222Our implementation. Kornblith et al. (2018) reports 71.6 top-1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x) with similar hyperparameter settings. Sandler et al. (2018) uses different hyperparameters and reports 72.0 top-1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x). Silberman & Guadarrama (2016) uses different hyperparameters and reports 65.4 top-1 accuracy for MobileNetV2 (0.5x). 2.0 97 1.9 62.9
CC-MV2 (0.5x) [8] 15.5 113 1.1 68.4
MV2(1.0x)222Our implementation. Kornblith et al. (2018) reports 71.6 top-1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x) with similar hyperparameter settings. Sandler et al. (2018) uses different hyperparameters and reports 72.0 top-1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x). Silberman & Guadarrama (2016) uses different hyperparameters and reports 65.4 top-1 accuracy for MobileNetV2 (0.5x). 3.5 301 1.3 71.6
CC-MV2 (1.0x) [8] 27.5 329 0.7 74.6
MV2 (1.4x)222Our implementation. Kornblith et al. (2018) reports 71.6 top-1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x) with similar hyperparameter settings. Sandler et al. (2018) uses different hyperparameters and reports 72.0 top-1 accuracy for MobileNetV2 (1.0x) and 74.7 for MobileNetV2 (1.4x). Silberman & Guadarrama (2016) uses different hyperparameters and reports 65.4 top-1 accuracy for MobileNetV2 (0.5x). 6.1 583 1.0 74.5
Table 2: Performance of our CC-MobileNetV2 (CC-MV2) models with 8 branches at 0.5x and 1.0x depth multipliers compared to the baseline MobileNetV2 (MV2) models. Our CC-MobileNetV2 (1.0x) model with 8 branches achieves comparable accuracy to the MobileNetV2 (1.4x) model with  56% of the inference cost.

4.1.3 ResNet-50

Model Params MADDs Train Valid
() () CE Top-1 (%)
RN-50333Our implementation. Our hyperparameter setting with addition data augmentation improves upon the top-1 accuracy of 76.4 for large-batch ResNet-50 training in Goyal et al. (2017). 25.6 4093 0.50 77.7
CC-RN-50 [2] 42.6 4127 0.35 78.4
CC-RN-50 [8] 130.2 4213 0.19 77.7
Table 3: Performance of our CC-ResNet50 (CC-RN-50) model with 2 and 8 branches against the ResNet-50 (RN-50) baseline. Increasing experts per layer of the CC-ResNet-50 models significantly improves model capacity. Generalization accuracy improves at two experts per layer.

Finally, we evaluate our approach on the ResNet-50  (He et al., 2016) architecture, a much larger architecture which uses ordinary 3x3 convolutions and residual bottleneck building blocks. We replace the convolutional layers in the final three residual blocks with CondConv layers, as well as the final classification layer with a 1x1 CondConv layer. We refer to our model as CC-ResNet-50.

The CC-ResNet-50 model with 2 experts per layer achieves 0.7 higher ImageNet Top-1 Accuracy compared to the ResNet-50 baseline. Both the ResNet-50 baseline and our CC-ResNet-50 models are trained with additional data augmentation with Mixup  (Zhang et al., 2017) and AutoAugment  (Cubuk et al., 2018). With larger numbers of experts per layer, the CC-ResNet-50 models demonstrate even higher capacity measured by the training cross-entropy loss following Sec.  4.1 with no additional regularization or data augmentation. We believe the generalization accuracy could be further improved with more complex regularization techniques and larger datasets.

4.2 COCO Object Detection

Model Params MADDs mAP
() ()
MV1 (0.25x)444Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported. 0.7 119 9.3
CC-MV1 (0.25x) [8] 3.1 122 12.1
MV1(0.5x)444Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported. 2.0 352 14.4
CC-MV1 (0.5x) [8] 11.4 363 18.0
MV1 (0.75x)444Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported. 4.1 730 18.2
CC-MV1 (0.75x) [8] 25.0 755 21.0
MV1(1.0x)444Our implementation. Our hyperparameter setting improves upon the reported mAP of 19.3 for MobileNetV1 (1.0x) with SSD300 in Howard et al. (2017). Other width multipliers not reported. 6.8 1230 20.3
CC-MV1 (1.0x) [8] 43.8 1280 22.4
Table 4: COCO object detection comparison of our CC-MobileNetV1 model with CC-SSD300 (CC-MV1) with 8 experts per layer against the baseline MobileNetV1 with SSD300 (MV1) at several different model sizes. mAP reported with COCO primary challenge metric (AP at IoU=0.50:0.05:0.95).

We further evaluate the effectiveness of CondConv to improve object detection with the Single Shot Detector  (Liu et al., 2016) framework with 300x300 input image resolution (SSD300). We use our CC-MobileNetV1 models with depth multipliers {0.25, 0.50, 0.75, 1.0} as the feature extractor for object detection. We further replace the rest of the convolutional feature extractor layers in SSD with CondConv layers, which we term CC-SSD.

The COCO dataset  (Lin et al., 2014) consists of 80,000 training and 40,000 validation images. We train on the combined COCO training and validation sets excluding 8,000 minival images, which we evaluate our networks on. We train our models using a batch size of 1024 for 20,000 steps. For the learning rate, we use linear warmup from 0.3 to 0.9 over 1,000 steps, followed by Cosine decay from 0.9. We use the data augmentation scheme proposed by Liu et al. (2016). We use the same convolutional feature layer dimensions, SSD hyperparameters, and training hyperparameters across all models.

We find that the learned features and additional capacity from CondConv layers with 8 experts significantly improves object detection results at a variety of model sizes in Figure  4. In particular, our CC-MobileNetV1(0.75x) CC-SSD model exceeds the mAP of the MobileNetV1(1.0x) SSD baseline by 0.7 mAP at 60% of the inference cost. Moreover, our CC-MobileNetV1(1.0x) CC-SSD model improves upon the MobileNetV1(1.0x) SSD baseline by 2.1 mAP at similar inference cost.

4.3 Ablation studies

Figure 3: Validation Top-1 Accuracy (left) and training cross-entropy (right) of per-expert vs per-channel routing with the CC-MobileNetV1 (0.25x) architecture. Per-channel routing performs better with fewer experts, while per-expert routing performs better with more experts.
Layer Params MADDs Train Valid
() () CE Top-1 (%)
CCMV1_0.25_32
(Depth + Point) 14.6 55.7 1.4 62.0
Pointwise 14.3 55.3 1.5 62.3
Depthwise 8.8 49.7 1.8 58.0
None (MV1_0.25) 0.47 41.2 2.6 50.0
Table 5: CondConv for point-wise vs depth-wise convolutions on our CC-MobileNetV1 (0.25x) model with 32 experts per layer.
Routing Params MADDs Train Valid
Fn () () CE Top-1 (%)
CCMV1_0.25_32
(Baseline) 14.6 55.7 1.4 62.0
Single 14.6 55.5 1.8 56.5
Partially-shared 14.6 55.6 1.4 62.5
Hidden (small) 14.6 55.6 1.7 57.7
Hidden (medium) 14.8 55.9 1.4 62.2
Hidden (large) 16.8 57.8 1.4 54.1
Hierarchical 14.6 55.7 1.4 60.3
Softmax 14.6 55.7 1.7 60.5
Table 6: Different routing function architectures on our CC-MobileNetV1(0.25x) model with 32 experts per layer. The baseline approach applies a single fully-connected layer with Sigmoid activation at every block.
CondConv Params MADDs Train Valid
Begin () () CE Top-1 (%)
CCMV1_0.25_32
(7) 14.6 55.7 1.4 62.0
1 14.9 56.3 1.4 62.5
3 14.9 56.2 1.4 62.1
5 14.8 56.0 1.4 62.0
13 11.6 52.5 1.6 59.5
15 (FC) 8.42 49.3 2.0 54.2
16 (MV1_0.25) 0.47 41.2 2.6 50.0
Table 7: Introducing CondConv layers at different depths in our CC-MobileNetV1(0.25x) model with 32 experts per layer. CondConv begin at means using CondConv layers beginning at the -th convolutional or separable convolutional block. CondConv layers are more beneficial later in the network, but are still helpful at earlier layers.
Cond Params MADDs Train Valid
FC () () CE Top-1 (%)
CCMV1_0.25_32
(FC + FE) 14.6 55.7 1.4 62.0
FE only 6.65 47.6 1.6 60.2
FC only 8.42 49.3 2.0 54.2
None (MV1_0.25) 0.47 41.2 2.6 50.0
Table 8: Using CondConv in the final fully-connected (FC) layer vs the feature extractor layers (FE) for our CC-MobileNetV1(0.25x) model with 32 experts per layer.
(a) CondConv 7
(b) CondConv 13
(c) Fully Connected (FC)
Figure 4: Mean routing weights learned for four distinct classes at different depths in our CC-MobileNetV1 (0.5x) network with 32 experts per layer. CondConv refers to the routing weights for the -th convolutional or separable convolutional block in the network. Routing weights become more class-specific at greater depths in the network.
Figure 5:

Mean routing weights in the final CondConv layer in our CC-MobileNetV1 (0.5x) model with 32 experts per layer. Error bars indicate one standard deviation. Some experts are class-specific, and have high weights across all-examples in the class with low variance. Other experts are example-specific within the class and show high variance.

Figure 6: Visualization of top 10-classes by mean routing weight for 4 different experts in the final CondConv layer in our CC-MobileNetV1 (0.5x) model with 32 experts per layer. Expert 1 is most activated for wheeled vehicles; Expert 2 is most activated for rectangular structures; Expert 3 is most activated for cylindrical household objects; Expert 4 is most activated for brown and black dog breeds.
Figure 7: Distribution of routing weights in the final CondConv layer of our CC-MobileNetV1 (0.5x) model with 32 experts per layer across ImageNet validation set images. Routing weights follow a bimodal distribution.

We perform ablation experiments to highlight important architectural decisions in designing soft conditional computation models with the CondConv block. In all experiments, we compare against the same baseline CC-MobileNetV1(0.25x) with 32 experts per layer trained as described in Sec. 4.1 with no additional regularization, which achieves 61.98% ImageNet Top-1 validation accuracy with 55.7M multiply-adds. In this section, we refer to this baseline model as CCMV1_0.25_32. This significantly outperforms the base MobileNetV1(0.25x) architecture, which achieves 50.0% Top-1 accuracy with 41.2M multiply-adds wiht the same training setup.555Our implementation. Howard et al. (2017) reports top-1 accuracy of 50.0% with different hyperparameters. We choose this model for ablation because of the large effect of introducing CondConv layers, and because it does not require additional data augmentation to perform well.

4.3.1 Per-channel routing

We explore learning different routing weights for each output channel in each expert, rather than using one routing weight for the entire expert. In theory, this increases the expressiveness of the model, by allowing each channel to have a different routing decision.666Note that with an expert count of one, per-filter routing is similar to Squeeze and Excite  (Hu et al., 2018), computed using input features rather than output features. The drawback of this approach is that the routing function complexity for each layer increases significantly, since it must output weights as opposed to weights. To keep the computation cost for the routing function small, we use a 2-layer routing function with bottleneck hidden layer of size ) for per-channel routing. We replace the routing weights at every CondConv layer in the baseline CC-MobileNetV1 (0.25x) model. We then vary the number of experts in powers of 2 from 1 to 64, trained as described in Sec. 4.1 with no additional regularization. We plot the results in Fig. 3.

At small expert counts, per-channel routing outperforms per-expert routing with the same number of experts. However, per-channel routing requires more computation. With large expert counts, per-expert routing outperforms per-channel routing, even on training cross-entropy loss. We hypothesize that although per-channel routing models are more expressive, they are difficult to optimize with many experts.

4.3.2 Pointwise vs. Depthwise Convolutions

We study the effects of CondConv for the pointwise vs depthwise convolutions in Table 5. The pointwise convolutions contain most of the parameters in the network and most of the inference cost compared to the depthwise convolutions. Introducing CondConv on just the pointwise convolutions has nearly as much capacity as on both the pointwise and depthwise convolutions, and slightly higher generalization accuracy. Introducing CondConv on depthwise convolutions only has less capacity than on both, but requires fewer parameters and has lower inference cost, while still significantly improving performance over the base MobileNetV1(0.25x) model accuracy of 50.6%.

4.3.3 Routing function

We investigate different choices for the routing function in Table  6. We first investigate sharing the routing weights between layers. The baseline model computes new routing weights for each layer. Single computes the routing weights only once at CondConv 7 (the 7th convolutional or separable convolutional layer) and uses the same routing weights in all subsequent layers. This significantly reduces the capacity and accuracy of the network. Partially-shared shares the routing weights between every other layer, and we find this slightly improves performance. We hypothesize that sharing the routing function among nearby layers can improve the learned routing weights.

We then experiment with more complex routing functions, by introducing a hidden layer with ReLU activation after the global average pooling step. We vary the capacity of the network by changing size of the hidden layer to be

for Hidden (small), for Hidden (medium), and for Hidden (large). We find adding that a non-linear hidden layer can slightly improve capacity. However, very large hidden layers exhibit over-fitting, while hidden layers that are too small reduce the capacity of the network.

Next, we experiment with Hierarchical routing functions, which take the routing weights of the previous layer as additional inputs to the routing function. This increases network capacity but also reduces generalization accuracy.

Finally, we experiment with the use of the Softmax activation function to compute routing weights, rather than the Sigmoid activation function. Using the Sigmoid activation function in the routing function improves the capacity and generalization accuracy of the network. In Section 5, we find that multiple experts are often useful for a single example. The Sigmoid activation function allows many experts to be used, while the Softmax activation function forces the experts to compete. With the Sigmoid routing function, CondConv allows the number of active experts to be chosen by the model for each example.

4.3.4 CondConv Layer Depth

We also analyze the performance of CondConv layers at different depths in the architecture. Table 7 introduces CondConv layers starting at different depths of the base MobileNetV1 network and in all subsequent layers. CondConv layers improve performance at every depth. Even with no additional data augmentation, replacing all layers with CondConv layers achieves the highest top-1 validation accuracy. CondConv layers at greater depth require more inference cost, but also improve accuracy more significantly. We hypothesize that later layers are more class-specific, and benefit from larger parameter counts and better input features for routing.

We then investigate the importance of using CondConv in the final fully-connected layer in Table 8

. The final fully-connected layer consists of a significant fraction of the parameters and computation cost of the base MobileNetV1(0.25x) network. Even with a standard fully-connected layer, using CondConv layers in the final 8 separable convolutional blocks significantly improves accuracy over the MobileNetV1(0.25x) baseline. This suggests that the learned features themselves are significantly better in CC-MobileNetV1(0.25x) with 32 experts. Introducing CondConv at the final classifiation layer further improves performance.

5 Analysis

In this section, we aim to gain a better understanding of the experts and routing functions learned by the CC-MobileNetV1 architectures. We study the CC-MobileNetV1(0.50x) architecture with 32 experts per layer trained on ImageNet with Mixup and AutoAugment and achieves 71.6% top-1 validation accuracy.

We first study inter-class and intra-class variation between the routing weights at different layers in the network. We evaluate the CC-MobileNetV1(0.5x) model on the 50,000 ImageNet validation examples, and compute the mean and standard deviation of the routing-weights for each class. In Figure 4, we visualize the average routing weight for four classes: cliff, pug, goldfish, and plane, as suggested by Hu et al. (2018) for semantic and appearance diversity. As comparison, we visualize the average routing weights for each expert across all classes.

The distribution of the routing weights is very similar across classes at early layers in the network. At later layers of the network, the routing weights become more and more class specific. This shows that features are largely shared between classes at earlier layers of the network, with more differentiation at later layers of the network. Moreover, it may be easier to perform good routing with deeper representations. This agrees with empirical results in Table 7, which show that CondConv layers at greater depth in the network improve final accuracy more than those at shallower depth.

We then visualize variation between routing weights within examples of the same class for experts in the final fully-connected layer in Figure 5. For both the goldfish class and the cliff class, we see that some experts are activated with high weight and small variance between examples, suggesting that those experts are specialized for this class. Other experts are activated for some examples in the same class but not others with high variance, which suggests that the features are useful for discriminating for specific examples within the same class.

Figure 6 illustrates the top-10 classes with highest mean routing weight for four different experts in the final fully-connected layer of the network. To visualize, we also show the exemplar image with highest routing weight within each class. Our routing function is trained end-to-end using the final classification loss only, but the experts still learn to specialize in semantically and visually meaningful ways.

Finally, we analyze the distribution of the routing weights of the final fully-connected layer in Figure 7. The routing weights follow a bi-modal distribution, with experts receiving a routing weight of close to 0 or 1. This shows that the experts are sparsely activated, even without regularization, and further suggests the specialization of the experts.

6 Discussion

In this paper, we introduced soft conditional computation, a parameter routing approach that enables easy optimization and efficient utilization of all experts. Rather than using hard-routing to choose a subset of experts to evaluate, CondConv applies soft-routing to the weights of all experts first, then applies the expensive convolution of the base network once. This introduces a new paradigm for designing high-capacity models with efficient inference: first, generate an expert for each input example, then perform the expensive computation of applying the expert to the inputs once. The capacity of the network increases with the size and complexity of the expert-generating function, while the inference cost increases over the base network by the cost of the expert-generating function. By designing expert-generating functions with large capacity, we can expand the expressiveness of expensive operations in the base network at small inference cost. This paradigm for conditional computation supports more complex information sharing across experts than previous routing mechanisms, which we believe is a promising direction for future conditional computation models. We hope to further explore the design space and limitations of this paradigm with larger datasets, more complex expert-generating functions, and architecture search to design better base architectures.

References