1 Introduction
It is widely recognized in neural science that distinct parts of the brain are highly specialized for different types of tasks [20]
. It results in not only the high efficiency in handling a response but also the surprising effectiveness of the brain in learning new events. In machine learning,
conditional computation [3]has been proposed to have a similar mechanism in deep learning models. For each specific sample, the basic idea of conditional computation is to only involve a small portion of the model in prediction. It also means that only a small fraction of parameters needs to be updated at each backpropagation step, which is desirable for training a large model.
One line of work that is closely related to conditional computation is Mixture of Experts (MoE) [15], where multiple subnetworks are combined via an ensemble using weights determined by a gating module. Particularly, several recent works [25, 29] propose to ensemble a small subset of dynamically selected experts in the model for each input. By doing so, these models are able to reduce computation cost while achieving similar or even better results than baseline models. Note that both the expert architectures and the number of experts in these works are predefined and fixed. Another line of works that resemble conditional computation focus on dynamic network configuration [30, 1, 7, 2, 4]. There are no explicitly defined experts in these methods. Rather, they dynamically select the units, layers, or other components in the main model for each input. In these works, one small submodule is usually added to each position to be configured in the model. That is, each submodule added is making decisions locally specific to the components it is configuring.
In this paper, we propose a novel framework called GaterNet for inputdependent dynamic filter selection in convolutional neural networks (CNNs), as shown in Figure 1. We introduce a dedicated subnetwork called gater
, which extracts features from input and generates the binary gates needed for controlling filters all at once based on the features. The gating vector is then used to select the filters in the
backbone^{1}^{1}1The term backbone is also used in the literature of object detection and TSENet [5]. network (the main model in our framework), and only the selected filters in the backbone network participate in the prediction and learning. We used a discretization technique called Improved SemHash [17] to enable differentiable training of inputdependent binary gates such that the backbone and the gater network can be trained jointly via backpropagation.Compared to previous works on dynamic network configuration, we use a dedicated subnetwork (the gater) for making global decisions on which filters in the backbone network should be used. The decision on each gate (for each filter) is made based on a shared global view of the current input. We argue that such a global gating unit can make more holistic decisions about how to optimally use the filters in the network than local configuration employed by previous work. Note that in [29], a module of the network is used to generate all the gates, which at a glance is similar to our gater. However, there are two important differences. Firstly, [29] is not based on an endtoend approach. It requires a preprocessing step to cluster classes of samples and assign each cluster to a subbranch of the network to handle. The assignments provides explicit supervision for training the gating module. Secondly, as mentioned above, the subbranch architectures and the number of branches are both manually defined and fixed throughout training in [29]. In contrast, in our framework, each sample uses a dynamically determined subbranch depending on the filters being selected. As a result, our method potentially allows a combinatorial number of choices of subbranches or experts given the number of filters to be controlled, which is more amenable for capturing complex distribution manifested in the data.
Our experiments on CIFAR [21] and ImageNet [24] classification datasets show that the gater in GaterNet is able to learn effective gating strategies for selecting proper filters. It consistently improves the original model with a significant margin. On CIFAR10, our method gives better classification results than stateoftheart models with only 1.2% additional parameters. Our contributions are summarized as follows:

We propose a new framework for dynamic filter selection in CNNs. The core of the idea is to introduce a dedicated gater network to take a glimpse of the input, and then generate inputdependent binary gates to select filters in the backbone network for processing the input. By using Improved SemHash, the gater network can be trained jointly with the backbone in an endtoend fashion through backpropagation.

We conduct extensive experiments on GaterNet, which show that it consistently improves the generalization performance of deep CNNs without significantly increasing the model complexity. In particular, our models achieve better results than several stateoftheart models on the CIFAR10 dataset by only introducing a small fraction of parameters.

We perform an indepth analysis about the model behavior for GaterNet, which reveals that GaterNet learns effective gating strategies by being relatively deterministic on the choice of filters to use in shallow layers but using more inputdependent filters in the deep layers.
2 Related Work
The concept of conditional computation is first discussed by Bengio in [3]
. Early works on conditional computation focus on how to select model components on the fly. Bengio et al. have studied four approaches for learning stochastic neurons in fullyconnected neural networks for conditional selection in
[4]. On the other hand, Davis and Arel have used lowrank approximations to predict the sparse activations of neurons at each layer [6]. Bengio et al. have also tested reinforcement learning to optimize conditional computation policies
[2] .More recently, Shazeer et al. have investigated the combination of conditional computation with Mixture of Experts on language modeling and machine translation tasks [25]. At each time step in the sequence model, they dynamically select a small subset of experts to process the input. Their models significantly outperformed stateoftheart models with a low computation cost. In the same vein, Mullapudi et al. have proposed HydraNets that uses multiple branches of networks for extracting features [29]. In this work, a gating module is introduced to generate decisions on selecting branches for each specific input. This method requires a preprocessing step of clustering the groundtruth classes to force each branch to learn features for a specific cluster of classes as discussed in the introduction.
Dynamic network configuration is another type of conditional computation that has been studied previously. In this line of works, no parallel experts are explicitly defined. Instead, they dynamically configure a single network by selectively activating model components such as units and layers for each input. Adaptive Dropout is proposed by Ba and Frey to dynamically learn a dropout rate for each unit and each input [1]. Denoyer and Ludovic have proposed a tree structure neural network called Deep Sequential Neural Network [7]. A path from the root to a leaf node in the tree represents a computation sequence for the input, which is also dynamically determined for each input. Recently, Veit and Belongie [30] have proposed to skip layers in ResNet [10] in an inputdependent manner. The resulting model is performing better and also more robust to adversarial attack than the original ResNet, which also leads to reduced computation cost.
Previous works have also investigated methods that dynamically rescale or calibrate the different components in a model. The fundamental difference between these methods and dynamic network configuration is that they generate a realvalued vector for each input, instead of a binary gate vector for selecting network components. SENet proposed by Hu et al. [12] rescales the channels in feature maps on the fly and achieves stateoftheart results on ImageNet classification dataset. Stollenga et al. [27] have also proposed to go through the main model for multiple passes. The features resulting from each pass (except the last) are used to generate a realvalued vector for rescaling the channels in the next pass. In contrast to these works, our gater network generates binary decisions to dynamically turn on or off filters depending on each input.
3 GaterNet
Our model contains two convolutional neural subnetworks, namely the backbone network and the gater network as illustrated in Figure 1. Given an input, the gater network decides the set of filters in the backbone network for use while the backbone network does the actual prediction. The two subnetworks are trained in an endtoend manner via backpropagation.
3.1 Backbone
The backbone network is the main module of our model, which extracts features from input and makes the final prediction. Any existing CNN architectures such as ResNet [10], Inception [28] and DenseNet [13] can be readily used as the backbone network in our GaterNet.
Let us first consider a standalone backbone CNN without the gater network. Given an input image , the output of the th convolutional layer is a 3D feature map . In a conventional CNN, is computed as:
(1) 
where is the th channel of feature map , is the th 3D filter, is the 3D input feature map to the th layer,
denotes the elementwise nonlinear activation function, and
denotes convolution. In general cases without the gater network, all the filters in the current layer are applied to , resulting in a dense feature map . The loss for training such a CNN for classification is for a single input image, where is the groundtruth label and denotes the model parameters.3.2 Gater
In contrast to the backbone, the gater network is an assistant of the backbone and does not learn any features directly used in the prediction. Instead, the gater network processes the input to generate an inputdependent gating mask—a binary vector. The vector is then used to dynamically select a particular subset of filters in the backbone network for the current input. Specifically, the gater network learns a function as below:
(2) 
Here, is an image feature extractor defined as , with being the height, width and channel number of an input image respectively, and
being the number of features extracted.
is a function defined as , where is the total number of filters in the backbone network. More details about function and will be discussed in Section 3.2.1 and Section 3.2.2 respectively.From the above definition we can see that, the gater network learns a function which maps input to a binary gating vector . With the help of , we reformulate the computation of feature map in Equation (1) as below:
(3) 
Here is the entry in corresponding to the th filter at layer , and is a 2D feature map with all its elements being 0. That is, the th filter will be applied to to extract features only when . If , the th filter is skipped and is used as the output instead. When is a sparse binary vector, a large subset of filters will be skipped, resulting in a sparse feature map. In this paper, we implement the computation in Equation (3) by masking the output channels using the binary gates:
(4) 
In the following subsections, we will introduce how we design the functions and in Equation (2) and how we enable endtoend training through the binary gates.
3.2.1 Feature Extractor
Essentially, the function in Equation (2) is a feature extractor which takes an image as input and outputs a feature vector . Similar to the backbone network, any existing CNN architectures can be used here to learn the function . There are two main differences compared with the backbone network: (1) The output layer of the CNN architecture is removed such that it outputs features for use in the next step. (2) A gater CNN does not necessarily need to be as complicated as the one for the backbone. One reason is that the gater CNN is supposed to obtain a brief view of the input. Having an overcomplicated gater network may encounter various difficulties in computation cost and optimization. Another reason is to avoid the gater network accidentally taking over the task that is intended for the backbone network.
3.2.2 Features to Binary Gates
FullyConnected Layers with Bottleneck
As defined in Equation (2), the function needs to map the vector of size to a binary vector of size . We first consider using fullyconnected layers to map to a realvalued vector of size . If we use one single layer to project the vector, the projection matrix would be of size . This can be very large when is thousands and is tens of thousands. To reduce the number of parameters in this projection, we use two fullyconnected layers to fulfill the projection. The first layer projects to a bottleneck of size , followed by the second layer mapping the bottleneck to . In this way, the total number of parameters becomes . This can be significantly smaller than when is much smaller than and . We ignore bias parameters here for simplicity.
In summary, the realvalued vector is computed as:
where and
denotes the two linear projections, ReLU denotes the nonlinear activation function in
[23], and BatchNorm means batch normalization
[14].Improved SemHash
So far, one important question still remains unanswered: how to generate binary gates from such that we can backpropagate the error through the discrete gates to the gater? In this paper, we adopt a method called Improved SemHash [17, 18].
During training, we first draw noise from a
dimentional Gaussian distribution with mean 0 and standard deviation 1. The noise
is added to to get a noisy version of the vector: . Two vectors are then computed from :where
is the saturating sigmoid function
[19, 16]:with being the sigmoid function. Here, is a realvalued gate vector with all the entries falling in the interval , while is a binary vector. We can see that, has the desirable binary property that we want to use in our model, but the gradient of w.r.t is not defined. On the other hand, the gradient of w.r.t is well defined, but is not a binary vector. In forward propagation, we randomly use for half of the training samples and use for the rest of the samples. When is used, we follow the solution in [17, 18] and define the gradient of w.r.t to be the same as the gradient of w.r.t in the backward propagation.
The above procedure is designed for the sake of easy training. Evaluation and inference are different to the training phase in two aspects. Firstly, we skip the step of drawing noise and always set . Secondly, we always use the discrete gates
in forward propagation. That is, the gate vector is always binarized in evaluation and inference phase. The interested readers are referred to
[17, 18] for more intuition behind Improved SemHash.Sparse Gates
To encourage the gates to be sparse, we introduce a regularization term into the training loss:
where is the weight for the regularization term and is the size of . Note that the backbone network receives no gradients from the second term, while the gater network receives gradients from both the two terms.
3.3 Pretraining
While our model architecture is straightforward, there are several empirical challenges to train it well. First, it is difficult to learn these gates, which are discrete latent representations. Although Improved SemHash has been shown to work well in several previous works, it is unclear whether the approximation of gradients mentioned above is a good solution in our model. Second, the introduction of gater network into the model has essentially changed the optimization space. The current parameter initialization and optimization technique may not be suitable for our model. We leave the exploration of better binarization, initialization and optimization techniques to our future works. In this paper, we always initialize our backbone network and gater network from networks pretraiend on the same task, and empirically find it works well with a range of models.
4 Experiments
We first conduct preliminary experiments on CIFAR [21] with ResNet [10, 11], which gives us a good understanding about the performance improvements our method can achieve and also the gating strategies that our gater is learning. Then we apply our method to stateoftheart models on CIFAR10 and show that we consistently outperform these models. Lastly, we move on to a largescale classification dataset, ImageNet 2012 [24], and show that our method significantly improves the performance of large models, such as ResNet and Inceptionv4 [28], as well.
4.1 Datasets
CIFAR10 and CIFAR100 contain natural images belonging to 10 and 100 classes respectively. There are 50,000 training and 10,000 test images. We randomly hold out 5,000 training images as a validation set. All the final results reported on test images are using models trained on the complete training set. The raw images are with pixels and we normalize them using the channel means and standard deviations. Standard data augmentation by random cropping and mirroring are applied to the training set. ImageNet 2012 classification dataset contains 1.28 million training images and 50,000 validation images of 1,000 classes. We use the same data augmentation method as the original papers of the baseline models in Table 3. The images are of and in ResNet and Inceptionv4 respectively.
4.2 Cifar10 and CIFAR100
4.2.1 Preliminary Experiments with ResNet
We first validate the effectiveness of our method using ResNet as the backbone network on CIFAR10 and CIFAR100 datasets. We consider a shallow version, ResNet20, and two deep versions, ResNet56 and ResNet164^{2}^{2}2Our ResNet164 is slightly different to the one in [11]. The number of filters in the first group of residual units are 16, 4, 16 respectively. to gain a better understanding on how our gating strategy can help models with varying capacities. All our gated models employ ResNet20 as the gater network. Table 1 shows the comparison with baseline models on the test set. ResNetWider is the ResNet with additional filters at each layer such that it contains roughly the same number of parameters as our model. ResNetSE is the ResNet with squeezeandexcitation block [12]. The Gated Filters column shows the number of filters under consideration in our models.
Classification Results
From the table we can see that, our model consistently outperforms the original ResNet with a significant margin. On CIFAR100, the error rate of ResNet164 is reduced by 1.83%.
It is also evident that, our model is performing better than ResNetSE in all cases. Note that our gater network is generating binary gates for the backbone network channels, while ResNetSE is rescaling the channels. It is interesting that, although our method is causing more information loss in the forward pass of backbone network due to the sparse discrete gates, our model still achieves better generalization performance than ResNet and ResNetSE. This to some extent validates our assumption that only a subset of filters are needed for the backbone to process an input sample.
In all cases, ResNetWider is better than the original ResNet as well. ResNet20Wider is even the best among all the shallow models. We hypothesize that ResNet20 is suffering from underfitting due to its small amount of filters and hence adding additional filters significantly improves the model. On the other hand, although ResNet20Gated has a similar number of parameters as ResNet20Wider, a significant portion (about a half) of its parameters belongs to the gater network, rather than directly participating in prediction, and ResNet20Gated still performed on par with ResNet20Wider.
The backbone network in ResNet20Gated suffers from underfitting due to the lack of effective filters. The comparison among the deep models validates our hypothesis. ResNet50 and ResNet164 contain many more filters than ResNet20, and adding filters to them shows only a minor improvement (see ResNet50Wider and ResNet164Wider). In these cases, our models show a significant improvement over the wider models and are the best among all the deep models on both datasets. The comparison with ResNetWider shows that the effectiveness of our model is not solely due to the increase of parameter number, but mainly due to our new gating mechanism.
Cifar10  Cifar100  
Gated Filters  Param  Error Rates %  Param  Error Rates %  
ResNet20 [10]    0.27M  8.06  0.28M  32.39 
ResNet20Wider    0.56M  6.85  0.57M  30.08 
ResNet20SE [12]    0.28M  7.81  0.29M  31.22 
ResNet20Gated (Ours)  336  0.55M  6.88 (1.18)  0.60M  30.79 (1.60) 
ResNet56 [10]    0.86M  6.74  0.86M  28.87 
ResNet56Wider    1.08M  6.72  1.09M  28.39 
ResNet56SE [12]    0.88M  6.27  0.89M  28.00 
ResNet56Gated (Ours)  1,008  1.14M  5.72 (1.02)  1.14M  27.71 (1.16) 
ResNet164 [11]    1.62M  5.61  1.64M  25.39 
ResNet164Wider    2.04M  5.57  2.07M  24.80 
ResNet164SE [12]    2.00M  5.51  2.02M  23.83 
ResNet164Gated (Ours)  7,200  1.96M  4.80 (0.81)  1.98M  23.56 (1.83) 
Complexity
It appears to be an issue at a glance if a comprehensive gater network is needed to assist a backbone network, as it may greatly increase the number of parameters. However, our experiments show that the gater network does not need to be complex, and as a matter of fact, it can be much smaller than the backbone network (see Table 1). Although the number of filters (in the backbone network) under consideration varies from 336 to 7200, the results show that a simple gater network such as ResNet20 is powerful enough to learn inputdependent gates for the three models that have a wide range of model capacity. As such, when the backbone network is large (where our method shows more significant improvements over baselines), the parameter overhead introduced by the gater network becomes small. For example, ResNet164Gated has only 20% more parameters than ResNet164. In contrast, in other more complicated backbone networks such as DenseNet and ShakeShake, this overhead is reduced to 1.2% as shown in Table 2. Consequently, the complexity and the number of additional parameters that our method brings to an existing model is relatively small, especially to large models.
Gate Distribution
One question that would naturally occur is how the distribution of the learned gates looks like. Firstly, it is possible that the gater network is just randomly pruning the backbone network and introducing regularization effects similar to dropout into the backbone. It is also possible that the gates are always the same for different samples. Secondly, the generated gates may give us good insights into the importance of filters at different layers.
To answer these questions, we analyzed the gates generated by the gater network in ResNet164Gated. We first conduct forward propagation in the gater network on CIFAR10 test set and collect gates for all the test samples. As expected, three types of gates emerge: gates that are always on for all the samples, gates that are always off, and gates that can be on or off conditioned on the input, i.e., inputdependent gates. We show the percentage of the three types of gates at different depth in Figure 2. We can see that, a large subset (up to 68.75%) of the gates are always off at the shallow residual blocks. As the backbone network goes deeper, the proportion of alwayson and inputdependent gates increases gradually. In the last two residual blocks, inputdependent gates become the largest subset of gates with percentages of around 45%. The phenomenon is consistent with the common belief that shallow layers are usually extracting lowlevel features which are essential for all kinds of samples, while deep layers are extracting highlevel features which are very samplespecific.
Although the above figures show that the gater network is learning inputdependent gates, it does not show how often that these gates are on/off. For example, a gate that is on for only one test sample but off for the rest would also appear inputdependent. To investigate this further, we collect all the inputdependent gates and plot the distribution of number of times that they are on in Figure 3. There are totally 1567 inputdependent gates out of the total number of 7200 gates for the backbone network. While many of these gates remain in one state—either on or off—in most of the time, there are 1,124 gates that switch on and off more frequently—they are activated for 100 9900 samples out of the 10,000 test samples.
We also examined how many gates are fired when processing each test example. The maximum and minimum number of fired gates per sample is 5380 and 5506 respectively. The average number is around 5453. The number of gates used each time seems to obey a normal distribution (see Figure
4).Lastly, we want to investigate what gating strategy has been learned by the gater network. To do so, we represent the filter usage of each test sample as a 7200dimensional binary vector where each element in the vector represents if the corresponding gate is on () or off (
). We collect the filter usage vector of each sample and reduce the dimension of these vectors from 7200 to 400 using Principal Component Analysis (PCA). We then project these vectors onto a 2dimensional space via tSNE
[22] (see Figure 5). Interestingly, we find samples of the same class tend to use similar gates. In the figure, each color of dots represents a groundtruth label. This shows that the gater network learned to turn on similar gates for samples from the same class— hence similar parts of the backbone network are used to process the samples from the class. On the other hand, we found the clusters in Figure 5 is still far from perfectly setting samples from different labels apart. It is indeed a good evidence that the gater network doesn’t accidentally take over the prediction task that the backbone network is intended to do, which is what we want to avoid. We want the gater network to focus on learning to make good decisions on which filters in the backbone network should be used. From this analysis, we can see that the experiments are turned out as we expected and the backbone network still does the essential part of prediction for achieving the high accuracy.We can draw the following conclusions from the above observations and analyses:

The gater network is capable of learning effective gates for different samples. It tends to generate similar gates for samples from the same class (label).

The residual blocks at shallow layers are more redundant than those at deep layers.

Inputdependent features are more needed at deep layers than at shallow layers.
4.2.2 StateoftheArt on CIFAR10
Next we test the performance of our method with stateoftheart models, ShakeShake [9] and DenseNet [13], on CIFAR10 dataset. We use ShakeShake and DenseNet as the backbone network and ResNet20 again as the gater network to form our models respectively. Table 2 summarizes the comparison of our models with the original models. The gater network in our method consistently improves the stateoftheart backbone network without significantly increasing the number of parameters. One of our models, ShakeShakeGated 26 2x96d, has only 1.2% more parameters than the corresponding baseline model. Another interesting finding is that, with the assistance of the gater network, DenseNetBCGated is even performing better than both DenseNetBC and DenseNetBCGated , although DenseNetBCGated has much fewer parameters.
Note that in [8], it is shown when ShakeShake 26 2x96d is combined with a data preprocessing technique called cutout, it can achieve 2.56% error rate on CIFAR10 test set. The technique is orthogonal to our method and can also be combined with our method to give better results.
Gated Filters  Param  Error Rates %  
DenseNetBC [13]    0.77M  4.48 
DenseNetBCGated ()  540  1.05M  4.03 
DenseNetBC [13]    15.32M  3.61 
DenseNetBCGated ()  2,880  15.62M  3.31 
DenseNetBC [13]    25.62M  3.52 
DenseNetBCGated ()  3,600  25.93M  3.39 
ShakeShake 26 2x64d [9]    11.71M  3.05 
ShakeShakeGated 26 2x64d (Ours)  3,584  12.01M  2.89 
ShakeShake 26 2x96d [9]    26.33M  2.82 
ShakeShakeGated 26 2x96d (Ours)  5,376  26.65M  2.64 
4.3 ImageNet
Classification Results
To test the performance of our method on large dataset, we apply our method to models for ImageNet 2012. We use ResNet [10] and Inceptionv4 [28] as the backbone network and ResNet18 [10] as the gater network to form our models. Table 3 shows the classification results on ImageNet validation set with baselines similar to the settings in Table 1. We can see that, our method improves all the models by 0.52% 1.85% in terms of top1 error rate, and 0.14% 0.78% in terms of top5 error rate. Note that [30] proposes to dynamically skip layers in ResNet101, and the top1 and top5 error rates of their model are 22.63% and 6.26% respectively. Our ResNet101Gated achives 21.51% and 5.72% on the same task, which is apparently much better than their model. In addition, there are also two interesting findings:

The performance of ResNet101 is significantly boosted with the help of the gater network. ResNet101Gated is even performing better than ResNet152 using much fewer layers.

Similar to the results on CIFAR datasets, ResNetWider is performing well when the original model is shallow and small, but is outperformed by our models when the original model contains enough filters.
Gated Filters  Parameters  Top1 Error %  Top5 Error %  

ResNet34 [10]    21.80M  26.56  8.48 
ResNet34Wider    33.89M  25.36  7.91 
ResNet34SE [12]    21.96M  26.08  8.30 
ResNet34Gated (Ours)  3,776  34.08M  26.04 (0.52)  8.34 (0.14) 
ResNet101 [10]    44.55M  23.36  6.56 
ResNet101Wider    59.17M  21.89  6.05 
ResNet101SE [12]    49.33M  22.38  6.07 
ResNet101Gated (Ours)  32,512  64.21M  21.51 (1.85)  5.78 (0.78) 
ResNet152 [10]    60.19M  22.34  6.22 
ResNet152Wider    81.37M  21.50  5.67 
ResNet152SE [12]    66.82M  21.57  5.73 
ResNet152Gated (Ours)  47,872  83.80M  21.19 (1.15)  5.45 (0.77) 
Inceptionv4 [28]    44.50M  20.33  4.99 
Inceptionv4Gated (Ours)  16,608  61.67M  19.64 (0.69)  4.80 (0.19) 
4.4 Implementation Details
We train the baseline model for each architecture by following the training scheme proposed in the original papers. We pretrain the backbone and the gater network on the target task separately to properly initialize the weights. The training scheme here includes training configurations such as the number of training epochs, the learning rate, the batch size, the weight decay and so on.
After pretraining, we train the backbone and the gater network jointly as a single model. In addition to following the original training scheme for each backbone architecture, we introduce a few minor modifications. Firstly, we increase the number of training epochs for DenseNetGated and ShakeShakeGated by 20 and 30 respectively as they seem to converge slowly at the end of training. Secondly, we set the initial learning rate for DenseNetGated and ShakeShakeGated to a smaller value, 0.05, since a large learning rate seem to result in bad performance at the end.
Note that not all the filters in a backbone network are subject to gating in our experiments. When ResNet is used as the backbone, we apply filter selection to the last convolutional layer in each residual unit, which is similar to the SEblock in [12]. As for DenseNet, we apply filter selection to all the convolutional layers except the first in each dense block. There are multiple residual branches in each residual block in ShakeShake. We apply filter selection to the last convolutional layer in each branch of the residual block in ShakeShake. In Inceptionv4, there are many channel concatenation operations. We apply filter selection to all the feature maps after the channel concatenation operations.
For all our models on CIFAR, we set the size of the bottleneck layer to 8. ResNet34Gated, ResNet101Gated and Inceptionv4Gated use a bottlneck size of 256, while ResNet152Gated uses 1024.
5 Conclusions
In this paper, we have proposed GaterNet, a novel architecture for inputdependent dynamic filter selection in CNNs. It involves two distinct components: a backbone network that conducts actual prediction and a gater network that decides which part of backbone network should be used for processing each input. Extensive experiments on CIFAR and ImageNet indicate that our models consistently outperform the original models with a large margin. On CIFAR10, our model improves upon stateoftheart results. We have also performed an indepth analysis about the model behavior that reveals an intuitive gating strategy learned by the gater network.
References
 [1] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In NIPS, pages 3084–3092, 2013.
 [2] E. Bengio, P.L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
 [3] Y. Bengio. Deep learning of representations: Looking forward. In Statistical Language and Speech Processing, pages 1–37, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
 [4] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [5] Z. Chen, X. Li, and N. L. Zhang. Learning sparse deep feedforward networks via tree skeleton expansion. CoRR, abs/1803.06120, 2018.
 [6] A. Davis and I. Arel. Lowrank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013.
 [7] L. Denoyer and P. Gallinari. Deep sequential neural network. In Deep Learning and Representation Learning Workshop, NIPS 2014, 2014.
 [8] T. Devries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. CoRR, abs/1708.04552, 2017.
 [9] X. Gastaldi. Shakeshake regularization. In ICLR Workshop, 2017.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV. Springer, 2016.
 [12] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In CVPR, 2018.
 [13] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. CVPR, pages 2261–2269, 2017.
 [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [15] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
 [16] Ł. Kaiser and S. Bengio. Can active memory replace attention? In NIPS, pages 3781–3789, 2016.
 [17] Ł. Kaiser and S. Bengio. Discrete autoencoders for sequence models. arXiv preprint arXiv:1801.09797, 2018.
 [18] L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and N. Shazeer. Fast decoding in sequence models using discrete latent variables. In ICML, 2018.
 [19] Ł. Kaiser and I. Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
 [20] E. Kandel, J. Schwartz, and T. Jessell. Principles of neural science. International edition. Elsevier, 1991.
 [21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
 [22] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [23] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.

[24]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV)
, 115(3):211–252, 2015.  [25] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. In ICLR, 2017.
 [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 [27] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, pages 3545–3553, 2014.

[28]
C. Szegedy, S. Ioffe, and V. Vanhoucke.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, 2017.  [29] R. Teja Mullapudi, W. R. Mark, N. Shazeer, and K. Fatahalian. Hydranets: Specialized dynamic architectures for efficient inference. In CVPR, June 2018.
 [30] A. Veit and S. J. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, volume 11205 of Lecture Notes in Computer Science, pages 3–18. Spriba2013adaptivenger, 2018.
 [31] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In CVPR, June 2018.
Appendix
Appendix A Distribution of Gates
The amounts of filters in different residual units of ResNet164 are different. To give a complete view of the gates that our gater is learning, we plot the same figure as Figure 2 with the number of gates on the Yaxis in Figure 6 below. A gate is an entry in the binary gate vector . It corresponds to a filter in the backbone network ResNet164. A gate is always off means that it is 0 for all the samples in the test set. A gate is always on means that it is 1 for all the test samples. And a gate is inputdependent means that it is 1 for some of the test samples and 0 for the others.
Appendix B Scheduled Dropout
In the experiments of DenseNetGated, ShakeShakeGated and Inceptionv4Gated, scheduled dropout [26] similar to ScheduledDropPath in [31] are applied to the gate vector . We start from a dropout rate of 0.0 and increase it gradually during training. The dropout rate reaches 0.05 at the end of training.