Our main goal is to optimize the convolutions performed by the Convolutional Neural Network (CNN) in a conditional computation setting. We use an off-the-shelf neural architecture for both VQA v2  and CLEVR  datasets and just replace the CNN with a modularized version of the ResNeXt  CNN as we describe below. The details of convolutional architecture used in the VQA v2 model and CLEVR model are illustrated in Table 3 and Table 4 respectively in Appendix D.
2 Bottleneck convolutional block
Figure 1 (a-d) shows the transformation of modularized ResNeXt-101 residual block to its grouped convolution form. This technique is similarly applicable for the convolutional block used for CLEVR dataset. We introduce a gating mechanism to assign weights to each of the paths (which equal 32 in the example shown). We treat each path as a convolutional module which should potentially be used for a specific function. The gate values are normalized to sum to unity and are conditioned on the LSTM based feature representation of the question. The working of gate controller is detailed in section 2.1. See Figure 2 (Appendix E) for the original ResNeXt-101 residual block.
In order to optimize the computation of a ResNeXt residual block, we execute just the top- (out of 32) paths and zero out the contribution of others. This is based on the hypothesis that the gate controller shall determine the most important modules (aka paths) to execute by assigning higher weights to more important modules. In our efficient implementation, we avoid executing the groups which don’t fall in top-. More technically, we aggregate the non-contiguous groups of the input feature map, which fall in top-, into a new feature map. We perform the same trick for the corresponding convolutional and batch-norm weights and biases.
Computational complexity of the ResNeXt convolutional block (in terms of floating point operations)***No. of FLOPS of a convolutional block (no grouping) = , No. of FLOPS of a convolutional block (with grouped convolution) = where , , , , , denote the number of input channels, no. of output channels, kernel size, no. of groups, output feature map height and width respectively=
Notation: conv-reduce, conv-conv and conv-expand denote the , and convolutional layers in a ResNeXt convolutional block (in that order).
The implementation of modularized ResNeXt block is more efficient than the regular implementation in the case when . The comparison of FLOPS with varying values of the hyper-parameter is shown in Table 1 for the VQA v2 model and Table 2 for the CLEVR model.
2.1 Working of gate controller
The function of the gate controller is to choose the set of experts which are the most important. The gate controller predicts a soft attention 
over the image grid by combining visual and question features. The weighted visual feature vector is then summed with the question feature representation to get a query vector in the multimodal space. This new query vector is fed to an MLP which predicts attention weights for the set of experts. The experts whose weights are contained in top-are selected for execution and their outputs are weighted by the gate values assigned. See Appendix B for more details on the attention mechanism.
3.1 VQA v2 dataset
We use the Bottom-up attention model for VQA v2 dataset as proposed in as our base model and replace the CNN sub-network with our custom CNN. A schematic diagram to illustrate the working of this model is given in Figure 3 (Appendix E). The results (see Table 1) show that there is a very minimal loss in accuracy from sparsity to sparsity. However, with sparsity‡‡‡Here, sparsity means that of the modules/paths in the ResNeXt convolutional block are turned off., there is a marked loss in overall accuracy.
3.2 CLEVR dataset
We use the Relational Networks model  because it is one of the few models which is fully-supervised and trains the CNN in the main model pipeline. We replace the Vanilla CNN used in their model with our modularized CNN and report the results on the CLEVR dataset. A diagram to illustrate the working of this model is shown in Figure 4 (Appendix E). The CNN used for this model has four layers with one residual ResNeXt block each followed by a convolutional layer. The results (see Table 2) show that with a slight dip in performance, the model which uses sparsity has comparable performance with the one which doesn’t have sparsity in the convolutional ResNeXt block.
|Architecture for CNN||
|CNN Model description||FLOPS (CNN)∥∥∥The FLOPS calculation assumes an input image of size||Val. Acc. (%)|
We presented a general framework for utilizing conditional computation to sparsely execute a subset of modules in a convolutional block of ResNeXt model. The amount of sparsity is a user-controlled hyper-parameter which can be used to turn off the less important modules conditioned on the question representation, thereby increasing the computational efficiency. Future work may include studying the utility of this technique in other multimodal machine learning applications which support use of conditional computation.
-  Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
-  Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
-  Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 527–536. JMLR. org, 2017.
-  Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ramakant Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. CoRR, abs/1511.05960, 2015.
-  Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, 2013.
Reza Ebrahimpour, Ehsanollah Kabir, Hossein Esteky, and Mohammad Reza Yousefi.
View-independent face recognition with mixture of experts.Neurocomputing, 71(4-6):1103–1107, 2008.
-  Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In , 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
-  Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017.
-  Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2752–2761, 2018.
-  Robert A Jacobs, Michael I Jordan, Steven J Nowlan, Geoffrey E Hinton, et al. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
-  Jason Jo, Vikas Verma, and Yoshua Bengio. Modularity matters: Learning invariant relational reasoning tasks. arXiv preprint arXiv:1806.06765, 2018.
-  Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1988–1997. IEEE, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  Minh Ha Nguyen. Cooperative coevolutionary mixture of experts: a neuro ensemble approach for automatic decomposition of classification problems. University of New South Wales, Australian Defence Force Academy, School of …, 2006.
-  Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Image question answering using convolutional neural network with dynamic parameter prediction. arXiv preprint arXiv:1511.05756, 2015.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
-  Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
-  Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
-  Steven Richard Waterhouse. Classification and regression using mixtures of experts. PhD thesis, Citeseer, 1998.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
-  Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
Appendix A Related Work
Mixture of Experts Mixture of Experts  is a formulation in machine learning which employs the divide-and-conquer principle to solve a complex problem by dividing the neural network into several expert networks. In the Mixture of MLP Experts (MME) method [28, 18, 6], a gating network is used to assign weights to the outputs of the corresponding expert networks. The gating network is an MLP followed by softmax operator. The final output of the ensemble is a weighted sum of outputs of each of the individual expert networks.  uses a mixture of experts for visual reasoning tasks in which each expert is a stack of residual blocks.
Conditional Computation Conditional Computation is the technique of activating only a sub-portion of the neural network depending on the inputs. For instance, if a visual question answering system has to count the number of a specified object vs
if it has to tell the color of an object, the specific features needed to give the correct answer in either case are different. Hence, there is a potential to reduce the amount of computation the network has to perform in each case, which can be especially useful to train large deep networks efficiently. The use of stochastic neurons with binary outputs to selectively turn-off experts in a neural network has been explored in. 
uses a low-rank approximation of weight matrix of MLP to compute the sign of pre-nonlinearity activations. In case of ReLU activation function, it is then used to optimize the matrix multiplication for the MLP layer.
uses Policy gradient to sparsely activate units in feed-forward neural networks by relying on conditional computation. proposes Sparsely-Gated Mixture-of-Experts layer (MoE) which uses conditional computation to train huge capacity models on low computational budget for language modeling and machine translation tasks. It makes use of noisy top- mechanism in which a random noise is added to gating weights and then top- weights are selected. Another line of work makes use of conditional computation in VQA setting. DPPNet  makes use of a dynamic parameter layer (fully-connected) conditioned on the question representation for VQA. ABC-CNN  predicts convolutional kernel weights using an MLP which takes question representation as input. Here, the advantage is that the question conditioned convolutional kernels can filter out unrelated image regions in the visual processing pipeline itself.
Computationally efficient CNNs  proposes Multi-Scale Dense Convolutional Networks (MSDNet) to address two key issues in (i) budgeted classification- distribute the computation budget unevenly across easy and hard examples, (ii) anytime prediction: the network can output the prediction result at at any layer depending on the computation budget without significant loss in accuracy. The optimization of computational complexity of CNNs at inference-time has been studied in  in which an adaptive early-exit strategy is learned to by-pass some of the network’s layers in order to save computations. MobileNet  uses depth-wise separable convolutions to build light weight CNNs for deployment on mobile devices.
CNN architectures The invention of Convolutional Neural Networks (CNNs) has led to a remarkable improvement in performance for many computer vision tasks [16, 17, 20]. In recent years, there has been a spate of different CNN architectures with changes to depth , topology [8, 25], etc. The use of split-transform-merge strategy for designing convolutional blocks (which can be stacked to form the complete network) has shown promise for achieving top performance with a lesser computational complexity [25, 24, 26, 29]. The ResNeXt  CNN model proposes cardinality (size of set of transformations in a convolutional block) as another dimension apart from depth and width to investigate, for improving the performance of convolutional neural networks. Squeeze-and-Excitation Networks  proposes channel-wise attention in a convolutional block and helped improve the state of art in ILSVRC 2017 classification competition.
Grouped Convolution In grouped convolution, each filter convolves only with the input feature maps in its group. The use of grouped convolutions was first done in AlexNet  for training a large network on 2 GPUs. A recently proposed CNN architecture named CondenseNet  makes use of learned grouped convolutions to minimise superfluous feature-reuse and achieve computational efficiency.
Appendix B Gate controller
The gate controller takes as input the LSTM based representation of the question and the intermediate convolutional map which is the output of the previous block. Given the image features and question features , we perform the fusion of these features, followed by a linear layer and softmax to generate the attention over the pixels of the image feature input.
This pixel-wise attention is then used to modulate the image features and the resulting feature vector is summed with the question feature vector to obtain the combined query vector in multi-modal space. The gating weights are obtained by an MLP followed by ReLU (Recified Linear Unit) activation on the query vector and subsequent L1 normalization.
Notation: , , , , , , , , , , ,
Appendix C Additional Training details
We add an additional loss term which equals the square of coefficient of variation (CV) of gate values for each convolutional block.
This helps to balance out the variation in gate values  otherwise the weights corresponding to the modules which get activated initially will increase in magnitude and this behavior reinforces itself as the training progresses.
Appendix D CNN layouts
7, 64, stride 2
3 max pool, stride 2
|conv1||6464||33, 64, stride 2|
|conv2||3232||33 max pool, stride 2|