Learning Sparse Mixture of Experts for Visual Question Answering

09/19/2019 ∙ by Vardaan Pahuja, et al. ∙ 0

There has been a rapid progress in the task of Visual Question Answering with improved model architectures. Unfortunately, these models are usually computationally intensive due to their sheer size which poses a serious challenge for deployment. We aim to tackle this issue for the specific task of Visual Question Answering (VQA). A Convolutional Neural Network (CNN) is an integral part of the visual processing pipeline of a VQA model (assuming the CNN is trained along with entire VQA model). In this project, we propose an efficient and modular neural architecture for the VQA task with focus on the CNN module. Our experiments demonstrate that a sparsely activated CNN based VQA model achieves comparable performance to a standard CNN based VQA model architecture.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our main goal is to optimize the convolutions performed by the Convolutional Neural Network (CNN) in a conditional computation setting. We use an off-the-shelf neural architecture for both VQA v2 [7] and CLEVR [15] datasets and just replace the CNN with a modularized version of the ResNeXt [29] CNN as we describe below. The details of convolutional architecture used in the VQA v2 model and CLEVR model are illustrated in Table 3 and Table 4 respectively in Appendix D.

2 Bottleneck convolutional block

Figure 1 (a-d) shows the transformation of modularized ResNeXt-101 residual block to its grouped convolution form. This technique is similarly applicable for the convolutional block used for CLEVR dataset. We introduce a gating mechanism to assign weights to each of the paths (which equal 32 in the example shown). We treat each path as a convolutional module which should potentially be used for a specific function. The gate values are normalized to sum to unity and are conditioned on the LSTM based feature representation of the question. The working of gate controller is detailed in section 2.1. See Figure 2 (Appendix E) for the original ResNeXt-101 residual block.

Figure 1: Architecture of residual block of conditional gated ResNeXt-101 () assuming we choose to turn ON paths with gating weights. Here denote the indices of groups in .

In order to optimize the computation of a ResNeXt residual block, we execute just the top- (out of 32) paths and zero out the contribution of others. This is based on the hypothesis that the gate controller shall determine the most important modules (aka paths) to execute by assigning higher weights to more important modules. In our efficient implementation, we avoid executing the groups which don’t fall in top-. More technically, we aggregate the non-contiguous groups of the input feature map, which fall in top-, into a new feature map. We perform the same trick for the corresponding convolutional and batch-norm weights and biases.

Computational complexity of the ResNeXt convolutional block (in terms of floating point operations)***No. of FLOPS of a convolutional block (no grouping) = , No. of FLOPS of a convolutional block (with grouped convolution) = where , , , , , denote the number of input channels, no. of output channels, kernel size, no. of groups, output feature map height and width respectively=

Notation: conv-reduce, conv-conv and conv-expand denote the , and convolutional layers in a ResNeXt convolutional block (in that order).

The implementation of modularized ResNeXt block is more efficient than the regular implementation in the case when . The comparison of FLOPS with varying values of the hyper-parameter is shown in Table 1 for the VQA v2 model and Table 2 for the CLEVR model.

2.1 Working of gate controller

The function of the gate controller is to choose the set of experts which are the most important. The gate controller predicts a soft attention [30]

over the image grid by combining visual and question features. The weighted visual feature vector is then summed with the question feature representation to get a query vector in the multimodal space. This new query vector is fed to an MLP which predicts attention weights for the set of experts. The experts whose weights are contained in top-

are selected for execution and their outputs are weighted by the gate values assigned. See Appendix B for more details on the attention mechanism.

3 Experiments

3.1 VQA v2 dataset

We use the Bottom-up attention model for VQA v2 dataset as proposed in

[27] as our base model and replace the CNN sub-network with our custom CNN. A schematic diagram to illustrate the working of this model is given in Figure 3 (Appendix E). The results (see Table 1) show that there is a very minimal loss in accuracy from sparsity to sparsity. However, with sparsityHere, sparsity means that of the modules/paths in the ResNeXt convolutional block are turned off., there is a marked loss in overall accuracy.

3.2 CLEVR dataset

We use the Relational Networks model [21] because it is one of the few models which is fully-supervised and trains the CNN in the main model pipeline. We replace the Vanilla CNN used in their model with our modularized CNN and report the results on the CLEVR dataset. A diagram to illustrate the working of this model is shown in Figure 4 (Appendix E). The CNN used for this model has four layers with one residual ResNeXt block each followed by a convolutional layer. The results (see Table 2) show that with a slight dip in performance, the model which uses sparsity has comparable performance with the one which doesn’t have sparsity in the convolutional ResNeXt block.

Architecture for CNN
§§§The FLOPS calculation assumes an input image of size
Acc. (%)
ResNeXt-32 (101 x 32d)
The baseline model doesn’t use R-CNN based features, so the accuracy is not directly comparable with state of the art approaches.
156.04E+09 54.51
Modular ResNeXt-32 (101 x 32d)
k = 32 (0 sparsity)
181.39E+09 54.90
Modular ResNeXt-32 (101 x 32d)
k = 16 (50 sparsity)
77.72E+09 54.47
Modular ResNeXt-32 (101 x 32d)
k = 8 (75 sparsity)
45.94E+09 51.28
Table 1: Results on VQA v2 validation set
CNN Model description FLOPS (CNN)The FLOPS calculation assumes an input image of size Val. Acc. (%)
Modular CNN, k=12
5.37E+07 94.05
Modular CNN, k=6,
50 % sparsity
3.21E+07 92.23
Table 2: Results on CLEVR v1.0 validation set (Overall accuracy)

4 Conclusion

We presented a general framework for utilizing conditional computation to sparsely execute a subset of modules in a convolutional block of ResNeXt model. The amount of sparsity is a user-controlled hyper-parameter which can be used to turn off the less important modules conditioned on the question representation, thereby increasing the computational efficiency. Future work may include studying the utility of this technique in other multimodal machine learning applications which support use of conditional computation.


Appendix A Related Work

Mixture of Experts Mixture of Experts [13] is a formulation in machine learning which employs the divide-and-conquer principle to solve a complex problem by dividing the neural network into several expert networks. In the Mixture of MLP Experts (MME) method [28, 18, 6], a gating network is used to assign weights to the outputs of the corresponding expert networks. The gating network is an MLP followed by softmax operator. The final output of the ensemble is a weighted sum of outputs of each of the individual expert networks. [14] uses a mixture of experts for visual reasoning tasks in which each expert is a stack of residual blocks.

Conditional Computation Conditional Computation is the technique of activating only a sub-portion of the neural network depending on the inputs. For instance, if a visual question answering system has to count the number of a specified object vs

if it has to tell the color of an object, the specific features needed to give the correct answer in either case are different. Hence, there is a potential to reduce the amount of computation the network has to perform in each case, which can be especially useful to train large deep networks efficiently. The use of stochastic neurons with binary outputs to selectively turn-off experts in a neural network has been explored in

[2]. [5]

uses a low-rank approximation of weight matrix of MLP to compute the sign of pre-nonlinearity activations. In case of ReLU activation function, it is then used to optimize the matrix multiplication for the MLP layer.


uses Policy gradient to sparsely activate units in feed-forward neural networks by relying on conditional computation.

[22] proposes Sparsely-Gated Mixture-of-Experts layer (MoE) which uses conditional computation to train huge capacity models on low computational budget for language modeling and machine translation tasks. It makes use of noisy top- mechanism in which a random noise is added to gating weights and then top- weights are selected. Another line of work makes use of conditional computation in VQA setting. DPPNet [19] makes use of a dynamic parameter layer (fully-connected) conditioned on the question representation for VQA. ABC-CNN [4] predicts convolutional kernel weights using an MLP which takes question representation as input. Here, the advantage is that the question conditioned convolutional kernels can filter out unrelated image regions in the visual processing pipeline itself.

Computationally efficient CNNs [11] proposes Multi-Scale Dense Convolutional Networks (MSDNet) to address two key issues in (i) budgeted classification- distribute the computation budget unevenly across easy and hard examples, (ii) anytime prediction: the network can output the prediction result at at any layer depending on the computation budget without significant loss in accuracy. The optimization of computational complexity of CNNs at inference-time has been studied in [3] in which an adaptive early-exit strategy is learned to by-pass some of the network’s layers in order to save computations. MobileNet [9] uses depth-wise separable convolutions to build light weight CNNs for deployment on mobile devices.

CNN architectures The invention of Convolutional Neural Networks (CNNs) has led to a remarkable improvement in performance for many computer vision tasks [16, 17, 20]. In recent years, there has been a spate of different CNN architectures with changes to depth [23], topology [8, 25], etc. The use of split-transform-merge strategy for designing convolutional blocks (which can be stacked to form the complete network) has shown promise for achieving top performance with a lesser computational complexity [25, 24, 26, 29]. The ResNeXt [29] CNN model proposes cardinality (size of set of transformations in a convolutional block) as another dimension apart from depth and width to investigate, for improving the performance of convolutional neural networks. Squeeze-and-Excitation Networks [10] proposes channel-wise attention in a convolutional block and helped improve the state of art in ILSVRC 2017 classification competition.

Grouped Convolution In grouped convolution, each filter convolves only with the input feature maps in its group. The use of grouped convolutions was first done in AlexNet [16] for training a large network on 2 GPUs. A recently proposed CNN architecture named CondenseNet [12] makes use of learned grouped convolutions to minimise superfluous feature-reuse and achieve computational efficiency.

Appendix B Gate controller

The gate controller takes as input the LSTM based representation of the question and the intermediate convolutional map which is the output of the previous block. Given the image features and question features , we perform the fusion of these features, followed by a linear layer and softmax to generate the attention over the pixels of the image feature input.

This pixel-wise attention is then used to modulate the image features and the resulting feature vector is summed with the question feature vector to obtain the combined query vector in multi-modal space. The gating weights are obtained by an MLP followed by ReLU (Recified Linear Unit) activation on the query vector and subsequent L1 normalization.

Notation: , , , , , , , , , , ,

Appendix C Additional Training details

We add an additional loss term which equals the square of coefficient of variation (CV) of gate values for each convolutional block.

This helps to balance out the variation in gate values [22] otherwise the weights corresponding to the modules which get activated initially will increase in magnitude and this behavior reinforces itself as the training progresses.

Appendix D CNN layouts

stage output Description
conv1 112112 7

7, 64, stride 2

conv2 5656 3

3 max pool, stride 2

conv3 2828 4
conv4 1414 23
conv5 77 3
Table 3: Modular CNN for VQA v2 model
stage output Description
conv1 6464 33, 64, stride 2
conv2 3232 33 max pool, stride 2
conv3 1616 1
conv4 88 1
conv5 88
conv. layer
with 24 o/p channels
Table 4: Modular CNN for Relational Networks Model

Appendix E VQA Model architectures

Figure 2: Architecture of a sample block of ResNeXt-101 ()
Figure 3: Model architecture for VQA v2 dataset (adapted from [27])
Figure 4: Model architecture for Relational Networks (adapted from [21])