Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation

by   Jean-Benoit Delbrouck, et al.

In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions. Approaches to pool two modalities usually include element-wise product, sum or concatenation. In this paper, we evaluate the more advanced Multimodal Compact Bilinear pooling method, which takes the outer product of two vectors to combine the attention features for the two modalities. This has been previously investigated for visual question answering. We try out this approach for multimodal image caption translation and show improvements compared to basic combination methods.



There are no comments yet.


page 1

page 2

page 3

page 4


An empirical study on the effectiveness of images in Multimodal Neural Machine Translation

In state-of-the-art Neural Machine Translation (NMT), an attention mecha...

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Modeling textual or visual information with vector representations train...

Hadamard Product for Low-rank Bilinear Pooling

Bilinear models provide rich representations compared with linear models...

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

Multimodal machine translation (MMT), which mainly focuses on enhancing ...

Supervised Visual Attention for Simultaneous Multimodal Machine Translation

Recently, there has been a surge in research in multimodal machine trans...

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

We propose an approach to build a neural machine translation system with...

Bilinear Attention Networks

Attention networks in multimodal learning provide an efficient way to ut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine translation, neural networks have attracted a lot of research attention. Recently, the attention-based encoder-decoder framework 

(Sutskever et al., 2014; Bahdanau et al., 2014)

has been largely adopted. In this approach, Recurrent Neural Networks (RNNs) map source sequences of words to target sequences. The attention mechanism is learned to focus on different parts of the input sentence while decoding. Attention mechanisms have been shown to work with other modalities too, like images, where their are able to learn to attend to salient parts of an image, for instance when generating text captions 

(Xu et al., 2015)

. For such applications, Convolutional neural networks (CNNs) have shown to work best to represent images 

(He et al., 2016).

Multimodal models of texts and images enable applications such as visual question answering or multimodal caption translation. Also, the grounding of multiple modalities against each other may enable the model to have a better understanding of each modality individually, such as in natural language understanding applications.

The efficient integration of multimodal information still remains a challenging task though. Both Huang et al. (2016) and Caglayan et al. (2016) made a first attempt in multimodal neural machine translation. Recently, Calixto et al. (2017) showed an improved architecture that significantly surpassed the monomodal baseline. Multimodal tasks require combining diverse modality vector representations with each other. Bilinear pooling models Tenenbaum & Freeman (1997), which computes the outer product of two vectors (such as the visual and textual representations), may be more expressive than basic combination methods such as element-wise sum or product. Because of its high and intractable dimensionality (), Gao et al. (2016) proposed a method that relies on Multimodal Compact Bilinear pooling (MCB) to efficiently compute a joint and expressive representation combining both modalities, in a visual question answering tasks. This approach has not been investigated previously for multimodal caption translation, which is what we focus on in this paper.


Algorithm 1 Multimodal CBP
3:for do
4:  for do
5:    sample from
6:    sample from
10:  for do
12:  return
Figure 1:

Left: Tensor Sketch algorithm - Right: Compact Bilinear Pooling for two modality vectors (top) and

”MM pre-attention” model (bottom) ; Note that the textual representation vector is tiled (copied) to match the dimension of the image feature maps

2 Model

We detail our model build from the attention-based encoder-decoder neural network described by Sutskever et al. (2014) and Bahdanau et al. (2014) implemented in TensorFlow (Abadi et al., 2016). Textual encoder  Given an input sentence where is the sentence length and is the dimension of the word embedding space, a bi-directional LSTM encoder of layer size produces a set of textual annotation where is obtained by concatenating the forward and backward hidden states of the encoder: .
Visual encoder  An image associated to this sentence is fed to a deep residual network, computing convolutional feature maps of dimension . We obtain a set of visual annotations where .

Decoder  The decoder produces an output sentence and is initialized by where is the textual encoder’s last state. The next decoder states are obtained as follows:


During training, is the ground truth symbol in the sentence whilst

is the previous attention vector computed by the attention model. The current attention vector

, concatenated with the LSTM output , is used to compute a new vector

. The probability distribution over the target vocabulary is computed by the equation :


Attention  At every time-step, the attention mechanism computes two modality specific context vectors given the current decoder state and the two annotation sets . We use the same attention model for both modalities described by Vinyals et al. (2015). We first compute modality specific attention weights . The context vector is then obtained with the following weighted sum :
Both and are considered modalities dependent and thus aren’t shared by both modalities. The projection layer is applied to the decoder state and is thus shared (Caglayan et al., 2016). Vectors are then combined to produce with an element-wise (e-w) sum / product or concatenation layer.

Multimodal Compact Bilinear (MCB) pooling  Bilinear models (Tenenbaum & Freeman, 1997) can be applied as vectors combination. We take the outer product of our two context vectors and then learn a linear model i.e. , where denotes the outer product and denotes linearizing the matrix in a vector. Bilinear pooling allows all elements of both vectors to interact with each other in a multiplicative way but leads to a high dimensional representation and an infeasible number of parameters to learn in . For two modality context vectors of size and an attention size of (), would have 537 million parameters. We use the compact method proposed by Gao et al. (2016), based on the tensor sketch algorithm (see Algorithm 1), to make bilinear models feasible. This model, referred as the ”MM Attention” in the results section, is illustrated in Figure 1 (top right)

We try a second model inspired by the work of (Fukui et al., 2016). For each spatial grid location in the visual representation, we use MCB pooling to merge the slice of the visual feature with the language representation. As shown at the bottom right of Figure 1, after the pooling we use two convolutional layers to predict attention weights for each grid location. We then apply softmax to produce a new normalized soft attention map. This method can be seen as the removal of unnecessary information in the feature maps according to the source sentence. Note that we still use the ”MM attention” during decoding. We refer this model as the ”MM pre-attention”.

3 Settings

We use the Adam optimizer (Kingma & Ba, 2014) with a l.r. of and L2 regularization of . Layer size and word embeddings size is 512. Embeddings are trained along with the model. We use mini-batch size of 32 and Xavier weight initialization (Glorot & Bengio, 2010). For this experiments, we used the Multi30K dataset (Elliott et al., 2016) which is an extended version of the Flickr30K Entities. For each image, one of the English descriptions was selected and manually translated into German by a professional translator (Task 1). As training and development data, 29,000 and 1,014 triples are used respectively. A test set of size 1000 is used for BLEU and METEOR evaluation. Vocabulary sizes are 11,180 (en) and 19,154 (de). We lowercase and tokenize all the text data with the Moses tokenizer. We extract feature maps from the images with a ResNet-50 at its layer. We use early-stopping if no improvement is observed after 10,000 steps.

4 Results

Method Validation Scores BLEU METEOR Monomodal Text 29.24 48.32 MM attention Concatenation 26.12 44.14 Element-wise Sum 27.48 45.79 Element-wise Product 28.62 47.99 MCB 1024 28.48 47.57 MM pre-attention*   Element-wise sum 28.57 46.40    Element-wise Product 29.14 46.71 MCB 4096 29.75 48.80 *with Prod as MM att. Compact Bilinear BLEU Multimodal attention 512 27.78 1024 28.48 2048 28.12 Multimodal pre-attention 1024 28.71 2048 29.19 4096 29.75 8192 29.39 16000 27.98
Table 1: The BLEU and METEOR results on the test split containing 1000 triples. All scores are the average of two runs.

To our knowledge, there is currently no multimodal translation architecture that convincingly surpass a monomodal NMT baseline. Our work nevertheless shows a small but encouraging improvement. In the ”MM attention” model, where both attention context vectors are merged, we notice no improvement using MCB over an element-wise product. We suppose the reason is that the merged attention vector

has to be concatenated with the cell output and then gets linearly transformed by the

proj layer to a vector of size 512. This heavy dimensionality reduction undergone by the vector may have lead to a consequent loss of information, thus the poor results. This motivated us to implement the second attention mechanism, ”MM pre-attention”. Here, the attention model can enjoy the full use of the combined vectors dimension, varying from 1024 to 16000. We show here an improvement of +0.62 BLEU over e-w multiplication and +1.18 BLEU over e-w sum. We believe a step further could be to investigate different experimental settings or layer architectures as we felt MCB could perform much better as seen in similar previous work (Fukui et al., 2016).

5 Acknowledgements

This work was partly supported by the Chist-Era project IGLU with contribution from the Belgian Fonds de la Recherche Scientique (FNRS), contract no. R.50.11.15.F, and by the FSO project VCYCLE with contribution from the Belgian Waloon Region, contract no. 1510501.