Probing Contextualized Sentence Representations with Visual Awareness

11/07/2019 ∙ by Zhuosheng Zhang, et al. ∙ Shanghai Jiao Tong University National Institute of Information and Communications Technology 10

We present a universal framework to model contextualized sentence representations with visual awareness that is motivated to overcome the shortcomings of the multimodal parallel data with manual annotations. For each sentence, we first retrieve a diversity of images from a shared cross-modal embedding space, which is pre-trained on a large-scale of text-image pairs. Then, the texts and images are respectively encoded by transformer encoder and convolutional neural network. The two sequences of representations are further fused by a simple and effective attention layer. The architecture can be easily applied to text-only natural language processing tasks without manually annotating multimodal parallel corpora. We apply the proposed method on three tasks, including neural machine translation, natural language inference and sequence labeling and experimental results verify the effectiveness.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning vector representations of sentence meaning is a long-standing objective in natural language processing (NLP)

Wang et al. (2018b); Chen et al. (2019); Zhang et al. (2019b)

. Text representation learning has evolved from word-level distributed representations

Mikolov et al. (2013); Pennington et al. (2014) to contextualized language modeling (LM) Peters et al. (2018); Radford et al. (2018); Devlin et al. (2018); Yang et al. (2019). Despite the success of LMs, NLP models are impoverished compared to humans due to the monotonous learning solely from textual features without grounding in the outside world such as visual conception. Therefore, there emerges a trend of researches that are motivated to apply non-linguistic modalities into language representations Bruni et al. (2014); Calixto et al. (2017); Zhang et al. (2018); Ive et al. (2019); Shi et al. (2019).

Previous work mainly integrates the visual guidance to word or character representations Kiela and Bottou (2014); Silberer and Lapata (2014); Zablocki et al. (2018); Wu et al. (2019), which requires the alignment of word and images. Besides, word meaning may vary in different sentences depending on the context, thus the aligned image would not be optimal. Recently, there is a recent trend of pre-training visual-linguistic (VL) representations for visual and language tasks Su et al. (2019); Lu et al. (2019); Tan and Bansal (2019); Li et al. (2019); Zhou et al. (2019); Sun et al. (2019). However, these VL studies rely on the text-image annotations as the paired input, thus are retrained only in VL tasks, such as image caption and visual questions answering. However, for NLP tasks, most texts are unlabeled. Therefore, it is essential to probe a general method to apply visual information to a wider range of mono-modal text-only tasks.

Recent studies have verified that the representations of images and texts can be jointly leveraged to build visual-semantic embeddings in a shared representation space Frome et al. (2013); Karpathy and Fei-Fei (2015); Ren et al. (2016); Mukherjee and Hospedales (2016). To this end, a popular approach is to connect both of the mono-modal text and image encoding paths with fully connected layers Wang et al. (2018a); Engilberge et al. (2018). The shared deep embedding can be used for cross-modal retrieval thus it can associate sentence texts with associated images. Inspired by this line of research, we are motivated to incorporate the visual awareness into sentence modeling by retrieving a group of images for a given sentence.

According to the Distributional Hypothesis Harris (1954) which states that words that occur in similar contexts tend to have similar meanings, we make the attempt to extend it to visual modalities, the sentences with similar meanings would be likely to pair with similar or the same images in the shared embedding space. In this paper, we propose an approach to model contextualized sentence representations with visual awareness. For each sentence, we retrieve a diversity of images from a shared text-visual embedding space that is pre-trained on a large-scale of text-image pairs to connect both the mono-modal paths of text and image embeddings. The texts and images are encoded by transformer LM and pre-trained convolutional neural network (CNN), respectively. A simple and effective attention layer is then designed to fuse the two sequences of representations. In particular, the proposed approach can be easily applied to text-only tasks without manually annotating multimodal parallel corpora. The proposed method was evaluated on three tasks, including neural machine translation (NMT), natural language inference (NLI) and sequence labeling (SL). Experiments and analysis show effectiveness. In summary, our contributions are primarily three-fold:

  1. We present a universal visual representation method that overcomes the shortcomings of the multimodal parallel data with manual annotations.

  2. We propose a multimodal context-driven model to jointly learn sentence-level representations from textual and visual modalities.

  3. Experiments on different tasks verified the effectiveness and generality of the proposed approach.

Figure 1: Overview of the our model architecture.

2 Approach

Figure 1

illustrates the architecture of our proposed method. Given a sentence, we first fetch a group of matched images from the cross-modal retrieval model. The text and images are encoded by the text feature extractor and image feature extractor, respectively. Then the two sequences of representation are integrated by multi-head attention to form a joint representation which is passed to downstream task-specific layers. Before introducing our visual-aware model, let us briefly show the cross-modal retrieval model which is used to for image retrieval given sentence text.

2.1 Cross-modal Retrieval Model

Figure 2: Details of the proposed semantic-visual embedding model.

For input sentence, our aim is to associate it with a number of images from a candidate corpus. Following Engilberge et al. (2018), we train a semantic-visual embedding on a text-image corpus, which is then used for image retrieval. The semantic-visual embedding architecture comprises two paths to encoder the texts and images into vectors, respectively. Based on our preliminary experiments, we choose the simple recurrent unit (SRU) architecture as our text encoder and the fully convolutional residual ResNet-152 Xie et al. (2017) with Weldon pooling Durand et al. (2016).

Both pipelines are learned simultaneously, each image is paired with 1) a positive text that describes the image and 2) a hard negative which is selected as the one that has the highest similarity with the image while not being associated with it. The architecture of the model is shown on Figure 2. During training, a triplet loss Wang et al. (2014); Schroff et al. (2015); Gordo et al. (2017) is used to converge correctly and increase our performances as shown in equation 1.


where , , and are respectively the embeddings of , , and .

is the minimum margin between the similarity of the correct caption and the unrelated caption. The loss function enables that the sentence

should be closer to the corresponding image than the unrelated one .

In prediction time, the relationship of texts and images is calculated by cosine similarity.

2.2 Visual-aware Model

2.3 Encoding Layer

For each sentence , we pair it with the top matched images according to the retrieval method above. Then, the sentence = is fed into multi-layer transformer encoder to learn the sentence representation H. Meanwhile, the images = are encoded by a pre-trained ResNet (He et al., 2016) followed by a feed forward layer to learn the image representation M with the same dimension with H.

2.4 Multi-modal Integration Layer

Then, we apply a one-layer multi-head attention mechanism to append the image representation to the text representation:


where {, } are packed from the learned image representation M.

Following the same practice in the transformer block, We fuse H and with layer normalization to learn the joint representation:


where and are parameters.

2.5 Task-specific Layer

In this section, we show how the joint representation is used for downstream tasks by taking NMT, NLI and SL tasks for example. For NMT, is fed to the decoder to learn a dependent-time context vector for predicting target translation. For NLI and SL, is directly fed to a feed forward layer to make the prediction.

3 Task Settings

We evaluate our model on three different NLP tasks, namely neural machine translation, natural language inference and sequence labeling. We present these evaluation benchmarks in what follows.

3.1 Neural Machine Translation

We use two translation datasets, including WMT’16 English-to-Romanian (EN-RO), and WMT’14 English-to-German (EN-DE) which are standard corpora for NMT evaluation.

1) For the EN-RO task, we experimented with the officially provided parallel corpus: Europarl v7 and SETIMES2 from WMT’16 with 0.6M sentence pairs. We used newsdev2016 as the dev set and newstest2016 as the test set.

2) For the EN-DE translation task, 4.43M bilingual sentence pairs of the WMT14 dataset were used as training data, including Common Crawl, News Commentary, and Europarl v7. The newstest2013 and newstest2014 datasets were used as the dev set and test set, respectively.

3.2 Natural Language Inference

Natural Language Inference involves reading a pair of sentences and judging the relationship between their meanings, such as entailment, neutral and contradiction. In this task, we use the Stanford Nat-ural Language Inference (SNLI) corpus (Bowmanet al., 2015) which provides approximately 570k hypothesis/premise pairs.

3.3 Sequence Labeling

We use the CoNLL-2003 Named Entity Recognition dataset

Sang and Meulder (2003) for the sequence labeling task, which includes four kinds of NEs: Person, Location, Organization and MISC.

4 Model Implementation

Now, we introduce the specific implementation parts of our method. All the experiments were done on 8 NVIDIA TESLA V100 GPUs.

4.1 Cross-modal Retrieval Model

The cross-modal retrieval model is trained on the MS-COCO dataset Lin et al. (2014) that contains 123 287 images with 5 English captions per image. It is split into 82,783 training images, 5,000 validation images and 5,000 testing images. We used the Karpathy split Karpathy and Fei-Fei (2015) that forms 113, 287 training, 5,000 validation and 5,000 test images. The model is implemented following the same settings in Engilberge et al. (2018) with the state-of-the-art results (94.0% R@10) in cross-modal retrieval. The maximum number of retrieved images for each sentence is set to 8 according to our preliminary experimental results.

Public Systems
Trans. Vaswani et al. (2017) - 27.3
Trans. Lee et al. (2018) 32.40 -
Our implementation
Trans. 32.66 27.31
+ VA 34.63 27.83
Table 1: BLEU scores on EN-RO and EN-DE for the NMT tasks. Trans. is short for transformer.

4.2 Baseline

To incorporate our visual-aware model (+VA), we only modify the encoder of the baselines by introducing the image encoder layer and multi-modal integration layer.

For the NMT tasks, the baseline was Transformer (Vaswani et al., 2017) implemented by fairseq111 (Ott et al., 2019)

. We used six layers for the encoder and the decoder. The number of dimensions of all input and output layers was set to 512. The inner feed-forward neural network layer was set to 2048. The heads of all multi-head modules were set to eight in both encoder and decoder layers. The byte pair encoding algorithm was adopted, and the size of the vocabulary was set to 40,000. In each training batch, a set of sentence pairs contained approximately 4096

4 source tokens and 40964 target tokens. During training, the value of label smoothing was set to 0.1, and the attention dropout and residual dropout were p = 0.1. The Adam optimizer (Kingma and Ba, 2014) was used to tune the parameters of the model. The learning rate was varied under a warm-up strategy with 8,000 steps.

For the NLI and SL tasks, the baseline was BERT (Base)222

. We used the pre-trained weights of BERT and follow the same fine-tuning procedure as BERT without any modification. The initial learning rate was set in {8e-6, 1e-5, 2e-5, 3e-5} with warm-up rate of 0.1 and L2 weight decay of 0.01. The batch size is selected in {16, 24, 32}. The maximum number of epochs is set in [2, 5]. Texts are tokenized using wordpieces, with maximum length of 128.

4.3 Results

Table 1 shows the translation results for the WMT’14 EN-DE and WMT’16 EN-RO translation task. We see that our method significantly outperformed the baseline Transformer, demonstrating the effectiveness of modeling visual information for NMT.

Model Acc
Public Systems
GPT Radford et al. (2018) 89.9
DRCN Kim et al. (2018) 90.1
MT-DNN Liu et al. (2019) 91.6
SemBERT Zhang et al. (2019a) 91.6
BERT (Base) Liu et al. (2019) 90.8
Our implementation
BERT (Base) 90.7
+ VA 91.2
Table 2: Accuracy on SNLI dataset.
Model F1 score
Public Systems
LSTM-CRF Lample et al. (2016) 90.94
BERT (Base) Pires et al. (2019) 91.07
Our implementation
BERT (Base) 91.21
+VA 91.46
Table 3: Results (%) of CoNLL-2003 NER dataset.
Figure 3: Examples of the retrieved images for sentences.

Table 2-3 show the results for the NLI and SL tasks, which also verify the effectiveness. The results show the our method is not only useful for the fundamental tagging task but also more advanced translation and inference tasks.

5 Analysis

5.1 Concept Localization

We observe an important advantage of shared embedding space is that it can address the localization of arbitrary concepts within the image. For input text, we compute the image localization heatmap derived from the activation map of the last convolutional layer following Engilberge et al. (2018). Figure 4 shows the example, which indicates that the shared space can not only perform the image retrieval, but also match language concepts in the image for any text query.

Figure 4: Concept activation maps with different input words. The orange region indicates the highest peak in the heatmap.

5.2 Examples of Image Retrieval

Although the image retrieval method achieves wonderful results on COCO datasets, we are interested in the explicit results on our task-specific datasets. We randomly select some examples to interpret the image retrieval process intuitively, as shown in Figure 3. Besides the “good” examples that show good matched contents of the text and image, we also observe some “negative” examples that the contents might not be related in concept but show some potential connections. To some extent, the alignment of the text and image concepts might not be the only effective factor for multimodal modeling, since they are defined by human knowledge and the meanings may vary among different people or different time. In contrast, the consistent mapping relationships of the modalities in a shared embedding space would be more potentially beneficial, because the similar images tend to be retrieved for similar sentences, which can play the role of topical hints for sentence modeling.

6 Conclusion

In this work, we present a universal method to incorporate visual information into sentence modeling by conducting image retrieval from a pre-trained shared cross-modal embedding space to overcome the shortcomings of the manual annotated multimodal parallel data. The text and image representations are respectively encoded by transformer encoder and convolutional neural network and then integrated in a multi-head attention layer. Empirical studies on a wide range of NLP tasks including NMT, NLT and SL verify the effectiveness. Our method is general and fundamental and can be easily implemented to any existing deep learning NLP system. We hope this work will facilitate future multimodal researches across vision and language.


  • E. Bruni, N. Tran, and M. Baroni (2014) Multimodal distributional semantics. Journal of Artificial Intelligence Research 49, pp. 1–47. Cited by: §1.
  • I. Calixto, Q. Liu, and N. Campbell (2017) Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1913–1924. Cited by: §1.
  • K. Chen, R. Wang, M. Utiyama, E. Sumita, and T. Zhao (2019) Neural machine translation with sentence-level topic context. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 1970–1984. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • T. Durand, N. Thome, and M. Cord (2016)

    Weldon: weakly supervised learning of deep convolutional neural networks


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4743–4752. Cited by: §2.1.
  • M. Engilberge, L. Chevallier, P. Pérez, and M. Cord (2018) Finding beans in burgers: deep semantic-visual embedding with localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3984–3993. Cited by: §1, §2.1, §4.1, §5.1.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §1.
  • A. Gordo, J. Almazan, J. Revaud, and D. Larlus (2017) End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124 (2), pp. 237–254. Cited by: §2.1.
  • Z. S. Harris (1954) Distributional structure. Word 10 (2-3), pp. 146–162. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.3.
  • J. Ive, P. Madhyastha, and L. Specia (2019) Distilling translations with visual awareness. arXiv preprint arXiv:1906.07701. Cited by: §1.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §1, §4.1.
  • D. Kiela and L. Bottou (2014) Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 36–45. Cited by: §1.
  • S. Kim, J. Hong, I. Kang, and N. Kwak (2018) Semantic sentence matching with densely-connected recurrent and co-attentive information. arXiv preprint arXiv:1805.11360. Cited by: Table 2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Cited by: Table 3.
  • J. Lee, E. Mansimov, and K. Cho (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1173–1182. Cited by: Table 1.
  • G. Li, N. Duan, Y. Fang, D. Jiang, and M. Zhou (2019) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: Table 2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1.
  • T. Mukherjee and T. Hospedales (2016) Gaussian visual-linguistic embedding for zero-shot recognition. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 912–918. Cited by: §1.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53. Cited by: §4.2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL-HLT, Cited by: §1.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual bert?. arXiv preprint arXiv:1906.01502. Cited by: Table 3.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report. Cited by: §1, Table 2.
  • Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille (2016) Joint image-text representation by gaussian visual-semantic embedding. In Proceedings of the 24th ACM international conference on Multimedia, pp. 207–211. Cited by: §1.
  • E. F. T. K. Sang and F. D. Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In HLT-NAACL 2003, pp. 1–6. Cited by: §3.3.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2.1.
  • H. Shi, J. Mao, K. Gimpel, and K. Livescu (2019) Visually grounded neural syntax acquisition. arXiv preprint arXiv:1906.02890. Cited by: §1.
  • C. Silberer and M. Lapata (2014)

    Learning grounded meaning representations with autoencoders

    In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 721–732. Cited by: §1.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §1.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. arXiv preprint arXiv:1904.01766. Cited by: §1.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. Cited by: §4.2, Table 1.
  • J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014) Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §2.1.
  • L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018a) Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 394–407. Cited by: §1.
  • R. Wang, M. Utiyama, A. Finch, L. Liu, K. Chen, and E. Sumita (2018b) Sentence selection and weighting for neural machine translation domain adaptation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (10), pp. 1727–1741. Cited by: §1.
  • W. Wu, Y. Meng, Q. Han, M. Li, X. Li, J. Mei, P. Nie, X. Sun, and J. Li (2019) Glyce: glyph-vectors for chinese character representations. arXiv preprint arXiv:1901.10125. Cited by: §1.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
  • E. Zablocki, B. Piwowarski, L. Soulier, and P. Gallinari (2018) Learning multi-modal word representation grounded in visual context. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • K. Zhang, G. Lv, L. Wu, E. Chen, Q. Liu, H. Wu, and F. Wu (2018) Image-enhanced multi-level sentence representation net for natural language inference. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 747–756. Cited by: §1.
  • Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou (2019a) Semantics-aware bert for language understanding. arXiv preprint arXiv:1909.02209. Cited by: Table 2.
  • Z. Zhang, Y. Wu, J. Zhou, S. Duan, H. Zhao, and R. Wang (2019b) SG-Net: syntax-guided machine reading comprehension. arXiv preprint arXiv:1908.05147. Cited by: §1.
  • L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao (2019)

    Unified vision-language pre-training for image captioning and vqa

    arXiv preprint arXiv:1909.11059. Cited by: §1.