Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

08/30/2016 ∙ by Ronghang Hu, et al. ∙ 0

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression. One major difficulty to train such language-based image segmentation systems is the lack of datasets with joint vision and text annotations. Although existing vision datasets such as MS COCO provide image captions, there are few datasets with region-level textual annotations for images, and these are often smaller in scale. In this paper, we explore how existing large scale vision-only and text-only datasets can be utilized to train models for image segmentation from referring expressions. We propose a method to address this problem, and show in experiments that our method can help this joint vision and language modeling task with vision-only and text-only data and outperforms previous results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Previous methods for image segmentation from referring expressions require joint vision and text annotations (shown in a), but such datasets are expensive to collect. We explore how datasets with vision-only (shown in b) and text-only (shown in c) annotations can be utilized in this task.

Semantic image segmentation [Shotton et al.2009, Long et al.2015, Chen et al.2015, Zheng et al.2015]

is an important problem in computer vision. Given an input image and a pre-defined set of visual semantic categories, such as “sky”, “dog”, “bus”, the task of semantic image segmentation is to localize all image pixels that belong to a particular category.

Instead of operating over a fixed set of visual categories, recent works such as image captioning [Mao et al.2015, Donahue et al.2015, Devlin et al.2015] and visual question answering [Xu and Saenko2015, Yang et al.2016, Andreas et al.2016] have extended visual comprehension from a set of classes to broader semantic labels represented by natural language expressions. hu2016segmentation approaches the task of image segmentation from referring expressions, where the goal is to ground a given query expression in an input image, and output a pixelwise segmentation for the corresponding visual entity described by the expression. For example, given an image and an expression “the girl with red tie” (Fig. 1a), the model is asked to output pixelwise segmentation mask for the girl on the right.

hu2016segmentation proposes a model that encodes the given expression into a real-valued vector using Long Short Term Memory (LSTM) networks

[Hochreiter and Schmidhuber1997]

and extracts a spatial feature map from the image using a Convolutional Neural Network (CNN)

[Krizhevsky et al.2012]. Then it performs pixelwise classification based on the encoded expression and feature map to output an image mask covering the visual entity described by the expression. However, the LSTM-CNN model proposed in hu2016segmentation requires referring expression annotations at image region level as training data (Fig. 1a). Such annotations are expensive compared with visual class labels, and existing image segmentation datasets with referring expression annotations [Kazemzadeh et al.2014, Mao et al.2016] are an order of magnitude smaller than those with only visual category annotations such as MS COCO [Lin et al.2014].

Therefore, in this paper, we propose a method to utilize existing large scale vision-only datasets containing image regions annotated with visual class labels but no text (Fig. 1b), and text-only corpus datasets (Fig. 1c) to help image segmentation from referring expressions. We show that the performance of this task can be improved by pretraining word embeddings on text corpus and synthesizing textual phrases from the class names of visual classes as additional training data. We also incorporate traditional category-based semantic image segmentation datasets and models by mapping the textual expression into visual categories and matching it with category-based image segmentation results.

Our work is related to bounding box or pixelwise image region localization from query expressions [Hu et al.2016b, Mao et al.2016, Rohrbach et al.2015] and image captioning with vision-only and text-only data [Hendricks et al.2016].

2 Our Method

Our method extends the LSTM-CNN model [Hu et al.2016a] with word embeddings (Sec. 2.1) and synthesized expressions (Sec. 2.2), and exploits category-based image segmentation datasets and models (Sec. 2.3). In Sec. 2.4 we describe how we jointly train the full model.

2.1 Word Embeddings

Figure 2: We augment the previous LSTM-CNN model with a word embedding pretrained on text-only corpora datasets, and synthesized textual expressions as additional training data from vision-only semantic image segmentation datasets.

To utilize existing text-only datasets, we train a word embedding matrix on a large text corpus using GloVe [Pennington et al.2014], which is effective in embedding novel and rare words. The trained GloVe vectors are used as the word embedding matrix transforming the inputs to the LSTM network of [Hu et al.2016a] as shown in Fig. 2. This matrix is kept fixed during image segmentation training.

Compared with learning a word-to-vector mapping from scratch on the limited joint vision and language data as in hu2016segmentation, the GloVe vectors trained on large corpora are more effective in projecting the words to a semantic space, and handling rare and novel words not seen in the training set.

2.2 Synthesized Expressions

The large scale vision-only datasets for semantic image segmentation, such as MS COCO [Lin et al.2014], only have image regions annotated over a pre-defined set of visual categories. To utilize such vision-only datasets, we synthesize textual expressions from the category annotations. In our implementation, we take a simple approach and directly use the visual class name as the textual expression for a image region, as shown in Fig. 2. For example, an image region of the visual class person is labeled with expression “person”. These synthesized expressions and the image regions are used as additional training data for image segmentation from referring expressions. This approach, while straightforward, has the potential to also benefit semantically related words. Since the GloVe vectors map semantically related words to close points in the projection space, an image region of person class can also benefit expressions with semantically related words such as “man”, “girl” and “child”.

2.3 Category-based Image Segmentation

Figure 3:

To utilize existing category-based semantic image segmentation models, we map the input expression into visual classes with a LSTM classifier, and match the mapped classes to category-based image segmentation results to obtain a coarse image segmentation output.

In the computer vision community, the traditional category-based semantic image segmentation problem has been extensively studied, and a number of state-of-the-art models for this task have been proposed [Long et al.2015, Chen et al.2015, Zheng et al.2015]. In category-based image segmentation, a classification model is trained on vision-only datasets (Fig. 1, b) to label every pixel with a visual class from a pre-defined set of visual classes. For each pixel

in the image, a probability vector

is outputted over the classes.

To take the advantage of the well-studied category-based image segmentation models, we also associate the input textual expression with the visual categories by classifying the expression into the classes as shown in Fig. 3. For example, the expression “a kitty sitting at the table” should be associated with cat class. We treat the expression as a word sequence and feed it into a LSTM network. After scanning through the sequence, a two-layer neural network takes the LSTM final hidden state and outputs a -class probability vector .

Then, the category probability distribution at each image pixel is matched with the probability distribution of the textual expression. If both an image pixel and the textual expression have

probability to be class cat, then it is likely that this pixel belongs to the image region described by the expression (assuming only one cat present in the image). We define the foreground probability of pixel belonging to the input expression as the dot product between the two probability vectors.

(1)

If both and are accurate, then this approach should be able to get the visual category correct in the output segmentation results from . For example in Fig. 1a, given an input expression “the girl with red tie”, the method above should be able to output an image mask covering the two persons. Although this method cannot separate individual object instances of the same class, the results can serve as a coarse segmentation for the expression, and be integrated into to the LSTM-CNN [Hu et al.2016a] to reduce category errors.

In our implementation, we use the Fully Convolutional Network (FCN) model [Long et al.2015] as the category-based image segmentation model.

2.4 Full Model

We combine the approaches described above for final output by taking a weighted average of the per-pixel foreground probability outputs and from the LSTM-CNN model (augmented with word embeddings and synthesized expressions in Sec. 2.1 and 2.2) and category-based image segmentation model in Sec. 2.3, as follows

(2)

where determines the weight of the two outputs. is trained jointly with the whole system, end-to-end with back-propagation.

3 Experiments

Method prec@0.5 prec@0.6 prec@0.7 prec@0.8 prec@0.9 overall IoU
baseline LSTM-CNN [Hu et al.2016a] 15.25% 8.37% 3.75% 1.29% 0.06% 28.14%
ours (with word embedding) 16.44% 9.25% 4.35% 1.39% 0.04% 30.72%
ours (with word embedding, synthesized expressions) 17.38% 10.40% 4.72% 1.48% 0.07% 31.52%
ours (with category-base image segmentation) 20.89% 13.07% 6.65% 2.73% 0.36% 33.53%
ours (full model) 21.08% 13.34% 7.47% 2.98% 0.44% 34.06%
Table 1: G-Ref dataset: The precision and overall IoU of our methods and baseline approach (higher is better).
Method prec@0.5 prec@0.6 prec@0.7 prec@0.8 prec@0.9 overall IoU
baseline LSTM-CNN [Hu et al.2016a] 26.82% 16.04% 7.58% 1.83% 0.06% 35.34%
ours (with word embedding) 26.80% 16.26% 7.52% 1.91% 0.05% 35.24%
ours (with word embedding, synthesized expressions) 26.92% 16.57% 8.14% 2.04% 0.01% 35.53%
ours (with category-base image segmentation) 10.38% 5.27% 2.21% 0.72% 0.05% 26.10%
ours (full model) 27.56% 17.06% 8.18% 2.23% 0.13% 36.05%
Table 2: UNC-Ref dataset: The precision and overall IoU of our methods and baseline approach (higher is better).
Method prec@0.5 prec@0.6 prec@0.7 prec@0.8 prec@0.9 overall IoU
baseline LSTM-CNN [Hu et al.2016a] 34.02% 26.71% 19.32% 11.63% 3.92% 48.03%
ours (with word embedding) 35.86% 28.37% 20.49% 12.47% 4.48% 49.91%
Table 3: ReferIt dataset: The precision and overall IoU of our methods and baseline approach (higher is better).

We evaluate how additional vision-only and text-only datasets with our method can help improve the performance of image segmentation from referring expressions. Existing image segmentation datasets with joint vision and language annotations include G-Ref, UNC-Ref [Mao et al.2016] and ReferIt [Kazemzadeh et al.2014], containing 25799, 19994 and 19997 images, respectively. Also, MS COCO [Lin et al.2014] and Gigaword are used as additional vision-only and text-only datasets. MS COCO contains 123287 images with class label annotations over image regions (Fig. 1b). Both G-Ref and UNC-Ref are built upon MS COCO and have image regions annotated with both visual class labels and textual expressions, while ReferIt contains image regions with textual expressions but no class labels.

Baseline. On G-Ref111We report the performance on G-Ref using its validation set, since its test set has not been released., UNC-Ref and ReferIt, we train and evaluate the LSTM-CNN baseline under precision metric and overall IoU metrics [Hu et al.2016a], shown in Tables 3, 3, and 3.

Word embedding and synthesized expressions. As described in Sec. 2.1, we use the GloVe vectors pretrained on Gigaword as our word embedding matrix, and add image regions with synthesized expressions as additional training data (Sec. 2.2). Tables 3, 3, and 3 show that the pretrained word embedding and additional synthesized data improves the image segmentation performance over the baseline.

Category-based image segmentation and full model. As described in Sec. 2.3, we train visual category classifiers on G-Ref and UNC-Ref for the textual expressions, and train a FCN model on MS COCO for category-based image segmentation, and obtain the full model as in Sec. 2.4. Fig. 4 shows some visualized results from our full model and Table 3 and 3 show the performance of category-based image segmentation and the full model on G-Ref and UNC-Ref. It can be seen that our full model achieves the highest performance, outperforming previous results.

Figure 5 visualizes image segmentation predictions using different components of our method, where the second to last columns correspond to the results in Table 1 and 2 in the paper. Figure 6, 7 and 8 contain more results on G-Ref, UNC-Ref and ReferIt dataset.

input image heatmap final output ground-truth

(a) input expression=“bird on right side of windowsill”

(b) input expression=“a cell phone cover which was opened to repair”

(c) input expression=“screen in middle facing you”

(d) input expression=“left kid”

Figure 4: Example results on G-Ref (a, b) and UNC-Ref (c, d) with our full model.

4 Conclusion

In this paper, we propose a method to utilize additional large scale vision and text data to improve the performance of image segmentation from referring expressions. A word embedding is pretrained from large corpora and textual expressions are synthesized from visual class labels as additional data. Also, well-studied traditional category-based semantic image segmentation models are integrated into language-based image segmentation. Experimental results show that the method in this paper improves the performance over previous work.

As future work, we would like to extend our method to incorporate recent datasets containing entities and bounding box annotations [Krishna et al.2016, Plummer et al.2015].

References

  • [Andreas et al.2016] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
  • [Chen et al.2015] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2015. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [Devlin et al.2015] Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809.
  • [Donahue et al.2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
  • [Hendricks et al.2016] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • [Hu et al.2016a] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016a. Segmentation from natural language expressions. arXiv preprint arXiv:1603.06180.
  • [Hu et al.2016b] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016b. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Kazemzadeh et al.2014] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

    .
  • [Krishna et al.2016] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332.
  • [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS).
  • [Lin et al.2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV).
  • [Long et al.2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Mao et al.2015] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015.

    Deep captioning with multimodal recurrent neural networks (m-rnn).

    In Proceedings of the International Conference on Learning Representations (ICLR).
  • [Mao et al.2016] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • [Plummer et al.2015] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • [Rohrbach et al.2015] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2015. Grounding of textual phrases in images by reconstruction. arXiv preprint arXiv:1511.03745.
  • [Shotton et al.2009] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. 2009. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–23.
  • [Xu and Saenko2015] Huijuan Xu and Kate Saenko. 2015. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv preprint arXiv:1511.05234.
  • [Yang et al.2016] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Zheng et al.2015] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).