Referring Expression Object Segmentation with Caption-Aware Consistency

10/10/2019 ∙ by Yi-Wen Chen, et al. ∙ 4

Referring expressions are natural language descriptions that identify a particular object within a scene and are widely used in our daily conversations. In this work, we focus on segmenting the object in an image specified by a referring expression. To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains. We introduce the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object. To better communicate between the language and visual modules, we employ a caption generation network that takes features shared across both domains as input, and improves both representations via a consistency that enforces the generated sentence to be similar to the given referring expression. We evaluate the proposed framework on two referring expression datasets and show that our method performs favorably against the state-of-the-art algorithms.



There are no comments yet.


page 1

page 2

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object segmentation aims to separate foreground objects from the background. In this work, we focus on object segmentation from referring expressions, in which the segmentation is guided by a natural language description that identifies a particular object instance in a scene, e.g, the man in a blue jacket or the laptop on the left.

Figure 1: Illustration of the proposed algorithm. Given an image and a referring expression, we use a comprehension network to segment the specified object. With the features containing both language and visual information as input, the generation network produces a sentence identifying the target object. By enforcing a caption-aware consistency loss between the query and output sentence, we further improve the performance of comprehension.

Transferring knowledge between language and visual domains is an important but challenging task. Two relevant tasks are: 1) referring expression comprehension for localizing or segmenting an object according to a natural language description, and 2) referring expression generation for producing a sentence that identifies a particular object in an image. Existing methods [Mao_CVPR_2016, Yu_ECCV_2016] address both tasks by constructing a generation model and inferring the region which maximizes the expression posterior in the comprehension task. However, such joint information is usually exploited to only enhance the generation performance.

In this paper, we focus on referring expression object segmentation. Unlike existing methods, our model jointly considers both tasks to benefit the comprehension task. Intuitively, when one signal, e.g, a sentence, is transferred from the language domain to the visual domain, and then transferred back to the language domain, the transferred-back signal is supposed to be similar to the original one. By exploiting this property, we develop a network that jointly considers referring expression comprehension and generation, and enforces a caption-aware consistency between the visual and language domains.

To this end, we first design a comprehension network that contains the language and visual encoders to extract the feature representations of respective domains. To connect these two domains, we further propose to use spatial-aware dynamic filters to bridge the language and visual encoders. Meanwhile, these filters provide visual representations with the localization ability from the input referring expression. Based on the proposed baseline model, we then employ a caption generation model that takes feature representations from the comprehension network as inputs. The generated referring expression should be similar to the original sentence, and we leverage this property as an additional consistency cue to enhance the language and visual representations. The main steps of the proposed model are illustrated in Figure 1.

To evaluate the proposed method, we conduct extensive experiments on the RefCOCO [Yu_ECCV_2016] and RefCOCOg [Mao_CVPR_2016, Nagaraja_ECCV_2016] datasets. Experimental results show that our model performs favorably against the state-of-the-art methods. In addition, we provide the ablation study to demonstrate the effectiveness of each component in the proposed framework, including the spatial-aware dynamic filters and caption-aware consistency. The main contributions of this work are summarized as follows: 1) We integrate referring expression generation into referring expression comprehension so that the two complementary tasks can benefit each other via enforcing the caption-aware consistency. 2) We develop the spatial-aware dynamic filters that bridge the visual and language domains and facilitate the feature learning process. 3) We design an end-to-end trainable network for referring expression comprehension, achieving the state-of-the-art performance.

2 Related Work

Referring Expression Comprehension.

The task of referring expression comprehension aims to localize or segment an object given a natural language description. Existing methods [Hu_CVPR_2016_2, Luo_CVPR_2017, Mao_CVPR_2016]

mainly rely on recurrent caption generation models, and select the object with the maximum posterior probability of the expression among all object proposals. By exploring the relationship between the object and its context 

[Nagaraja_ECCV_2016, Yu_ECCV_2016, Zhang_CVPR_2018], the target object can be better localized. Recent approaches adopt various learning strategies, such as embedding images and sentences into a common feature space [Wang_CVPR_2016, Rohrbach_ECCV_2016], or learning attributes [Liu_ICCV_2017_2] to help differentiate objects of the same category. In addition, Hu et al [Hu_CVPR_2017] analyze the inter-object relationships by parsing the sentence into subject, relationship and object parts. To jointly consider the associated factors such as attributes and relationships between objects, Yu et al [Yu_CVPR_2018] propose a modular attention network to decompose the expression into subject appearances, locations, and relationships to other objects.

While the aforementioned methods mainly localize an object by a bounding box, algorithms that focus on segmentation [Hu_ECCV_2016, Li_CVPR_2018, Liu_ICCV_2017, Shi_ECCV_2018, Margffoy-Tuay_ECCV_2018] usually encode the referring expression through the LSTM network and use a fully convolutional network for foreground/background segmentation by using both the language and visual features. Different from these approaches, our proposal-based model first localizes objects and performs segmentation via learning better feature representations through a referring generation network that considers the caption consistency. We note that the approach in [Rohrbach_ECCV_2016] also considers the consistency between the generated sentence and input sentence but does not target at segmenting objects. Furthermore, this approach uses pre-defined and fixed region proposals, in which the visual representations are not updated through the proposals. In contrast, our unified framework is end-to-end trainable while bridging features across visual and language domains.

Referring Expression Generation.

The generation task of referring expressions is a special case of image captioning. Rather than describing the whole image, the generated sentence uniquely identifies an object within the image. A referring expression is considered good if one can localize the corresponding object by comprehending this referring expression. Therefore, referring expression comprehension is often employed in the generation task 

[Mao_CVPR_2016, Liu_ICCV_2017_2, Luo_CVPR_2017] to improve the performance.

CNN-LSTM based models are widely used for image captioning [Karpathy_CVPR_2015, Rennie_CVPR_2017, Vinyals_CVPR_2015, Xu_ICML_2015]. While a CNN model extracts visual features, an LSTM module produces captions. To address referring expression generation, Mao et al [Mao_CVPR_2016] combine the extracted visual features with the location and size of the target object. Furthermore, this method uses a CNN-LSTM model for the comprehension task and jointly trains the generation and comprehension modules. Yu et al [Yu_CVPR_2017] further propose a joint speaker-listener-reinforcer model where a reward function is introduced to guide the expression sampling. Their approach jointly trains the generation and comprehension networks, but does not specifically consider the caption consistency as our framework. While the aforementioned methods mainly utilize the comprehension model to generate high-quality sentences, in this work, we focus on the comprehension task and demonstrate that the generation model also facilitates the comprehension performance by enforcing the proposed caption-aware consistency between the visual and language domains.

Figure 2: Architecture of the proposed framework. The proposed network is composed of a language encoder , a visual encoder , a Mask R-CNN head , and a caption generator . Features of the referring expression and image are extracted by and respectively. Knowledge is transferred from language domain to visual domain via the spatial-aware dynamic filters , by which a response map is generated to produce a location-aware feature . Based on this feature that carries information from both domains, we generate object bounding box and mask by the Mask R-CNN head . The caption generator takes features and as inputs, and produces a sentence identifying the target object. We apply a caption-aware consistency loss between the input query and the generated sentence to further improve the language and visual feature representations.

3 Proposed Framework

In this work, we focus on referring expression object segmentation. The overview of the proposed framework is illustrated in Figure 2. Given an image and a natural language description , we aim to segment the object in specified by . To this end, we propose an end-to-end trainable network that contains a language encoder , a visual encoder , a Mask R-CNN head , and a caption generator . The encoders and extract language and visual features, respectively. Motivated by the dynamic filter network [Brabandere_NIPS_2016], we enhance the ability of specific object localization via introducing the spatial-aware dynamic filters to transfer knowledge from text to image. The yielded cross-modal information allows the Mask R-CNN head to produce more accurate segmentation results. To further improve our model, we employ the caption generation network and a consistency loss to jointly train the comprehension and generation networks. We describe each component of the proposed network below.

3.1 Segmentation from Referring Expression

In this subsection, we introduce how the proposed network generates the object segment given the query referring expression. To this end, the language encoder , visual encoder , and spatial-aware dynamic filters are elaborated.

Language Encoder.

Similar to [Yu_CVPR_2018], we use a bi-directional LSTM model to extract features of a referring expression. Given a referring expression of words with each word

represented by a one-hot vector

, the bi-directional LSTM is applied to encode the whole sentence in both forward and backward directions:


where and are the forward and backward hidden states at time step , respectively. We concatenate the final hidden states in both directions to yield the feature representation of the referring expression.

Visual Encoder.

Given an input image , we aim at pixel-wise segmentation. Different from the approaches based on the fully convolutional network (FCN) that does not generate instance-aware results, we adopt the proposal-based Mask R-CNN [He_ICCV_2017] framework to generate an object mask based on each detected object bounding box. We use the ResNet-101 [He_CVPR_2016] model as the backbone network and extract features over the entire image. The feature from the final convolutional layer of the fourth block, denoted by , serves as the representation of image .

Spatial-aware Dynamic Filters.

Motivated by the recent work [Li_CVPR_2017] on tracking with natural language, we utilize dynamic convolutional filters as a bridge to connect the language and visual domains. Unlike conventional convolutional filters that apply the same weights to all input images, dynamic convolutional filters are generated depending on the input sentence. Given the feature representation of a sentence , a single fully connected layer parameterized by the weights and the bias is adopted to generate a set of dynamic filters:


where is the hyperbolic tangent function, and is a set of convolutional filters with the same number of channels as the visual representation . We then convolve the visual representation with the generated dynamic filters to obtain a response map :


With this formulation, knowledge is transferred from the language domain through learning the dynamic filters, with which the response map reflects the information inferred from the referring expression.

However, such filters consider the entire image and thus may only be able to catch the global structure but ignore spatially distributed objects. As such, we propose to utilize spatial-aware dynamic convolutional filters that consider local regions of the image, including up, down, left, right, horizontal and vertical middle regions, and each region covers a half area of the entire image. We thereby apply six additional fully connected layers to generate spatial-aware dynamic filters corresponding to each region via (2). The six dynamic filters are then convolved with the visual feature , where the values outside the defined regions are set to 0. Then we obtain six spatial-aware response maps similar to (3), denoted by , in which each map focuses on its defined region.

To combine these spatial response maps and the one from (3), we adopt another set of dynamic filters with 7 channels, which are also generated from the sentence representation , to account for the importance of each region depending on the input sentence. We convolve with the concatenation of the response maps and obtain a final response map with one channel, i.e,



is the sigmoid function with output range

. Ideally, represents a map of the object specified by the input referring expression. Thus, we apply a binary cross-entropy loss to supervise the response map with respect to the ground-truth object mask.

Baseline Objective.

Based on the response map in (4), we take the element-wise multiplication of and to be the caption-aware feature representation , which carries the information from both the language and visual domains. To obtain the final segmentation result, we then feed into the Mask R-CNN [He_ICCV_2017] RoI head , which includes the bounding box and the binary segmentation branches. The overall objective can be written as:


where includes the classification loss, bounding box loss and mask loss, the same as those defined in Mask R-CNN. With this formulation, we construct an end-to-end trainable network that produces the referring expression object segmentation. Unlike the state-of-the-art methods, such as MAttNet [Yu_CVPR_2018], that require multiple training stages and pre-processing steps, our model can be efficiently learned, through the help of spatial-aware dynamic filters which provide the spatial information from the input sentence.

3.2 A Joint Framework

In light of the cycle consistency work [Zhu_ICCV_2017] that solves the domain transfer problem in cross-directions, we integrate both the referring expression comprehension and generation tasks into a joint framework, where their feature representations are shared and can be jointly optimized through back-propagation.

Referring Expression from Segmentation.

To generate a sentence describing a particular object within an image, we adopt the attention-based image captioning model [Xu_ICML_2015]. To train the caption generation model , we input the feature representation extracted from Mask R-CNN and concatenate it with which contains the spatial information about the object. As a result, during training the caption generation model, gradients can be back-propagated through to update the Mask R-CNN feature extractor, as well as through to optimize dynamic filters and the language encoder.

Caption-aware Consistency.

Given the ground-truth sentence , which is the input to the language encoder, the objective for caption generation is to minimize the cross-entropy loss :


where is the probability of predicting a particular word from the caption generation network parameterized by

. Here, this loss function in our framework enforces that the predicted sentence

generated by the feature , i.e, , should be consistent with the input query that generates the same feature, i.e, , where is a mixed operation involving the visual encoder and dynamic filters in the proposed method. Hence, our caption-aware consistency actually enforces .

Overall Objective.

To exploit the caption-aware consistency, we jointly train the comprehension model, including the language encoder , visual encoder , Mask R-CNN head in Section 3.1 and the caption generation model in Section 3.2. The total loss function is extended from (5) to:


where is the coefficient of the consistency loss. We note that adding enables the joint optimization between the language and the visual domains. That is, the intermediate feature would be updated by the guidance from the first two loss functions in (7), which are supervised by the comprehension task, and in the meanwhile updates the feature based on the caption generation task.

Model Info. val testA testB val* val test
Nagaraja et al [Nagaraja_ECCV_2016] C 57.30 58.60 56.40 - - 49.50
Luo et al [Luo_CVPR_2017] J - 67.94 55.18 49.07 - -
Liu et al [Liu_ICCV_2017_2] Attr, J - 72.08 57.29 52.35 - -
Yu et al [Yu_CVPR_2017] J - 73.78 63.83 59.84 - -
MAttNet [Yu_CVPR_2018] Attr, Attn, L, R 76.65 81.14 69.99 - 66.58 67.27
VC [Zhang_CVPR_2018] C - 73.33 67.44 62.30 - -
baseline - 72.65 76.65 65.75 54.18 58.09 58.32
+ spatial coords [Hu_ECCV_2016] L 75.89 78.57 68.54 61.37 64.10 64.21
+ spatial-aware filters L 76.98 79.30 69.75 61.65 65.18 65.28
+ caption-aware consistency J 76.05 78.84 69.36 60.69 64.71 63.79
full model L, J 77.08 80.34 70.62 62.34 65.83 65.44
Table 1: Localization results of our method and the competing methods on two datasets. We summarize the major information used in each method, including context (C), attribute prediction (Attr), attention module (Attn), location (L), relationships between objects (R), and joint training with referring expression generation (J).

3.3 Model Training and Implementation Details

To train the joint network model, we adopt a sequential training strategy to optimize the objective in (7). First, we only update the comprehension network by optimizing (5). Then, we pre-train the caption generation network by optimizing (6) as a warm-up. Finally, we update the entire framework with the objective in (7). With the trained model, we choose the detected object with the largest score during testing.

We implement our model with PyTorch using the SGD optimizer. For the language encoder, the dimension of the LSTM hidden states is set to

. By concatenating the forward and backward hidden states, the feature is a -dimensional vector. In the visual encoder, the visual feature is of dimension . Thus, we also generate the dynamic filters of dimension . For the caption generation model, the input spatial features are resized to and have the same number of channels as that of the concatenation of and . When training the full model, the loss weight in (7) is set to for all experiments. The codes and models are available at:

4 Experimental Results

We evaluate the proposed framework on two referring expression datasets: RefCOCO [Yu_ECCV_2016] and RefCOCOg (with two splits111The first split [Mao_CVPR_2016] randomly partitions objects into training and validation sets. We denote the validation set as “val*” in this paper. The second split [Nagaraja_ECCV_2016] randomly partitions images into training, validation and testing sets, where we denote the validation and testing ones as “val” and “test”, respectively.[Mao_CVPR_2016, Nagaraja_ECCV_2016]. The two datasets are collected from the Microsoft COCO images [MSCOCO], with different properties of expressions. We show both detection and segmentation results with comparisons against the state-of-the-art algorithms. In addition, we present an ablation study to demonstrate the importance of each component in the proposed framework. More results are provided in the supplementary material.

For evaluating the detection performance, the predicted bounding box is considered correct if the intersection-over-union (IoU) of the prediction and the ground truth is above 0.5. As for the segmentation quality, we use Intersection-over-Union (IoU) as metric.

4.1 Localization Results

In Table 1, we show comparisons with existing state-of-the-art algorithms [Liu_ICCV_2017_2, Luo_CVPR_2017, Nagaraja_ECCV_2016, Yu_CVPR_2018, Yu_CVPR_2017, Zhang_CVPR_2018]. Since each method adopts diverse information to help the comprehension task, we further summarize the major cues that each approach relies on, such as context information [Nagaraja_ECCV_2016, Zhang_CVPR_2018], attribute prediction [Liu_ICCV_2017_2, Yu_CVPR_2018], and joint training with referring expression generation [Liu_ICCV_2017_2, Luo_CVPR_2017, Yu_CVPR_2017].

Table 1 shows that the proposed method performs favorably against most methods by significant margins, and competitively with MAttNet [Yu_CVPR_2018]. We note that, the MAttNet method utilizes various cues, including attention module, attribute prediction, location information, and relations between objects to achieve good performance, while our model only focuses on the location cue and joint training with referring expression generation. It is also worth mentioning that our model is a unified framework that is end-to-end trainable, while MAttNet requires multiple separate training stages to obtain the final model. The runtime speed of our method is 0.17 seconds per image, which is much faster than MAttNet with 0.67 seconds per image on an Intel Xeon 2.5 GHz machine and an NVIDIA GTX 1080 Ti GPU with 11 GB memory.

Model Backbone Net val testA testB val* val test
D+RMI+DCRF [Liu_ICCV_2017] Deeplab101 45.18 45.69 45.57 - - -
RRN+LSTM+DCRF [Li_CVPR_2018] Deeplab101 55.33 57.26 53.95 36.45 - -
MAttNet [Yu_CVPR_2018] Res101 56.51 62.37 51.70 - 47.64 48.61
KWAN [Shi_ECCV_2018] Deeplab101 - - - 36.92 - -
DMN [Margffoy-Tuay_ECCV_2018] DPN92 49.78 54.83 45.13 36.76 - -
Ours Res101 58.90 61.77 53.81 44.32 46.37 46.95
Table 2: Segmentation results of our method and the competing methods on two datasets.
“person holding tray” “girl in red coat” “darkest colored horse” “lighter brown horse with head down”
“a small giraffe” “giraffe to the far left” “the slice of cake on the left” “chocolate dessert cake on a plate”
Figure 3: Sample results of objects referred by various query expressions.
“elephant in back”
Input Image Ground Truth Baseline Spatial-aware Filters Full Model
Figure 4: Sample results from different variants of the proposed model on RefCOCO.

4.2 Segmentation Results

We present the experimental results with comparisons to the state-of-the-art algorithms including D+RMI+DCRF [Liu_ICCV_2017], RRN+LSTM+DCRF [Li_CVPR_2018], MAttNet [Yu_CVPR_2018], KWAN [Shi_ECCV_2018] and DMN [Margffoy-Tuay_ECCV_2018] on the two datasets in Table 2. Overall, our method consistently and significantly outperforms other segmentation-based approaches that use a similar backbone network (i.e, Deeplab [CP2016Deeplab] with ResNet-101) as ours. Different from the DMN [Margffoy-Tuay_ECCV_2018] scheme that utilizes dynamic filters in a sequential manner for capturing the information of each word in a sentence, our model generates the dynamic filters in a spatial-aware manner, where each set of filters produces a response map to certain region of the image. We note that the proposed method achieves better performance. Similar to the localization results, MAttNet [Yu_CVPR_2018] that fuses multiple cues performs competitively with our model. We present qualitative examples of referring expression object segmentation in Figure 3. The proposed model can segment different objects according to various query expressions, such as the location, color, or action information, and further demonstrates the effectiveness of the proposed caption-aware consistency framework.

4.3 Ablation Study

We present the results of an ablation study in Table 1. We first show that using the proposed spatial-aware dynamic filters improves the baseline with only a single dynamic filter or the spatial-aware mechanism [Hu_ECCV_2016] that concatenates spatial coordinates and feature maps. Second, the referring expression generation network with caption-aware consistency performs favorably against the baseline model. In the full model with both spatial-aware filters and caption-aware consistency, higher performance gains are achieved over other baselines.

We present sample segmentation results predicted by different variants of our model in Figure 4. Compared with the baseline and the model with spatial-aware filters, the proposed full model can localize objects accurately while the baseline model predicts the wrong object. In addition to improving the localizing ability, our full model enhances feature representations around the object. For instance, the elephant in back is well segmented by our model even if it is surrounded by complex background and similar instances.

5 Concluding Remarks

In this paper, we propose an end-to-end trainable framework for referring expression segmentation. We design a comprehension model that consists of language and visual encoders to extract feature representations in the respective domains. By introducing the spatial-aware dynamic filters, knowledge can be transferred from language domain to visual domain, while capturing the useful location cue. In addition to the proposed baseline model, we employ a caption generation network to connect referring expression comprehension and generation. Considering the consistency that the generated sentence is supposed to be similar to the given referring expression, we enforce a caption-aware consistency loss and further enhance the language and visual representations. Extensive experiments and an ablation study on two referring expression datasets show that the proposed algorithm achieves favorable performance against the state-of-the-art methods.


This work was supported in part by Ministry of Science and Technology (MOST) under grants 107-2628-E-001-005-MY3 and 108-2634-F-007-009.