In this work, we present a state-of-the-art model, BiLingUNet, for image segmentation from referring expressions. Given an image and a natural language description, BiLingUNet returns a segmentation mask that marks the object(s) described. The language input may contain various visual attributes (e.g., color, shape), spatial information (e.g., “on the right”, “in front of”), actions (e.g., “running”, “sitting”) and interactions/relations between different objects (e.g., “arm of the chair that the cat is sitting in”). We show that BiLingUNet achieves state-of-the-art performance on four referring expression datasets.
Our model is based on the U-Net image segmentation architecture . There is a bottom-up branch which starts from low level visual features and applies a sequence of contracting filters that result in successively higher level feature maps with lower spatial resolution. There is a top-down branch which takes the final low resolution feature map and applies a sequence of expanding filters that eventually result in a segmentation mask at the original image resolution. The defining features of the U-Net architecture are the skip connections between the contracting and the expanding filters.
To make the segmentation conditional on language, we follow the LingUNet model  and add language-conditional filters at each level of the architecture. LingUNet only uses language-conditional filters and only applies them on the expanding branch.
filters can highlight locations with specific properties but cannot represent relational concepts that involve multiple locations. Modulating only the top-down/expanding branch with language means the high level features extracted by the bottom-up/contracting branch cannot be language-conditional. Our model expands on LingUNet by modulating both branches with wider language-conditional filters and shows significant improvements on the referring expression segmentation task.
Our experiments support the following conclusions:
We find that using language to customize visual filters works better than concatenating some linguistic representation to the visual input.
It is important for language to modulate both the top-down and bottom-up visual processing for best results.
The filters that implement language modulation need to have a spatial extent larger than to be able to represent relational concepts.
2 Related Work
In this section, we review related work in three areas: Semantic segmentation classifies the object category of each pixel in an image without language input. Referring expression comprehension locates a bounding box for the object(s) described in the language input. Image segmentation from referring expressions generates a segmentation mask for the object(s) described in the language input.
2.1 Semantic Segmentation
which are trained on the large ImageNet image classification dataset. FCN-based models simply transforms fully connected layers of these ImageNet models into convolutional layers. A deconvolutional network generates segmentation mask by fusing visual representation of different resolutions.
Recently, DeepLab  has improved the state-of-the-art in semantic segmentation by replacing regular convolutions with atrous (dilated) convolutions in the last residual block of ResNets. Atrous convolutional layers apply the convolution operation after they transform dense filters into sparse filters by inserting zeros between consecutive filter values. In addition to atrous convolution, DeepLab implements a pooling operation called Atrous Spatial Pyramid Pooling (ASPP) to augment multi-scale information by combining outputs of 4 different parallel atrous convolution operations.
The U-Net architecture  improves over the standard FCN by connecting contracting (bottom-up) and expanding (top-down) paths at the same resolution: the output of the encoder layer at each level is passed to the decoder at the same level. LingUNet  is a multimodal extension of the U-Net architecture that makes the expanding path language conditional and it is used for following natural language instructions in virtual environments.
Our proposed architecture uses the DeepLab network as an image encoder and a variant of LingUNet for segmentation. We find that for image segmentation, making both the contracting and expanding paths language conditional and using wider language filters (as opposed to 1x1 filters used by LingUNet) gives the best results.
2.2 Referring Expression Comprehension
In this task, an image and an object description are given as inputs. A model tries to find the referred object by locating it with a bounding box. Early models for referring expression comprehension were typically built using a hybrid LSTM-CNN architecture [23, 35]
which is also common in other visually grounded language tasks such as image captioning[46, 25, 10] and visual question answering . For referring expression comprehension using an external region proposal framework [52, 12, 16] to detect objects in the input image was also found to be useful. Among all the region proposal approaches the most notable ones are Region-based CNN (R-CNN) [15, 39, 17] methods. Newer referring expression comprehension models use one of these R-CNN models as a sub-component. [21, 51, 50, 47]. Recent models also take advantage of advanced concepts like attention and neural module networks [3, 48, 1]. Most notable ones are MAttNet  and NMTree . Both solutions try to predict object locations, but they also segment objects after finding the bounding boxes. MAttNet improves CMN  by integrating a word-level attention mechanism to capture relations between objects. NMTree generates dependency parsing tree of the given textual input, assigns modules for nodes, then starting from bottom, it tries to predict the most relevant bounding box for all nodes recurrently. At the end, it gives a final prediction which is for the root node. Both process the image in bottom-up fashion, however, the proposed BiLingUNet performs image segmentation by modulation both top-down and bottom-up visual processing.
2.3 Image Segmentation from Referring Expressions
In this section, we review the image segmentation from referring expressions task and compare our model with the previous works. The first proposed solution  is a simple hybrid CNN-LSTM model. A CNN, particularly VGG-16 , extracts spatial feature maps or visual representation for the input image. An LSTM  network encodes the language input and final hidden state becomes language representation. Spatial location information and language representation are appended at the end of the visual representation of each spatial location. As last step, an upsampling network takes this multi-modal representation and generates segmentation mask. Recurrent Multimodal Interaction (RMI)  architecture takes this LSTM-CNN baseline model and moves it a step forward by transforming it to a sequential process. Instead of concatenating multi-modal inputs our model modulates visual paths with the language input at multiple resolution with a simpler architecture. Key-Word Aware Network (KWAN) 
generates attention scores for visual representation and LSTM hidden state pairs for each spatial location and time step. Weighted visual and language representations are obtained by using the attention scores. Visual representation, weighted visual representation and weighted language representation vectors are concatenated to construct multi-modal representation for KWAN. Cross-Modal Self-Attention model (CMSA) integrates a different attention mechanism called as self-attention  into its architecture. Unlike other methods, CMSA simply uses the concatenation of the word embedding vectors to represent the referring expression. This concatenation is appended to the each spatial position of the visual representation. A self-attention layer takes this combined representation as input and generates multi-modal representation for the image and the language pair. This operation repeated for different scales and all of them are fused into together. Different scales and multi-modal representation for each word make CMSA computationally expensive in terms of memory usage.
Among the literature, the closest models to our work are Recurrent Refinement Networks (RRN) , Dynamic Multimodal Network (DMN) , Convolutional RNN with See-through-Text Embedding Pixelwise heatmaps (Step-ConvRNN or ConvRNN-STEM)  and LingUNet . RRN which has a structure similar to UNet  is built on top of a Convolutional LSTM (ConvLSTM)  network. Concatenated multi-modal representation is fed to ConvLSTM at initial time step. At each time step, corresponding spatial feature map of encoder is used as input. Unlike our model, ConvLSTM filters are not generated from language representation and the multi-modal representation is used only in the initial time step. DMN generates 1 x 1 language-conditional filters for language representation of each word. It performs convolution operation on visual representation with language-conditional filters to generate multi-modal representation for each word. Like RMI, word-level multi-modal representations are fed as input to a multi-modal recurrent network to obtain multi-modal representation for image/language pairs. Step-ConvRNN starts with a visual-textual co-embedding and uses a ConvRNN to iteratively refine a heatmap for image segmentation. Step-ConvRNN uses a bottom-up and top-down approach similar to our model, however, BiLingUNet uses spatial language generated kernels within a simpler architecture.
LingUNet, which is based on UNet, solves a different problem called as navigational instruction following. As an intermediate step, LingUNet generates segmentation maps to predict goal location by modulating top-down/expanding visual processing with 1 x 1 language-conditional filters. BiLingUNet extends it by also modulating the bottom-up process with the language input and benefits from wider text kernels.
Figure 1 shows an overview of our proposed architecture. For a given referring expression and an input image , the task is predicting a segmentation mask M that covers the object(s) referred to. Our model is based on the LingUNet architecture . First, the model extracts a tensor of visual features using a backbone convolutional neural network and it encodes the referring expression S to a vector representation
using a long short-term memory (LSTM) network. Starting with the visual feature tensor, the model generates feature maps in a contracting and an expanding path where the final map represents the segmentation mask, similar to U-Net . 3x3 convolutional filters generated from the text representation (text kernels) are used to modulate both the contracting and the expanding paths. The original LingUNet design only modulates the expanding path with language and uses 1x1 text-generated filters. Our experiments show that modulating both paths and using wider filters each contribute significantly to our segmentation accuracy.
3.1 Image Features
Given an input image , we extract visual features from the fourth layer of the DeepLab ResNet101-v2 network  pre-trained on the Pascal VOC dataset 111This pretrained model has been used in previous work to generate a visual representation [31, 49, 5], however in some cases the outputs of all five convolutional layers were used [49, 5]. We use the output of the fourth layer mainly due to GPU memory limitations.. We set as the image size for our experiments. Thus, the output of the fourth convolutional layer of DeepLab ResNet101-v2 produces a feature map with the size of and channels for this setup. We concatenate 8-D location features to this feature map following previous work [23, 31, 49, 5]. The first three dimensions of the location features represent normalized horizontal positions, the following three dimensions stand for vertical positions, and the last two dimensions represent the normalized width and height information of the image. The final representation, , has 1032 channels, and the spatial dimensions are .
3.2 Language Representation
Consider a referring expression where represents the ’th word. In this work, each word is represented with a -dimensional GloVe embedding, i.e. . We map the referring expression to hidden states using a long short-term memory network  as . We use the final hidden state of the LSTM as the textual representation, . We set the size of hidden states to , i.e. .
After generating image () and language () representations, BiLingUNet generates a segmentation Mask . The architecture is similar to LingUNet, which extends the U-Net  image segmentation model by making its expanding branch conditioned on the language. BiLingUNet extends the LingUNet model one step further by conditioning both contracting and expanding branches on language and using wider text kernels.
BiLingUNet applies convolutional modules to the image representation . Each module, , takes the concatenation of the previously generated feature map () and its convolved version with a text kernel and produces an output feature map (). Each
has a 2D convolution layer followed by batch normalization34]. The convolution layers have filters with and halving the spatial resolution, and they all have the same number of output channels.
Following , we split the textual representation to equal parts () to generate language conditional convolutional filters (text kernels). In the original LingUNet architecture, each part is used to generate a kernel. Although kernels are sufficient to pick out locations with a certain property (e.g. expressed by nouns and adjectives), they are incapable of representing relations between multiple locations (e.g. expressed by prepositions like on, under, next to). If we see language conditional convolutions over feature maps as an attention control mechanism, kernels only decide which part of the input should be activated. On the other hand, wider kernels can both highlight a region with specific properties and shift a previously activated region in a specified direction [20, 8, 4]. Therefore, we use each to generate a kernel ():
Each is an affine transformation followed by normalizing and reshaping to convert the output to a convolutional filter. DROPOUT is the dropout regularization . After obtaining the kernel, we convolve it over the feature map obtained from the previous module to relate expressions to image features:
Then, the concatenation of the resulting text-modulated features () and the previously generated features () is fed into module for the next step.
In the expanding branch, we generate feature maps starting from the final output of the contracting branch as follows:
Similar to the bottom-up phase, is the modulated feature maps with text kernels generated as follows:
where is again an affine transformation followed by normalizing and reshaping. Here, we convolve the kernel () over the feature maps from the contracting branch (). Each upsampling module gets the concatenation () of the text related features and the feature map () generated from the previous module. Only the first module operates on just convolved features. Each consists of a 2D deconvolution layer followed by a batch normalization and ReLU activation function. The deconvolution layers have filters with and doubling the spatial resolution, and they all have the same number of output channels.
After generating the final feature map , we apply a stack of layers (, , …, ) to map to the exact image size. Similar to upsampling modules, each is a 2D deconvolution layer followed by batch normalization and the ReLU activation. The deconvolutional layer has filters with and to double the spatial sizes of the input. Each
preserves the number of channels except for the last one which maps the features to a single channel for the mask prediction. There is no batch norm operation and the ReLU activation for the final module, instead we apply a sigmoid function to turn the final features into probabilities ().
Given the probabilities () for each pixel belonging to the target object(s), and the ground-truth mask , the main training objective is the pixel-wise Binary-Cross-Entropy (BCE) loss:
If we pad the input image to reach a fixed size, we ignore the padded pixels during the loss calculation.
Multi-Scale Loss: Similar to [28, 6, 11], we use auxiliary pixel-wise BCE loss calculations for our final objective. Instead of using the loss value obtained from only the final layer, we also calculate auxiliary loss values after the output of each module in the expanding branch of the network. To this end, similar to the final layer, we use a deconvolution layer that maps the feature maps to probabilities. For each auxiliary loss calculation, we scale the ground-truth mask to match the size of the output generated from the corresponding layer. Our final objective is the sum of each loss term scaled by their resolution ratio.
In this section we first give the details of the datasets and our experimental configurations (Section 4.1). Then we present our main results and compare our model with the state-of-the-art (Section 4.2). A detailed analysis of the contribution of the different parts of the architecture is given in Section 4.3. Finally, Section 4.4 shows some qualitative results.
4.1 Datasets and Experiment Setup
We evaluate our model on four standard referring expression segmentation datasets and using two evaluation metrics.
UNC: This dataset  contains 142,209 referring expressions with 19,994 images obtained from MS COCO dataset . Examples are collected interactively using a two-player game  for 50k objects. Images can contain two or more instances of the same object class. There is no restriction on the expressions.
UNC+: Similar to UNC dataset, UNC+  is also collected from MS COCO dataset. There are 141,564 referring expressions for 49,856 objects. The total number of images is 49,856. The difference between UNC and UNC+ is that location-specific expressions are not allowed in UNC+. Specifically, annotators are instructed to refer to an object by describing its appearance.
Google-Ref (G-Ref): This dataset  is also constructed from 26,711 images of MS COCO. There are 104,560 referring expressions for 54,822 objects. Expressions are collected from Amazon Mechanical Turk instead of a two-player game. This resulted in longer expressions (avg. 8 words) than ones in UNC and UNC+ (avg. 4 words).
ReferItGame (ReferIt): The ReferIt dataset  contains 130,525 expressions for 96,654 objects in 19,894 images. The images are from the IAPR Tc-12 dataset. In this dataset, there are some referred objects from nature
such as water, sky, and ground. Due to the collection process, the expressions are shorter (avg. 3.4 words) in this dataset .
Evaluation Metrics: Following the previous work [31, 36, 49, 5], we use overall intersection-over-union (IoU) and as evaluation metrics. Given the predicted segmentation mask and the ground truth, the IoU metric is the ratio between the intersection and the union of the two. The overall IoU calculates the total intersection over total union score. The second metric, , calculates the percentage of test examples that have IoU score higher than the threshold . In experiments, .
. In all convolutional layers, we set the filter size, stride, and number of filters () as , , and , respectively. The depth is in the UNet part of the network We set the dropout probability to throughout the network. We use Adam  with default parameters for the optimization. We do not train the DeepLab ResNet101-v2 weights. There are examples in each minibatch. We train and test the model separately for each dataset. We train our model for epochs and present the scores for the model that achieves the best performance on the validation split.
4.2 Quantitative Results
Table 1 shows the comparison of our model with the previous work. Our model outperforms all previous models on all datasets. When we compare our model with the previous state-of-the-art model, Step-ConvRNN, the most significant improvement is on the G-Ref dataset. Since this dataset includes longer and richer expressions, this improvement demonstrates the ability of our approach to model the long-range dependencies between visual and linguistic modalities.
We also compare our model with MAttNet which is a referring expression comprehension model. Since they present segmentation results after they predict bounding boxes for objects, it is comparable with our work. Our model is significantly better than MAttNet which depends on an explicit object proposal network that is trained on more COCO images. This result shows the ability of our model to detect object regions and relate them with expressions.
Table 2 presents the comparison of our model with the state-of-the-art in terms of scores. Our model achieves better scores for higher thresholds. These scores indicate that although models can relate expressions with the correct objects, the low scores on the higher thresholds suggest that there is still room for improvement for whole-part object segmentation.
4.3 Ablation Results
We also performed ablation studies to better understand the contributions of the different parts of our model. Table 3 shows the performances of the different architectures on the UNC validation split with prec@X and overall IoU metrics.
|1x1 Text Kernels||67.58||59.42||49.79||36.16||12.57||57.57|
|No Multi-Scale Loss||71.37||64.20||55.77||42.30||15.64||60.09|
LingUNet vs. BiLingUNet: The first row shows the performance of our baseline model, LingUNet (). This architecture has the language conditioning only on the expanding branch, and each text filter is size of . When we compare this architecture with our final model, we see the dramatic performance gain, which shows the effectiveness of the bidirectional modulation of visual processing with expressions and usage of spatial kernels.
vs. Text Kernels: When we compare the first and second rows, we see that usage of spatial kernels brings additional improvement over the base model. Similarly, if we use kernels in our model, the performance of the model decreases significantly. This shows the desirability of using wider convolutional kernels to represent language.
Dropout & Multi-Scale Loss: Both turning off the dropout across the model and using only the predictions from final layer for the loss calculation decrease the performance slightly. We also observed that using multi-scale loss provides faster convergence on some datasets. For example, the model gets the best performance after only the fifth epoch on G-Ref dataset.
Shuffling the Training Data: Since we heavily use the batch norm operation in our architecture, to effectively train our model, we shuffle the training data after each epoch ends, as suggested in the Batchnorm study . Otherwise, we observed a performance drop (2 mIoU), which would indicate that the model tends to memorize the minibatch statistics.
4.4 Qualitative Results
In this section, we visualize some of the segmentation predictions of our model to gain better insights about the trained model.
Figure 2 shows some of the cases that our model segments correctly. These examples demonstrate that our model can learn a variety of language and visual reasoning patterns. For example, the first two examples of the first row shows that our model learns to relate superlative adjectives (e.g., longer, shorter) with visual comparison. Examples include spatial prepositions (e.g., on right, on left, next to, behind, over, bottom) demonstrate the spatial reasoning ability of the model. We also see that the model can learn a domain-specific nomenclature (catcher, batter) that is present in the dataset. Lastly, we can see that the model can distinguish the different actions (e.g., standing, squatting, sitting).
Figure 3 shows some of the incorrect segmentation predictions from our model on the UNC validation dataset. In the figure, each group shows one of the observed patterns within the examples. One of them (a) is that our model tends to combine similar objects or their parts when they are hard to distinguish. Another reason for the errors is that some of the expressions are ambiguous (b), where there are more than one object that could be the correspondence of the expression. And the model segments both possible objects. Some of the examples (d) are hard to segment completely due to the lack of light or objects that mask the referred objects. Finally, some of the annotations contain incorrect or incomplete ground-truth mask (c).
We have proposed BiLingUNet, a state-of-the-art model for image segmentation from referring expressions. BiLingUNet uses expressions to generate text kernels for the modulation of visual processing. We showed with detailed ablation studies that using language to modulate both bottom-up and top-down visual processing works better than just making the top-down processing language-conditional. Our model achieved state-of-the-art performance on four referring expression datasets. Our future work focuses on adapting BiLingUNet for different language-vision problems such as image captioning and visual question answering or using it as a sub-component to solve a far more challenging task like mapping natural language instructions to sequences of actions.
-  (2016) Neural module networks. In , pp. 39–48. Cited by: §2.2.
-  (2015-12) VQA: visual question answering. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
-  (2014) Neural machine translation by jointly learning to align and translate. External Links: Cited by: §2.2.
-  (2019) Learning from implicit information in natural language instructions for robotic manipulations. In Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP), pp. 29–39. Cited by: §3.3.
-  (2019) See-through-text grouping for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7454–7463. Cited by: §2.3, §3.1, §4.1, §4.1, Table 1, Table 2, footnote 1.
Deep contextual networks for neuronal structure segmentation. In
Thirtieth AAAI conference on artificial intelligence, Cited by: §3.4.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.1, §3.1.
-  (2018) Using syntax to ground referring expressions in natural images. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.3.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1.
-  (2017-04) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 677–691. External Links: Cited by: §2.2.
-  (2017) 3D deeply supervised network for automated segmentation of volumetric medical images.. Medical image analysis 41, pp. 40. Cited by: §3.4.
-  (2014-06) Scalable object detection using deep neural networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §2.2.
-  (2010) The segmented and annotated iapr tc-12 benchmark. Computer vision and image understanding 114 (4), pp. 419–428. Cited by: §4.1.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §3.1.
-  (2014-06) Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §2.2.
-  (2015-12) Fast r-cnn. 2015 IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §2.2.
-  (2017-10) Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §2.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.3, §3.2, §3.
-  (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813. Cited by: §3.3.
-  (2017-07) Modeling relationships in referential expressions with compositional modular networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.2.
-  (2016) Segmentation from natural language expressions. Lecture Notes in Computer Science, pp. 108–124. External Links: Cited by: §2.3, Table 1.
-  (2016-06) Natural language object retrieval. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.2, §3.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.3, §4.3.
-  (2017-04) Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 664–676. External Links: Cited by: §2.2.
Referitgame: referring to objects in photographs of natural scenes.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798. Cited by: §4.1, §4.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2015) Deeply-supervised nets. In Artificial intelligence and statistics, pp. 562–570. Cited by: §3.4.
-  (2018) Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §2.3, Table 1, Table 2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
-  (2017) Recurrent multimodal interaction for referring image segmentation. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1280–1289. Cited by: §2.3, §3.1, §4.1, §4.1, §4.1, Table 1, Table 2, footnote 1.
-  (2019-10) Learning to assemble neural module tree networks for visual grounding. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.1.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.3.
-  (2016-06) Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.2, §4.1.
-  (2018) Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 630–645. Cited by: §2.3, §4.1, §4.1, Table 1, Table 2.
-  (2018) Mapping instructions to actions in 3d environments with visual goal prediction. arXiv preprint arXiv:1809.00786. Cited by: §1, §2.1, §2.3, §3.3, §3.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.
-  (2017-06) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Cited by: §2.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2.1, §2.3, §3.3, §3.
-  (2018-09) Key-word-aware network for referring expression image segmentation. In The European Conference on Computer Vision (ECCV), Cited by: §2.3, Table 1.
Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 802–810. External Links: Cited by: §2.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1, §2.3.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.3.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.3.
-  (2015-06) Show and tell: a neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.2.
-  (2019-06) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.2.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.2.
-  (2019-06) Cross-modal self-attention network for referring image segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.3, §3.1, §4.1, §4.1, Table 1, Table 2, footnote 1.
-  (2018-06) MAttNet: modular attention network for referring expression comprehension. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §2.2, Table 1.
-  (2016) Modeling context in referring expressions. Lecture Notes in Computer Science, pp. 69–85. Cited by: §2.2, §4.1, §4.1.
-  (2014) Edge boxes: locating object proposals from edges. In European conference on computer vision, pp. 391–405. Cited by: §2.2.