Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

04/12/2017 ∙ by Yuting Zhang, et al. ∙ University of Michigan 0

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

page 18

page 20

page 22

page 25

page 30

page 31

page 33

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object localization and detection in computer vision are traditionally limited to a small number of predefined categories (e.g., car, dog, and person), and category-specific image region classifiers

[7, 11, 14] serve as object detectors. However, in the real world, the visual entities of interest are much more diverse, including groups of objects (involved in certain relationships), object parts, and objects with particular attributes and/or in particular context. For scalable annotation, these entities need to be labeled in a more flexible way, such as using text phrases.

Deep learning has been demonstrated as a unified learning framework for both text and image representations. Significant progress has been made in many related tasks, such as image captioning [55, 56, 25, 37, 5, 9, 23, 18, 38], visual question answering [3, 36, 57, 41, 2], text-based fine-grained image classification [44], natural-language object retrieval [21, 38], and text-to-image generation [45].

A few pioneering works [21, 38] use recurrent neural language models [15, 39, 50] and deep image representations [31, 49] for localizing the object referred to by a text phrase given a single image (i.e., “object referring" task [26]). Global spatial context, such as “a man on the left (of the image)”, has been commonly used to pick up the particular object. In contrast, Johnson et al. [23] takes descriptions without global context111Only a very small portion of text phrases on the Visual Genome refer to the global context. as queries for localizing more general visual entities on the Visual Genome dataset [30].

Figure 1: Comparison between (a) image captioning model and (b) our discriminative architecture for visual localization.

All above existing work performs localization by maximizing the likelihood to generate the query text given image regions using an image captioning model (Figure 1

a), whose output probability density needs to be modeled on the virtually infinite space of the natural language. Since it is hard to train a classifier on such a huge structured output space, current captioning models are constrained to be trained in generative

[21, 23] or partially discriminative [38] ways. However, as discriminative tasks, localization and detection usually favor models that are trained with a more discriminative objective to better utilize negative samples. In this paper, we propose a new deep architecture for natural-language-based visual entity localization, which we call a discriminative bimodal network (DBNet). Our architecture uses a binary output space to allow extensive discriminative training, where any negative training sample can be potentially utilized. The key idea is to take the text query as a condition rather than an output and to let the model directly predict if the text query and image region are compatible (Figure 1b). In particular, the two pathways of the deep architecture respectively extract the visual and linguistic representations. A discriminative pathway is built upon the two pathways to fuse the bimodal representations for binary classification of the inter-modality compatibility.

Compared to the estimated probability density in the huge space of the natural language, the score given by a binary classifier is more likely to be

calibrated. In particular, better calibrated scores should be more comparable across different images and text queries. This property makes it possible to learn decision thresholds to determine the existence of visual entities on multiple images and text queries, making the localization model generalizable for detection tasks. While a few examples of natural-language visual detection are showcased in [23], we perform more comprehensive quantitive and ablative evaluations.

In our proposed architecture, we use convolutional neural networks (CNNs) for both visual and textual representations. Inspired by fast R-CNN

[13]

, we use the RoI-pooling architecture induced from large-scale image classification networks for efficient feature extraction and model learning on image regions. For textual representations, we develop a character-level CNN

[60] for extracting phrase features. A network on top of the image and language pathways dynamically forms classifiers for image region features depending on the text features, and it outputs the classifier responses on all regions of interest.

Our main contributions are as follows:

  1. We develop a bimodal deep architecture with a binary output space to enable fully discriminative training for natural-language visual localization and detection.

  2. We propose a training objective that extensively pairs text phrases and bounding boxes, where 1) the discriminative objective is defined over all possible region-text pairs in the entire training set, and 2) the non-mutually exclusive nature of text phrases is taken into account to avoid ambiguous training samples.

  3. Experimental results on Visual Genome demonstrate that the proposed DBNet significantly outperforms existing methods based on recurrent neural language models for visual entity localization on single images.

  4. We also establish evaluation methods for natural-language visual detection on multiple images and show state-of-the-art results.

2 Related work

Object detection.

Recent success of deep learning on visual object recognition [31, 59, 49, 51, 53, 17] constitutes the backbone of the state-of-the-art for object detection [14, 48, 52, 61, 42, 43, 13, 46, 17, 6]. Natural-language visual detection can adapt the deep visual representations and single forward-pass computing framework (e.g., RoI pooling [13], SPP [16], R-FCN [6]) used in existing work of traditional object detection. However, natural-language visual detection needs a huge structured label space to represent the natural language, and finding a proper mapping to the huge space from visual representations is difficult.

Image captioning and caption grounding.

The recurrent neural network (RNN) 

[19] based language model [15, 39, 50] has become the dominant method for captioning images with text [55]. Despite differences in details of network architectures, most RNN language models learn the likelihood of picking up a word from a predefined vocabulary given the visual appearance features and previous words (Figure 1a). Xu et al. [56] introduced an attention mechanism to encourage RNNs to focus on relevant image regions when generating particular words. Karpathy and Fei-Fei [25] used strong supervision of text-region alignment for well-grounded captioning.

Object localization by natural language.

Recent work used the conditional likelihood of captioning an image region with given text for localizing associated objects. Hu et al. [21] proposed the spatial-context recurrent ConvNet (SCRC), which conditioned on both local visual features and global contexts for evaluating given captions. Johnson et al. [23] combined captioning and object proposal in an end-to-end neural network, which can densely caption (DenseCap) image regions and localize objects. Mao et al. [38] trained the captioning model by maximizing the posterior of localizing an object given the text phrase, which reduced the ambiguity of generated captions. However, the training objective was limited to figuring out single objects on single images. Lu et al. [34] simplified and limited text queries to subject-relationship-object (SVO) triplets. Rohrbach et al. [47] improved localization accuracy with an extra text reconstruction task. Hu et al. [20] extended bounding box localization to instance segmentation using natural language queries. Yu et al. [58] and Nagaraja et al. [40] explicitly modeled context for referral expressions.

Text representation.

Neural networks can also embed text into a fixed-dimensional feature space. Most RNN-based methods (e.g., skip-thought vectors 

[29]) and CNN-based methods [24, 27]

use word-level one-hot encoding as the input. Recently, character-level CNN has also been demonstrated an effective way for paragraph categorization 

[60] and zero-shot image classification [44].

3 Discriminative visual-linguistic network

The best-performing object detection framework [7, 11, 14] in terms of accuracy generally verifies if a candidate image region belongs to a particular category of interest. Though recent deep architectures [52, 46, 23] can propose regions with confidence scores at the same time, a verification model, taking as input the image features from the exact proposed regions, still serves as a key to boost the accuracy.

In this section, we develop a verification model for natural-language visual localization and detection. Unlike the classifiers for a small number of predefined categories in traditional object detection, our model is dynamically adaptable to different text phrases.

3.1 Model framework

Let be an image, be the coordinates of a region, and be a text phrase. The verification model outputs the confidence of ’s being matched with . Suppose that is the binary label indicating if is a positive or negative region-text pair on . Our verification model learns to fit the probability for and being compatible (a positive pair), i.e., . See Section B in the supplementary materials for a formalized comparison with conditional captioning models.

To this end, we develop a bimodal deep neural network for our model. In particular, is composed of two single-modality pathways followed by a discriminative pathway. The image pathway extracts the -dim visual representation on the image region on . The language pathway extracts the -dim textual representation for the phrase . The discriminative pathway with parameters dynamically generates a classifier for visual representation according to the textual representation, and predicts if and are matched on . The full model is specified by .

3.2 Visual and linguistic pathways

RoI-pooling image network.

We suppose the regions of interest are given by an existing region proposal method (e.g., EdgeBox [62], RPN [46]). We calculate visual representations for all image regions in one pass using the fast R-CNN RoI-pooling pipeline. State-of-the-art image classification networks, including the 16-layer VGGNet [49] and ResNet-101 [17], are used as backbone architectures.

Character-level textual network.

For an English text phrase , we encode each of its characters into a 74-dim one-hot vector, where the alphabet is composed of 74 printable characters including punctuations and the space. Thus, the is encoded as a 74-channel sequence by stacking all character encodings. We use a character-level deep CNN [60] to obtain the high-level textual representation of

. In particular, our network has 6 convolutional layers interleaving with 3 max-pooling layers and followed by 2 fully connected layers (see Section 

A in the supplementary materials for more details). It takes a sequence of a fixed length as the input and produces textual representations of a fixed dimension. The input length is set to be long enough (here, 256 characters) to cover possible text phrases.222The Visual Genome dataset has more than 2.8M unique phrases, whose median length in character is 29. Less than 500 phrases has more than 100 characters. To avoid empty tailing characters in the input, we replicate the text phrase until reaching the input length limit.

We empirically found that the very sparse input can easily lead to over-sparse intermediate activations, which can create a large portion of “dead” ReLUs and finally result in a degenerate solution. To avoid this problem, we adopt the Leaky ReLU (LReLU)

[35] to keep all hidden units active in the character-level CNN.

Other text embedding methods [29, 24, 27] also can be used in the DBNet framework. We use the character-level CNN because of its simplicity and flexibility. Compared to word-based models, it uses lower-dimensional input vectors and has no constraint on the word vocabulary size. Compared to RNNs, it easily allows deeper architectures.

3.3 Discriminative pathway

The discriminative pathway first forms a linear classifier using the textual representation of the phrase . Its linear combination weights and bias are

(1)
(2)

where , , and . This classifier is applied to the visual representation of the image region on , obtaining the verification confidence predicted by our model:

(3)

Compared to the basic form of the bilinear function , our discriminative pathway includes an additional linear term as the text-dependent bias for the visual representation classifier.

As a natural way for modeling the cross-modality correlation, multiplication is also a source of instability for training. To improve the training stability, we introduce a regularization term for the dynamic classifier, besides the network weight decay for .

4 Model learning

In DBNet, we drive the training of the proposed two-pathway bimodal CNN with a binary classification objective. We pair image regions and text phrases as training samples. We define the ground truth binary label for each training region-text pair (Section 4.1

), and propose a weighted training loss function (Section 

4.2).

Training samples.

Given training images , let be the set of ground truth annotations for , where is the number of annotations, is the coordinate of the th region, and is the text phrase corresponding to . When one region is paired with multiple phrases, we take each pair as a separate entry in .

We denote the set of all regions considered on by , which includes both annotated regions and regions given by proposal methods [54, 62, 46]. We write for the set of annotated text phrases on , and for all training text phrases.

4.1 Ground truth labels

Figure 2: Ground truth labels for region-text pairs (given an arbitrary image region). Phrases are categorized into positive, ambiguous, and negative sets based on the given region’s overlap with ground truth boxes (measured by IoU and displayed as the numbers in front of the text phrases). Ambiguous phrases augmented by text similarity is not shown here (see the video in the supplementary materials for an illustration). For visual clarity, and , which are different from the rest of the paper.

Labeling criterion.

We assign each possible training region-text pair with a ground truth label for binary classification. For a region on the image and a text phrase , we take the largest overlap between and ’s ground truth regions as evidence to determine ’s label. Let denote the intersection over union. The largest overlap is defined as

(4)

In object detection on a limited number of categories (i.e., consists of category labels), is usually reliable enough for assigning binary training labels, given the (almost) complete ground truth annotations for all categories.

In contrast, text phrase annotations are inevitably incomplete in the training set. One image region can have an intractable number of valid textual descriptions, including different points of focus and paraphrases of the same description, so annotating all of them is infeasible. Consequently, cannot always reflect the consistency between an image region and a text phrase. To obtain reliable training labels, we define positive labels in a conservative manner; and then, we combine text similarity together with spatial IoU to establish the ambiguous text phrase set that reflects potential “false negative” labels. We provide detailed definitions below.

Positive phrases.

For a region on , its positive text phrases (i.e., phrases assigned with positive labels) constitute the set

(5)

where is a high enough IoU threshold () to determine positive labels. Some positive phrases may be missing due to incomplete annotations. However, we do not try to recover them (e.g., using text similarity), as “false positive” training labels may be introduced by doing so.

Ambiguous phrases.

Still for the region , we collect the text phrases whose ground truth regions have moderate (neither too large nor too small) overlap with into a set

(6)

where is the IoU lower bound (). When ’s largest IoU with the ground truths of a phrase lies in , it is uncertain whether is positive or negative. In other words, is ambiguous with respect to the region .

Note that only contains phrases from . To cover all possible ambiguous phrases from the full set , we use a text similarity measurement to augment to the finalized ambiguous phrase set

(7)

where we use the METEOR [4] similarity for and set the text similarity threshold .333If the METEOR similarity of two phrases is greater than 0.3, they are usually very similar. In Visual Genome, 0.25% of all possible pairs formed by the text phrases that occur 20 times can pass this threshold.

Labels for region-text pairs.

For any image region on and any phrase , the ground truth label of is

(8)

where the pairs of a region and its ambiguous text phrases are assigned with the “uncertain” label to avoid false negative labels. Figure 2 illustrates the region-text label for an arbitrary training image region.

4.2 Weighted training loss

Effective training sets.

On the image , the effective set of training region-text pairs is

(9)

where, as previously defined, consists of annotated and proposed regions, and consists of all phrases from the training set. We exclude samples of uncertain labels.

We partition into three subsets according to the value of and the origin of the phrase : for , for , and for all negative region-text pairs containing phrases from the rest of the training set (i.e., not from ).

Per-image training loss

Let for notation convenience; and, let

be a binary classification loss, in particular, the cross-entropy loss of logistic regression. We define the training loss on

as the summation of three parts:

(10)
(11)
(12)
(13)

where is ’s frequency of occurrences in the training set. We normalize and re-weight the loss for each of the three subsets of separately. In particular, we set to balance the positive and negative training loss. The values of and are implicitly determined by the numbers of text phrases that we choose inside and outside during stochastic optimization.

The training loss functions in most existing work on natural-language visual localization [21, 23] use only positive samples for training, which is similar to solely using . The method in [38] also considers the negative case (similar to ), but it is less flexible and not extensible to the case of . The recurrent neural language model can encourage a certain amount of discriminativeness on word selection, but not on entire text phrases as ours.

Region Visual Localization Recall / % for IoU@ Median Mean
proposal network model 0.1 0.2 0.3 0.4 0.5 0.6 0.7 IoU IoU
DC-RPN 500 16-layer VGGNet DenseCap 52.5 38.9 27.0 17.1 09.5 04.3 01.5 0.117 0.184
DBNet 57.4 46.9 37.8 29.4 21.3 13.6 07.0 0.168 0.250
EdgeBox 500 16-layer VGGNet DenseCap 48.8 36.2 25.7 16.9 10.1 05.4 02.4 0.092 0.178
SCRC 52.0 39.1 27.8 18.4 11.0 05.8 02.5 0.115 0.189
DBNet w/o bias term 52.3 43.8 36.3 29.3 22.4 15.7 09.4 0.124 0.246
DBNet w/o VOC pretraining 54.3 45.0 36.6 28.8 21.3 14.4 08.2 0.144 0.245
DBNet 54.8 45.9 38.3 30.9 23.7 16.6 09.9 0.152 0.258
ResNet-101 DBNet 59.6 50.5 42.3 34.3 26.4 18.6 11.2 0.205 0.284
Table 1: Single-image object localization accuracy on the Visual Genome dataset. Any text phrase annotated on a test image is taken as a query for that image. “IoU@” denotes the overlapping threshold for determining the recall of ground truth boxes. DC-RPN is the region proposal network from DenseCap.

Full training objective.

Summing up the training loss for all images together with weight decay for the whole neural network and the regularization for the text-specific dynamic classifier (Section 3.3), the full training objective is:

(14)

where we set and . Model optimization is in Section C of the supplementary materials.

5 Experiments

Dataset.

We evaluated the proposed DBNet on the Visual Genome dataset [30]. It contains 108,077 images, where 5M regions are annotated with text phrases in order to densely cover a wide range of visual entities.

We split the Visual Genome datasets in the same way as in [23]: 77,398 images for training, 5,000 for validation (tuning model parameters), and 5000 for testing; the remaining 20,679 images were not included (following [23]).

The text phrases were annotated from crowd sourcing and included a significant portion of misspelled words. We corrected misspelled words using the Enchant spell checker [1] from AbiWord. After that, there were 2,113,688 unique phrases in the training set and 180,363 unique phrases in the testing set. In the test set, about one third (61,048) of the phrases appeared in the training set, and the remaining two thirds (119,315) were unseen. About 43 unique phrases were annotated with ground truth regions per image. All experimental results are reported on this dataset.

Models.

We constructed the fast R-CNN [13]-style visual pathway of DBNet based on either the 16-layer VGGNet (Model-D in [49]) or ResNet-101 [17]. In most experiments, we used VGGNet for fair comparison with existing works (which also use VGGNet) and less evaluation time. ResNet-101 was used to further improve the accuracy.

We compared DBNet with two image captioning based localization models: DenseCap [23] and SCRC [21]. In DBNet, the visual pathway was pretrained for object detection using the faster R-CNN [46] on the PASCAL VOC 2007+2012 trainval set [10]

. The linguistic pathway was randomly initialized. Pretrained VGGNet on ImageNet ILSVRC classification dataset 

[8]

was used to initialize DenseCap, and the model was trained to match the dense captioning accuracy reported by

Johnson et al. [23]. We found that the faster R-CNN pretraining did not benefit DenseCap (see Section E.1 of the supplementary materials). The SCRC model was additionally pretrained for image captioning on MS COCO [33] in the same way as Hu et al. [21] did.

We trained all models using the training set on Visual Genome and evaluated them for both localization on single images and detection on multiple images. We also assessed the usefulness of the major components of our DBNet.

5.1 Single image localization

In the localization task, we took all ground truth text phrases annotated on an image as queries to localize the associated objects by maximizing the network response over proposed image regions.

Evaluation metrics.

We used the same region proposal method to propose bounding boxes for all models, and we used the non-maximum suppression (NMS) with the IoU threshold to localize a few boxes. The performance was evaluated by the recall of ground truth regions of the query phrase (see Section D of the supplementary materials for a discussion on recall and precision for localization tasks). If one of the proposed bounding boxes with the top- network responses had a large enough overlap (determined by an IoU threshold) with the ground truth bounding box, we took it as a successful localization. If multiple ground truth boxes were on the same image, we only required the localized boxes to match one of them. The final recall was averaged over all test cases, i.e., per image and text phrase. Median and mean overlap (IoU) between the top-1 localized box and the ground truth were also considered.

DenseCap Recall / % for IoU@ Median
performance 0.1 0.3 0.5 IoU
Small test set in [23] 56.0 34.5 15.3 0.137
Test set in this paper 50.5 24.7 08.1 0.103
Table 2: Localization accuracy of DenseCap on the small test set (1000 images and 100 test queries) used in [23] and the full test set (5000 images and 0.2M queries) used in this paper. boxes (at most) per image are proposed using the DenseCap RPN.

DBNet outperforms captioning models.

We summarize the top-1 localization performance of different methods in Table 1, where bounding boxes were proposed for testing. DBNet outperforms DenseCap and SCRC under all metrics. In particular, DBNet’s recall was more than twice as high as the other two methods for the IoU threshold at (commonly used for object detection [10, 33]) and about times higher for IoU at (for high-precision localization [12, 61]).

Johnson et al. [23] reported DenseCap’s localization accuracy on a much smaller test set (1000 images and 100 test queries in total), which is not comparable to our exhaustive test settings (Table 2 for comparison). We also note that different region proposal methods (EdgeBox and DenseCap RPN) did not make a big difference on the localization performance. We used EdgeBox for the rest of our evaluation.

Figure 5 shows the top- recall () in curves. SCRC is slightly better than DenseCap, possibly due to the global context features used in SCRC. DBNet outperforms both consistently with a significant margin, thanks to the effectiveness of discriminative training.

(a) IoU@0.5
(b) IoU@0.7
Figure 5: Top- localization recall under two overlapping thresholds. VGGNet and EdgeBox 500 are used in all methods.
Figure 6: Qualitative comparison between DBNet and DenseCap on localization task. Green boxes: ground truth; Red boxes: DenseCap; Yellow boxes: DBNet.

Dynamic bias term improves performance.

The text-dependent bias term introduced in (2) and (3) makes our method for fusing visual and linguistic representations different from the basic bilinear functions (e.g., used in [44]) and more similar to a visual feature classifier. As in Table 1, this dynamic bias term led to relative improvement on median IoU and ( absolute) relative improvement on recall at all IoU thresholds.

Transferring knowledge benefits localization accuracy.

Pretraining the visual pathway of DBNet for object detection on PASCAL VOC showed minor benefit on recall at lower IoU thresholds, but it brought and relative improvement to the recall for the IoU threshold at and , respectively. See Section E.1 in the supplementary materials for more results, where we showed that DenseCap did not get benefit from the same technique.

Qualitative results.

We visually compared the localization results of DBNet and DenseCap in Figure 6. In many cases, DBNet localized the queried entities at more reasonable locations. More examples are provided in Section F of the supplementary materials.

More quantitative results.

In the supplementary materials, we studied the performance improvement of the learned models over random guessing and the upper bound performance due to the limitation of region proposal methods (Section E.2). We also evaluated DBNet using queries in a constrained form (Section E.3), where the high query complexity was demonstrated as a significant source of failures for natural language visual localization.

5.2 Detection on multiple images

In the detection task, the model needs to verify the existence and quantity of queried visual entities in addition to localizing them, if any. Text phrases not associated with any image regions can exist in the query set of an image, and evaluation metrics can be defined by extending those used in traditional object detection.

Query sets.

Due to the huge total number of possible query phrases, it is practical to test only a subset of phrases on a test image. We developed query sets in three difficulty levels (). For a text phrase, a test image is positive if at least one ground truth region exists for the phrase; otherwise, the image is negative.

  • Level-0: The query set was the same as in the localization task, so every text phrase was tested only on its positive images (43 phrases per image).

  • Level-1: For each text phrase, we randomly chose the same number of negative images and the positive images (92 phrases per image).

  • Level-2: The number of negative images was either 5 times the number of positive images or 20 (whichever was larger) for each test phrase (775 phrases per image). This set included relatively more negative images (compared to positive images) for infrequent phrases.

As the level went up, it became more challenging for a detector to maintain its precision, as more negative test cases are included. In the level-1 and level-2 sets, text phrases depicting obvious non-object “stuff”, such as sky, were removed to better fit the detection task. Then, 176,794 phrases (59,303 seen and 117,491 unseen) remained.

Average IoU@0.3 IoU@0.5 IoU@0.7
precision / % mAP gAP mAP gAP mAP gAP
DenseCap 36.2 01.8 15.7 00.5 03.4 00.0
SCRC 38.5 02.2 16.5 00.5 03.4 00.0
DBNet 48.1 23.1 30.0 10.8 11.6 02.1
DBNet w/ Res 51.1 24.2 32.6 11.5 12.9 02.2
(a) Level-0: Only positive images per text phrase.
Average IoU@0.3 IoU@0.5 IoU@0.7
precision / % mAP gAP mAP gAP mAP gAP
DenseCap 22.9 01.0 10.0 00.3 02.1 00.0
SCRC 37.5 01.7 16.3 00.4 03.4 00.0
DBNet 45.5 21.0 28.8 09.9 11.4 02.0
DBNet w/ Res 48.3 22.2 31.2 10.7 12.6 02.1
(b) Level-1: The ratio between the positive and negative images is 1:1 per text phrase.
Average IoU@0.3 IoU@0.5 IoU@0.7
precision / % mAP gAP mAP gAP mAP gAP
DenseCap 04.1 00.1 01.7 00.0 00.3 00.0
DBNet 26.7 08.0 17.7 03.9 07.6 00.9
DBNet w/ Res 29.7 09.0 19.8 04.3 08.5 00.9
(c) Level-2: The ratio between the positive and negative images is at least 1:5 (minimum 20 negative images and 1:5 otherwise) per text phrase.
Table 6: Detection average precision using query set of three levels of difficulties. mAP: mean AP over all text phrases. gAP: AP over all test cases. VGGNet is the default visual CNN for all methods. “DBNet w/ Res” denotes our DBNet with ResNet-101.

Evaluation metrics.

We measured the detection performance by average precision (AP). In particular, we computed AP independently for each query phrase (comparable to a category in traditional object detection [10]) over its test images, and reported the mean AP (mAP) over all query phrases. Like traditional object detection, the score threshold for a detected region is category/phrase-specific.

For more practical natural-language visual detection, where the query text may not be known in advance, we also directly computed AP over all test cases. We term it global AP (gAP), which implies a universal decision threshold for any query phrase. Table 6 summarizes mAPs and gAPs under different overlapping thresholds for all models.

DBNet shows higher per-phrase performance.

DBNet achieved consistently stronger performance than DenseCap and SCRC in terms of mAP, indicating that DBNet produced more accurate detection per given phrase. Even for the challenging IoU threshold of 0.7, DBNet still showed reasonable performance. The mAP results suggest the effectiveness of discriminative training.

Figure 7: Qualitative detection results of DBNet with ResNet-101. We show detection results of six different text phrases on each image. For each image, the colors of bounding boxes correspond to the colors of text tags on the right. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results.
Prune Phrases Finetune Localization Detection (Level-1)
ambiguous from other visual Recall / % for IoU@ Median Mean mAP / % for IoU@ gAP / % for IoU@
phrases images pathway 0.3 0.5 0.7 IoU IoU 0.3 0.5 0.7 0.3 0.5 0.7
No No No 30.6 17.5 07.8 0.066 0.211 35.5 22.0 08.6 08.3 03.1 00.4
Yes No No 34.5 21.2 09.0 0.113 0.237 39.0 24.6 09.7 15.5 07.4 01.6
Yes Yes No 34.7 21.1 08.8 0.119 0.238 41.3 25.6 10.0 17.2 07.9 01.6
Yes Yes Yes 38.3 23.7 09.9 0.152 0.258 45.5 28.8 11.4 21.0 09.9 02.0
Table 7: Ablation study of DBNet’s major components. The visual pathway is based on the 16-layer VGGNet.

DBNet scores are better “calibrated”.

Achieving good performance in gAP is challenging as it assumes a phrase-agnostic, universal decision threshold. For IoU at 0.3 and 0.5, DenseCap and SCRC showed very low performance in terms of gAP, and DBNet dramatically () outperformed them. For IoU at 0.7, DenseCap and SCRC were unsuccessful, while DBNet could produce a certain degree of positive results. The gAP results suggest that the responses of DBNet are much better calibrated among different text phrases than captioning models, supporting our hypothesis that distributions on a binary decision space are easier to model than those on the huge natural language space.

Robustness to negative and rare cases.

The performance of all models dropped as the query set became more difficult. SCRC appeared to be more robust than DenseCap for negative test cases (level-1 performance). DBNet showed superior performance in all difficulty levels. Particularly for the level-2 query set, DenseCap’s performance dropped significantly compared to the level-1 case, which suggests that it probably failed at handling rare phrases (note that relatively more negative images are included in the level-2 set for rare phrases). For IoU at 0.5 and 0.7, DBNet’s level-2 performance was even better than the level-0 performance of DenseCap and SCRC. We did not test SCRC on the level-2 query set because of its high time consumption.444For level-2 query set, DBNet and DenseCap cost 0.5 min to process one image (775 queries) when using the VGGNet and a Titan X card. SCRC takes nearly minutes with the same setting. In addition, DBNet took 2–3 seconds to process one image when using level-0 query set.

Qualitative results.

We showed qualitative results of DBNet detection on selected examples in Figure 7. More comprehensive (random and failed) examples are provided in Section G of the supplementary materials. Our DBNet could detect diverse visual entities, including objects with attributes (e.g., “a bright colored snow board”), objects in context (e.g., “little boy sitting up in bed”), object parts (e.g., “front wheel of a bicycle”), and groups of objects (e.g.,“bikers riding in a bicycle lane”).

5.3 Ablation study on training strategy

We did ablation studies for three components of our DBNet training strategy: 1) pruning ambiguous phrases ( defined in Eq. (7)), 2) training with negative phrases from other images (), and 3) finetuning the visual pathway.

As shown in Table 7, the performance of the most basic training strategy is better than DenseCap and SCRC, due to the effectiveness of discriminative training. Ambiguous phrase pruning led to significant performance gain, by improving the correctness of training labels, where no “pruning ambiguous phrases” means setting . More quantitative analysis on tuning the text similarity threshold are provided in Section E.4 of the supplementary materials. Inter-image negative phrases did not benefit localization performance, since localization is a single-image task. However, this mechanism improved the detection performance by making the model more robust to diverse negative cases. As expected in most vision tasks, finetuning pretrained classification network boosted the performance of our models. In addition, upgrading the VGGNet-based visual pathway to ResNet-101 led to another clear gain in DBNet’s performance (Table 1 and 6).

6 Conclusion

We demonstrated the importance of discriminative learning for natural-language visual localization. We proposed the discriminative bimodal neural network (DBNet) to allow flexible discriminative training objectives. We further developed a comprehensive training strategy to extensively and properly leverage negative observations on training data. DBNet significantly outperformed the previous state-of-the-art based on caption generation models. We also proposed quantitative measurement protocols for natural-language visual detection. DBNet showed more robustness against rare queries compared to existing methods and produced detection scores with better calibration over various text queries. Our method can be potentially improved by combining its discriminative objective with a generative objective, such as image captioning.

Acknowledgements

This work was funded by Software R&D Center, Samsung Electronics Co., Ltd, as well as ONR N00014-13-1-0762, NSF CAREER IIS-1453651, and Sloan Research Fellowship. We thank NVIDIA for donating K40c and TITAN X GPUs. We also thank Kibok Lee, Binghao Deng, Jimei Yang, and Ruben Villegas for helpful discussions.

1

References

  • [1] AbiWord. Enchant spell checker. http://www.abisource.com/projects/enchant/.
  • Andreas et al. [2016] J. Andreas, M. Rohrbach, T. Darrell, and D. Kleina. Neural module networks. In CVPR, 2016.
  • Antol et al. [2015] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual question answering. In CVPR, 2015.
  • Banerjee and Lavie [2005] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.
  • Chen and Zitnick [2015] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, 2015.
  • Dai et al. [2016] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016.
  • Dalal and Triggs [2005] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Donahue et al. [2017] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):677–691, April 2017.
  • Everingham et al. [2010] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • Felzenszwalb et al. [2010] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
  • Geiger et al. [2012] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, 2012.
  • Girshick [2015] R. Girshick. Fast R-CNN. In ICCV, 2015.
  • Girshick et al. [2016] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142–158, Jan 2016.
  • Graves [2013] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
  • He et al. [2014] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Hendricks et al. [2016] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In CVPR, 2016.
  • Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hu et al. [2016a] R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In ECCV, 2016a.
  • Hu et al. [2016b] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In CVPR, 2016b.
  • Jia et al. [2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
  • Johnson et al. [2016] J. Johnson, A. Karpathy, and L. Fei-Fei. DenseCap: Fully convolutional localization networks for dense captioning. In CVPR, 2016.
  • Kalchbrenner et al. [2014] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modelling sentences. In ACL, 2014.
  • Karpathy and Fei-Fei [2015] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
  • Kazemzadeh et al. [2014] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referit game: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  • Kim [2014] Y. Kim. Convolutional neural networks for sentence classification. EMNLP, 2014.
  • Kingma and Ba [2015] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kiros et al. [2015] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In NIPS, 2015.
  • Krishna et al. [2017] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • LeCun et al. [1989] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • [33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV.
  • Lu et al. [2016] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detections. In ECCV, 2016.
  • Maas et al. [2013] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
  • Malinowski et al. [2015] M. Malinowski, M. Rohrbach, and M. Fritz.

    Ask your neurons: A neural-based approach to answering questions about images.

    In CVPR, 2015.
  • Mao et al. [2015] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (m-RNN). In ICLR, 2015.
  • Mao et al. [2016] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  • Mikolov et al. [2010] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010.
  • Nagaraja et al. [2016] V. Nagaraja, V. Morariu, and L. Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016.
  • Noh et al. [2016] H. Noh, P. H. Seo, and B. Han. Image question answering using convolutional neural network with dynamic parameter prediction. In CVPR, 2016.
  • Ouyang et al. [2016] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, H. Li, C. C. Loy, K. Wang, J. Yan, and X. Tang. DeepID-Net: Deformable deep convolutional neural networks for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • Redmon et al. [2016] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • Reed et al. [2016a] S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep representations of fine-grained visual descriptions. In

    IEEE Computer Vision and Pattern Recognition

    , 2016a.
  • Reed et al. [2016b] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016b.
  • Ren et al. [2015] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • Rohrbach et al. [2016] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
  • Sermanet et al. [2014] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. OverFeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
  • Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • Sutskever et al. [2011] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
  • Szegedy et al. [2015a] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015a.
  • Szegedy et al. [2015b] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015b.
  • Szegedy et al. [2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. 2016.
  • Uijlings et al. [2013] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  • Vinyals et al. [2015] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
  • Xu et al. [2015] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  • Yang et al. [2016] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
  • Yu et al. [2016] L. Yu, P. Poirson, S. Yang, A. Berg, and T. Berg. Modeling context in referring expressions. In ECCV, 2016.
  • Zeiler and Fergus [2014] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
  • Zhang et al. [2015a] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In NIPS, 2015a.
  • Zhang et al. [2015b] Y. Zhang, K. Sohn, R. Villegas, G. Pan, and H. Lee. Improving object detection with deep convolutional networks via bayesian optimization and structured prediction. In CVPR, 2015b.
  • Zitnick and Dollár [2014] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014.

Appendix A CNN architecture for the linguistic pathway

We summarize the CNN architecture used for the linguistic pathway in Table 8.

Layer ID Type Kernel size Output channels Pooling size Output length Activation
0 input n/a 74 none 256 none
1 convolution 7 256 2 128 LReLU (leakage )
2 convolution 7 256 none 128 LReLU (leakage )
3 convolution 3 256 none 128 LReLU (leakage )
4 convolution 3 256 2 64 LReLU (leakage )
5 convolution 3 512 none 64 LReLU (leakage )
6 convolution 3 512 2 32 LReLU (leakage )
7 inner-product n/a 2048 n/a n/a LReLU (leakage )
8 inner-product n/a 2048 n/a n/a LReLU (leakage )
Table 8: CNN architecture for the linguistic pathway.

Appendix B Formalized comparison with conditional generative models

In contrast to our discriminative framework, which fits , existing methods on natural-language visual localization [21, 23, 38] use the conditional caption generation model, where resembles . In [21, 23], the models are trained by maximizing . In [38], the model is trained instead by maximizing . However, it still resembles , and

is calculated via Bayes’ theorem.

Since the space of the natural language is intractable, accurately modeling is extremely difficult. Even considering only the plausible text phrases for on , the modes of are still hard to be properly lifted and balanced due to the lack of enough training samples to cover all valid descriptions. The generative modeling for text phrases may fundamentally limit the discriminative power of the existing model.

In contrast, our model takes both and as conditional variables. The conditional distribution on

is much easier to model due to the small binary label space, and it also naturally admits discriminative training. The power of deep distributed representations can also be leveraged for generalizing textual representations to less frequent phrases.

Appendix C Model optimization

The training objective is optimized by back-propagation [32]

using the mini-batch stochastic gradient descent (SGD) with momentum

. We use the basic SGD for the visual pathway and Adam [28] for the rest of the network.

We use EdgeBox [62] to propose 1000 boxes per image (in addition to the boxes annotated with text phrases) during training. For each image per iteration, we always include the top 50 proposed boxes in the SGD, and randomly sample another 50 out of the remaining 950 box proposals for diversity and efficiency.

To calculate exactly, we need to extract features from all text phrases (2.8M in Visual Genome) in the training set and combine them with almost every image regions in the mini-batch, which is impractical. Following the stochastic optimization framework, we randomly sample a few text phrases according to their frequencies of occurrence in the training set. This stochastic optimization procedure is consistent with (13).

In each iteration, we sample images when using the 16-layer VGGNet and image when using ResNet-101 on a single Titan X. The representations for each unique phrase and each unique image region is computed once per iteration. We partition a DBNet into sub-networks for the visual and textual pathways, and for the discriminative pathway. The batch size for those sub-networks are different and determined by inputs, e.g., the numbers of text phrases, bounding boxes, and effective region-text pairs. When using 2 images per iteration, the batch size for the discriminative pathway is 10K, where we feed all effective region-text pairs, as defined in (9

) , to the discriminative pathway. The large batch size is needed for efficient and stable optimization. Our Caffe

[22] and MATLAB based implementation supports dynamic and arbitrarily large batch sizes for sub-networks. The initial learning rates when using different visual pathways are summarized in Table 9.

Sub-networks \ Models 16-layer VGGNet ResNet-101
Visual Before RoI-pooling
pathway After RoI-pooling
Remainder
Table 9: Learning rates for DBNet training

We trained the VGG-based DBNet for approximately 10 days (3–4 days without finetuning the visual network, 4–5 days for the whole network, and 1–2 days with the decreased learning rate). DenseCap could get converged in 4 days, but further training did not improve the results. Given DBNet’s much higher accuracy, the extra training time was worthwhile.

Appendix D Discussion on recall and precision for localization

Table 1, 2, and 7 report the recall for the localization tasks, where each text phrase is localized with the bounding box of the highest score. Given an IoU threshold, the localized bounding box is either correct or not. As no decision threshold exists in this setting, we can calculate only the accuracy, but not a precision-recall curve. Following the convention in DenseCap and SCRC, we call this accuracy the “(rank-1) recall”, since it reflects if any ground-truth region can be recalled by the top-scored box. In Figure 5, assuming one ground-truth region per image (i.e., ordinary localization settings), we have . Note that rank-1 precision is the same as rank-1 recall.

Appendix E More quantitative results

We provide more quantitative analysis in this section, including the impact of pretraining on other datasets, random and upper-bound localization performance, localization with controlled queries, and an ablative study on the text similarity threshold for determining the ambiguous text phrase set.

e.1 Pretraining on different datasets

We trained DBNet and DenseCap using various pretrained visual networks. In particular, we used the 16-layer VGGNet in two settings: 1) pretrained on ImageNet ILSVRC 2012 for image classification (VGGNet-CLS) [8] and 2) further pretrained on the PASCAL VOC [10] for object detection using faster R-CNN [46]. We compared DBNet and DenseCap trained with these two pretrained networks and tested them with two different region proposal methods (i.e., DenseCap RPN and EdgeBox). As shown in Table 10, VOC pretraining was beneficial for DBNet, but it was not beneficial for DenseCap. Thus, we used the ImageNet pretrained VGGNet for DenseCap in the main paper.

Region Localization Accuracy / % for IoU@ Median Mean
proposal model 0.1 0.2 0.3 0.4 0.5 0.6 0.7 IoU IoU
DC-RPN 500 DenseCap (VGGNet-CLS) 52.5 38.9 27.0 17.1 09.5 04.3 01.5 0.117 0.184
DenseCap (VGGNet-DET) 49.4 36.9 26.0 16.7 09.3 04.3 01.5 0.096 0.176
DBNet (VGGNet-CLS) 57.7 46.9 37.0 27.9 19.5 11.7 05.6 0.169 0.242
DBNet (VGGNet-DET) 57.4 46.9 37.8 29.4 21.3 13.6 07.0 0.168 0.250
EdgeBox 500 DenseCap (VGGNet-CLS) 48.8 36.2 25.7 16.9 10.1 05.4 02.4 0.092 0.178
DenseCap (VGGNet-DET) 46.6 34.8 24.9 16.6 10.0 05.2 02.2 0.076 0.171
DBNet (VGGNet-CLS) 54.3 45.0 36.6 28.8 21.3 14.4 08.2 0.144 0.245
DBNet (VGGNet-DET) 54.8 45.9 38.3 30.9 23.7 16.6 09.9 0.152 0.258
Table 10: Localization performance for DBNet and DenseCap with different pretrained models on Visual Genome. VGGNet-CLS: the 16-layer VGGNet pretrained on ImageNet ILSVRC 2012 dataset. VGGNet-DET: the 16-layer VGGNet further pretrained on PASCAL VOC07+12 trainval set.

e.2 Random and oracle localization performance

Given proposed image regions, we performed localization for text phrases with random guessing and the oracle detector. For random guessing, we randomly chose a proposed region and took it as the localization results. For more accurate evaluation, we averaged the results over all possible cases (i.e., enumerating over all proposed boxes). For the oracle detector, it always picked up the proposed region that had the largest overlap with a ground truth region, providing the performance upper bound due to the limitation of the region proposal method, as in [61].

As shown in Table 11, the trained models (DBNet, SCRC, DenseCap) significantly outperformed random guessing, which suggests that promising models can be developed using deep neural networks. However, the the performance of DBNet had a large gap with the oracle detector, which indicates that more advanced methods need to be developed in the further to better address the natural language visual localization problem.

Model Recall / % for IoU@ Median Mean
0.1 0.2 0.3 0.4 0.5 0.6 0.7 IoU IoU
Random 19.0 10.0 5.2 2.6 1.2 00.5 00.2 0.041 0.056
DenseCap 48.8 36.2 25.7 16.9 10.1 05.4 02.4 0.092 0.178
SCRC 52.0 39.1 27.8 18.4 11.0 05.8 02.5 0.115 0.189
DBNet 54.8 45.9 38.3 30.9 23.7 16.6 09.9 0.152 0.258
Oracle 94.0 87.3 80.4 73.1 65.1 055.8 042.4 0.650 0.572
Table 11: Single-image object localization accuracy on the Visual Genome dataset for random guess, oracle detector, and trained models. EdgeBox is used to propose 500 regions per image. Random: a proposed region is randomly chosen as the localization for a text phrase and the performance is averaged over all possibilities; Oracle: the proposed region that has the largest overlap with the ground box(es) is taken as the localization for a text phrase.

e.3 Localization using constrained queries

Pairwise relationships describe a particular type of visual entities, i.e., two objects interacting with each other in a certain way. As the basic building block of more complicated parsing structures, the pairwise relationship is worth evaluating as a special case. The Visual Genome dataset has pairwise object relationship annotations, independent from the text phrase annotations. To fit “object-relationship-object” (Obj-Rel-Obj) triplets into our model, we represented a triplet in a SVO (subject-verb-object) text phrase, and took the bounding box enclosing the two objects as the ground truth region for the SVO phrase. During the training time, we used both the original text phrase annotations and the SVO phrases derived from the relationship annotations to keep sufficient diversity of the text descriptions. During the testing time, we used only the SVO phrases to focus on the localization of pairwise relationships. The training and testing sets of images were the same as in the other experiments.

As reported in Table 1, the localization recall for the IoU threshold at was close to . The groups of two objects were easier to localize than general visual entities, since they were more clearly defined and generally context-free. In particular, DBNet’s performance (recall and median/mean IoU) for Obj-Rel-Obj was approximately twice as high as that for general text phrases. The above experimental results demonstrate the effectiveness of DBNet for localizing object relationships. The results also demonstrate the complexity of the text quires (e.g., using all human-annotated phrases v.s. obj-rel-obj pairs) as a significant source of failures.

Region Visual Localization Recall / % for IoU@ Median Mean
proposal network model 0.1 0.2 0.3 0.4 0.5 0.6 0.7 IoU IoU
EdgeBox 500 16-layer VGGNet DBNet (all phrases) 54.8 45.9 38.3 30.9 23.7 16.6 09.9 0.152 0.258
DBNet (Obj-Rel-Obj) 81.8 75.1 67.3 57.8 46.8 35.4 23.1 0.471 0.448
Table 12: Single-image object localization accuracy on the Visual Genome dataset. Any text phrase annotated on a test image is taken as a query for that image. “IoU@” denotes the overlapping threshold for determining the recall of ground truth boxes. DC-RPN is the region proposal network from DenseCap.

e.4 Ablative study on the text similarity threshold

As discussed in Section 5.3, removing ambiguous training samples are important. The ambiguous sample pruning depends on 1) overlaps between proposed regions and ground truth regions, and 2) text similarity. While the image region overlaps have been commonly considered in traditional object detection, the text similarity is specific to natural language visual localization and detection.

In Table 13, we reported the localization performance of DBNet under different values of the text similarity threshold (defined in Eq. (7)), where we considered a controlled setting with neither text phrases from other images nor the visual pathway finetuning. DBNet achieved the best performance with the default parameter . Suboptimal caused approximately decrease in localization recall and decrease in median/mean IoU.

Phrases from Finetuning Recall / % for IoU@ Median Mean
other images visual pathway 0.3 0.5 0.7 IoU IoU
No No 0.1 33.6 20.6 08.6 0.101 0.231
No No 0.2 33.0 20.2 08.5 0.094 0.227
No No 0.3 34.5 21.2 09.0 0.113 0.237
No No 0.4 33.0 20.2 08.4 0.093 0.227
No No 0.5 32.8 20.2 08.4 0.091 0.226
Table 13: Ablative study on text similarity threshold in Eq. (7).

Since the above controlled setting excluded text phrases from the rest of the training set, the localization performance was not too sensitive to the value of due to the limited number of phrases. When the text phrases from the whole training set are included in the training loss on a single image, the choice of can have a more obvious impact. For example, setting can disable the inclusion of text phrases from other images in any case.

Appendix F More qualitative comparison for localization

More quantitative localization results were shown in this section. We compared DBNet with DenseCap (Figure 8 in Section F.1) and SCRC (Figure 9 in Section F.2), respectively. For each test example, we cropped the image to make the figure focus on the localized region. We used a green box for the ground truth region, a red box for DenseCap/SCRC, and a yellow box for our DBNet.

In the examples that we showed, at least one of the two methods (DBNet and DenseCap/SCRC) can localize the text query to an image region that has overlap with the ground truth region. Besides this constraint, all examples were chosen randomly. While DenseCap and SCRC outperformed DBNet in a few cases, DBNet significantly outperformed those two methods most of the time.

See results on the next page.

f.1 More qualitative comparison with DenseCap

a small chalk
board in the window
a foot hang
over the side
blue and white ship
on the water
side mirror
of the bus
part of
ocean surfboard
arch over
window column
the short hair
of the player
the back wheel
of the bus
a man wearing
sunglasses
leash on dog pulling
person on skateboard
racket in tennis
player’s hand
child is wearing
pink gloves
flag on top of
building
cement circle
in the glass
a knife with
a brown handle
silver stereo
on table
umbrella covering
the people
the man
is surfing
a black
tee shirt
a airplane is
in the sky
a red and
white ball
the clock says
11/08
drink can sitting
on the sink
video game on
the tv screen
cell phone
in hand
side windows
of a plane
a woman waiting at
the train station
the building
has windows
black screen
on television
light brown cow
on ground
a brown roof
of a building
girl about to
throw frisbee
dark shirt with
yellow writing
the bear has
a black nose
the clock
is black
a red stripe
on the plane
the red jacket
of the skier
the apple
is green
the bird stands
in the sun
white keys
of a keyboard
a letter p
written in white
a fat seagull
standing
Figure 8: Qualitative comparison between DBNet and DenseCap on localization task. Examples are randomly sampled. Green boxes: ground truth; Red boxes: DenseCap; Yellow boxes: DBNet. The numbers are IoU with ground truth boxes.

f.2 More qualitative comparison with SCRC

left armrest
of the bench
the nose of
the airplane
shadow
from the man
the microwave
has buttons
this is
a bottle
gray hour hand
on clock
a black
baseball bat
the weiner is
in the bun
a guy is wearing
a white helmet
the visor
is white
a window
on the train
a bird’s
tiny leg
arrow
pointing right
the shower
nozzle
blue jeans
on young woman
the uniform
is grey
man in yellow
snowboarding
a white
computer keyboard
a remote control
on coffee table
a tree near
a house
gray stapler
this is a
dining table
the arch of
a building
the door on
the stone cottage
the silver
long train
boat in the
middle of water
man wearing
wet suit
yellow directional sign
on street
the jet is
made of steel
bearded man with
a white hat
partially loaded
moving van
young boy
pointing at camera
black traffic
light
blue and white
stripe outfit
yellow taxi cab
on the street
granola in
yogurt cup
keyboard of
street meter
young man carrying
backpack
two candle
holders
trunck of
elephant
directional street sign
1600 block
paper note shaped
like autumn leaf
Figure 9: Qualitative comparison between DBNet and SCRC on localization task. Examples are randomly sampled. Green boxes: ground truth; Red boxes: SCRC; Yellow boxes: DBNet. The numbers are IoU with ground truth boxes.

Appendix G Qualitative Comparison for Detection

In this section, we showed more qualitative results for visual entity detection with various phrases. As opposed to the localization task, a decision threshold was needed to decide if the visual entity of interest exists or not. We determined this threshold either using prior knowledge on the ground truth regions (Section G.1) or based on the precision of the detector (Section G.2 and Section G.3).

In Section G.1, we showed the same number of detected regions as the ground truth regions for all methods. We visualized randomly chosen testing images and phrases under the constraint that at least one of DBNet, DenseCap, or SCRC could get sufficiently accurate detection results (IoU with a ground truh is greater than ).

In Section G.2, we found a decision threshold for each text phrase to make the detection precision (for the IoU threshold at ) equal to . If not applicable, we excluded that phrase from visualization. We randomly chose testing images and phrases to visualize.

In Section G.3, we used the same decision threshold as in Section G.2. However, we focused on visualizing failed detection cases. In particular, we randomly chose testing images and phrases under the constraint that at least one of DBNet, DenseCap, and SCRC gave significantly wrong detection results (IoU with any ground truth is less than ). The failure types were also displayed in the figures.

See results on the next page.

g.1 Random detection results with known number of ground truths

In Figure 10, the number of ground truth entities on the image was supposed to be known in advance. All three methods (DBNet, DenseCap, and SCRC) could perform similarly for detecting queried visual entities under a loose standard for localization accuracy (e.g., counting a detected box as a true positive even if it overlaps slightly with the ground truth box). The localization accuracy of DBNet was usually more accurate.

Text phrases DBNet DenseCap SCRC
Figure 10: Qualitative detection results of DBNet, DenseCap, and SCRC when the number of ground truth is known. Detection results of six different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(10)

Figure 11: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC when the number of ground truth is known. Detection results of six different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(11)

Figure 12: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC when the number of ground truth is known. Detection results of six different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.

g.2 Random detection results with phrase-dependent thresholds

In Figure 13, we used phrase-dependent decision thresholds to determine how many regions were detected on an image. We set the threshold to make the detection precision for the IoU threshold at equal to when applicable. DBNet outperformed DenseCap and SCRC significantly. DenseCap and SCRC resulted in many cases of false alarms or miss detection. Note that DBNet could usually achieve the precision with a reasonable recall level, but DenseCap and SCRC might either fail achieving the precision at all or give a low recall.

Text phrases DBNet DenseCap SCRC
Figure 13: Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(13)

Figure 14: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(14)

Figure 15: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(15)

Figure 16: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(16)

Figure 17: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(17)

Figure 18: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(18)

Figure 19: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(19)

Figure 20: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.
Text phrases DBNet DenseCap SCRC

clip(20)

Figure 21: (continued from Figure ) Qualitative detection results of DBNet, DenseCap, and SCRC using phrase-dependent detection threshold. Detection results of four different text phrases are shown for each image. The colors of the bounding boxes correspond to the colors of text phrases on the left. The semi-transparent boxes with dashed boundaries are ground truth regions, and the boxes with solid boundaries are detection results of three models.

g.3 Failure cases for detection with phrase-dependent thresholds

In this section, we used phrase-dependent decision thresholds in the same way as in Section G.2, except for focusing on showing failure cases. We visualized randomly chosen testing images and phrases under the constraint that at least one of DBNet, DenseCap, and SCRC should significantly fail in detection (i.e., IoU with ground truth is less than 0.2). In Figure 22, we categorized failure cases into three types: 1) the false alarm (the detected box has no overlap with any ground truth), 2) inaccurate localization (the IoU with ground truth is less than ), 3) missing detection (no detection box has overlap with a ground truth region). For each image, we showed only one phrase for visual clarity and displayed the failure types for comprehensiveness. DBNet has significantly less failure cases than DenseCap and SCRC.

Text phrases DBNet DenseCap SCRC
a man with dark hair eating outside
missing
missing
missing
a group of swimmers in the ocean
missing
missing
missing
a multi colored towel in the cabinet
missing
missing
missing
Figure 22: Random failure examples. Green boxes with solid boundary: successful detection (); Green boxes with dashed boundary: ground truth with matched detection; Red boxes: false alarm; Yellow boxes with dashed boundary: missed ground truth (without matched detection); Blue boxes: inaccurately localized detection ().
Text phrases DBNet DenseCap SCRC
a black and white cat
missing
missing
missing
a buckle is on the collar
missing
missing
missing
a black shirt
missing
missing
missing

clip(22)

Figure 23: (continued from Figure ) Random failure examples. Green boxes with solid boundary: successful detection (); Green boxes with dashed boundary: ground truth with matched detection; Red boxes: false alarm; Yellow boxes with dashed boundary: missed ground truth (without matched detection); Blue boxes: inaccurately localized detection ().
Text phrases DBNet DenseCap SCRC
a baseball tee
missing
missing
missing
airplane parked on tarmac
missing
missing
missing
a 2 toned blue winter jacket
missing
missing
missing

clip(23)

Figure 24: (continued from Figure ) Random failure examples. Green boxes with solid boundary: successful detection (); Green boxes with dashed boundary: ground truth with matched detection; Red boxes: false alarm; Yellow boxes with dashed boundary: missed ground truth (without matched detection); Blue boxes: inaccurately localized detection ().

Appendix H Precision-recall curves

We show precision-recall curves for both global average precision (gAP) (Section H.1) and mean average precision (mAP) (Section H.2) calculation.

h.1 Phrase-independent precision-recall curves

We reported precision-recall curves for different query set under different IoU threshold using the detection results for all test cases in Figure 25. gAP was computed based on these precision-recall curves.

Level 0, IoU@0.3 Level 0, IoU@0.5 Level 0, IoU@0.7
Level 1, IoU@0.3 Level 1, IoU@0.5 Level 1, IoU@0.7
Level 2, IoU@0.3 Level 2, IoU@0.5 Level 2, IoU@0.7
Figure 25: Phrase-independent precision-recall curves for calculating gAP.

h.2 Phrase-dependent precision-recall curves

We calculated precision-recall curves using various query sets under different IoU thresholds independently for different text phrases over the entire test set. mAP was computed based on these precision-recall curves. We showed precision-recall curves for a few selected text phrases in Figure 26, 27, 28, 29, and 30.

Level 0, IoU@0.3 Level 0, IoU@0.5 Level 0, IoU@0.7
Level 1, IoU@0.3 Level 1, IoU@0.5 Level 1, IoU@0.7
Level 2, IoU@0.3 Level 2, IoU@0.5 Level 2, IoU@0.7
Figure 26: Precision-recall curves for text phrase “head of a person”.
Level 0, IoU@0.3 Level 0, IoU@0.5 Level 0, IoU@0.7
Level 1, IoU@0.3 Level 1, IoU@0.5 Level 1, IoU@0.7
Level 2, IoU@0.3 Level 2, IoU@0.5 Level 2, IoU@0.7
Figure 27: Precision-recall curves for text phrase “a window on the building”.
Level 0, IoU@0.3 Level 0, IoU@0.5 Level 0, IoU@0.7
Level 1, IoU@0.3 Level 1, IoU@0.5 Level 1, IoU@0.7
Level 2, IoU@0.3 Level 2, IoU@0.5 Level 2, IoU@0.7
Figure 28: Precision-recall curves for text phrase “the water is calm”.
Level 0, IoU@0.3 Level 0, IoU@0.5 Level 0, IoU@0.7
Level 1, IoU@0.3 Level 1, IoU@0.5 Level 1, IoU@0.7
Level 2, IoU@0.3 Level 2, IoU@0.5 Level 2, IoU@0.7
Figure 29: Precision-recall curves for text phrase “man wearing blue jeans”.
Level 0, IoU@0.3 Level 0, IoU@0.5 Level 0, IoU@0.7
Level 1, IoU@0.3 Level 1, IoU@0.5 Level 1, IoU@0.7
Level 2, IoU@0.3 Level 2, IoU@0.5 Level 2, IoU@0.7
Figure 30: Precision-recall curves for text phrase “small ripples in the water”.