OptiBox: Breaking the Limits of Proposals for Visual Grounding

11/29/2019 ∙ by Zicong Fan, et al. ∙ The University of British Columbia 0

The problem of language grounding has attracted much attention in recent years due to its pivotal role in more general image-lingual high level reasoning tasks (e.g., image captioning, VQA). Despite the tremendous progress in visual grounding, the performance of most approaches has been hindered by the quality of bounding box proposals obtained in the early stages of all recent pipelines. To address this limitation, we propose a general progressive query-guided bounding box refinement architecture (OptiBox) that leverages global image encoding for added context. We apply this architecture in the context of the GroundeR model, first introduced in 2016, which has a number of unique and appealing properties, such as the ability to learn in the semi-supervised setting by leveraging cyclic language-reconstruction. Using GroundeR + OptiBox and a simple semantic language reconstruction loss that we propose, we achieve state-of-the-art grounding performance in the supervised setting on Flickr30k Entities dataset. More importantly, we are able to surpass many recent fully supervised models with only 50 competitively with as low as 3



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual grounding is the task of associating textual input with corresponding regions in a given image. The problem has attracted much attention in recent years as it plays a vital role in applications such as image captioning [13] and visual question answering (VQA) [34]. Most methods in this field follow a two-stage process [11, 27, 3, 21, 18] consisting of an object proposal stage that suggests potential bounding boxes a query phrase could ground to, and a decision stage that assigns one or more proposed boxes to a query. Despite the various effort to improve visual grounding systems, their performance is bounded by the quality of bounding box proposals in the first stage. We say that a query is correctly grounded if it is assigned a proposal box that is close enough in size and location to the ground-truth annotation box. When all proposals have low overlap with the ground-truth (left of Figure 1, for instance), the ability of a grounding model would become extremely limited.

Figure 1: Left: the white box represents the raw prediction of a grounding model (GroundeR [27] in this case) for the phrase A girl. Right: our model (OptiBox) makes appropriate adjustments using query-guided regression, resulting in a much more precise box on the right. The blue box represents the ground-truth.

Although this is an obvious drawback, few methods in visual grounding attempt to address this issue of improving the quality of bounding box proposals. Chen et al. [2]

introduce a bounding box regression network that leverages reinforcement learning techniques to guide the training. However, their model does not take global visual cues into account, nor is expressive enough for bounding box refinement. Although Yang

et al. [33] propose to incorporate query information in the region proposal network to predict bounding boxes that are more strongly correlated with the query. However, they do not refine the predicted boxes. To overcome these limitations, we introduce a bounding box optimization network which we call OptiBox — that uses global visual cues to refine the predicted proposal boxes progressively and ensures that the box is tight and accurate around the object of interest (Figure 1 right).

As a proof of concept, we apply OptiBox to the GroundeR model introduced by Rohrbach et al. [27]. Unlike other grounding methods [11, 3, 21, 18, 33] that have only been shown effective in the fully supervised setting, GroundeR is able to leverage partially labeled or even unlabeled data. When labeled data are available, we suggest a simple semantic reconstruction loss that performs well in both fully-supervised and semi-supervised settings. In particular, with only annotations, namely, labels for our data, our model outperforms the original fully-supervised GroundeR model ( annotations) by a large margin; with annotations, it surpasses most recent fully supervised models; with annotations, it achieves state-of-the-art performance. Our ablation studies demonstrate the efficacy of OptiBox, and we expect its application will benefit most other visual grounding frameworks.

Contributions: Our contributions are three-fold: (1) we propose a general progressive query-guided bounding box refinement architecture that leverages global image encoding for added context; (2) we apply this architecture in the context of the GroundeR model [27], illustrating state-of-the-art performance; (3) we propose a simple semantic linguistic reconstruction loss which, when coupled with GroundeR, further improves performance in both supervised and, more importantly, semi-supervised settings. The resulting model can, uniquely, produces close to state-of-the-art (supervised) grounding performance with as little as of the data.

2 Related Work

Visual grounding: In recent years, there has been considerable progress in visual grounding of phrases. Most authors adopt a two-stage pipeline [27, 2, 18, 1, 11, 31, 30, 21]: an object proposal stage followed by a grounding decision stage. The object proposal stage generates bounding box proposals from the input image using an object detection model. The grounding decision stage uses the query’s linguistic features and the proposal’s visual features to score the correspondence between the query and each proposal. The proposal with the highest score is selected as the predicted grounding result. Rohrbach et al. [27] propose the GroundeR model that allows multiple levels of supervision using a reconstruction loss. Hu et al. [11] use a similar approach based on a caption generation framework but for the supervised setting only. Wang et al. [31] apply a deep structure-preserving embedding framework to grounding, which they formulate as a ranking problem. This work is further extended to a similarity network [30] and a concept weight branch [21]. Chen et al. [2]

introduce a query-guided regression framework based on reinforcement learning that refines the proposed boxes using query heuristic in order to break the bottleneck of bounding box proposals. Furthermore, Chen

et al. [1] take the contextual information of the phrase into account by penalizing a joint loss for all phrases in a sentence, whereas Dogan et al. [3] encode the queries and proposal features using two independent LSTM modules. Bajaj et al. [18] construct a visual graph and a phrase graph to model the pairwise relationship between entities via graph convolutional operations. The convolved visual and linguistic features are merged and refined by a fusion graph in the end. There have also been techniques focusing on single-stage grounding. Xiao et al. [32] perform weakly-supervised pixel-level grounding with a spatial attention mask generated from the hierarchical structure of the parse tree from the phrase query. Very recent works propose to embed query information in the region proposal stage [33, 28]. In contrast, our proposed method shows that with simple architecture, we can learn from features in multiple modalities to obtain a more refined, accurate result in various learning settings.

Bounding box regression: Bounding box regression has been commonly used in the final stage of an object detector to improve its accuracy. The R-CNN line of work [5, 6, 24, 7]

applies a linear regression layer to refine bounding boxes selected from the list of proposals. The model in

[4] iteratively refines and merges proposals using a deep CNN regression model. Lin et al. [17] use a class-agnostic, convolutional bounding box regressor to correct the offset between an anchor and its closest ground-truth. Additionally, Jiant et al. [12] learn to predict the IoU between the proposals and their ground-truth (IoU-Net). Instead of using a fixed architecture to perform bounding box regression, Rajaram et al. [23] implement an iterative refinement algorithm (RefineNet) that can be trained in a similar fashion as the Faster R-CNN [24] network. Roh and Lee [26] add a refinement layer for both the object classification and bounding box regression losses. Recently, Rezatofighi et al. [25] introduce a generalized IoU metric that is more robust to non-overlapping objects, while He et al. [9] use the KL divergence between the predicted and ground-truth distribution as the penalty. In the task of object segmentation, Pinheiro et al. [20]

propose SharpMask which augments a traditional feedforward net with a refinement module to produce more accurate segmentation. The unique challenge of incorporating these bounding box regression methods into visual grounding is the multimodality of the task – we need to ensure the correction is guided by the query, rather than treating it as an object detection refinement procedure.

3 Approach

Our grounding model consists of two parts: a grounding module and a box refinement module. Given an image and a query, the grounding module returns a bounding box with the highest confidence value. The box refinement module then refines and outputs an offset to suggest an adjustment on . Finally, using the predicted offset , we obtain the refined bounding box .

3.1 Grounding

We adopt the GroundeR model introduced in [27] as our grounding module. An input image first goes through an object detector to obtain bounding box proposals of potential objects, where . For each proposal , we extract its visual features from the detection backbone. For the query phrase, we first encode each word using pre-trained word embeddings, which are then passed into an LSTM [10] to obtain the last hidden state linguistic features . To combine the two modalities, we first project and

both onto a common dimensionality space using two separate fully-connected layers with the ReLU activation. The projected query feature

is then added to each of the projected proposal features to produce a feature vector for each proposal. At this point, should contain information from both the query and the object . At inference time, we score the correspondence between the pairs by projecting to a scalar and assign the query to the proposal with the maximum . At training time when labeled data are available, the cross-entropy loss is applied against the target box, which is the proposal with the highest IoU out of all with an IoU of at least 0.5 with the ground-truth.

Figure 2: Our global attention module. The model first extracts the feature map from the input image using the detection backbone. It then concatenates a box , the visual feature of , and the LSTM encoded query feature . The concatenated vector is then projected onto . Each spatial grid of the feature map then concatenates with a copy of this -dimensional vector, which results in an attention feature map after a convolution. After softmax normalization, the attention map entries are used as the weights to average across all spatial grid locations of the feature map to produce a context vector . Best viewed in color.

Semantic Reconstruction: In the semi-supervised setting, the original GroundeR [27] model exploits unlabeled data through a phrase reconstruction process: if the model is highly confident about a subset of the proposals, one should be able to reconstruct the phrase solely using the visual features of those boxes. In particular, it normalizes with a softmax function so that . It then takes a weighted average over the visual features of all proposals,


and learns an LSTM decoder to unroll the original phrase query. A cross-entropy loss is applied to the phrase reconstruction at each time step of the LSTM. Although this seems to be the most natural approach to take, there is an inherent flaw: Even if the reconstructed phrase is semantically equivalent to the input phrase, it could be penalized if it does not match the input phrase word for word. For example, “a little boy” and “a young boy” could refer to the same bounding box, and thus this minor deviation should not be penalized. In other words, the model should strive to reconstruct the semantic meaning of the phrase based on the visual features, rather than trying to match exactly. To this end, we propose to learn a function that projects the visual feature to the original query semantic space,


and penalize with the semantic loss


in the linguistic latent space. To balance the classification loss and the semantic reconstruction loss, we use a hyper-parameter and mimimize the sum of the two losses:

Figure 3: OptiBox: our proposed bounding box refinement model. Given a predicted box from a grounding network, we concatenate the visual features corresponding to the box, the bounding box itself, the query’s linguistic features , and the global context features and project using feedforward layers. The final projection yields a -dimensional bounding box offset. The numbers on the layers indicate the corresponding layer output dimensions. Best viewed in color.

3.2 OptiBox: A Bounding Box Refinement Model

In two-staged visual grounding, the first stage is often crucial as the quality of the proposals fundamentally limits the latter stage. For instance, if all proposals generated significantly offset from objects in the image, the grounding module would not be able to learn and predict well due to the lack of suitable candidates. Even in the case when the query is grounded correctly, the selected bounding box may deviate by a noticeable amount from the actual object indicated by the query, since the grounding module has not been optimized to perform further adjustments beyond its predictions. For these reasons, we propose to use a bounding box regression module to refine the proposals. Note that our method is essentially metric-agnostic: it will yield tighter and more accurate bounding boxes for visual grounding regardless of the evaluation metric used.

For convenience, we drop the subscript for the predicted bounding box. OptiBox takes advantage of several available components of our method: the bounding box , the visual feature , the original query feature , and the global feature map of the entire image. Figure 3 depicts the simple refinement architecture of OptiBox. We first encode the original query into a language feature vector using a separate LSTM. We then apply global attention to pool context information from the image to obtain a global context vector (which we will discuss in detail shortly). We concatenate into a single vector and project it to a lower-dimensional space for refinement. The refinement process consists of 5 weight-sharing fully-connected layers of the same size. Empirically, using additional layers only yield marginal returns, and the choice of weight sharing is to avoid overfitting. Finally, we project the output of the final refinement layer to a box offset vector , and all fully connected layers are followed by the ReLU activation.

Figure 2 illustrates the structure of the aforementioned global attention module. For the input image in question, we extract its feature map from the detection backbone. We perform adaptive average pooling to obtain a feature map . The visual feature , bounding box , and the query feature are then concatenated and projected to a lower-dimensional vector. Each spatial grid of the feature map receives one copy of this projected vector, yielding a more informative feature map. We then apply a convolution to the feature map and obtain an attention map. Finally, we normalize the attention map using softmax and perform a weighted average across all channels of , giving us a global attention vector .

To be consistent with widely-used methods in bounding box regression [5], we adopt the following bounding box offset representation: given a box to regress and the groundtruth box , the targets of our bounding box regression model are defined by


During inference time, given the chosen proposal box and the predictions from box regression network , the regressed box can be obtained by


4 Experiments and Results

4.1 Implementation Setup

Approach Visual Features Fine-tune Dataset # Proposals Acc. (%)
SCRC [11] VGG16 ImageNet 100 27.80
DSPE [31] VGG19 Pascal 100 43.89
GroundeR [27] VGG16 Pascal 100 47.81
CCA [22] VGG19 Pascal 200 50.89
MCB Reg Spatial [1] VGG16 Pascal 100 51.01
Similarity Net [30] VGG19 Pascal 200 51.05
MNN Reg Spatial [1] VGG16 Pascal 100 55.99
SeqGROUND [3] VGG16 N/A N/A 61.60
CITE-Resnet [21] ResNet101 COCO 200 61.33
CITE-Pascal[21] VGG16 Pascal 500 59.27
CITE-Flickr30K[21] VGG16 Flickr30K 500 61.89
QRC Net [2] VGG16 Flickr30K 100 65.14
GraphGround - PhraseGraph [18] VGG16 Visual Genome 50 60.80
GraphGround [18] VGG16 Visual Genome 50 63.87
GraphGround++ [18] VGG16 Visual Genome 50 66.93
One-Stage-Bert [33] Darknet53-FPN COCO (Flickr30K) 4032 68.69
Ours: Fully-supervised
GroundeR ResNet101 Visual Genome 50 67.04
Ours: Semi-supervised
Annotation %: 3.12% ResNet101 Visual Genome 50 58.55
Annotation %:    50% ResNet101 Visual Genome 50 65.85
Table 1: Test accuracy comparison of the state-of-the-art models on the Flickr30k Entities dataset. For our approaches, we use GroundeR with OptiBox in all cases with the semantic loss.

GroundeR: To encode the query phrase, we use 200-dimensional GloVe [19]

embeddings pre-trained on the Twitter corpus. The word embeddings are then passed into a single-layer, uni-directional LSTM with hidden size 512. The hidden states are batch-normalized and projected to 128 dimensions. We adapt the ResNet101 network

[8] pre-trained on Visual Genome [16] with the top 200 most frequent object class labels. To be consistent with most existing approaches, we do not finetune the detection backbone on Flickr30k [22], which is our train and evaluation dataset for grounding. We allow the region proposal network to generate proposals for each image. We extract the C4 layer of ResNet101 as our global feature map and perform global average pooling after the ROI head to obtain a 2048-dimensional feature vector for each proposed region . The bounding box visual features are also batch-normalized, followed by a projection to

. The resulting vector is summed with the projected hidden states from the query phrase. The aggregated features then go through a fully connected layer that forms the attention over the proposed bounding boxes, and the attention values are penalized against the target using cross-entropy. We train our grounding model for 25 epochs with weight decay

. We use the Adam optimizer [14] with batch size 128 and a scheduled learning rate (that decays from 0.001 by at epoch 15 and 25), and we select the model maximizing validation accuracy for reporting on the test set.

Hyperparam.\Annot. 3.12% 50% 100%
Weight decay 0.01 0.0005 0.01
Semantic loss reg. () 10 100 100
Table 2:

Grid search results for key hyperparameters.

In the semi-supervised setting, we optimize the classification loss and the semantic reconstruction loss jointly (see Eq. 4). The semantic reconstruction loss uses the -distance for its empirical performance. Note that the value of weight decay and the choice of can significantly impact the accuracy. Thus we select them from the validation set via a grid search. The selected values are in Table 2.

OptiBox: To be consistent with the grounding model, we also encode each word of the original query by the 200-dimensional GloVe embeddings from the Twitter corpus and feed each word embedding sequentially through a separate, uni-directional LSTM with hidden dimension 512. The query feature is then extracted from the final hidden state of the LSTM. As mentioned, our box regression model makes use of the query feature with dimension, the visual feature with dimensions, the box with dimensions, and the global feature map with dimensions . The shared dimension of refinement layers in Figure 3 are in 512 dimensions; the fully-connected layer to project before concatenating with each channel of feature map is 512 dimensions. The ReLU activation is added between fully-connected layers of Figure 3. It is also used before broadcasting the local features onto the feature map . Since the performance of the bounding box regression is conditioned on the outputs of the grounding model, we begin training the regression model when the grounding model converges. To train the regression model, we use Adam [14] with a learning rate of and a batch size of 128 until convergence. Again, the -loss works the best in our validation. Thus, we apply the -loss between the predicted bounding box offset and the target offset for supervision (see Equations 5 and 6).

In order to obtain highly informative linguistic features, we pre-train an LSTM autoencoder using queries in the training set. The autoencoder is trained using Adam

[14] for epochs with an initial learning rate of on batches of size . The learning rate decays by a factor of at epoch . Once the autoencoder converges, we use it to initialize the weights of the LSTM encoders in GroundeR and OptiBox. Previous methods [3, 18, 33, 31] have shown that learning an image projection and a query projection with a ranking loss [15] to ensure the projected vectors in a common space helps. Therefore, with the LSTM encoder initialized, we freeze its weights and pre-train the two linear layers mentioned in Section 3.1 to project the image feature to and to project the query feature to using the ranking loss. We trained the projections with Adam with a learning rate of and a batch size of for epochs. The learning rate decays by a factor of at epochs , , and .

4.2 Dataset and Evaluation

Approach Prop. UB (%) Acc. (%)
GroundeR [27] 77.90 47.81
RPN+QRN [2] 71.25 53.48
SS+QRN [2] 77.90 55.99
PGN+QRN [2] 89.61 60.21
One-Stage-Bert [33] 95.48 68.69
GroundeR 84.00 62.15
GroundeR + OptiBox 84.00 65.20
GroundeR + SL 84.00 62.25
GroundeR + SL + OptiBox 84.00 67.04
Table 3: Comparison of proposal upper bounds of various state-of-the-art models with ours (last four). SL indicates using semantic loss. In the QRN network, RPN, SS and PGN are different proposal methods.
Approach Feature Acc. (%)
GroundeR [27] VGG16 CLS 41.56
GroundeR [27] VGG16 DET 47.81
GroundeR ResNet101 DET 62.15
GroundeR + OptiBox ResNet101 DET 65.20
Table 4: Comparison of the original GroundeR model and our models without semantic loss. CLS and DET denote the model is optimized for classification and detection task, respectively. The median IoU of model prediction for GroundeR with ResNet101 feature is 0.6008 and 0.6617 when OptiBox is added.

We evaluate the test time performance of our model on the Flickr30k Entities dataset [22]. The dataset contains 31,783 annotated images in total. Each image has five sentences that contain phrases describing entities in the image. Each phrase is also annotated with the ground-truth bounding boxes within the image. We use the same data split as released by the dataset authors, which consists of 1000 images for validation, 1000 images for testing, and the rest for training. At evaluation time, a predicted bounding box that has a higher than 0.5 IoU with the ground-truth bounding box is considered correctly grounded, and we report the accuracy on the test set, which is the percentage of the test set phrases that are correctly grounded.

Table 1 shows the accuracy of various state-of-the-art models comparing to ours along with the visual features to use, on what dataset the detection backbone is fine-tuned on, and the number of proposed bounding boxes used. We can see that with a simple replacement of the detection backbone in the GroundeR model, we can already achieve a competitive performance against the most recent models. Among these methods, only the QRC Net performs bounding box regression, but does not use a recurrent, shared-weight component. Table 3 shows that their model also uses more proposals with a higher proposal upper bound (possibly due to fine-tuning on Flickr30k[22]), yet we are able to exceed their accuracy using only half the number. Comparing to [18], although this model makes grounding decisions jointly by considering all phrases within a sentence altogether, our accuracy is slightly higher than all variations of it. We recently became aware of a concurrent work called the One-Stage-Bert model [33]. Although they have higher accuracy, their detection backbone was again fine-tuned on Flickr30k[22] and implicitly used more proposals through the YOLO framework to achieve a high proposal upper bound. Although we could also have fine-tuned our detection backbone to obtain better results, we avoided doing so to be consistent with the visual grounding literature.

Fully Supervised: Note that the original GroundeR [27] uses VGG16[29] detection features while our version uses ResNet101 detection features. Results from the CITE model [21] in Table 1 shows that the use of ResNet101 also has some positive impact on grounding performance. Table 4 shows the performance of our model compared to the original GroundeR model in the supervised setting. Indeed, with our enhanced version of GroundeR (before adding OptiBox), we achieve an over improvement in the test accuracy, possibly due to the higher proposal upper bound and better proposal quality. The QRN variants in Table 3 shows an example where this could happen. Recall that the proposal upper bound is defined as the proportion of queries with at least one proposal having a IoU with its ground-truth. Interestingly, looking at Table 3, with the reconstruction loss, when using all annotations, our model performs slightly better than the model without it. We posit that the semantic reconstruction loss acts as a regularizer when annotation is abundant.

Models \Annot. (%) 3.12% 50% 100%
GroundeR [27] 28.94 46.65 48.38
GroundeR + SL 55.11 60.87 62.25
GroundeR + OptiBox + SL 58.55 65.85 67.04
Table 5: Comparison against the original GroundeR model in fully-supervised and semi-supervised settings with the semantic loss (SL) with varying proportions of annotations.

Semi-Supervised: Table 5

also contains results for the semi-supervised learning case. To the best of our knowledge, we are the first to achieve this level of visual grounding accuracy such little annotated data. Comparing to the original semi-supervised GroundeR model, given

bounding box annotations, our semi-supervised model almost doubled in test accuracy. Moreover, even without OptiBox, the test accuracy of our semi-supervised model () performs better than the original GroundeR with full annotations (). With annotations, there is about increase in test accuracy compared to the original GroundeR with the same amount of annotated data.

With OptiBox, the accuracy increases further to , which surpasses most fully-supervised state-of-the-art models (see Table 1) except the concurrent works: GraphGround++ [18] and One-Stage-Bert [33].

In view of the performance of OptiBox in this semi-supervised setting, when annotated data is scarce (), the box regression process improves the test accuracy by . The regressor gives the best improvement () when there is annotated data.

4.3 Ablation Studies: OptiBox

Figure 4: IoU distributions before and after OptiBox regression. From (a)-(c) we show distribution changes when regressing on different initial qualities of the proposals, in terms of the IoU. Best viewed in color.

In this section, we provide ablation studies to investigate the effect of our bounding box regression model and evaluate how it could be applied to benefit visual grounding models in general.

In Table 4 we see that the median IoU of model prediction for GroundeR with ResNet101 feature is 0.6008 and 0.6617 when OptiBox is applied, which is IoU gain on average. Note that if the grounding module gives a negative prediction, it is very likely that the IoU between the chosen box and the ground-truth box is close to zero. Thus, the median IoUs are heavily affected by the zero IoUs. In practice, one would expect the given source bounding box to have at least some overlap with the ground-truth box.

Visual Box Query Global Median IoU
Y Y Y Y 0.200
N Y Y Y 0.106
Y N Y Y 0.197
Y Y N Y 0.105
Y Y Y N 0.198
Table 6: Comparison of the effectiveness of the different features for OptiBox. The four feature columns refer to as defined in Section 3.1, respectively. Median IoU is computed by taking the median of IoU differences between bounding box after and before regression.

Since the distribution of IoUs of the predicted boxes is heavily conditioned on the architecture of the grounding model, to assess its universal regression performance, we perform an additional independent experiment for the regression model. Specifically, given a set of Flickr30k images, we select the proposal boxes with greater than IoUs to the ground-truth box to perform regression. The choice of IoUs is to ensure that the proposal boxes are at least nearby the ground-truth boxes . In short, the dataset consists of source and target pairs of boxes where . We use the proposal boxes in the previous Flickr30k experiment as our source boxes and associate each Flickr30k ground-truth box with the proposal boxes that have at least IoU. Using the same data split as the previous experiment, we train our box regression for 40 epochs with a learning rate of and a batch size of using the Adam optimizer [14]. The learning rate decays by a factor of 0.1 at epoch 3, 10, 20, and 30.

Figure 4 shows the distributions of IoUs before and after the bounding box regression in our independent experiment. In particular, Figure 4 (a) shows the cases when the given bounding boxes having some but not sufficient overlap () with the ground truth. We can see that most area of the distribution is moved above 0.5 IoU and only a small area is less than (worse IoU after regression). We denote the region with IoU less than 0.5 by the grey background and denote the white background as IoU great than 0.5. Most source bounding boxes get better off after the regression, and most boxes fall into the region of the white background (IoU greater than 0.5). Looking at Figure 4 (b), it shows the cases when the source boxes are relatively good. We could see that similar to Figure 4 (b), the distribution is moved to a peak of IoU. There is only a tiny area that falls into the grey region. Finally, Figure 4 (c) shows the change of IoU distribution for exceptionally good source boxes. Although some portion of source boxes are less than the original IoU minimum (), most bounding boxes have IoU , which is the threshold we care about. Figure 5 shows some qualitative examples of our grounding network and our bounding box regression network. For example, Figure 5 (b) shows that the query is “A carefully balanced male”. The bounding box chosen by the grounding network is shown in green. The white box shows the box after the bounding box regression. The blue box is the ground-truth bounding box. The original IoU was 0.34, and it increases to 0.53 after the regression. Interestingly, our regression network seems to be flexible with multiple instances. For example, the source bounding box in Figure 5 (e) covers only one child, but the regression network modified it and grounds to two children according to the query. Figure 5 (c) shows a negative example of bounding box regression. Also, differentiating whether a person is “a man” is often relatively challenging when only their backs are shown in the image.

Figure 5: Qualitative samples for OptiBox. Blue boxes are the ground-truths, green boxes are the proposals selected prior to adjustment, and white boxes are the after-adjustment results. Numbers denote the IoU before and after applying OptiBox. Indeed, OptiBox yields more sensible results in most cases. Best viewed in color.

Finally, Table 6 shows the advantages of various features on our regression network. We can see that when all four features are included, our model enjoys a median IoU gain of 0.2 in our independent experiment. The most impactful features are visual features and query features. This makes sense because we expect a box regression model to perform poorly if the model cannot see (i.e., the visual features) or if we do not tell it upon what to focus (i.e., the query). There is a slight decrease in improvement if we drop either the bounding box or the global attention features. This also makes sense because if only a patch on the image is shown, one may still be able to give a decent bounding box prediction of it. However, we would expect global information to be more critical for boxes having less overlap with the ground-truth. In such cases, it is harder to guess bounding boxes if it is unaware of the entire image. Furthermore, the bounding box coordinates provide a slight impact on the regression performance. This is expected because knowing the location and the size of a patch (i.e., the bounding box coordinates) provides some spatial information as to where the attention should be placed.

5 Conclusion

In this work, we propose a query-guided box refinement network, OptiBox, which corrects the suboptimal bounding box predicted by a visual grounding model. We demonstrate its effectiveness using the GroundeR model [27] in both supervised and semi-supervised settings. We also introduce a semantic reconstruction loss, which we have shown to provide significant improvement to the overall grounding system. Evaluating on the Flickr30k dataset [22], we are able to outperform the original fully-supervised GroundeR model with only of the annotations. When we use of the annotations, we are competitive against recently proposed models. In the full-supervised setting, we achieve state-of-the-art performance.


The authors would like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute for Advanced Research (CIFAR) for supporting this research project. The authors also appreciate Shih-Han Chou, Bicheng Xu, and Mir Rayat Imtiaz Hossain for helpful feedback and discussions.


  • [1] K. Chen, R. Kovvuri, J. Gao, and R. Nevatia (2017) MSRC: multimodal spatial regression with semantic context for phrase grounding. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, Cited by: §2, Table 1.
  • [2] K. Chen, R. Kovvuri, and R. Nevatia (2017) Query-guided regression network with context policy for phrase grounding. In ICCV, Cited by: §1, §2, Table 1, Table 3.
  • [3] P. Dogan, L. Sigal, and M. Gross (2019) Neural sequential phrase grounding (seqground). CVPR. Cited by: §1, §1, §2, §4.1, Table 1.
  • [4] S. Gidaris and N. Komodakis (2015) Object detection via a multi-region and semantic segmentation-aware CNN model. In ICCV, Cited by: §2.
  • [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §2, §3.2.
  • [6] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §2.
  • [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
  • [9] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In CVPR, Cited by: §2.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: §3.1.
  • [11] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell (2016) Natural language object retrieval. In CVPR, Cited by: §1, §1, §2, Table 1.
  • [12] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In ECCV, Cited by: §2.
  • [13] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §1.
  • [14] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1, §4.1, §4.1, §4.3.
  • [15] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §4.1.
  • [16] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §4.1.
  • [17] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §2.
  • [18] L. W. M. Bajaj and L. Sigal (2019) GraphGround: graph-based language grounding. In ICCV, Cited by: §1, §1, §2, §4.1, §4.2, §4.2, Table 1.
  • [19] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §4.1.
  • [20] P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár (2016) Learning to refine object segments. In ECCV, Cited by: §2.
  • [21] B. A. Plummer, P. Kordas, M. Hadi Kiapour, S. Zheng, R. Piramuthu, and S. Lazebnik (2018) Conditional image-text embedding networks. In ECCV, Cited by: §1, §1, §2, §4.2, Table 1.
  • [22] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: OptiBox: Breaking the Limits of Proposals for Visual Grounding, §4.1, §4.2, §4.2, Table 1, §5.
  • [23] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi (2016) RefineNet: iterative refinement for accurate object localization. In ITSC, Cited by: §2.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §2.
  • [25] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, Cited by: §2.
  • [26] M. Roh and J. Lee (2017) Refining faster-rcnn for accurate object detection. In MVA, Cited by: §2.
  • [27] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele (2016) Grounding of textual phrases in images by reconstruction. In ECCV, Cited by: OptiBox: Breaking the Limits of Proposals for Visual Grounding, Figure 1, §1, §1, §1, §2, §3.1, §3.1, §4.2, Table 1, Table 3, Table 4, Table 5, §5.
  • [28] A. Sadhu, K. Chen, and R. Nevatia (2019) Zero-shot grounding of objects from natural language queries. In ICCV, Cited by: §2.
  • [29] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §4.2.
  • [30] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018)

    Learning two-branch neural networks for image-text matching tasks

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, Table 1.
  • [31] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In CVPR, Cited by: §2, §4.1, Table 1.
  • [32] F. Xiao, L. Sigal, and Y. Jae Lee (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In CVPR, Cited by: §2.
  • [33] Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo (2019) A fast and accurate one-stage approach to visual grounding. In ICCV, Cited by: §1, §1, §2, §4.1, §4.2, §4.2, Table 1, Table 3.
  • [34] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7w: grounded question answering in images. In CVPR, Cited by: §1.