Dense captioning with joint inference and visual context
Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase. We identify two key challenges of dense captioning that need to be properly addressed when tackling the problem. First, dense visual concept annotations in each image are associated with highly overlapping target regions, making accurate localization of each visual concept challenging. Second, the large amount of visual concepts makes it hard to recognize each of them by appearance alone. We propose a new model pipeline based on two novel ideas, joint inference and context fusion, to alleviate these two challenges. We design our model architecture in a methodical manner and thoroughly evaluate the variations in architecture. Our final model, compact and efficient, achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm. Qualitative experiments also reveal the semantic capabilities of our model in dense captioning.READ FULL TEXT VIEW PDF
We introduce the dense captioning task, which requires a computer vision...
In this paper, we address the task of learning novel visual concepts, an...
This notebook paper presents our system in the ActivityNet Dense Caption...
Dense captioning aims at simultaneously localizing semantic regions and
The task of associating images and videos with a natural language descri...
We address the problem of jointly learning vision and language to unders...
Important high-level vision tasks such as human-object interaction, imag...
Dense captioning with joint inference and visual context
The computer vision community has recently witnessed the success of deep neural networks for image captioning, in which a sentence is generated to describe a given image. Challenging as it seems, a list of pioneering approaches    have achieved remarkable success on datasets such as Flicker30k  and MS COCO 35], Meteor , and CIDEr . However, the holistic image descriptions from these datasets are either limited to the salient objects of the images, or tend to broadly depict the entire visual scene. A picture is worth a thousand words, and these holistic image descriptions are far from a complete visual understanding of the image. Furthermore, giving one description for an entire image can sometimes be quite subjective, making the evaluation of captioning often ambiguous.
Recently, Johnson et al.  propose to use a dense description of image regions as a better interpretation of the visual content, known as dense captioning. Human annotators are required to exhaustively label bounding boxes over different levels of visual concepts (e.g., objects, object parts, and interactions between them). Compared to global image descriptions, dense local descriptions are more objective and less affected by annotator preference. The local descriptions provide a rich and dense semantic labeling of the visual elements, which can benefit other tasks such as semantic segmentation  and visual question answering  . For convenience, we refer to image regions associated with annotated visual concepts as regions of interest in the remaining text.
. For each region proposal, the network produces three elements separately: a region-of-interest probability (similar to the detection score in object detection), a phrase to describe the content, and a bounding box offset. The major difference dense captioning has from traditional object detection is that it has an open set of targets (not limited to valid objects), and includes parts of objects and multi-object interactions. Because of this, two types of challenges emerge when predicting region captions.
First, the target bounding boxes become much denser than object detection with limited categories (e.g. 20 categories for PASCAL VOC ). Take the Visual Genome dataset as an example. The statistics of the maximum Intersection-over-Union (IoU) between ground truth bounding boxes can be seen in Fig. 2(a), from which we see more than half of the bounding boxes have maximum IoU larger than 0.3111Note that because a large portion of overlapping bounding boxes refer to the same object and have high IoU ratios, we have merged the bounding boxes with IoU larger than 0.7 together into one.. Fig. 2(b) shows an image overlaid with all ground truth bounding boxes. Here, we can visually see that any region proposal can easily have multiple overlapping regions of interest. Therefore, it is necessary to localize a target region with the guidance of the description.
Second, since there are a huge number of visual concepts being described, some of the target regions are visually ambiguous without information about their context. In Visual Genome, the number of different object categories is , which includes a long list of visually similar object pairs such as “street light” and “pole”, “horse” and “donkey”, and “chair” and “bench”.
Thus, we believe that tackling these two challenges can greatly benefit the task of dense captioning. We carefully design our dense captioning model to address the above two problems by introducing two key components. The first component is joint inference, where pooled features from regions of interest are fed into a recurrent neural network to predict region descriptions, and the localization bounding boxes are jointly predicted from the the pooled features with recurrent inputs from the predicted descriptions. Fig.1(a) shows an example of a step-by-step localization process with joint inference, where the localization bounding box gradually adapts to the correct position using the predicted descriptions. The second component is context fusion, where pooled features from regions of interest are combined with context features to predict better region descriptions. An example is shown in Fig. 1(b), where the object in the red bounding box is described as a desktop without visual cues from the surrounding context. We design several different network structures to implement the two key components respectively, and conduct extensive experiments to explore the benefits and characteristics of each. Our unified model achieves a mean average precision (mAP) accuracy of 9.31% on Visual Genome V1.0, a relative gain of 73% over the previous state-of-the-art approach by . An example image with the most confident region descriptions from our model is shown in Fig. 1(c).
To reiterate, the contributions of this work are two-fold:
We design network structures that incorporate two novel ideas, joint inference and context fusion, to address the challenges we identified in dense captioning.
We conduct an extensive set of experiments to explore the capabilities of the different model structures, and analyze the underlying mechanisms for each. With this, we are able to obtain a compact and effective model with state-of-the-art performance.
Recent image captioning models often utilize a convolutional neural network (CNN) as an image encoder and a recurrent neural network (RNN)  as a decoder for predicting a sentence   . RNNs have been widely used in language modeling    . Some image captioning approaches, though targeted at a global description, also build relationships with local visual elements. Karpathy et al.   learn an embedding with a latent alignment between image regions and word phrases. Fang et al.  first detect words from images using multiple instance learning, then incorporate the words in a maximum entropy language model. A soft-attention mechanism is also proposed to cast attention over different image regions when predicting each word  .
Recent object detection algorithms based on deep learning often show a two-stage paradigm: region proposal and detection  . Faster R-CNN  is the most related to our work, as it utilizes a Region Proposal Network (RPN) to generate region proposals and a detection network to predict object categories and bounding box offsets. The two networks can share convolutional features and can be trained with an approximate fast joint training algorithm. A recent improvement to faster R-CNN is the incorporation of context information using a four-direction RNN on the convolutional feature map . Visual context can greatly help tasks such as object detection    and semantic segmentation . Another direction is to remove the RPN and directly produce detection results   to further speed up the algorithm.
The task of dense captioning was first proposed in 
, in which a spatial transformer network is used to facilitate joint training of the whole network. A related application is also proposed to detect an arbitrary phrase in images using the dense captioning model. The experiments are conducted on the Visual Genome dataset , which provides not only region descriptions but also objects, attributes, question answering pairs, etc. Also closely related are other recent topics such as localizing a phrase in a specific image   , generating an unambiguous description for a specific region in an image  , or detecting visual relationships in images  .
Dense captioning is similar to object detection in that it also needs to localize the regions of interest in an image, but differs in that it replaces the fixed number of object categories with a much larger set of visual concepts described by phrases. Therefore, we can borrow successful recipes from the object detection literature in designing our dense captioning algorithm. In this work, our dense captioning model pipeline is inspired by the efficient faster R-CNN framework . Fig. 3
illustrates our dense captioning framework, which includes a region detection network adopted from faster R-CNN and a localization and captioning network. In this section, we will design different localization and captioning network architectures step by step in searching for the right formula. Our baseline model directly combines the faster R-CNN framework for region detection and long short-term memory (LSTM) for captioning.
Faster R-CNN 
uses a two-stage neural network to detect objects based on the image feature maps, which are generated by a fully convolutional neural network. In the first stage, the network uses a RPN to generate region proposals that are highly likely to be the regions of interest, then it generates fixed-length feature vectors for each region proposal using Region-Of-Interest (ROI) pooling layers. In the second stage, the feature vectors are fed into another network to predict object categories as well as the bounding box offsets. Since the gradients cannot be propagated through the proposal coordinates, exact joint training is not viable for faster R-CNN. Instead, it can be trained by alternatively updating parameters with gradients from the RPN and the final prediction network, or by approximate joint training which updates the parameters with gradients from the two parts jointly.
Our baseline model for dense captioning directly uses the proposal detection network from faster R-CNN in the first stage. For the second stage of localization and captioning, we use the model structure in Fig. 5(a), where the region features are used to produce detection scores and bounding box offsets, as well as fed into an LSTM to generate region descriptions. The LSTM predicts a word at each time step and uses this prediction to predict the next word. Fig. 4 shows an example of using such a recurrent process to generate descriptions. We use the structure of VGG-16  for the convolutional layers, which generates feature maps smaller than the input image. Following faster R-CNN 
, pretrained weights from the ImageNet Classification challenge are used. Also following previous work    , the region feature is only fed into the LSTM at the first time step, followed by a special start-of-sentence token, and then by the embedded feature vectors of the predicted words one by one. This model is similar to the model in 
except that their model replaces the ROI pooling layer with a bilinear interpolation module so that gradients can be propagated through bounding box coordinates. In contrast, our baseline model uses approximate joint training that is proven to be effective for object detection and instance-level semantic segmentation  . In our experiments, we observe that the baseline model with approximate joint training is very effective and already outperforms the previous state-of-the-art method . A potential reason is that although bilinear interpolation allows for exact end-to-end training, the model may be harder to train due to the transformation properties of the gradients.
In this section, we explore the model design for joint inference of bounding box localization and region description for a given region proposal. Due to the large number of open phrases and dense bounding boxes, we find it is necessary to combine the two in order to improve both localization and captioning. We fix the first stage of the proposal detection network in Fig. 3 to be the same as our baseline model, and focus on designing a joint localization and captioning network for the second stage.
To make the predictor of the bounding box offset aware of the semantic information in the associated region, we make the bounding box offset an output of an LSTM encoded with region descriptions. Several designs are shown in Fig. 5. Shared-LSTM (S-LSTM) (Fig. 5(b)) directly uses the existing LSTM to predict the bounding box offset at the last time step of the caption. This model embeds the captioning model and the location information in the same hidden space. Shared-Concatenation-LSTM (SC-LSTM) (Fig. 5(c)) concatenates the output of the LSTM and region features to predict the offset, so the prediction of the offset is directly guided by the region features. Twin-LSTM (T-LSTM) (Fig. 5(d)) uses two LSTMs to predict the bounding box offset and description separately. This model separates the embedded hidden spaces of the captioning model and the location predictor. The two LSTMs are denoted as location-LSTM and caption-LSTM, and both receive the embedded representation of the last predicted word as input. In all three models, the bounding box offset is predicted at the last time step of the description, when the “next word” is an end-of-sentence token and the description is finished. Thus the network obtains information about the whole description at the time of predicting the bounding box offset.
Visual context is important for understanding a local region in an image, where it has already shown to benefit tasks such as object detection and semantic segmentation   . Despite the exploration of context features in these tasks, there is limited work on the integration of context features into sequential prediction tasks such as image captioning. We concentrate on finding the optimal way to combine context features and local features in the sequential prediction task of dense captioning, rather than investigating better representations of context information. Thus, we resort to a simple but effective implementation of context features, which utilizes a global ROI pooling feature vector as the visual context. Since the bounding box offset is not directly connected to the context feature, we only use the context feature to assist in caption prediction, which in turn will influence localization through joint inference as discussed in the previous section.
In this work, we experiment with two variants of combining local features and context features, which are shown in Fig. 6 and termed as early-fusion and late-fusion. Early-fusion (Fig. 6(a)) directly combines the region feature and context feature together before feeding into the LSTM, while late-fusion (Fig.6(b)) uses an extra LSTM to generate a recurrent representation of the context feature, and then combines it with the local feature. The context feature representation is combined with the region feature representation via a fusion operator for both variants. We experimented with concatenation, summation, and multiplication. After each word is selected, its embedded representation is fed back into the caption-LSTM to guide the generation of the next word. Such fusion designs can be easily integrated with any of the models in Fig. 5.
The aforementioned model structures of joint inference and context fusion can be easily plugged together to produce an integrated model. For example, the integration of T-LSTM and the late-fusion context model can be viewed in Fig. 7. Note that a single word is predicted at each time step and the bounding box offset is predicted at the last time step of the caption, after all words have been encoded into the location-LSTM. Different integrated models are different instantiations of the model pipeline we show in Fig. 3.
Finally, training our dense captioning model boils down to minimizing the following loss function,
where, , and denote caption prediction loss, detection loss and bounding box regression loss, respectively, with and the weighting coefficients. is a cross-entropy term for word prediction at each time step of the sequential model, is a two-class cross-entropy loss for foreground / background regions, while is a smoothed-L1 loss . and are computed both in the region proposal network and the final prediction. For those models using an LSTM for predicting bounding box offset, the second is calculated at the last time-step of the LSTM output.
We use the Visual Genome dataset  as the evaluation benchmark. Visual Genome has two versions: V1.0 and V1.2. V1.2 is a cleaner version of V1.0, while V1.0 is used by . For comparison purposes, we conduct experiments mainly on V1.0, and report additional results on V1.2. We use the same train/val/test splits as in  for both V1.0 and V1.2, i.e., 77398 images for training and 5000 images each for validation and test. We use the same evaluation metric of mean Average Precision (mAP) as , which measures localization and description accuracy jointly. Average precision is computed for different IoU thresholds for localization accuracy, and different Meteor  score thresholds for language similarity, then averaged to produce the mAP score. For localization, IoU thresholds .3, .4, .5, .6, .7 are used. For language similarity, Meteor score thresholds 0, .05, .1, .15, .2, .25 are used. A comparison of our final model using the structure in Fig. 7 with the previous best result can be seen in Tab. 1, which shows that we achieve a relative gain compared to the previous best method. In the following sections, we first introduce the training and evaluation details, then evaluate and compare the joint inference models and integrated models under different structure designs. The influence of hyper-parameters in evaluation is also explored.
|Visual Genome V1.0||V1.2|
|Model||Johnson et al. ||Ours||Gain||Ours|
|model||Johnson et al. ||baseline||S-LSTM||SC-LSTM||T-LSTM|
In training, we use approximate joint training for all models. We use stochastic gradient descent with a mini-batch size of 1 to train the whole network. The input image is re-sized to have a longer side ofpixels. Initial learning rate is set to and halved every K iterations, and momentum is set to . Weight decay is not used in training. We begin fine-tuning the CNN layers after K iterations (
3 epochs) and finish training afterK iterations (9 epochs). The first seven convolutional layers are fixed for efficiency, with the other convolutional layers fine-tuned. We found that training models with context fusion from scratch tends not to converge well, so we fine-tune these models from their non-context counterparts, with a total of K training iterations. We only use descriptions with no more than 10 words for efficiency. We use the most frequent words as the vocabulary and replace other words with an UNK tag. For sequential modeling, We use an LSTM with hidden nodes. For the RPN, we use anchor boxes for generating the anchor positions in each cell of the feature map, and boxes are sampled in each forward pass of training. For the loss function, we fix values of , in Eq. (1) to and , respectively.
In evaluation, we follow the settings of  for fair comparison. First, boxes with the highest predicted confidence after non-maximum suppression (NMS) with IoU ratio are generated. Then, the corresponding region features are fed into the second stage of the network, which produces detection scores, bounding boxes, and region descriptions. We use efficient beam-1 search to produce region descriptions, where the word with the highest probability is selected at each time step. With another round of NMS with IoU ratio , the remaining regions and their descriptions are used as the final results.
We evaluate the baseline and three joint inference models in this section. All models are trained end-to-end with the convolutional layers and the RPN. To further clarify the effect of different model designs, we also conduct experiments to evaluate the performance of the models based on the same region proposals and image features. Towards this end, we fix the weights of the CNN to those of VGG16 and use a hold-out region proposal network also trained on Visual Genome based on the fixed CNN weights. The results of the end-to-end trained models and the fixed-CNN&RPN models are shown in Tab. 2.
T-LSTM performs best for joint inference. Among the three different structures for joint inference, T-LSTM has the best performance for both end-to-end training (mAP 8.03), and fixed-CNN&RPN training (mAP 5.64). The end-to-end model of T-LSTM outperforms the baseline model by more than 1% in mAP, while the others are even worse than the baseline model. By using a shared LSTM to predict both the caption and bounding box offset, S-LSTM unifies the language representation and the target location information into a single hidden space, which is quite challenging since they are from completely different domains. Even assisted by the original region feature, the shared LSTM solution does not show much improvement, only on par with the baseline (mAP 6.83). By separating the hidden space, i.e. using two LSTMs targeted at the two tasks respectively, the T-LSTM model yields much better performance (mAP 8.03 vs 6.47). Compared with the baseline model, T-LSTM is better at both localization and captioning. Fig. 8 shows several example predictions of bounding box and captions from one region proposal for the baseline model and the T-LSTM model. Fig. 8(a) (b) shows that T-LSTM improves on localization thanks to the guidance of the encoded caption information, while Fig. 8(c) (d) shows that T-LSTM is also better at predicting the descriptions, which reveals that location information helps to improve captioning. Although bounding box prediction does not feed information to the captioning process in the forward pass, it does influence captioning through back-propagation in the training stage. Considering all these factors, we believe that separating the hidden space using T-LSTM is most suitable for the joint inference of caption and location.
We evaluate the integrated models with different designs for both joint inference and context fusion in this section. For joint inference models, we evaluate three variants: S-LSTM, SC-LSTM, and T-LSTM. For context fusion, we compare the different settings proposed in Section 3.3, where we evaluate early-fusion and late-fusion with different fusion operators: concatenation, summation, and multiplication. For early-fusion with concatenation, we plug in a fully-connected layer after the concatenated feature to reduce it to the same input dimension as the LSTM. The mAP results of different variants of models are shown in Tab. 3.
Effectiveness of context fusion. In all models, context information helps to improve mAP ranging from 0.07 (S-LSTM, early-fusion, summation) to 1.10 (S-LSTM, late-fusion, multiplication). The three types of fusion methods all yield improvements in mAP for different models. Generally, concatenation and multiplication are more effective than summation, but the margin is subtle. With T-LSTM and late-fusion with multiplication, we obtain the best mAP performance 8.60 in this set of experiments. We refer to this model as T-LSTM-mult for brevity in the remaining text. Fig. 9 shows example predictions for comparison of T-LSTM without context fusion and T-LSTM-mult. In Fig. 9(a) (b) (c), T-LSTM-mult gives a better caption than the model without context. Without context, these objects are very hard to recognize even by humans. We can also observe from these examples that the context information employed by the model is not limited to the surrounding part of the region proposal, but from the whole image. In Fig. 9(d), the context model interestingly gives an incorrect but reasonable prediction since it is misled by the context which is full of sheep.
Late-fusion is better than early-fusion. Comparing early-fusion and late-fusion of context information, we find that late-fusion is better than early-fusion for all pairs of corresponding models. Also, early fusion only outperforms its no-context counterparts by a small margin. One disadvantage of early-fusion is that it directly combines the local and context features that have quite differing visual elements, making it unlikely able to decorrelate the visual element into the local region or the context region in the later stages of the model.
Intermediate location predictions. Since we only add the regression target to the last time step of the location-LSTM, it is not clear what the bounding box predictions from the previous time steps will be like. We test the bounding box predictions from these time steps, and find them to be fairly good. Fig. 10 shows examples of the predicted bounding box location at different time steps for the T-LSTM-mult model. Generally, the bounding box prediction at the first time step is already close to the region of interest. As words are fed into the location-LSTM, it gradually adjusts the bounding box to a tight localization of the object being described.
We also conduct experiments on Visual Genome V1.2 using the same train/val/test split as V1.0. The mAP performances are shown in Tab. 4. Here, we see similar results as on V1.0, which further verifies the advantage of T-LSTM over S-LSTM (mAP 8.16 vs 6.44 for no-context), and that context fusion greatly improves performance for both models. For context fusion, we can see that the T-LSTM model with late concatenation achieves the best result with mAP . We refer to this model as T-LSTM-concat. Overall, the accuracies are higher than those on Visual Genome V1.0, likely due to the cleaner ground truth labels.
The evaluation pipeline for dense captioning, a two-stage process of target prediction (region proposal and region description along with location refinement), involves several hyper-parameters that can influence the accuracy. These parameters include the number of proposals given by the RPN and the IoU ratio used by NMS both in the RPN and the final prediction. For these parameters, we use the same settings as  for all evaluations above. However, we are also interested in the impact of these parameters on our results. Parameters such as number of proposals is worth investigating because it can be used to find a trade-off between speed and performance. Also, the NMS thresholds used by  seem to overly suppress the predicted bounding box, especially since the ground truth regions are very dense (Fig. 2).
We use T-LSTM-mult for Visual Genome V1.0 and T-LSTM-concat for V1.2 as prototypes and find the best hyper-parameters for each by validating on the validation set. For the number of proposals given by the RPN, we validate between 100 and 300 proposals. We also validate to find the optimal IoU ratios used in the NMS thresholds for RPN and for final prediction, denoted as NMS_r1 and NMS_r2, respectively. NMS_r1 is chosen from the range , and NMS_r2 is chosen from the range . The results and corresponding optimal hyper-parameter settings are shown in Tab. 5.
With the validated hyper-parameters, we achieve even better mAP performance with on Visual Genome V1.0 and on Visual Genome V1.2 using proposals, which sets the new state-of-the-art. With only proposals, our model achieves mAP on Visual Genome V1.0 and on Visual Genome V1.2. Comparing the running times, we find that a image takes ms and ms for and proposals on a GTX TITAN GPU, respectively. The LSTM computations take around
of the total time consumption. Our implementation is developed using Caffe.
In this work, we have proposed a novel model structure which incorporates two ideas, joint inference and context fusion, to address specific challenges in dense captioning. To find an exact model realization incorporating these two approaches, we design our model step by step and propose different variants for each component. We evaluate the different models extensively, and gain intuitions on the effectiveness of each component and its variants. Finally, we find a model which utilizes the two approaches effectively and achieves state-of-the-art performance on the Visual Genome dataset. The feature representation learned by these models can potentially benefit other computer vision tasks requiring dense visual understanding such as object detection, semantic segmentation, and caption localization. The extensive comparison of different model structures we conducted can hopefully help guide model design in other tasks involving sequential modeling.
Ask your neurons: A neural-based approach to answering questions about images.In ICCV, 2015.
Generalization of backpropagation with application to a recurrent gas market model.Neural Networks, 1(4):339–356, 1988.