Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

09/04/2019 ∙ by Soravit Changpinyo, et al. ∙ Google 0

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly available benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection has been employed extensively as a primitive task for vision and language tasks such as image captioning and visual question answering (VQA); see Anderson et al. (2018) and the work that follows it. One motivation is that the ability to recognize salient regions and objects may be too difficult to learn from weakly-supervised top-down signals, in the form of captions and question-answer pairs. Indeed, bottom-up

signals provided by object detection often correspond to semantic units of language such as words or phrases, making them suitable for text generation and image-text alignment.

However, object detection itself can be broken down into multiple subtasks Liu et al. (2018)

. A family of “two-stage” object detectors first proposes category-agnostic bounding box candidates and then featurizes and classifies the cropped regions into one of the available semantic labels. Even “one-stage” object detection approaches, where these boxes become category-specific, can be formulated in a

bottom-up manner as detecting and grouping extreme and center points Zhou et al. (2019). Can we take advantage of this observation to learn to transfer more effectively? In this work, we take a step in this direction by examining the effect of decoupling box proposal and featurization on downstream vision and language tasks. In particular, we consider a two-stage object detector and set a goal of pushing the “featurization” aspect of the task further than before.

Figure 1: Ultrafine-grained semantic labels (at “instance level”) provide transfer learning power to downstream tasks like visual question answering.

Our choice to break free from “featurization by object detection models” has at least two advantages. First, there is a larger amount of labeled data that can be leveraged to train a better featurization module, even if such data do not support learning box proposals. To put it another way, the quality of features directly provided by object detectors is limited by the fact that annotating ground-truths for both the bounding boxes and their corresponding semantic labels is costly and scales poorly. By separating them, we reintroduce the freedom to annotate for object-agnostic box segmentation, without the burden of baking in annotation decisions related to the granularity level of the semantic labels (i.e., do we use as semantic labels “money”, “euro”, or “20 euro”?). As illustrated in Figure 1, the granularity level of the semantic labels plays a crucial role for downstream tasks such as VQA.

Second, this approach is better suited to downstream tasks whose domains are different from the one the object detector is trained on. In other words, it allows us to benefit from transfer learning, which is a great advantage given the relatively modest amount of available supervised data for these downstream vision and language tasks.

We empirically demonstrate the above-mentioned advantages through a focused study of the effect of improved featurization on image captioning and VQA in transfer learning settings. In particular, we (i) leverage ultra-fine-grained semantic labels (e.g., “golden gate bridge” vs. “bridge”) for featurization Juan et al. (2019); and, (ii) focus on scenarios in which object detection modules trained on Visual Genome (VG) Krishna et al. (2017) are applied to out-of-domain images: image captioning on the Conceptual Captions dataset Sharma et al. (2018), and VQA on the VizWiz dataset Gurari et al. (2018). Our results indicate that there are ways to incorporate low-level pre-training tasks that benefit vision and language models via higher-quality bottom-up signals.

2 Related Work

Attention-based deep models are popular in image captioning and VQA. Early work used fixed partitions of images as candidate regions Xu et al. (2015). However, variable sized regions that are better correlated with object boundaries have gained momentum Fu et al. (2017); Pedersoli et al. (2017); Anderson et al. (2018). Indeed, anderson18bottomup established new state-of-the-art performance over both image captioning and VQA tasks on the MSCOCO and VQA2 benchmarks using a Faster R-CNN detector trained on Visual Genome. As both Visual Genome and VQA2 were built on images from MSCOCO, the object detector was applied largely to in-domain images. In contrast, our work focuses on more realistic settings in which domains of different tasks may not be perfectly aligned Chen et al. (2018).

We leverage image representations extracted from a network pre-trained over large amounts of labeled data. Prior work demonstrated the power of pre-training with image classification at scale Sun et al. (2017); Mahajan et al. (2018); Wu et al. (2019). However, we consider downstream vision and language

tasks (image captioning and visual question answering), in contrast to less complex vision-only tasks explored in such work: object detection and in some cases semantic segmentation and human pose estimation. Furthermore, our transfer learning technique is based on decoupled region proposal and ultra-finegrained featurization, not fine-tuning the pre-trained network.

Another set of closely related work utilized additional data for scaling up either vision tasks Hoffman et al. (2016); Tang et al. (2017); Redmon and Farhadi (2017) or vision and language tasks Venugopalan et al. (2017); Lu et al. (2018); Noh et al. (2019). For instance, YOLO9000 Redmon and Farhadi (2017) built a “WordTree” hierarchy based on the WordNet synsets Miller et al. (1990)

, mapped categories in both COCO object detection and ImageNet classification datasets into the hierarchy, and proposed a joint detection and classification training framework. Our approach to transfer learning with ultrafine-grained featurization can similarly address the long-tail nature of target vocabulary (see Figure 

2) while being simpler (e.g., not require carefully merging different sets of vocabulary as in YOLO9000). The number of classes we consider is also several orders of magnitude larger.

Incorporating object detection signals in downstream tasks appropriately is non-trivial and an active subject for research Santoro et al. (2017); Zhang et al. (2018). In this work, we ask the orthogonal question of whether it is necessary to accept the object detector’s output as-is.

3 Features and Experimental Setup

Our starting point is a two-stage object detector, which consists of two core modules. One is responsible for category-agnostic box proposal, and the other for featurizing each cropped region for semantic label prediction. In this paper, we select Faster R-CNN Ren et al. (2015b), a widely-used object detector in image captioning and VQA.

Faster R-CNN Model

We reimplement the Faster R-CNN model, training it to predict both 1,600 object and 400 attribute labels in Visual Genome Krishna et al. (2017), following the standard setting from anderson18bottomup. ResNet-101 He et al. (2016) pre-trained on ImageNet Russakovsky et al. (2015) is used as the core featurization network111See further details in the supplementary material.. We achieve a mAP@50 of 10.96 for object detection and 1.5 for attribute detection. Given an image, Faster R-CNN proposes bounding box regions, each of which comes with a

-dimensional feature vector as well as object/attribute class predictions (along with their scores).

is set to 100 and to 2048 in our experiments. Using output features on the task of VQA and our model described in Section 5, we obtain an accuracy of 66.9% on the validation set of the VQA2 dataset Goyal et al. (2017). For comparison, this number already surpasses all validation accuracy numbers in Table 2 for a strong model by peng18dynamic, suggesting that our Faster R-CNN features are of high-quality.

Decoupled Box Proposal and Featurization with Ultra-finegrained Semantic Labels

In standard use of object detectors following anderson18bottomup, downstream tasks receive “knowledge” merely about a few thousand classes and four hundred attributes. Here, we exploit the fact that box proposal and featurization can be decoupled, and work on improving the object representation (featurization).

More concretly, we conduct a study toward understanding the utility of improved featurization on downstream tasks. To this end, we exploit a graph-based, semi-supervised representation learning approach called Graph-Regularized Image Semantic Embedding (Graph-RISE) Juan et al. (2019)

. Specifically, Graph-RISE is based on ResNet-101 where the 10x10x2K feature map is first average pooled to 4x4x2K, and then flattened and projected to a 64-dimensional embedding before the softmax layer. Learned from

(260M) web images and (40M) (noisy) semantic labels, these compact 64-dimensional feature vectors are trained to capture a whole spectrum of semantic similarity, ranging from coarse-grained / category-level (e.g., “bridge”), fine-grained level (e.g., “steel red bridge”), to ultrafine-grained / instance-level (e.g., “golden gate bridge”).

Our Objective

The main goal is to compare two approaches in using bottom-up signals: 1) FRCNN: use the default visual features from the Faster R-CNN detector; 2) Ultra: use bounding boxes from the Faster R-CNN detector, then featurize them using the much more compact representation from Graph-RISE that potentially reflects differentiation of ultrafine-grained semantic labels. Next, we evaluate this setup on downstream tasks for image captioning and visual question answering.

4 Image Captioning

Dataset

We use the Conceptual Captions (CC) dataset Sharma et al. (2018), consisting of 3.3 million training and 15,000 validation images/caption pairs. Another 12,000 image/caption pairs comprise the hidden test set. Official scores on the test set are obtained by submitting models to the CC Challenge server222ai.google.com/research/ConceptualCaptions. Unlike other image captioning datasets, images from CC are pulled from across the web and thus exhibit a wide variety of both images and image-caption styles. Most notably, the domain of images can be very different from Visual Genome, unlike in popular benchmarks such as MSCOCO Lin et al. (2014).

Model

We adopt the encoder-decoder model from Sharma et al. (2018)

, whose basic building block is a Transformer Network 

Vaswani et al. (2017). To convert multi-modal inputs to a sequence of encoder feature vectors, we use up to three types of image features:

G

: Global features by Graph-RISE, a dense 64D vector extracted from the whole image;

B

: Box-region features by Faster R-CNN (FRCNN, sparse 2048D), or Graph-RISE (Ultra, dense 64D), extracted from each cropped image region resized to 224x224 (cf. Sec. 3);

L

: Label embeddings, obtained by embedding predicted object semantic labels from Google Cloud Vision APIs333cloud.google.com/vision into a 512D feature vector. These semantic labels are then mapped to embeddings using an embedding layer pre-trained to predict label co-occurrences in web documents using a word2vec model Mikolov et al. (2013).

For both B and L, we select the inputs with highest scores and order the sequence inputs based on such scores from high to low. Additionally for B, we remove box regions whose scores are lower than 0.001. We use beam search with width 5 for the decoder in all of our experiments444See further details in the supplementary material..

Metrics

We adopt the standard automatic metrics for image captioning: CIDEr (Vedantam et al., 2015), ROUGE-L Lin and Och (2004), and SPICE Anderson et al. (2016), as implemented in the COCO-caption evaluation toolkit555https://github.com/tylin/coco-caption..

Figure 2: Qualitative results from our image captioning models using B-FRCNN vs. B-Ultra (see text for details), along with ground-truth captions. Ultra is more capable than FRCNN of dealing with images with unfamiliar objects, those that do not perfectly fall into the domain where the Faster R-CNN object detector is trained on.
dev test
CIDEr CIDEr ROUGE-L SPICE
Transf-Baseline - 0.772 0.244 0.172
TTI-BIC (single) - 0.980 0.266 0.186
G (Base) 0.868 - - -
B-FRCNN 0.667 - - -
B-Ultra 0.873 - - -
L 0.606 - - -
G + B-FRCNN 0.871 - - -
G + B-Ultra 0.912 - - -
G + L 0.888 - - -
G + B-FRCNN + L 0.892 0.944 0.261 0.190
G + B-Ultra + L 0.937 0.984 0.265 0.195
Table 1: Automatic metric scores for the image captioning task on Conceptual Captions. Ablation results are reported for our model using different sets of visual features. The top two baselines are from the Conceptual Captions Leaderboard as of August 30, 2019.

Results

We report results on both the dev and test sets for Conceptual Captions in Table 1. “Base” uses the G feature only. We first compare the Base G against each of the feature types (B-FRCNN, B-Ultra, and L). We then perform ablations under the +B condition (FRCNN/Ultra) to the Base G or stronger G + L models.

According to dev CIDEr scores, global or box Graph-RISE features G and B-Ultra are (individually) clearly stronger than box features by Faster R-CNN B-FRCNN or label embeddings L features. Nevertheless, these features are considerably complementary. Specifically, box features B-Ultra complements the Base G, pushing the score from 0.868 to 0.912. It is also worth noting that, albeit their low individual scores, B-FRCNN or L improves upon each model they are added to.

Our models with Ultra features clearly outperform the ones with FRCNN. This is demonstrated in three conditions: when they are on their own, when they are added to the simple G model, and when they are added to the stronger G + L model. Manual inspection of the models’ predictions further supports this; a qualitative comparison of B-Ultra vs. B-FRCNN in Figure 2 suggests that ultra-finegrained featurization leads to an improved correspondence between visual inputs and caption tokens of unfamiliar objects (such as “monks” and “staircase”).

To get test scores, we submit our best model using FRCNN and our best model using Ultra (based on dev CIDEr) to the CC Challenge server. Test scores for other models were not obtained due to the limited number of submissions per time period. As of August 30, 2019, the G + B-Ultra + L model outperforms all other single baselines666ai.google.com/research/ConceptualCaptions/leaderboard, for both CIDEr and SPICE (and tie on ROUGE-L).

5 Visual Question Answering

Dataset

We use the recently-proposed VizWiz dataset Gurari et al. (2018), in which both images and questions originate from visually-impaired or blind people. It consists of 20,000/3,173 image, question, answers triplets in the train/val splits, and additional 8,000 triplets for the test split. Each question is independently annotated with 10 answers. We choose the VizWiz benchmark specifically because it is a more suitable benchmark for measuring transfer learning effects. Other VQA datasets, including VQA1.0 Antol et al. (2015), VQA2.0 Goyal et al. (2017), Visual7W Zhu et al. (2016), COCOQA Ren et al. (2015a), and GQA Hudson and Manning (2019) are completely or partly based on MSCOCO or Visual Genome. As such, they may not provide unbiased grounds for measuring the impact of object-detection features based on Visual Genome versus alternative featurization techniques.

Model

We follow the setting described in Pythia v0.1 Jiang et al. (2018), the winning entry to the VQA challenge 2018. In particular, the architecture is a simplified “up-down” model from Anderson et al. (2018)777See further details in the supplementary material.. The featurization of the bounding boxes follows the description from Section 4. For the base condition, we use the box features based on Faster R-CNN (B-FRCNN), following the majority of previous work. For the test condition, we replace them with the Ultra-based features (B-Ultra).

Metrics

As commonly done in previous work Antol et al. (2015), we use as our accuracy metric the average score over 9 subsets of the ground-truth 10 answers, where each score is computed by the formula: min(# humans that provided that answer / 3, 1). Accuracy on the test-dev and test-standard splits is obtained by submitting the models to the VizWiz Challenge server888evalai.cloudcv.org/web/challenges/challenge-page/102/overview.

all y/n num unans other
VizWiz 46.9 59.6 21.0 80.5 27.3
BAN 51.6 68.1 17.9 85.3 31.5
Ours (FRCNN) 51.9 66.7 24.3 85.0 32.1
Ours (Ultra) 53.7 68.1 28.8 84.0 35.4
Table 2: Accuracy (%) on the test-standard split for the VQA task on the VizWiz dataset. Additionally, we provide accuracy per answer type: yes/no (y/n), number (num), unanswerable (unans), and the rest (other). The baselines include VizWiz Gurari et al. (2018) and BAN Kim et al. (2018).

Results

We report results on the VizWiz benchmark in Table 2. Our model with FRCNN provides a strong baseline, slightly outperforming the previous-best model, BAN Kim et al. (2018), a different architecture that also uses the FRCNN-based features for object bounding boxes. The model using Ultra features further improves upon this; at 53.7%, it outperforms the one using FRCNN by a significant margin (1.8% accuracy on “all” question types). Moreover, this 1.8% improvement is a weighted average across answer types; the per-answer-type numbers indicate that our approach achieves even better improvements on two of the more difficult answer types, “number” (+4.5%) and “rest” (+3.3%). These improvements are illustrated by the examples provided in Figure 1.

This illustrates the effectiveness of decoupling bounding box proposal and featurization, and quantifies the impact of using transfer learning via large amounts of training data and ultrafine-grained semantic labels used for object representations.

6 Conclusion

In this work, we propose to (re)decouple box proposal and featurization. We show that this allows us to leverage additional signals and annotations, leading to more effective transfer learning for downstream vision and language tasks: image captioning and visual question answering. This result suggests that large-scale datasets with fine-grained image-level semantic labels, even when they do not dissect complex visual scenes, can benefit current state-of-the-art models – especially when applied to benchmarks where images are from diverse domains.

References

  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: Appendix A, Appendix B, Appendix C.
  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In ECCV, Cited by: §4.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of CVPR, Cited by: Appendix C, §1, §2, §5.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. In Proceedings of ICCV, Cited by: §5, §5.
  • Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive Faster R-CNN for object detection in the wild. In Proceedings of CVPR, Cited by: §2.
  • K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. TPAMI 39 (12), pp. 2321–2334. Cited by: §2.
  • G. Ghiasi, T. Lin, and Q. V. Le (2018) DropBlock: a regularization method for convolutional networks. In Proceedings of NeurIPS, Cited by: Appendix A.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In Proceedings of CVPR, Cited by: §3, §5.
  • D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham (2018) VizWiz Grand Challenge: answering visual questions from blind people. In Proceedings of CVPR, Cited by: Table 3, §1, §5, Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of CVPR, Cited by: §3.
  • J. Hoffman, D. Pathak, E. Tzeng, J. Long, S. Guadarrama, T. Darrell, and K. Saenko (2016) Large scale visual recognition through adaptation using joint representation and multiple instance learning. JMLR 17 (1), pp. 4954–4984. Cited by: §2.
  • D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for compositional question answering over real-world images. In Proceedings of CVPR, Cited by: §5.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of ICML, Cited by: Appendix A.
  • Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh (2018) Pythia v0.1: the winning entry to the VQA Challenge 2018. arXiv preprint arXiv:1807.09956. Cited by: Appendix C, §5.
  • D. Juan, C. Lu, Z. Li, F. Peng, A. Timofeev, Y. Chen, Y. Gao, T. Duerig, A. Tomkins, and S. Ravi (2019) Graph-RISE: graph-regularized image semantic embedding. arXiv preprint arXiv:1902.10814. Cited by: §1, §3.
  • J. Kim, Y. Choi, S. Hong, J. Jun, and B. Zhang (2018) Bilinear attention networks for VizWiz challenge. In Proceedings of the ECCV Workshop on VizWiz Grand challenge, Cited by: Table 3, §5, Table 2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of ICLR, Cited by: Appendix B, Appendix C.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei (2017) Visual Genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123 (1), pp. 32–73. Cited by: §1, §3.
  • C. Lin and F. J. Och (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of ACL, Cited by: §4.
  • T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Proceedings of ECCV, Cited by: §4.
  • L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2018) Deep learning for generic object detection: a survey. arXiv preprint arXiv:1809.02165. Cited by: §1.
  • J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In Proceedings of CVPR, Cited by: §2.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of ECCV, Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of NeurIPS, Cited by: item L.
  • G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller (1990) Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pp. 235–244. Cited by: §2.
  • H. Noh, T. Kim, J. Mun, and B. Han (2019) Transfer learning via unsupervised task discovery for visual question answering. In Proceedings of CVPR, Cited by: §2.
  • M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek (2017) Areas of attention for image captioning. In Proceedings of ICCV, Cited by: §2.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of CVPR, Cited by: §2.
  • M. Ren, R. Kiros, and R. Zemel (2015a) Exploring models and data for image question answering. In Proceedings of NeurIPS, Cited by: §5.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015b) Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of NeurIPS, Cited by: §3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §3.
  • T. Salimans and D. P. Kingma (2016)

    Weight normalization: a simple reparameterization to accelerate training of deep neural networks

    .
    In Proceedings of NeurIPS, Cited by: Appendix C.
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Proceedings of NeurIPS, Cited by: §2.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual Captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, Cited by: §1, §4, §4.
  • C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of ICCV, Cited by: §2.
  • Y. Tang, J. Wang, X. Wang, B. Gao, E. Dellandréa, R. Gaizauskas, and L. Chen (2017) Visual and semantic knowledge transfer for large scale semi-supervised object detection. TPAMI 40 (12), pp. 3045–3058. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of NeurIPS, Cited by: §4.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. In Proceedings of CVPR, Cited by: §4.
  • S. Venugopalan, L. Anne Hendricks, M. Rohrbach, R. Mooney, T. Darrell, and K. Saenko (2017) Captioning images with diverse objects. In Proceedings of CVPR, Cited by: §2.
  • B. Wu, W. Chen, Y. Fan, Y. Zhang, J. Hou, J. Huang, W. Liu, and T. Zhang (2019) Tencent ML-Images: a large-scale multi-label image database for visual representation learning. arXiv preprint arXiv:1901.01703. Cited by: §2.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of ICML, Cited by: §2.
  • Y. Zhang, J. Hare, and A. Prügel-Bennett (2018) Learning to count objects in natural images for visual question answering. In Proceedings of ICLR, Cited by: §2.
  • X. Zhou, J. Zhuo, and P. Krähenbühl (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of CVPR, Cited by: §1.
  • Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei (2016) Visual7W: grounded question answering in images. In Proceedings of CVPR, Cited by: §5.

Appendix A Details on Faster R-CNN

Our model is implemented in TensorFlow Abadi et al. (2015). We follow anderson18bottomup in terms of model architecture, data splits, and processing steps. We describe major components and differences below. In particular, we use the latest version of Visual Genome (v1.4), with 1600 object and 400 attribute categories. We also have the “background” class for objects and the “no attribute” class for attributes. We limit the number of attributes per object to 16. We resize the image to so that the maximum of height or width is 896. We train our model with a batch size of 64 for 50K steps, using SGD with momentum on an 8-core Google Cloud TPU999cloud.google.com/tpu. We clip the gradient if the norm is greater than 10. We use the cosine learning rate schedule with 1K warm-up steps, increasing the learning rate from 0.003 to 0.04 and reducing it to 0.01 at step 20K and to 0.005 at step 40K. We apply random crops to images and use batch normalization Ioffe and Szegedy (2015) as well as DropBlock Ghiasi et al. (2018)

on block 3 and block 4 of the ResNet-101 during training. Our features come from fc6 after ReLU.

Appendix B Details on Image Captioning

Figure 3: Pipeline for converting an image to a sequence of image features in our highest performing image captioning model on the Conceptual Captions benchmark, used as input to the Transformer-based model.

Our model is implemented in TensorFlow Abadi et al. (2015). Our Transformer-based architecture has a stack of 6 layers for both the encoder and the decoder. The number of attention heads is set to 8. We do not use positional encoding. We have an additional dense projection layer for each type of input features (see Figure 3 for examples). Moreover, for Faster-RCNN features, we observe the best performance when first transforming the 2048D input feature vector to a 64D one (as in Ultra) using another projection layer, and thus report accuracy numbers in this setting. At the same time, we also have these projection layers in our VQA architecture when using Ultra features (see the next section). We use Adam optimizer Kingma and Ba (2015)

with a warm-up style learning rate schedule, linearly increasing the learning rate in the first 20 epochs until it reaches 0.000032 and then use a decay rate of 0.95 for every 25 epochs. We tuned the initial learning rate over {0.000016, 0.000032, 0.000064}. We train our model with a batch size of 4096 on a 32-core Google Cloud TPU for a total of 2 million steps. Each training run takes approximately 4 days.

In Figure 3, we show how we convert an image (pixels) to an input sequence of image features to the Transformer-based model described in the main text.

Appendix C Details on VQA

val test-dev test-standard
all all y/n num unans other all y/n num unans other
VizWiz Gurari et al. (2018) - - - - - - 46.9 59.6 21.0 80.5 27.3
BAN Kim et al. (2018) - - - - - - 51.6 68.1 17.9 85.3 31.5
Ours (FRCNN) 55.2 53.6 72.7 22.7 85.9 33.3 51.9 66.7 24.3 85.0 32.1
Ours (Ultra) 56.8 55.1 71.7 31.6 84.4 36.7 53.7 68.1 28.8 84.0 35.4
Table 3: Accuracy (%) for the VQA task on the VizWiz dataset. Additionally, we provide accuracy per answer type on the test-dev and test-standard splits: yes/no (y/n), number (num), unanswerable (unans), and the rest (other).

Our model is implemented in TensorFlow Abadi et al. (2015). As mentioned in the main text, the architecture is a simplified “up-down” model of Anderson et al. (2018). This architecture has two major differences. First, it uses weight normalization Salimans and Kingma (2016) followed by ReLU instead of the more expensive gated hyperbolic tangent activation. Second, it uses multi-modal combination by element-wise multiplication instead of by feature concatenation.

The only minor differences from Pythia v0.1 are that we use Adam Kingma and Ba (2015), not its variant AdaMax, and that we use a single classifier layer instead of two vision and language layers in all of our experiments. We use Pythia v0.1 Jiang et al. (2018) to preprocess VizWiz dataset and retain 3135 top answers. Analogous to what we observe in our image captioning model, for Ultra features, we see the best performance when scaling and expanding the 64D input feature vector to a 2048D one (as in FRCNN) using another projection layer, followed by ReLU. We thus report accuracy numbers in this setting. We use a warm-up style learning rate schedule, linearly increasing the learning rate in the first 10 epochs until it reaches the initial learning rate, and then use a decay rate of 0.5 for every 20 epochs. We tune the initial learning rate over {0.00005, 0.0001, 0.0003, 0.0005, 0.001, 0.003}. We train our model with a batch size of 192 on an 8-core Google Cloud TPU for a total of 70K steps. Each training run takes approximately 2 hours.

Appendix D Full results on the VizWiz benchmark

Table 3 reports accuracy on additional splits of VizWiz, complementing the one in the main text.