Cross-modal reasoning is challenging for grounding entities and objects in different modalities. Representative tasks include visual question answering (VQA) and image captioning that leverage grounded features between text and images to make predictions. While recent advances in these tasks achieve impressive results, the quality of the correspondence between textual entities and visual objects in both modalities is not necessarily convincing or interpretable(Liu et al., 2017). This is likely because the grounding from one modality to the other is trained implicitly and the intermediate results are not often evaluated as explicitly as in object detection. To address this issue, Plummer et al. (2015) created the Flickr30K Entities dataset with precise annotations of the correspondence between language phrases and image regions to ease the evaluation of visual grounding. In Figure1, two men are referred to in the caption as separate entities. To uniquely ground the two men in the image, the grounding algorithm must take respective context and attributes into consideration for learning the correspondence.
Over the years, many deep learning based approaches were proposed to tackle this localization challenge. The basic idea is to derive representative features for each entity as well as object, and then score their correspondence. In the modality of caption input, individual token representations usually start with the word embeddings followed by a recurrent neural network (RNN), usually Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU), to capture the contextual meaning of the text entity in a sentence. On the other hand, the visual objects in image regions of interest (RoI) are extracted through object detection. Each detected object typically captures limited context through the receptive fields of 2D convolutions. Advanced techniques such as thefeature pyramid network (FPN) (Lin et al., 2017)
enhance the representations by combining features at different semantic levels w.r.t. the object size. Even so, those conventional approaches are limited to extracting relevant long range context in both text and images effectively. In view of this limitation, non-local attention techniques were proposed to address the long range dependencies in natural language processing (NLP) and computer vision (CV) tasks(Vaswani et al., 2017; Wang et al., 2018). Inspired by this advancement, we introduce the contextual grounding approach to improving the representations through extensive intra- and inter-modal interaction to infer the contextual correspondence between text entities and visual objects.
On the methodology of feature interaction, the Transformer architecture (Vaswani et al., 2017) for machine translation demonstrates a systematic approach to efficiently computing the interaction between language elements. Around the same time, non-local networks (Wang et al., 2018) generalize the transformer to the CV domain, supporting feature interaction at different levels of granularity from feature maps to pooled objects. Recently, the image transformer (Parmar et al., 2018) adapts the original transformer architecture to the image generation domain by encoding spatial information in pixel positions while we deal with image input at the RoI level for grounding. The following work in (Devlin et al., 2019) proposed BERT as a pre-trained transformer encoder on large-scale masked language modeling, facilitating training downstream tasks to achieve state-of-the-art (SOTA) results. Our work extends BERT to the cross-modal grounding task by jointly learning contextual representations of language entities and visual objects. Coincidentally, another line of work named VisualBERT (Li et al., 2019) also integrates BERT to deal with grounding in a single transformer architecture. However, their model requires both task-agnostic and task-specific pre-training on cross-modal datasets to achieve competitive results. Ours, on the contrary, achieves SOTA results without additional pre-training and allows respective architectural concerns for different modalities.
2 Contextual Grounding
The main approach of previous work is to use RNN/LSTM to extract high level phrase representations and then apply different attention mechanisms to rank the correspondence to visual regions. While the hidden representations of the entity phrases take the language context into consideration, the image context around visual objects is in contrast limited to object detection through 2D receptive fields. Nonetheless, there is no positional ordering as in text for objects in an image to go through the RNN to capture potentially far apart contextual dependencies. In view of the recent advances in NLP, the transformer architecture proposed by(Vaswani et al., 2017) addresses the long range dependency through pure attention techniques. Without RNN being incorporated, the transformer enables text tokens to efficiently interact with each other pairwise regardless of the range. The ordering information is injected through additional positional encoding. Enlightened by this breakthrough, the corresponding contextual representations of image RoIs may be derived through intra-modal interaction with encoded spatial information. We hypothesize that the grounding objective would guide the attention to the corresponding context in both the text and image with improved accuracy. Consequently, we propose the contextual grounding architecture as shown in Figure 2. The model is composed of two transformer encoder branches for both text and image inputs to generate their respective contextual representations for the grounding head to decide the correspondence. The text branch is pre-trained from the BERT base model (Devlin et al., 2019) which trains a different positional embedding from the original transformer (Vaswani et al., 2017). On the other hand, the image branch takes RoI features as input objects from an object detector. Correspondingly, we train a two layer MLP to generate the spatial embedding given the absolute spatial information of the RoI location and size normalized to the entire image. Both branches add the positional and spatial embedding to the tokens and RoIs respectively as input to the first interaction layer. At each layer, each hidden representation performs self-attention to each other to generate a new hidden representation as layer output. The self-attention may be multi-headed to enhance the representativeness as described in (Vaswani et al., 2017). At the end of each branch, the final hidden state is fed into the grounding head to perform the cross-modal attention with text entity hidden states as queries and image object hidden representations as the keys. The attention responses serve as the matching correspondences. If the correspondence does not match the ground truth, the mean binary cross entropy loss per entity is back propagated to guide the interaction across the branches. We evaluate the grounding recall on the Flickr30K Entities dataset and compare the results with SOTA work in the next section.
|Plummer et al. (2015)||Fast RCNN||50.89||71.09||75.73||85.12|
|Yeh et al. (2017)||YOLOv2||53.97||-||-||-|
|Hinami and Satoh (2017)||Query-Adaptive RCNN||65.21||-||-||-|
|BAN (Kim et al., 2018)||Bottom-Up (Anderson et al., 2018)||69.69||84.22||86.35||87.45|
|Ours L1-H2-abs||Bottom-Up (Anderson et al., 2018)||71.36||84.76||86.49||87.45|
|Ours L1-H1-abs||Bottom-Up (Anderson et al., 2018)||71.21||84.84||86.51||87.45|
|Ours L1-H1||Bottom-Up (Anderson et al., 2018)||70.75||84.75||86.39||87.45|
|Ours L3-H2-abs||Bottom-Up (Anderson et al., 2018)||70.82||84.59||86.49||87.45|
|Ours L3-H2||Bottom-Up (Anderson et al., 2018)||70.39||84.68||86.35||87.45|
|Ours L6-H4-abs||Bottom-Up (Anderson et al., 2018)||69.71||84.10||86.33||87.45|
Our contextual grounding approach uses the transformer encoder to capture the context in both text entities and image objects. While the text branch is pre-trained from BERT (Devlin et al., 2019), the image branch is trained from scratch. In view of the complexity of the transformer, previous work (Girdhar et al., 2019) has shown the performance varies with different numbers of interaction layers and attention heads. Also, the intra-modal object interaction does not necessarily consider the relationship in space unless some positional or spatial encoding is applied. In our evaluation, we vary both the number of layers and heads, along with adding the spatial encoding to explore the performance variations summarized in Table 1. We achieve the SOTA results in all top 1, 5 and 10 recalls based on the same object detector as used by previous SOTA BAN (Kim et al., 2018). The breakdown of per entity type recalls is given in Table 2. Six out of the eight entity type recalls benefit from our contextual grounding. Interestingly, the recall of the instrument type suffers. This may be due to the relative small number of instrument instances in the dataset preventing the model from learning the context well. On the other hand, compared with the text branch consisting of layers and heads with hidden size of dimensions, the best performance is achieved with the image branch having layer, attention heads and hidden size of dimensions. Moreover, adding the spatial embedding consistently improves the accuracy by or so. This is likely because image objects, unlike word embedding requiring the context to produce representative hidden states for its meaning, may already capture some neighborhood information through receptive fields.
Finally, we compare the results with the recent work in progress, VisualBERT (Li et al., 2019), in Table 3 which also achieves improved grounding results based on a single transformer architecture that learns the representations by fusing text and image inputs in the beginning. Marginally, ours performs better in the top 1 recall. Note, our approach, unlike VisualBERT which requires task-agnostic and task-specific pre-training on COCO captioning (Chen et al., 2015) and the target dataset, needs no similar pre-training to deliver competitive results. Besides, our architecture is also flexible to adapt to different input modalities respectively.
This paper introduces contextual grounding, a higher-order interaction technique to capture corresponding context between text entities and visual objects. The evaluation shows the SOTA 71.36% accuracy of phrase localization on Flickr30K Entities. In the future, it would be worth investigating the benefits of grounding guided visual representations in other related and spatio-temporal tasks.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
IEEE Conference on Computer Vision and Pattern Recognition, Note: CVPR 2018 full oral, winner of the 2017 Visual Question Answering challenge Cited by: Table 1.
- Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv. Note: arXiv admin note: text overlap with arXiv:1411.4952 External Links: Cited by: §3.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.. In Annual Meeting of the Association for Computational Linguistics, Note: Feature-based: ELMo - task-specific architectures - additional pre-trained representations - shallow concat of independently trained LMs from both directions
- (4) Fine-tuning: GPT - minimal task-specific params - trained on downstream tasks by fine-tuining pre-trained params - only learns self-attention unidirectionally
- (5) Bidirectional Encoder Representations from Transformers - BERT improve fine-tuning based GPT - masked language model to predict from context - levels of embedding * token/word * learned segment/sentence * learned position/order - use wordpiece embeddings with 30K token vocab - [CLS] to represent the entire sequence for classification - sentence pair with [SEP]
- (6) Unsupervised Training - Masked LM: 15* 80* 10* 10- Next Sentence Prediction Task
Task-Specific Params - batch size: 16, 32 - LR for Adam: 5e05, 3e-5, 2e-5 - epochs: 3, 4
Cited by: §1,
Video Action Transformer Network. In IEEE Conference on Computer Vision and Pattern Recognition, Note: Transformer head: - aggregate information from interactions with other objects - no need for explicit object detection but person RPN - only RGB frames
- (28) Architecture - I3D trunk network to generate spatial-temporal feature maps * downasmaple from TxHxW to T/4 x H/16 x W/16 * center feature map + location embedding -> RPN and RoIPool -> a 2-head 3-layer Action Transformer network -> regress tight person bounding box and action classification
- (29) - Simple I3D head * Spatio-Temporal RoIPooling (ST-RoiPool) * replicate RP in time to form a tube * RoIPool features per frame * stack RoI features over time to feed to layers of I3D outside trunk
- (30) - location embedding relative to center * spatial [h, w] * temporal [t] * 2 layer MLP for concatenation * added to Q/K/V channel-wise
- (31) Dataset - AVA: actions requiring recognizing interactions with other objercts
- (32) Training - pretrained on Kinetics 400 - fixed running mean and var of BN to the initialization from pretrained - data augmentation with random flips and crops in case of overfitting - batch size of 30 clips - LR 0.1 w/ cosine LR annealing over 500K iterations - linear warmup from 0.01 to 0.1 for the first 1000 iterations - 300K iterations if using ground truth bboxes - sigmoid CE for the multi-label classification loss
- (33) Error Analysis - still hard despite relatively large train sets Cited by: §3.
- Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation. In Conference on Empirical Methods on Natural Language Processing, Cited by: Table 1, Table 2.
- Bilinear Attention Networks. IEEE International Conference on Image Processing. Note: Derive attention and joint representations from multimodal feature interactions.
- (9) Low-rank bilinear pooling: joint representations of pairs of channels BAN: bilinear interactions between two groups of input channels MRN: multimodal residual networks using multi bilinear attention maps => residual summation instead of concatenation of features
- (10) M=2048 for visual embedding N=1024 for question embedding C=K=1024 for size of joint representation and pooling rank K’=3K for bilinear attention maps to improve residual leraning Adamax optimizer with infinite norm
- (11) Concatenation of 300-d Glov word embedding External Links: Cited by: Table 1, Table 2, §3, Implementation.
- VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv. Note: Mixing linguistic and visual tokens in BERT Pre-training objectives - masked language modeling - matching between caption and image
- (37) Evaluation tasks - VQA - VCR -
- (38) Disadvantages - required pretraining on COCO captioning for satisfactory results * smaller cross-modal pretraining dataset than single-modal alternatives - higher complexity due to * the total number of tokens and the pairwise attention nature * requirements of the most demanding modality External Links: Cited by: §1, Table 3, §3, Implementation.
- Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Note: FPN - bottom-up pathway * one pyramid level for each stage * choose the last layer output of each stage * feed-forward computation of ConvNet * scaling step of 2 - top-donw pathway * upscaling spatially coarser but semantically sronger pyramid lvls * - lateral connections Cited by: §1.
- Attention Correctness in Neural Image Captioning.. In AAAI Spring Symposium, Cited by: §1.
- Image Transformer. In International Conference on Machine Learning, Note: DMOL (Discretized mixture of logistics) - coordinate embeddings (d/2 for row, d/2 for col) * sine and cosine w/ different frequencies across dimensions * learned embedding Cited by: §1.
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision. Note: - focus on coreferences in different ways - salient entities for the purpose of NL description
- (16) Out of scene entities - the entire scene - physical but absent - abstract or non-visual
- (17) Split as Gong. - 29783 - 1000 - 1000
- (18) Evaluation for Phrase Localization - Gong’s split: 29783/1000/1000 - R@1, R@5, R@10 Cited by: §1, Table 1, Table 2.
- Attention is All you Need. In Neural Information Processing Systems, Cited by: §1, §1, §2.
- Non-Local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Note: Capturing long-range dependencies across spacetime
- (25) Approach - generalization of classical non-local mean operation * search for non-local patchs with high similarity responses - weighted sum of all positions in the feature maps * attention is a special case Cited by: §1, §1.
- Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts. In Neural Information Processing Systems, pp. 1912–1922. Cited by: Table 1, Table 2.
on the Flickr30K Entities dataset for fair comparison where the text entity representation is taken from the last word or subword in a phrase. It probably makes little sense for VisualBERT(Li et al., 2019)
to choose the cross entropy to rank the correspondences instead of the binary cross entropy because one text entity such as a group of people can actually correspond to multiple person objects in the ground truth annotations. Apart from different number of transformer layers and attention heads used in the base BERT model, our image transformer branch uses gradient clipping of 0.25 and dropout probability 0.4 for the best performance. During training, the learning rate is set to 5e-5 and the batch size is set to 256 with 2 steps of gradient accumulation before back-propagation. The model is trained for at most 10 epochs with early stopping.