TransVG: End-to-End Visual Grounding with Transformers

04/17/2021 ∙ by Jiajun Deng, et al. ∙ USTC 0

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and will make our code available to the public.



There are no comments yet.


page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual grounding (also known as referring expression comprehension [29, 55], phrase localization [22, 34], and natural language object retrieval [20, 23]) aims to predict the location of a region referred by the language expression onto an image. The evolution of this technique is of great potential to provide an intelligent interface for the natural language expression of human beings and the visual components of the physical world. Existing methods addressing this task can be broadly grouped into the two-stage and one-stage pipelines shown in Figure 1. In specific, the two-stage approaches [29, 32, 43, 55] first generate a set of sparse region proposals and then exploit region-expression matching to find the best one. The one-stage approaches [8, 25, 51] perform visual-linguistic fusion at intermediate layers of an object detector and output the box with the maximal score over pre-defined dense anchors.

Figure 1: A comparison of (a) two-stage visual grounding pipeline, (b) one-stage visual grounding pipeline, and (c) our proposed TransVG framework. TransVG performs intra-modality and inter-modality relation reasoning with a stack of transformer layers in a homogeneous way, and grounds the object by directly regressing the box coordinates.

Multi-model fusion and reasoning are the core problems in visual grounding. In general, the early two-stage and one-stage methods address multi-modal fusion in a simple way. Concretely, the two-stage Similarity Net [43] measures the similarity between region and expression embedding with an MLP, and the one-stage FAOA [51]

encodes the language vector to visual feature by direct concatenation. These simple designs are efficient but lead to sub-optimal results, especially on long and complex language expressions. Following studies have proposed diverse architectures to improve the performance. Among two-stage methods, modular attention network 

[54], various graphs [45, 48, 49], and multi-modal tree [26] are designed to better model the multi-modal relationships. The one-stage method [50] has also explored better query modeling by proposing a multi-round fusion module.

Despite the effectiveness, these complicated fusion modules are built on certain pre-defined structures of language queries or image scenes, inspired by the human prior. Typically, the involvement of manually-designed mechanisms in fusion module makes the models overfit to specific scenarios, such as certain query lengths and query relationships, and limits the plenitudinous interaction between visual-linguistic contexts. Moreover, even though the ultimate goal of visual grounding is to localize the referred object, most of the previous methods ground the queried object in an indirect fashion. They generally define surrogate problems of language-guided candidates prediction, selection, and refinement. Typically, the candidates are sparse region proposals [55, 29, 43] or dense anchors [51]

, from which the best region is selected and refined to get the final grounding box. Since these methods’ predictions are made out of candidates, the performance is easily influenced by the prior knowledge to generate proposals (or pre-define anchors) and by the heuristics to assign targets to candidates.

In this study, we explore an alternative approach to avoid the aforementioned problems. Formally, we introduce a neat and novel transformer-based framework, namely TransVG, to effectively address the task of visual grounding. We empirically show that the structurized fusion modules can be replaced by a simple stack of transformer encoder layers. Particularly, the core component of transformers (, attention layer) is ready to establish intra-modality and inter-modality correspondence across visual and linguistic inputs, despite that we do not pre-define any specific fusion mechanism. Besides, we find that directly regressing the box coordinates works better than previous methods to ground the queried object indirectly. Our TransVG directly outputs 4-dim coordinates to ground the object instead of making predictions based on a set of candidate boxes.

The pipeline of our proposed TransVG is illustrated in Figure 1(c). We first feed the RGB image and language expression into two sibling branches. The visual transformer and linguistic transformer are applied in these two branches to model the global cues in vision and language domains, respectively. Then, the abstracted visual tokens and linguistic tokens are fused, and the visual-linguistic transformer is exploited to perform cross-modal relation reasoning. Finally, the box coordinates of a referred object are directly regressed to make grounding. We benchmark our framework on five prevalent visual grounding datasets, including ReferItGame [22], Flickr30K Entities [34], RefCOCO [55], RefCOCO+ [55], RefCOCOg [29], and our method sets a series of state-of-the-art records. Remarkably, our proposed TransVG achieves 70.73%, 79.10% and 78.35% on the test set of ReferItGame, Flickr30K and RefCOCO datasets, with 6.13%, 5.80%, 6.05% absolute improvements over the strongest competitors.

In summary, we make three-fold contributions:

  • [noitemsep,nolistsep]

  • We propose the first transformer-based framework for visual grounding, which holds neater architecture yet achieves better performance than the prevalent one-stage and two-stage frameworks.

  • We present an elegant view of capturing intra- and inter-modality context homogeneously by transformers, and formulating visual grounding as a direct coordinates regression problem.

  • We conduct extensive experiments to validate the merits of our method, and show significantly improved results on several prevalent benchmarks.

2 Related Work

2.1 Visual Grounding

Recent advances in visual grounding can be broadly categorized into two directions, i.e., two-stage methods [18, 19, 26, 43, 45, 48, 54, 58, 62] and one-stage methods [8, 25, 39, 50, 51]. We briefly review them in the following.

Two-stage Methods. Two-stage approaches are characterized by generating region proposals in the first stage and then leveraging the language expression to select the best matching region in the second stage. Generally, the region proposals are generated using either unsupervised methods [35, 43] or a pre-trained object detector [54, 58]. The training loss of either binary classification [43, 59] or maximum-margin ranking [29, 32, 44] is applied in the second stage to maximize the similarity between the positive object-query pair. Pioneer studies [29, 44, 55] obtain good results with the two-stage framework. The early work MattNet [54] introduces the modular design and improves the grounding accuracy by better modeling the subject, location, and relation-related language description. Some recent studies further improve the two-stage methods by better modeling the object relationships [26, 45, 48], enforcing correspondence learning [27], or making use of phrase co-occurrences [2, 6, 12].

One-stage Methods.

One-stage approaches get rid of the computation-intensive object proposal generation and region feature extraction in the two-stage paradigm. Instead, the linguistic context is densely fused with the visual features, and the language-attended feature maps are further leveraged to perform bounding box prediction in a sliding-window manner. The pioneering work FAOA 

[51] encodes the text expression into a language vector, and fuses the language vector into the YOLOv3 detector [37] to ground the referred instance. RCCF [25] formulates the visual grounding problem as a correlation filtering process [3, 16], and picks the peak value of the correlation heatmap as the center of target objects. The recent work ReSC [50] devises a recursive sub-query construction module to address the limitations of FAOA [51] on grounding complex queries.

Figure 2: An overview of our proposed TransVG framework. It consists of four main components: (1) a visual branch, (2) a linguistic branch, (3) a visual-linguistic fusion module, and (4) a prediction head to regress the box coordinates.

2.2 Transformer

Transformer is first proposed in [42]

to tackle the neural machine translation (NMT). The primary component of a transformer layer is the attention module, which scans through the input sequence in parallel and aggregates the information of the whole sequence with adaptive weights. Compared to the recurrent units in RNNs 

[17, 30, 41], the attention mechanism exhibits better performance in processing long sequences. This superiority has attracted a surge of research interest in applications of transformers in NLP tasks [10, 11, 36, 60] and speech recognition [31, 46].

Transformer in Vision Tasks. Inspired by the great success of transformers in neural machine translation, a series of transformers [4, 5, 7, 13, 21, 47, 57, 61] applied to vision tasks have been proposed. The infusive work DETR [4] formulates object detection as a set prediction problem. It introduces a small set of learnable object queries, reasons global context and object relations with attention mechanism, and outputs the final set of predictions in parallel. ViT [13] shows that a pure transformer can achieve excellent performance on image classification tasks. More recently, a pre-trained image processing transformer (IPT) is introduced in [5] to address the low-level vision problems, e.g.

, denoising, super-resolution and deraining.

Transformer in Vision-Language Tasks. Motivated by the powerful pre-trained model of BERT [11], some researchers start to investigate visual-linguistic pre-training (VLP)  [9, 24, 28, 40, 52] to jointly represent images and texts. In general, these models take the object proposals and text as inputs, and devise several transformer encoder layers for joint representation learning. Plenty of pre-training tasks are introduced, including image-text matching (ITM), word-region alignment (WRA), masked language modeling (MLM), masked region modeling (MRM), etc.

Despite with similar base units (transformer encoder layers), the goal of VLP is to learn a generalizable vision-language representation with large-scale data to facilitate down-stream tasks. In contrast, we focus on developing a novel transformer-based visual grounding framework, and learning to perform homogeneous multi-modal fusion and reasoning with a small amount of visual grounding data.

3 Transformers for Visual Grounding

In this work, we present Transformers for Visual Grounding (TransVG), a novel framework for the visual grounding task based on a stack of transformer encoders with direct box coordinates prediction. As shown in Figure 2, given an image and a language expression as inputs, we first separate them into two sibling branches, i.e., a visual branch and a linguistic branch, to generate visual and linguistic feature embedding. Then, we put the multi-modal feature embedding together and append a learnable token (named [REG] token) to construct the inputs of visual-linguistic fusion modules. The visual-linguistic transformer homogeneously embeds the input tokens from different modalities into a common semantic space by modeling intra-modality and inter-modality context with the self-attention mechanism. Finally, the output state of the [REG] token is leveraged to directly predict the 4-dim coordinates of a referred object in the prediction head.

In the following subsections, we first review the preliminary for transformer and then elaborate our designs of transformers for visual grounding.

3.1 Preliminary

Before detailing the architecture of TransVG, we briefly review the conventional transformer proposed in [42] for machine translation. The core component in a transformer is the attention mechanism. Given the query embedding , key embedding and value embedding , the output of a single-head attention layer is computed as:


where is the channel dimension of . Similar to classic neural sequence transduction models, the conventional transformer has an encoder-decoder structure. However, in our approach, we only use transformer encoder layers.

Concretely, each transformer encoder layer has two sub-layers, i.e., a multi-head self-attention layer and a simple feed forward network (FFN). The multi-head attention is a variant of single-head attention (as in Function 1

), and self-attention indicates the query, key and value are from the same embedding set. FFN is an MLP composed of fully connected layers and ReLU activation layers.

In the transformer encoder layer, each sub-layer is put into a residual structure, where layer normalization [1]

is performed after the residual connection. Let us denote the input as

, the procedure in a transformer encoder layer is:


where indicates layer normalization, is the multi-head self-attention layer, and represents the feed forward network.

3.2 TransVG Architecture

As depicted in Figure 2, there are four main components in TransVG: (1) a visual branch, (2) a linguistic branch, (3) a visual-linguistic fusion module, and (4) a prediction head.

Visual Branch. The visual branch starts with a convolutional backbone network, followed by the visual transformer. We exploit the commonly used ResNet [15] as the backbone network. The visual transformer is composed of a stack of 6 transformer encoder layers. Each transformer encoder layer includes a multi-head self-attention layer and an FFN. There are 8 heads in the multi-head attention layer, and 2 FC layers followed by ReLU activation layers in the FFN. The output channel dimensions of these 2 FC layers are 2048 and 256, respectively.

Given an image as the input of this branch, we exploit the backbone network to generate a 2D feature map . Typically, the channel dimension is , and the width and height of the 2D feature map are of the original image size (, ). Then, we leverage a convolutional layer to reduce the channel dimension of to and obtain . Since the input of a transformer encoder layer is expected to be a sequence of 1D vectors, we further flatten into , where is the number of input tokens. To make the visual transformer sensitive to the original 2D positions of input tokens, we follow [4, 33] to utilize sine spatial position encodings as the supplementary of visual feature. Concretely, the position encodings are added with the query and key embedding at each transformer encoder layer. The visual transformer conducts global vision context reasoning in parallel, and outputs the advanced visual embedding , which shares the same shape as .

Linguistic Branch. The linguistic branch is a sibling to the visual branch. Our linguistic branch includes a token embedding layer and a linguistic transformer. To make the best of the pre-trained BERT model [11], the architecture of this branch follows the design of the basic model of BERT series. Typically, there are 12 transformer encoder layers in the linguistic transformer. The output channel dimension of the linguistic transformer is .

Given a language expression as the input of this branch, We first convert each word ID into a one-hot vector. Then, in the token embedding layer, we tokenize each one-hot vector into a language token by looking up the token table. We follow the common practice in machine translation [10, 11, 36, 42] to append a [CLS] token and a [SEP] token at the beginning and end positions of the tokenized language expression. After that, we take the language tokens as inputs of the linguistic transformer, and generate the advanced language embedding , where is the number of language tokens.

Visual-linguistic Fusion Module. As the core component in our model to fuse the multi-modal context, the architecture of the visual-linguistic fusion module (abbreviated as V-L module) is simple and elegant. Specifically, the V-L module includes two linear projection layers (one for each modality) and a visual-linguistic transformer (with a stack of 6 transformer encoder layers).

Given advanced visual tokens out of the visual branch and advanced linguistic tokens out of the linguistic branch, we apply a linear projection layer to project them into embedding with same channel dimension. We denote the projected visual embedding and linguistic embedding as and , where . Then, we pre-append a learnable embedding (namely a [REG] token) to and , and formulate the joint input tokens of the visual-linguistic transformer as:


where represents the [REG] token. The [REG] token is randomly initialized at the beginning of the training stage and optimized with the whole model.

After obtaining the input in the joint embedding space as described above, we apply the visual-linguistic transformer to embed into a common semantic space by performing intra- and inter-modality relation reasoning in a homogeneous way. To retain the positional and modal information, we add learnable position encodings to the input of each transformer encoder layer.

Thanks to the attention mechanism, the correspondence can be freely established between each pair of tokens from the joint entities, regardless of their modality. For example, a visual token can attend to a visual token, and it can also freely attend to a linguistic token. Typically, the output state of the [REG] token develops a consolidated representation enriched by both visual and linguistic context, and is further leveraged for box coordinates prediction.

Prediction Head. We leverage the output state of [REG] token from the V-L module as the input of our prediction head. To perform box coordinates prediction, we attach a regression block to the [REG] token. The regression block is implemented by an MLP with two ReLU activated 256-dim hidden layers and a linear output layer. The output of the prediction head is the 4-dim box coordinates.

3.3 Training Objective

Unlike many previous methods that ground referred objects based on a set of candidates (i.e., region proposals in two-stage methods and anchor boxes in one-stage methods), TransVG directly infers a 4-dim vector as the coordinates of the box to be grounded. This simplifies the process of target assignment and positive/negative examples mining at the training stage, but it also involves the scale problem. Specifically, the widely used smooth L1 loss tends to be a large number when we try to predict a large box, while tends to be small when we try to predict a small one, even if their predictions have similar relative errors.

To address this problem, we normalize the coordinates of the ground-truth box by the scale of the image, and involve the generalized IoU loss [38] (GIoU loss), which is not affected by the scales.

Let us denote the prediction as , and the normalized ground-truth box as . The training objective of our TransVG is:


where and are the smooth L1 loss and GIoU loss, respectively. is the weight coefficient of GIoU loss to balance these two losses.

4 Experiments

4.1 Datasets

ReferItGame. ReferItGame [22] includes 20,000 images collected from the SAIAPR-12 dataset [14], and each image has one or a few regions with corresponding referring expressions. We follow the common practice to divide this dataset into three subsets, i.e., a train set with 54,127 referring expressions, a validation set with 5,842 referring expressions and a test set with 60,103 referring expressions. We use the validation set to conduct experimental analysis and compare our method with others on the test set.

Flickr30K Entities. Flickr30K Entities [34] augments the original Flickr30K [53] with short region phrase correspondence annotations. It contains 31,783 images with 427K referred entities. We follow the previous works [34, 35, 43, 50] to separate the these images into 29,783 for training, 1000 for validation, and 1000 for testing.

RefCOCO/ RefCOCO+/ RefCOCOg. RefCOCO [55] includes 19,994 images with 50,000 referred objects. Each object has more than one referring expression, and there are 142,210 referring expressions in this dataset. The samples in RefCOCO are officially split into a train set with 120,624 expressions, a validation set with 10,834 expressions, a testA set with 5,657 expressions and a testB set with 5,095 expressions. Similarly, RefCOCO+ [55] contains 19,992 images with 49,856 referred objects and 141,564 referring expressions. It is also officially split into a train set with 120,191 expressions, a validation set with 10,758 expressions, a testA set with 5,726 expressions and a testB set with 4,889 expressions. RefCOCOg [29] has 25,799 images with 49,856 referred objects and expressions. There are two commonly used split protocols for this dataset. One is RefCOCOg-google [29], and the other is RefCOCOg-umd [32]. We report our performance on both RefCOCOg-google (val-g) and RefCOCOg-umd (val-u and test-u) to make comprehensive comparisons.

4.2 Implementation Details

Inputs. We set the input image size as

and the max expression length as 40. When performing image resizing, we keep the original aspect ratio of each image. The longer edge of an image is resized to 640, while the shorter one is padded to 640 with the mean value of RGB channels. Meanwhile, We cut off the language query if its length is longer than 38 (leaving one position for the [CLS] token and one position for the [SEP] token). Otherwise, we pad empty tokens after [SEP] token to make the input length equal to 40. For both the input image and language expression, the padded pixel/word is recorded with a mask and will not be involved in the computation of transformers.

Training Details. The whole architecture of our TransVG is end-to-end optimized with AdamW optimizer. We set the initial learning rate of the V-L module and prediction head to , the visual branch and linguistic branch to , and set weight decay to . Our visual branch is initialized with the backbone and encoder of DETR model [4], and our linguistic branch is initialized with the basic BERT model [11]

. For the other components, the parameters are randomly initialized with Xavier init. On all the datasets except Flickr30K Entities, our model is trained for 90 epochs with a learning rate dropped by a factor of 10 after 60 epochs. As for the Flickr30K Entities, our model is trained for 60 epochs, with a learning rate drops after 40 epochs. We set the batch size to 64. The weight coefficient

is set to 1. To avoid overfitting, we exploit dropout operation after the multi-head self-attention layer and the FFN of each transformer encoder layer. The dropout ratio is set to 0.1 by default. We follow the common practice in  [25, 50, 51] to perform data augmentation at the training stage.

Inference. Since our TransVG directly outputs the box coordinates, there is no extra operation at the inference stage.

4.3 Comparisons with State-of-the-art Methods

To validate the merits of our proposed TransVG, we report our performance and compare it with other state-of-the-art methods on five visual grounding benchmarks, including ReferItGame [22], Flickr30K Entities [34], RefCOCO [55], RefCOCO+ [55], and RefCOCOg [29]. We follow the standard protocol to report the performance in terms of top-1 accuracy (%). Specifically, once the Jaccard overlap between the predicted region and the ground-truth box is above 0.5, the prediction is regarded as a correct one.

ReferItGame. Table 1 shows the result comparison between state-of-the-art methods on the ReferItGame test set. We group the methods into two-stage methods, one-stage methods, and transformer-based methods. Among all the methods, TransVG achieves the best performance as the first transformer-based approach. With ResNet-50 backbone, TransVG achieves 69.76% top-1 accuracy and outperforms ZSGNet [39] with the same backbone network by 11.13%. By replacing ResNet-50 with a stronger ResNet-101, the performance boosts to 70.73%, which is 6.13% higher than the strongest competitor ReSC-Large for one-stage methods and 7.73% higher than the strongest competitor DDPN for two-stage methods, respectively.

In particular, we find the recurrent architecture in ReSC shares the same spirit with our stacking architecture in the visual-linguistic transformer that fuses the multi-modal context in multiple rounds. However, in ReSC, recurrent learning is only performed to construct the language sub-query, and this procedure is isolated from the sub-query attended visual feature modulation. In contrast, our TransVG embeds the visual and linguistic embedding into a common semantic space by homogeneously performing intra- and inter-modality context reasoning. The superiority of our performance empirically demonstrates the effectiveness of our unified visual-linguistic encoder and fusion module designs. It also validates that the complicated multi-modality fusion module can be replaced by a simple stack of transformer encoder layers.

Models Backbone ReferItGame Flickr30K
test test
CMN [19] VGG16 28.33 -
VC [58] VGG16 31.13 -
MAttNet [54] ResNet-101 29.04 -
Similarity Net [43] ResNet-101 34.54 60.89
CITE [35] ResNet-101 35.07 61.33
DDPN [56] ResNet-101 63.00 73.30
SSG [8] DarkNet-53 54.24 -
ZSGNet [39] ResNet-50 58.63 58.63
FAOA [51] DarkNet-53 60.67 68.71
RCCF [25] DLA-34 63.79 -
ReSC-Large [50] DarkNet-53 64.60 69.28

TransVG (ours) ResNet-50 69.76 78.47
TransVG (ours) ResNet-101 70.73 79.10

Table 1: Comparisons with state-of-the-art methods on the test set of ReferItGame [22] and Flickr30K Entities [34] in terms of top-1 accuracy (%). The previous methods follow the two-stage or one-stage directions, while ours is transformer-based. We highlight the best and second best performance in the red and blue colors.

Flickr30K Entities. Table 1 also reports the performance of our TransVG on the Flickr30K Entities test set. On this dataset, our TransVG achieves 79.10% top-1 accuracy with a ResNet-101 backbone network, surpassing the recently proposed Similarity Net [43], CITE [35], DDPN [56], ZSGNet [39], FAOA [51], and ReSC-Large [50] by a remarkable margin (i.e., 5.80% absolute improvement over the previous state-of-the-art record).

Models Venue Backbone RefCOCO RefCOCO+ RefCOCOg

val testA testB val testA testB val-g val-u test-u
CMN [19] CVPR’17 VGG16 - 71.03 65.77 - 54.32 47.76 57.47 - -
VC [58] CVPR’18 VGG16 - 73.33 67.44 - 58.40 53.18 62.30 - -
ParalAttn [62] CVPR’18 VGG16 - 75.31 65.52 - 61.34 50.86 58.03 - -
MAttNet [54] CVPR’18 ResNet-101 76.65 81.14 69.99 65.33 71.62 56.02 - 66.58 67.27
LGRANs [45] CVPR’19 VGG16 - 76.60 66.40 - 64.00 53.40 61.78 - -
DGA [48] ICCV’19 VGG16 - 78.42 65.53 - 69.07 51.99 - - 63.28
RvG-Tree [18] TPAMI’19 ResNet-101 75.06 78.61 69.85 63.51 67.45 56.66 - 66.95 66.51
NMTree [26] ICCV’19 ResNet-101 76.41 81.21 70.09 66.46 72.02 57.52 64.62 65.87 66.44
SSG [8] arXiv’18 DarkNet-53 - 76.51 67.50 - 62.14 49.27 47.47 58.80 -
FAOA [51] ICCV’19 DarkNet-53 72.54 74.35 68.50 56.81 60.23 49.60 56.12 61.33 60.36
RCCF [25] CVPR’20 DLA-34 - 81.06 71.85 - 70.35 56.32 - - 65.73
ReSC-Large [50] ECCV’20 DarkNet-53 77.63 80.45 72.30 63.59 68.36 56.81 63.12 67.30 67.20
TransVG (ours) - ResNet-50 80.32 82.67 78.12 63.50 68.15 55.63 66.56 67.66 67.44
TransVG (ours) - ResNet-101 81.02 82.72 78.35 64.82 70.70 56.94 67.02 68.67 67.73

Table 2: Comparisons with state-of-the-art methods on RefCOCO [55], RefCOCO+ [55] and RefCOCOg [29] in terms of top-1 accuracy (%). We highlight the best and second best performance in the red and blue colors.

RefCOCO/RefCOCO+/RefCOCOg. To further validate the effectiveness of our proposed TransVG, we also conduct experiments to report our performance on the RefCOCO, RefCOCO+ and RefCOCOg datasets. The top-1 accuracy (%) of our method, together with other state-of-the-art methods, is reported in Table 2. Our TransVG consistently achieves the best performance on the RefCOCO and RefCOCOg for all the subsets and splits. Remarkably, we achieve 78.35% on the RefCOCO testB set, 6.05% absolute improvement over the previous state-of-the-art result. When performing grounding on longer expressions (on the RefCOCOg dataset), our method also works well, which further validates our neat architecture’s effectiveness to process complicated queries. On RefCOCO+, TransVG also achieves comparable performance to that with the best records. We study the failure cases and find some extreme examples whose expressions are not suitable for generating embedding with transformers. For example, a query that just tells a number “32” in the annotation degenerates our linguistic transformer to an MLP in this situation.

Among the competitors, MAttNet [54] is the most representative method that devises multi-modal fusion modules with re-defined structures (i.e, modular attention networks to separately model subject, location and relationship). When we compare our model with MAttNet in Table 1 and Table 2, we can find that MAttNet shows comparable results to our TransVG on RefCOCO/RefCOCO+/RefCOCOg, but lags behind our TransVG on RefeItGame. The reason is that the pre-defined relationship in multi-modal fusion modules makes it easy to overfit to specific datasets (e.g., with specific scenarios, query lengths, and relationships). Our TransVG theoretically avoids this problem by establishing intra-modality and inter-modality correspondence with the flexible and adaptive attention mechanism.

4.4 Ablation Study

In this section, we conduct ablative experiments to verify the effectiveness of each component in our TransVG framework. We exploit ResNet-50 as the backbone network of the visual branch and evaluate the models on the ReferItGame validation set. All of the models are trained for 90 epochs as described in the implementation details.

Design of the [REG] Token. We perform the study on the design of the [REG] token in our framework, and report the results in Table 3. Specifically, there are several choices to construct the initial state of the [REG] token (i.e., the embedding appended to advanced visual embedding and advanced linguistic embedding as in Equation (4)). We detail these designs and analysis as follows:

  • [nolistsep]

  • Average pooled visual tokens. We perform average pooling over the visual tokens and exploit the average-pooled embedding as the initial state of [REG] token.

  • Max pooled visual tokens.

    We take the max-pooled visual token embedding as the initial [REG] token.

  • Average pooled linguistic tokens. Similar to the first choice, but using the linguistic tokens.

  • Average pooled linguistic tokens. Similar to the second choice, but using the linguistic tokens.

  • Sharing with [CLS] token. We use the [CLS] token of linguistic embedding to pl the [REG] token. Concretely, the [CLS] token out of the V-L module is fed into the prediction head.

  • Learnable embedding*. This is our default setting by randomly initializing the [REG] token embedding at the beginning of the training stage. And the parameters of this embedding are optimized with the whole model.

Our proposed design to exploit a learnable embedding achieves 72.50% and 69.76% on the validation test set of ReferItGame, which is the best performance among all the designs. Typically, the initial [REG] token of other designs is either generated from visual or linguistic tokens, which involves biases to the specific prior context of the corresponding modality. In contrast, the learnable embedding tends to be more equitable and flexible when performing relation reasoning in the visual-linguistic transformer.

Initial State of [REG] Token Acc@val Acc@test
Average pooled visual tokens 71.37 69.27
Max pooled visual tokens 70.91 69.11
Average pooled linguistic tokens 69.96 68.15
Max pooled linguistic tokens 70.37 68.46
Sharing with [CLS] token 70.84 69.01

Table 3:

Ablative experiments of the [REG] token design in our framework. The evaluation metric is the top-1 accuracy (%) on the ReferItGame test set. The initial state of the [REG] token is either obtained from visual/linguistic tokens out of the corresponding branch or by pre-appending a learnable token embedding.

Transformers in Visual and Linguistic Branches. We study the role of the transformers in the visual branch and the linguistic branch (i.e., visual transformer and linguistic transformer). Table 4 summarizes the results of several models with or without the visual transformer and the linguistic transformer. The baseline model without both visual transformer and linguistic transformer reports an accuracy of 64.24%. When we only attach either the visual transformer or the linguistic transformer, an improvement of 68.48% and 66.78% are achieved, respectively. With the complete architecture, the performance is further boosted to 69.76% on the ReferIt test set. This result demonstrates the essential of transformers in the visual branch and linguistic branch to capture intra-modality global context before performing multi-modal fusion.

Figure 3: Qualitative results of our TransVG on the RefCOCOg test set. We show the [REG] token’s attention to visual tokens in the top row. Blue and red boxes are the predicted regions and the ground truths, respectively.

4.5 Qualitative Results

We showcase the qualitative results of four examples from the RefCOCOg [29] test set in Figure 3. We observe that our approach can successfully model queries with complicated relationships, , “orange between other oranges and a banana” in Figure 3 (c). The first row of Figure 3 visualizes the [REG] token’s attention to the visual tokens in the visual-linguistic transformer. TransVG generates interpretable attentions on the referred object that corresponds to the overall object shape and location.

Motivated by the correspondence between visual attention and predicted regions, we visualize the [REG] token’s attention score on the visual tokens in the visual-linguistic transformer’s intermediate layers to better understand TransVG. Figure 4 shows the [REG] token’s attention score on the visual tokens from the second, forth and sixth transformer encoder layers. In the early layer (layer ), we observe that the [REG] token captures the global context by attending to multiple regions in the whole image. In the middle layer (layer ), the [REG] token tends to attend the discriminative regions which are closely related to the referred object (e.g., the bus behind the man in the first example, which indicates the scene is on the road). In the final layer (layer ), TransVG attends to the referred object and generates a more accurate attention prediction for the object’s shape, which enables the model to regress the target’s coordinates correctly.

Visual Branch Linguistic Branch Accuracy Runtime
w/o Tr. w/ Tr. w/o Tr. w/ Tr. (%) (ms)
64.24 33.67
Table 4: Ablative experiments of the visual transformer and linguistic transformer in our framework. The performance is evaluated on the test set of ReferItGame [22] in terms of top-1 accuracy (%). “Tr.” represents transformer.
Figure 4: Visualization of the [REG] token’s attention score on visual tokens from the second (layer ), forth (layer ) and sixth (layer ) encoder layer of the visual-linguistic transformer.

5 Conclusion

In this paper, we present TransVG, a transformer-based framework for visual grounding. Instead of leveraging complex manually designed fusion modules, TransVG uses a simple stack of transformer encoders to perform the multi-modal fusion and reasoning for the visual grounding task. Extensive experiments indicate that TransVG’s multi-modal transformer layers effectively perform the step-by-step fusion and reasoning, which enable TransVG to set a series of new state-of-the-art records on multiple datasets. Our TransVG serves as a new framework and exhibits huge potential for future investigation.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.1.
  • [2] M. Bajaj, L. Wang, and L. Sigal (2019) G3raphGround: graph-based language grounding. In ICCV, pp. 4281–4290. Cited by: §2.1.
  • [3] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui (2010) Visual object tracking using adaptive correlation filters. In CVPR, pp. 2544–2550. Cited by: §2.1.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In ECCV, pp. 213–229. Cited by: §2.2, §3.2, §4.2.
  • [5] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao (2020) Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364. Cited by: §2.2.
  • [6] K. Chen, R. Kovvuri, and R. Nevatia (2017) Query-guided regression network with context policy for phrase grounding. In ICCV, pp. 824–832. Cited by: §2.1.
  • [7] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In ICML, pp. 1691–1703. Cited by: §2.2.
  • [8] X. Chen, L. Ma, J. Chen, Z. Jie, W. Liu, and J. Luo (2018) Real-time referring expression comprehension by single-stage grounding network. In arXiv, Cited by: §1, §2.1, Table 1, Table 2.
  • [9] Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) Uniter: universal image-text representation learning. In ECCV, pp. 104–120. Cited by: §2.2.
  • [10] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2018) Universal transformers. In ICLR, Cited by: §2.2, §3.2.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2, §2.2, §3.2, §3.2, §4.2.
  • [12] P. Dogan, L. Sigal, and M. Gross (2019) Neural sequential phrase grounding (seqground). In CVPR, pp. 4175–4184. Cited by: §2.1.
  • [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2.2.
  • [14] H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger (2010) The segmented and annotated iapr tc-12 benchmark. CVIU 114, pp. 419–428. Cited by: §4.1.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
  • [16] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2014) High-speed tracking with kernelized correlation filters. TPAMI 37, pp. 583–596. Cited by: §2.1.
  • [17] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9, pp. 1735–1780. Cited by: §2.2.
  • [18] R. Hong, D. Liu, X. Mo, X. He, and H. Zhang (2019) Learning to compose and reason with language tree structures for visual grounding. TPAMI. Cited by: §2.1, Table 2.
  • [19] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko (2017) Modeling relationships in referential expressions with compositional modular networks. In CVPR, pp. 1115–1124. Cited by: §2.1, Table 1, Table 2.
  • [20] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell (2016) Natural language object retrieval. In CVPR, pp. 4555–4564. Cited by: §1.
  • [21] L. Huang, J. Tan, J. Liu, and J. Yuan (2020)

    Hand-transformer: non-autoregressive structured modeling for 3d hand pose estimation

    In ECCV, pp. 17–33. Cited by: §2.2.
  • [22] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg (2014) Referitgame: referring to objects in photographs of natural scenes. In EMNLP, pp. 787–798. Cited by: §1, §1, §4.1, §4.3, Table 1, Table 4.
  • [23] J. Li, Y. Wei, X. Liang, F. Zhao, J. Li, T. Xu, and J. Feng (2017) Deep attribute-preserving metric learning for natural language object retrieval. In MM, pp. 181–189. Cited by: §1.
  • [24] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In ECCV, pp. 121–137. Cited by: §2.2.
  • [25] Y. Liao, S. Liu, G. Li, F. Wang, Y. Chen, C. Qian, and B. Li (2020) A real-time cross-modality correlation filtering method for referring expression comprehension. In CVPR, pp. 10880–10889. Cited by: §1, §2.1, §2.1, §4.2, Table 1, Table 2.
  • [26] D. Liu, H. Zhang, F. Wu, and Z. Zha (2019) Learning to assemble neural module tree networks for visual grounding. In ICCV, pp. 4673–4682. Cited by: §1, §2.1, §2.1, Table 2.
  • [27] X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li (2019) Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR, pp. 1950–1959. Cited by: §2.1.
  • [28] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §2.2.
  • [29] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy (2016) Generation and comprehension of unambiguous object descriptions. In CVPR, pp. 11–20. Cited by: §1, §1, §1, §2.1, §4.1, §4.3, §4.5, Table 2.
  • [30] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In InterSpeech, Cited by: §2.2.
  • [31] N. Moritz, T. Hori, and J. Le (2020)

    Streaming automatic speech recognition with the transformer model

    In ICASSP, pp. 6074–6078. Cited by: §2.2.
  • [32] V. K. Nagaraja, V. I. Morariu, and L. S. Davis (2016) Modeling context between objects for referring expression understanding. In ECCV, pp. 792–807. Cited by: §1, §2.1, §4.1.
  • [33] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In ICML, pp. 4055–4064. Cited by: §3.2.
  • [34] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2017) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123 (1), pp. 74. Cited by: §1, §1, §4.1, §4.3, Table 1.
  • [35] B. A. Plummer, P. Kordas, M. H. Kiapour, S. Zheng, R. Piramuthu, and S. Lazebnik (2018) Conditional image-text embedding networks. In ECCV, pp. 249–264. Cited by: §2.1, §4.1, §4.3, Table 1.
  • [36] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §2.2, §3.2.
  • [37] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.1.
  • [38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, pp. 658–666. Cited by: §3.3.
  • [39] A. Sadhu, K. Chen, and R. Nevatia (2019) Zero-shot grounding of objects from natural language queries. In ICCV, pp. 4694–4703. Cited by: §2.1, §4.3, §4.3, Table 1.
  • [40] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) Vl-bert: pre-training of generic visual-linguistic representations. Cited by: §2.2.
  • [41] K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. Cited by: §2.2.
  • [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §2.2, §3.1, §3.2.
  • [43] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018)

    Learning two-branch neural networks for image-text matching tasks

    TPAMI 41, pp. 394–407. Cited by: §1, §1, §1, §2.1, §2.1, §4.1, §4.3, Table 1.
  • [44] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In CVPR, pp. 5005–5013. Cited by: §2.1.
  • [45] P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. v. d. Hengel (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In CVPR, pp. 1960–1968. Cited by: §1, §2.1, §2.1, Table 2.
  • [46] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al. (2020) Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP, pp. 6874–6878. Cited by: §2.2.
  • [47] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo (2020)

    Learning texture transformer network for image super-resolution

    In CVPR, pp. 5791–5800. Cited by: §2.2.
  • [48] S. Yang, G. Li, and Y. Yu (2019) Dynamic graph attention for referring expression comprehension. In ICCV, pp. 4644–4653. Cited by: §1, §2.1, §2.1, Table 2.
  • [49] S. Yang, G. Li, and Y. Yu (2020) Graph-structured referring expression reasoning in the wild. In CVPR, pp. 9952–9961. Cited by: §1.
  • [50] Z. Yang, T. Chen, L. Wang, and J. Luo (2020) Improving one-stage visual grounding by recursive sub-query construction. In ECCV, Cited by: §1, §2.1, §2.1, §4.1, §4.2, §4.3, Table 1, Table 2.
  • [51] Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo (2019) A fast and accurate one-stage approach to visual grounding. In ICCV, pp. 4683–4693. Cited by: §1, §1, §1, §2.1, §2.1, §4.2, §4.3, Table 1, Table 2.
  • [52] Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang, L. Zhang, and J. Luo (2020) TAP: text-aware pre-training for text-vqa and text-caption. arXiv preprint arXiv:2012.04638. Cited by: §2.2.
  • [53] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. ACL 2, pp. 67–78. Cited by: §4.1.
  • [54] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg (2018) Mattnet: modular attention network for referring expression comprehension. In CVPR, pp. 1307–1315. Cited by: §1, §2.1, §2.1, §4.3, Table 1, Table 2.
  • [55] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg (2016) Modeling context in referring expressions. In ECCV, pp. 69–85. Cited by: §1, §1, §1, §2.1, §4.1, §4.3, Table 2.
  • [56] Z. Yu, J. Yu, C. Xiang, Z. Zhao, Q. Tian, and D. Tao (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In IJCAI, Cited by: §4.3, Table 1.
  • [57] Y. Zeng, J. Fu, and H. Chao (2020) Learning joint spatial-temporal transformations for video inpainting. In ECCV, pp. 528–543. Cited by: §2.2.
  • [58] H. Zhang, Y. Niu, and S. Chang (2018) Grounding referring expressions in images by variational context. In CVPR, pp. 4158–4166. Cited by: §2.1, §2.1, Table 1, Table 2.
  • [59] Y. Zhang, L. Yuan, Y. Guo, Z. He, I. Huang, and H. Lee (2017) Discriminative bimodal networks for visual localization and detection with natural language queries. In CVPR, pp. 557–566. Cited by: §2.1.
  • [60] J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T. Liu (2020) Incorporating bert into neural machine translation. In ICLR, Cited by: §2.2.
  • [61] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2021) Deformable detr: deformable transformers for end-to-end object detection. In ICLR, Cited by: §2.2.
  • [62] B. Zhuang, Q. Wu, C. Shen, I. Reid, and A. van den Hengel (2018) Parallel attention: a unified framework for visual object discovery through dialogs and queries. In CVPR, pp. 4252–4261. Cited by: §2.1, Table 2.