Recently, DETR  has proposed a novel detection paradigm based on transformer  architecture. This kind of detection heads predict results by performing several multi-head attentions, named cross-attention (Eq. 1), between object queries and feature maps and shows good performance on detection tasks. Therefore, many other works [7, 6] further studied this novel transformer-based detection paradigm and extended it to other vision tasks .
DETR and its derived models all update the object queries after each cross-attention, however, they don’t renew the object queries’ position encoding, named query position. They adopt an identical query position for every cross-attention layer, thus object queries’ position information indicated by query position is not updated. Consider that in cross-attention, besides object queries, query position will also interact with the feature map to enhance the attention of the regions expressed by query position in the feature map. Thus not embedding object queries’ latest location information to query position will take extra time for the model to learn the newest regions that query position should express and focus on.
To fix this issue, we propose GQPos method (sec 3.1): query position is updated after object queries iterate, which won’t bring extra learning burden. To embed the newest location information of object queries to query position, predicted positions from the latest object queries are encoded, and are utilized as the query position of next cross-attention. Since query position is embedded with the newest location information, the model is free from extra learning. Cosine positional encoding is chosen to encode locations so that the learning procedure only brings little overhead. Visualization on DETR (Figure 4) shows that after applying GQPos, the focused regions of DETR’s decoder gradually transfer from the center to the boundary of the object.
Transformer-based detection heads also try to use multi-scale feature maps to improve detection on objects at all scales. And on classical CNN models, this can be achieved by fusing multi-scale features . However, it is difficult for the transformer to perform attention on high-resolution feature map, because the complexity of attention is quadratic in the number of pixels, so feature fusion alone has limited effects on transformer-based detection heads. Therefore, the Similar Attention (SiA) fusion scheme (sec 3.2
) is proposed. Considering that the distribution of attention maps at different scales should be similar, we propose that besides the feature maps can be fused, the multi-scale attention weight maps can also be fused. Specifically, the low-resolution attention weight map which is easier to learn is interpolated as the prior of high-resolution attention map. As a result, the learning of high-resolution attention map can be accelerated.
The effectiveness of GQPos and SiA is verified on two tasks, object detection and human-object-interaction. For object detection, when applied GQPos, DETR achieves % mAP at epochs under single-scale, which surpasses the original counterpart with a 10x training schedule. Furthermore, when combined with the proposed SiA method under multi-scale, the model achieves SOTA performance of % mAP with Resnet- backbone. Experiments on SMCA  and YoloS  also show consistent improvements, which also verified the effectiveness of GQPos. For human-object-interaction, the single and multi-scale HoiTransformer’s  performance can also get improved by applying GQPos and SiA respectively.
We hope our research will bring new thoughts about transformer-based detection heads’ query position and attention weights fusion to the community. Our work provides the following three contributions:
We analyze the mechanism of query position in transformer-based detection head and find out that the query position is not fully optimized to involve the latest location information. We proposed GQPos (sec 3.1) to embed the latest location information of object queries to query position iteratively.
Feature fusion has limited effects on transformer-based detection heads due to attention’s high complexity, which hinders them from improving detection performance at all scales. Therefore, we propose SiA (sec 3.2), which fuse multi-scale attention weight maps to accelerate the learning of high-resolution feature map by interpolating well-learned low-resolution feature map.
The proposed GQPos and SiA methods shows consistent improvement on object detection and human-object-interaction tasks. In particular, when applied GQPos, DETR surpasses the performance of the original version with a 10x training schedule. Furthermore, when combined with SiA, DETR achieves SOTA performance of % mAP under multi-scale.
2 Related Work
2.1 Object Detection
In recent years, deep learning has been successfully applied to object detection. The object detection framework based on deep learning can be divided into several categories: two-stage and one-stage, or anchor-free and anchor-based. Recently, the research of end-to-end object detection has become a hot topic.
, extract candidate feature regions in the first stage and then object classification and location estimation are performed in the second stage. One stage detectors, such as Yolo and SSD , perform object classification and location prediction directly on dense anchors. At present, the most popular two-stage and one-stage detectors contain complex hand-crafted components, such as NMS and anchors. The design of these components also has a great impact on the final detection results.
Recently, end-to-end object detectors have become more and more popular. They remove the complex post-processing such as NMS, and achieve one-to-one matching between the target and the candidate by the Hungarian algorithm . DETR  and POTO  are some of successful examples. Moreover, DETR has successfully introduced transformer  into the field of object detection, but it is limited by the slow convergence.
To address the slow convergence issue of DETR, Deformable DETR  propose that the high computational complexity of attention in the image field causes the slow convergence of DETR, especially the self-attention computational complexity of the encoder is the square of the number of image pixels, so it uses sparse sampling attention to accelerate the convergence. UP-DETR  uses self-supervised pre-training to improve DETR, while TSP-RCNN/FCOS  thinks that the existing Hungarian matching algorithm and the cross-attention of decoder cause the slow convergence, and that decoder will destroy the performance of small objects. So it only adds the encoder to the origin FCOS  and Faster RCNN. Adaptive clustering transformer (ACT)  proposes the idea of clustering to reduce the computational complexity of self-attention of the encoder. Recently, SMCA  improves the sparse attention of Deformable DETR by generating a gaussian map centered on the reference point to enhance the attention near the reference point, compared with Deformable attention, it uses more global information. DETR’s new detection paradigm is also used for tasks like human-object interaction, such as HoiTransformer , which predicts people, objects, and their interaction in images. What’s more, this paradigm has also been tried to be combined with ViT , such as YoloS , YoloS adds 100 object queries to the origin patches of Vision Transformer to perform multi-head attention, and objects are predicted by these added object queries.
2.2 Transformer In Vision
Transformer structure is based on self-attention and cross-attention mechanisms to learn the relationship between the elements in the input sequence. Unlike RNN 
and other recurrent networks, which can only process the input sequence recursively and focus on the short-term context, the transformer can learn the long-term relationships of the input sequence and has a higher degree of parallelism. One of the characteristics of transformer structure is that its scalability to high complexity models and large-scale datasets, and transformer does not need prior assumptions or knowledge about the problem compared to convolutional neural networks and recurrent networks. Therefore, the transformer can be well pre-trained on large-scale unlabeled datasets, and then it can be used in downstream tasks after fine-tuning to get the expected results.
The great success of transformer in the field of natural language processing (NLP) has aroused the interest of computer vision community. Transformer and its variants have been successfully used in image classification[5, 27], segmentation 
, image super resolution, object detection , video understanding [23, 8], image generation , text image synthesis and visual question answering [25, 22]. DETR is a typical application of transformer in the field of object detection. In DETR, the feature information extracted by CNN is encoded by the encoder of the transformer. Then in the decoder, a series of object queries interact with these encoded features through cross-attention, to predict the locations and classes of objects in the image.
In this section, we introduce two novel methods, Guided query position (GQPos) and Similar Attention (SiA). An overall architecture of DETR with GQPos and SiA is shown in Figure 1. The extracted features are first enhanced by the encoder’s self-attention, then the enhanced features are feed to decoder layers to perform cross-attention with object queries which are embedded with query position. Decoder layers’ outputs are used for predictions. SiA is used in each cross-attention layer and GQPos iteratively guided query position.
3.1 Guided Query Position
When cross-attention is performed between object queries and the input feature map, the information of feature map is embedded to object queries to make predictions. A typical cross-attention is as follows:
where is the output object queries of the -th decoder layer, refers to the enhanced feature map output by the encoder, is the dimension of . As the encoding of object queries, the query position indicates the location information of the predicted objects. The term enables the feature map’s information to be embedded to object queries and the term strengthens the attention of the regions expressed by query position in the feature map. Previous models ignore the update of term thus model takes extra time to learn the latest regions query position should express.
Different from previous methods, the proposed GQPos update the query position during the iteration of cross-attention layers. The motivation of GQPos is that the update of query position should be guided to embed the latest position information of object queries. Thus we first calculate the locations of objects predicted by the output object queries of the former layer:
where are the center, height, width of the predicted object respectively. Then we use the cosine positional encoding to encode the :
where is the dimension. Each dimension of the positional encoding corresponds to a sinusoid. After that the location information of the latest object queries is embedded to . Finally, we add a linear projection for . The cross-attention for -th layer is changed to:
where is the enhanced feature map . The structure of GQPos is illustrated in Figure 2.
3.2 Similar Attention
As for transformer with multi-scale feature maps, the attention weight map is calculated as follows:
where is the stitching of high resolution and low-resolution feature maps, and is the stitching attention weight map. The attention weight map contains the hard-to-learn high-resolution part and easy-to-learn low-resolution part.
Since the locations of the objects on different scale feature maps are similar, the distribution of attention weight maps at different scales should also be similar. Thus SiA interpolates the low-resolution attention weight map to get the prior of the high-resolution attention map. The prior is calculated as follows:
where is the low resolution feature map and is the calculated prior. Finally SiA combines the prior and origin high resolution attention weight map:
where represents the high-resoluation part in the stitching attention weight map and is a learnable coefficient. In our implementation, we add an extra small resolution feature map downsampled by . The structure for Similar Attention is shown in Figure 3.
4.1 Datasets and metrics
We evaluate object detection performance on COCOdataset. Specifically, we train on the COCO 2017 training data set and validate on the validation data set. The COCO dataset contains training images and validation images respectively. We use mean average precision (mAP) to measure the performance of our method.
HICO-DET  HOI detection is considered as true positive not only human and object is localized accurately but also the interaction has to be predicted correctly. We evaluate the performance of human-object-interaction task on HICO-DET dataset. There are images and more than human-object pairs in HICO-DET with HOI categories over interactions and objects. HOI categories are split into Rare and Non-Rare based on the number of training instances. The training set includes images and the testing set includes images. The mAP is still used to examine the model performance.
4.2 Implementation details
All the experiments for single-scale models are trained on RTX Ti GPUs and multi-scale models are trained on V GPUs.
DETR and SMCA For DETR and SMCA, We follow experimental settings of . The batch size per GPU is set to . The learning rate of backbone is and the learning rate of the transformer part is . We drop the learning rate to in the th epoch and experiments are based on the training results of epochs. The loss of classification is changed from cross-entropy loss to focal loss , the number of queries is replaced by , and reference points is used in DETR’s location regression to make fair comparisons with [33, 7]. Multi-head attention’s std is decreased when initializing to stabilize the training process. The whole model is optimized by AdamW optimizer  and pretrained with Resnet-. Random cropping is used for data augmentation with larges width and height are setting as .
YoloS For YoloS, we choose YoloS-Ti model in our experiments. We follow the general settings of , but some modifications are made to the YoloS-Ti architecture, including adding more transformer layers since origin YoloS has no detection head part except a ViT backbone, and auxiliary losses are added to the extra layers to improve intermediate object queries’ expressions. Model is pretrained with Deit-Ti  and fine-tuned on COCO with epochs.
HoiTransformer For HoiTransoformer, the experimental settings are basically the same with , and for the convenience of verification we train HoiTransformer using pretrained Resnet- . For single-scale HoiTransformer, the training procedure lasts for epochs, with the learning rate drop at the -th epoch. For multi-scale HoiTransformer, we train epochs and drop the learning rate at -th epochs.
|Faster RCNN-FPN-R50 ||36||40.2||61.0||43.8||24.2||43.5||52.0||180||42|
|Deformable DETR-R50 ||50||39.4||59.6||42.2||20.6||43.0||55.5||78||34|
|SMCA-R50  (2021)||50||41.0||*||*||21.9||44.3||59.1||86||42|
|MS Deformable DETR-R50 ||50||43.8||62.6||47.7||26.4||47.1||58.0||173||40|
|MS SMCA-R50 ||50||43.7||63.6||47.2||24.2||47.0||60.4||152||40|
|Faster RCNN-FPN-R50 ||109||42.0||62.1||45.5||26.6||45.4||53.4||180||42|
4.3 Object Detection
DETR The results are shown in Table 1. With the help of GQPos, DETR’s performance can surpass the previous related models whether on the short or long training schedule. Specifically, DETR with GQPos can achieve % mAP at
epochs, exceeding local attention models like (by % mAP) and  (by % mAP) which have a lower complexity for attention calculation, while at epochs, DETR-GQPos surpasses the performance of its original version at epochs by % mAP, and exceeds the performance of self-pretrained UP-DETR at epochs by % mAP. On the other hand, origin DETR’s training schedule is reduced to a great extent (DETR-GQPos with % mAP at epochs versus origin DETR with % mAP at 500 epochs). Above results show that guiding and updating the query position for each cross-attention indeed helps transformer-based detection heads converge faster.
Furthermore, with the help of SiA, the performance of DETR-GQPos is further improved by % mAP. Specifically, the detection performance of the small object is improved by % mAP, indicating the learning of high-resolution attention weight map is accelerated. And when compared with local attention based model, DETR-GQPos-SiA also has an advantage in large object detection, especially over MS Deformable-DETR by % mAP.
|Methods||Epoch||Multi-scale inputs||feature maps fusion||attention weight fusion||mAP|
|Methods||Epoch||mAP full||mAP rare||mAP nonrare||mAP inter|
|w/o iterative updating||50||40.6 (-1.4)|
|w/o pos encoding||50||39.3 (-2.7)|
|w/o FC in||50||41.3(-0.7)|
SMCA Results of GQPos on SMCA are shown in Table 2. Combing the GQPos, the performance can be further improved by % mAP under single-scale and % mAP under multi-scale, suggesting that GQPos is not conflicted with local attention based model like SMCA, and has a good generalization ability.
YoloS We first make some modifications to the original YoloS-Ti model as mentioned in experimental settings. Then GQPos is applied to the modified YoloS-Ti model. The results are in show Table 3. The long training schedule of YoloS is decreased to some extent. Based on the results, we conclude that (i) origin YoloS has no detection heads part except a ViT backbone, which hinders the YoloS from transferring to object detection task. Therefore adding more transformer layers with a larger learning rate enables YoloS to have an approximate detection head. (ii) Inspired by DETR, adding auxiliary losses for the last 6 layers can improve the expression of intermediate object queries. (iii) Above two modifications are critical to YoloS’ architecture, bring % mAP gain to the original YoloS-Ti. (iv) GQPos further improve the performance by % mAP, indicating the problem of query position’s lack of guiding and updating is widespread.
We then demonstrate our ablation studies of GQPos and SiA. We use DETR model with Resnet- backbone to perform object detection on the challenging COCO dataset for ablations.
Guided Query Position The ablations include implementing GQPos not in an iterative way, removing pos encoding part, and fully connected layers of query position and spatial position. Results are shown in Table 6. Relevant results demonstrate that: (i) The iterative approach outperforms the parallel one by % mAP since it enables query position to be embedded with the latest location information of object queries. (ii) Embedding object queries’ location information to query position is the most important (removing it can cause a % performance drop). In other words, the update of query position needs to be guided. (iii) Adding the linear projection to query position can make the update more dynamic to some extent (% mAP improvements).
Similar Attention Results about SiA’s ablations are shown in Table 4. We observe the effects of adding multi-scale features or fusing multi-scale features are limited (% mAP and % mAP respectively), demonstrating the high complexity problem of high-resolution feature map should be addressed. And when combined with the attention weights fusion part in SiA, the performance obtains a % mAP gain, showing that using the prior of low-resolution feature map is critical to learn the high-resolution feature map.
4.4 Human Object Interaction
The results of applying GQPos and SiA respectively to single and multi-scale HoiTransformer are shown in Table 5. Both of the methods can improve HoiTransformer’s performance. By adding GQPos, HoiTransformer obtains a % mAP gain for full objects at epochs and by adding SiA, and multi-scale HoiTransformer obtains a % mAP gain for full objects at epochs. In particular, SiA has a significant improvement in the detection of rare objects (% mAP versus % mAP), suggesting that in human-object-interaction, the rare objects detection can be enhanced with well-learned multi-scale information. Overall, GQPos and SiA can generalize well to other vision tasks like human-object-interaction.
decoder layers, and the attention region with high probability is highlighted. DETR tries to find object boundaries from the first decoder layer and attend to similar locations over different layers. With GQPos, our model locates objects more reasonably by first focusing on the object center, and then gradually moving the attention to the boundary region.
To explore the inherent mechanism of GQPos, we visualize the decoder feature maps of different layers. As Figure 3(a) shows, the decoder of origin DETR attends to object boundaries for all layers. so that the attention maps keep high consistency among different layers, which means different decoder layers bring little improvement for attention representation.
Interestingly, GQPos locates objects in a more humanoid way. That is, first find the object center, and then locate the boundaries. Figure 3(b) shows the decoder feature maps of GQPos. Unlike origin DETR, the decoder with GQPos attends to the central part of the object at the beginning, such as the human body in Figure 3(b). Compared to locating object boundary, locating object center is easier and thus the initial query position can be better optimized. After locating objects by central part, the focus of attention maps gradually moves to object boundaries in subsequent layers, which shows the refinement ability of GQPos.
We propose that when transformer-based detection heads performing cross-attention, the interaction between query position and feature maps needs to be updated as the object queries are renewed. And a simple and effective method GQPos is introduced. GQPos shows good generalizations and is compatible with DETR, SMCA, YoloS, and HoiTransformer. Furthermore, for Transformer with multi-scale feature maps, we observe that feature maps fusion is not enough and propose SiA method to additionally fuse the multi-scale attention weight maps which accelerate the learning of high-resolution attention weight maps. SiA shows improvements on multi-scale transformer-based detection heads like DETR and HoiTransformer.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §1, §2.2, §4.2, Table 1.
-  (2015) Hico: a benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1017–1025. Cited by: §4.1.
-  (2020) Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364. Cited by: §2.2.
-  (2020) UP-detr: unsupervised pre-training for object detection with transformers. arXiv preprint arXiv:2011.09094. Cited by: §2.1, Table 1.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.1, §2.2.
-  (2021) You only look at one sequence: rethinking transformer in vision through object detection. arXiv preprint arXiv:2106.00666. Cited by: §1, §1, §2.1, §4.2.
-  (2021) Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448. Cited by: §1, §1, §2.1, §4.2, §4.3.1, Table 1.
Video action transformer network. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253. Cited by: §2.2.
-  (2015) Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence 38 (1), pp. 142–158. Cited by: §2.1, §2.1.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
-  (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §2.2.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.1.
-  (2018) Fixing weight decay regularization in adam. Cited by: §4.2.
-  (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §2.2.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §2.1, Table 1.
-  (2016) End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2325–2333. Cited by: §2.1.
-  (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.2.
-  (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473. Cited by: §2.2.
-  (2020) Rethinking transformer-based set prediction for object detection. arXiv preprint arXiv:2011.10881. Cited by: §2.1.
-  (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.2.
-  (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636. Cited by: §2.1.
-  (2020) Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877. Cited by: §2.2, §4.2.
-  (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1, §2.1.
-  (2020) End-to-end object detection with fully convolutional network. arXiv preprint arXiv:2012.03544. Cited by: §2.1.
-  (2020) Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5800. Cited by: §2.2.
-  (2019) Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511. Cited by: §2.2.
-  (2020) End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315. Cited by: §2.1.
-  (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §2.1, §2.1, §4.2, §4.3.1, Table 1.
-  (2021) End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11825–11834. Cited by: §1, §1, §2.1, §4.2.