Visual Composite Set Detection Using Part-and-Sum Transformers

by   Qi Dong, et al.

Computer vision applications such as visual relationship detection and human-object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection, and human-object interaction, and demonstrate that PST achieves state-of-the-art results among single-stage models, while nearly matching the results of custom-designed two-stage models.


page 8

page 13

page 14

page 15


HOTR: End-to-End Human-Object Interaction Detection with Transformers

Human-Object Interaction (HOI) detection is a task of identifying "a set...

k-means Mask Transformer

The rise of transformers in vision tasks not only advances network backb...

Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer

Recent developments in transformer models for visual data have led to si...

Line Segment Detection Using Transformers without Edges

In this paper, we present a holistically end-to-end algorithm for line s...

Human-Object Interaction Detection via Disentangled Transformer

Human-Object Interaction Detection tackles the problem of joint localiza...

Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection

Human-Object Interaction detection is a holistic visual recognition task...

BoxeR: Box-Attention for 2D and 3D Transformers

In this paper, we propose a simple attention mechanism, we call Box-Atte...

1 Introduction

In this paper, we study problems such as visual relationship detection (VRD) [29, 21] and human object interaction (HOI) [12, 37, 4] where a composite set of a two-level (part-and-sum) hierarchy is to be detected and localized in an image. In both VRD and HOI, the output consists of a set of entities. Each entity, referred to as a “sum”, represents a triplet structure composed of parts: the parts are (subject, object, predicate) in VRD and (human, interaction, object) in HOI. The sum-and-parts structure naturally forms a two-level hierarchical output - with the sum at the root level and the parts at the leaf level. In the general setting for composite set detection, the hierarchy consists of two levels, but the number of parts can be arbitrary.

(a) Pipeline overview.
(b) Part-and-Sum Transformer Decoder.
Figure 1: Overview of visual composite set detection by Part-and-Sum Transformer (PST).

Many existing approaches in VRD [29, 17, 30, 54, 54] and HOI [36, 4, 24, 15, 10] are based on two stage processes in which some parts (e.g. the subject and the object in VRD) are detected first, followed by detection of the association (sum) and the additional part (the predicate). Single stage approaches [50, 49, 16, 19] with end-to-end learning for VRD and HOI also exist. In practice, two-stage approaches produce better performance while single-stage methods are easier to train and use.

The task for an object detector is to detect and localize all valid objects in an input image, making the output a set. Though object detectors such as FasterRCNN [33] are considered end-to-end trainable, they perform instance level predictions, and require post-processing using non-maximum suppression to recover the entire set of objects in an image. Recent developments in Transformers [44] and their extensions to object detection [3] enable set-level end-to-end learning by eliminating anchor proposals and non-maximum suppression.

Figure 2: Part-and-Sum Transformer Decoder with Composite queries. We compare various Transformer with various query designs. Note that CA: Cross-attention layer; SA: Self-attention layer.

In this paper, we formulate the visual relationship detection (VRD) [29, 21] and human object interaction (HOI) [12, 37, 4] as composite set (two-level hierarchy) detection problems and propose a new approach, part-and-sum Transformers (PST) to solve them. PST is different from existing detection transformers where each object is represented by a vector-based query in either a one-level set [3]. We show the importance of establishing an explicit “sum” representation for the triplet as a whole to be simultaneously modeled/engaged with the part queries (e.g., subject, object, and predicate). Both the global and the part features have also been modeled in the discriminatively trained part-based model (DPM) algorithm [8] though the global and part interactions there are limited to relative spatial locations. To summarize, we develop a new approach, part-and-sum Transformers (PST), to solve the composite set detection problem by creating composite queries and composite attention mechanism to account for both the sum (vector-based query) and parts (tensor-based queries) representations. Using systematic experiments, we study the roles of the sum and part queries and intra- and inter-token attention in Transformers. The effectiveness of the proposed PST algorithm is demonstrated in the VRD and HOI tasks.

2 Related Work

The standard object detection task [27] is a set prediction problem in which each element refers to a bounding box. If the prediction output is a hierarchy of multiple layers, e.g. , standard sliding window based approaches [33]

that rely on features extracted from the entire window no longer suffice. Algorithms performing inference hierarchically exist

[57, 35, 39] but they are not trained end-to-end. Here we study problems that require predictions of a two-level structured set including visual relationship detection (VRD) [29, 21] and human object interaction (HOI) [12, 37, 4]. We aim to develop a general-purpose algorithm that is trained end-to-end for composite set detection with a new Transformer design, which is different from the previous two-stage [29, 17, 30, 54, 54, 36, 4, 24, 15, 10] and single-stage approaches [50, 49, 16, 19]. The concept of sum-and-max [35]

is related to our part-and-sum approach but the two approaches have large differences in many aspects. In terms of structural modeling, the problem of structured prediction (or semantic labeling) has been long studied in machine learning

[22, 41] and computer vision [38, 42].

Transformers [44] have recently been applied to many computer-vision tasks [7, 40, 48]. Prominently, the object detector based on transformer (DETR) [3] has shown comparable results to well-established non-fully differentiable models based on CNNs and NMS, such as Faster RCNN [33]. Deformable DETR [58] has matched the performance of the previous state-of-the-art while preserving the end-to-end differentiability of DETR. A recent attempt [60] parallel to ours also applies DETR to the HOI task. Our proposed Part-and-Sum Transformers (PST) is different in the algorithm design with the development of composite queries and composite attention to simultaneously model the global and local information, as well as their interactions.

3 Composite Set Detection by Transformer

In this section, we describe the PST formulation for visual composite set detection. We use VRD as the example, and the formulation can be straightforwardly extended to HOI by assuming that the subject is always human and the predicate as the interaction with the object.

Given an image , the goal of VRD is to detect a set of visual relations . Each visual relation , a sum, has three parts: subject, object and predicate, i.e., . For each , the subject and object have class labels and and bounding boxes and . The predicate has a class label . VRD is therefore a composite set detection task, where each instance in the set is a composite entity consisting of three parts.

3.1 Overview

The overview of the proposed PST is shown in Figure 1 (a). Given an input image, we first obtain the image feature maps from a CNN backbone. The image features with learnable position embeddings are further encoded/tokenized by a standard [3] or deformable [58] transformer encoder. Those tokenized image features and a set of learnable queries are put into a transformer decoder to infer the classes and positions of every composite data. Unlike standard object detection, composite set detection not only detects all object entities but also entity-wise structure/relationships. For accurate modeling composite data, we propose composite query based Part-and-Sum transformer decoder, to learn each composite/structure data on both entity and relationship levels, as shown in Figure 1 (b). In the following sections, we detail the PST model and the corresponding processes of training and inference.

3.2 Part-and-Sum Transformer (PST)

To construct the PST model, we first describe the vector based query, tensor based query, and composite queries that are used for composite set prediction. We then formulate the composite transformer decoder layer based on the composite queries.

3.2.1 Query Design for Composite Data

Vector based query A standard decoder used in DETR takes a set of vector based queries as input, as shown in Figure 2 (a). When applying this formulation to the relationship detection task, one can use feedforward networks (FFNs) to directly predict a subject-predicate-object triplet from the output of each query. This straightforward extension of DETR, while serving as a reasonable baseline, is sub-optimal as each query mixes the parts and their interaction altogether inside a vector. This makes the parts and their interaction implicitly modeled, limiting the expressiveness and representation capability of the visual relationship model.

Tensor based query To explicitly model the parts and their relationships (e.g. Subject, Predicate, and Object), we propose a tensor-based query representation using disjoint sub-vectors as sub-queries. Specifically for VRD, in a tensor based query representation, three sub-queries represent Subject, Predicate, and Object. All queries together form a matrix, where is the number of queries, is the number of entities in a relationship (), and is the feature dimension of sub-queries. This formulation enables part-wise decoding in the transformer decoder, as shown in Figure 2 (b). Technically, the vector based query represents each relationship as a whole/Sum, whereas tensor based query models the parts disjointly. The difference in the query design leads to a difference in the learned contexts: self-attention layers among vector based queries mine inter-relation context, while self-attention layers among part queries mine the inter-part context.

Composite query On the one hand, vector based query is able to capture the relationship as a sum/whole, but there exists an intrinsic ambiguity in the parts. On the other hand, tensor based query models each part explicitly, but it lacks the knowledge of the relationships as a sum, which is important for the subject-object association. Based on the observation above, we propose a composite query representation. Formally, each composite query is composed of part queries (a tensor query) as well as a sum/whole query (a vector based query). In VRD, each composite query is composed of three sub-queries to represent Subject, Predicate, and Object; and one sum query to represent the relationship. , and , where , and denote subject, predicate and object sub-queries. Assuming composite queries in the decoder, the overall query is a tensor, where is the dimension of sub-queries.

3.2.2 Part-and-Sum Transformer Decoder

As the composite query includes both part and sum queries, we separately decode the part queries and the sum query . To enable mutual benefits of part and sum learning, we also set up part-sum interaction. Additionally, we propose a factorized self-attention layer for further enhancing part level learning. The architecture of Part-and-Sum Transformer Decoder is illustrated in Figure 2 (c).

Part-and-Sum separate decoding

. The PST decoder has a two-stream architecture, for part and sum queries decoding, respectively. Each decoding stream contains a self-attention layers (SA), cross-attention layers (CA), and feed-forward neural networks (FFN). Let

and denote respectively SA and CA layers for part queries. Decoding the part queries is written as:


where denotes the tokenized image features from the Transformer Encoder. Similarly, decoding the sum queries can be written as:


Each entity has both part and global embeddings via two separate sequential modules . The self-attention exploits the context among all queries. Part-and-Sum separated decoding effectively models two different types of contexts: the self-attention in part queries explores the inter-component context, for example, when one part query predicts “person”, it reinforces related predicates such as “eat” and “hold”; while self-attention for global queries exploits the inter-relationship context, for example, a sum query that predicts “Person read book”, is a clue to infer “person sit” relationship. These contexts provide the interactions needed for the accurate inference of the structured output.

Factorized self-attention layer. To make the interactions within a group-wise part query more structured, we design a factorized self-attention layer, as shown in Figure 2 (b). Instead of doing self-attention among all part queries as in Eq. 1, a factorized self-attention layer first conducts intra-relation self-attention, and then conducts inter-relation self-attention. The intra-relation self attention layer leverages the contexts of the parts to benefit relationship prediction, for example, subject query and object query are “person” and “horse” helps predict predicate “Ride”. The inter-relation self-attention layer leverages the inter-relation context, to enhance the holistic relation prediction per image, which is particularly important for multiple interactions detection for the same subject entity. More details are in the supplementary materials.

Part-Sum interaction. Part query decoding embeds more accurate component information, while global embedding contains more accurate component association. These two aspects are both important to the structured output detection, and are mutually beneficial to each other [9]. Thus, we design interaction between the two decoding streams, enabling part-sum conditions. Specifically, after FFNs in the decoder, for each part embedding , we combine it with the sum query embedding, while for each sum query , we fuse all three part query embeddings. The part-sum interaction is formulated as:


where is layer normalisation [1].

3.3 Model Training and Inference

Composite prediction. For each composite query , we predict the classes of subject, object and predicate; and the bounding boxes for subject and object. Specifically, for each part query, we predict corresponding classes by using a one-layer linear layer, and predict boxes using a shallow MLP head. Besides, we can also construct the global representation from group-wise part queries, by concatenating all part queries, and denote this as . The part query predication is:


where are the FFNs for subject, object and predicate classification; and are the FFNs for predicting the boxes of subject and object; is an FFN for predicting relation triplet.

For Sum query prediction, we predict classes and boxes of all parts from a Sum query , that is:


where are FFNs for subject, object and predicate classification; and are the FFNs for predicting the boxes of subject and object in the global level. Note that the last layer in

is a Softmax layer, while the last layer in

is a Sigmoid layer.

Composite bipartite matching. We conduct composite group-wise bipartite matching, i.e. considering all components belonging to a relation jointly in computing set-to-set similarity. Specifically, for a relation, there are three Part queries (subject, object and predicate), and a triplet embedding. The bipartite matching algorithm finds a permutation of the predictions so that the total matching cost is minimized


where is a set of all possible permutation of elements, and is the th element of the permutation .

We define the following matching cost between the th ground truth and the corresponding th prediction determined by permutation :



is the probability of classifying

as computed by Eq. 4 and 5, and is a predicted bounding box ( and denote the subject and object embeddings from a Sum query branch). includes GIoU and L1 losses, same as [3]. Here we use a union box of the subject and object in a relation to represent the target box of a corresponding predicate.

Training loss. Given two-level Part and Sum outputs, we compute classification losses and box regression losses on both levels. Once we obtained the best permutation that minimizes the overall matching cost between and , we can compute the total loss as


Note that Eq. 8 is very similar to Eq. 7, except that a negative log-likelihood loss is used to train classifiers, for more effective learning.

Query Type Transformer Decoder Design Phrase Detection Relationship Detection
No. Vanilla Tensor Composite Vanilla Part Part-and-Sum
R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100
(a) 26.17 29.43 27.66 32.71 17.88 19.41 19.97 23.08
(b) 26.69 31.46 28.67 34.35 19.36 22.63 21.89 25.89
(c) 30.40 34.86 32.29 37.68 23.28 26.30 25.46 29.65
(d) 25.70 29.66 28.01 34.11 17.75 20.20 20.17 24.53
(e) 30.63 33.82 32.55 40.63 23.57 27.63 26.48 31.83
Table 1: Phrase and relationship detection result comparison (%) under various Transformer architectures on the VRD dataset. For a fair comparison, all transformer decoders are based on the same CNN backbone and Transformer encoder, using the same number of queries.

4 Experiments

We evaluate our method on two composite set detection applications: Visual Relationship Detection (VRD), and Human Object Interaction detection (HOI).

Datasets. (1) For the VRD task, we evaluate the proposed PST on VRD dataset [29], containing images, and entity categories and predicate categories. Relationships are labeled as a set of subject, predicate, object triplets, and all subject and object entities in relationships are annotated with an entity category and a bounding box. We follow the data split of [29] and use images for training/validation/test. There are visual relationship instances that belong to triplet types, and relation types that only appear in the test set, which are used for zero-shot relationship detection. (2) For HOI task, we conduct an evaluation on HICO-DET dataset [4]

, including 38,118 training images and 9,658 testing images. In this dataset, there are the same 80 object categories as MS-COCO

[27] and 117 verb categories, and objects and verbs construct 600 classes of HOI triplets. One person is able to interact with multiple objects in various ways at the same time in this dataset.

Task Settings. For the VRD task, we test PST on Phrase Detection and Relationship Detection [29, 54]. In Phrase Detection, the model detects one bounding box for each relationship, and recognizes the categories of subject, object and predicate in the relationship. In Relationship Detection, the model detects two individual bounding boxes for both subject and object entities in the relationship, and classifies the subject, predicate and object in the relationship. In both tasks, we consider two settings: single and multiple predicates between a pair of subject and object entities, with representing the number of predicates between a pair.

Performance Metrics.

(1) In VRD, we use relationship detection recall@K as the evaluation metric, as the true relationships annotations are incomplete. Following the evaluation in

[29], for each detected relationship, we compute joint probability of subject, predicate and object category prediction as the score for that relationship, and then rank all detected relationships to compute the recall metric. For a relationship to be detected correctly, all three elements are required to be correctly classified and the IoU between the predicted bounding boxes and groundtruth bounding boxes are greater than 0.5. (2) In HOI, we use mean average precision (mAP) [12] as the evaluation metric. An HOI detection is considered correct only when the action and the object class are both correctly recognized and the corresponding human and object bounding boxes detection have higher than 0.5 IoU with the groundtruth boxes.

Implementation Details. On both VRD and HOI task, PST shares configurations. Specifically, PST uses the standard ResNet-50 network as the backbone, followed by a Transformer encoder with six encoder layers, the same as Deformable DETR [58]. The proposed Part-and-Sum transformer decoder contains six layers of the proposed two-stream Part-and-Sum decoder layers. All feed-forward networks are two-linear-layer shallow networks. We set up auxiliary losses after each decoder layer, and use 400 composite queries, with three part queries representing the subject, object and predicate in a visual relation or human, object and interaction in a HOI, respectively. Note that, in our experiment, we use the vanilla multi-head self-attention module [44] as a self-attention layer, and use a deformable multi-head cross-attention module as a cross-attention layer.

During training, we train PST in an end-to-end manner, with a learning rate for the backbone, and

for the rest of the network. The training batch-size is set as 2, and the training epoch is 300. On both part and sum decoding streams, the loss weights of the subject, object, predicate classification and triplet reasoning are set as 1, and of all bounding box losses are 5 and GIoU loss

[34] are 2, and the part decoding loss and sum decoding loss are equal. We initialize the backbone and Transformer Encoder of PST by Deformable DETR model [58], pretrained on MS-COCO [27]. We use the common data augmentation in training including image flipping, rescaling and cropping 111Note that we do not apply the extra relationship level data augmentation. If an entity disappears due to data augmentation, its related relationships are not included in relationship learning..

4.1 Part-and-Sum Transformer Analysis

Part-and-Sum Transformer Decoder We first compare and analyze different Transformer designs, with different query types on VRD. The various Transformer designs are compared in Figure 2. Vector based query is the most straightforward way to detect structured outputs using a Transformer, by formulating an individual structure entity as a vector query, and feeding queries into a vanilla Transformer decoder [58] to learn embeddings for each relation. Then, three one-linear-layer heads are used to predict the classes of subjects, objects and predicates, and two three-layer MLP heads to regress the boxes. The result comparison are shown in Table 1.

We can see that (1) with vanilla transformer decoder, tensor based query outperforms vector based query (in (a) vs (b)), with the margin 1.01/1.64% and 2.03/2.81% at R@50/100 on Phrase and Relationship detection (). It is because vector query models the structure entity as a whole, and embeds multiple parts in one query. This design increases the difficulty of Hungarian matching. (2) For Tensor based query, Part Transformer outperforms the Vanilla Transformer with a clear margin (in (b) vs (c)). This benefit mainly comes from the factorized design in the self-attention layer, and the relation-level constraint (in Eq. 8). The former enhances the intra-relation context, reducing the ambiguity of entity recognition, for example, Subject “Person” and Object “horse” are important clues to infer Predicate “Ride”. The latter learns the relation as a whole to reduce the entity instance confusion [54]. (3) Despite the part-and-sum benefit inside the composite query, vanilla transformer with composite query degrades (in (d) vs (a)), compared with using the vanilla query. It shows that directly mixing Part and Sum queries cannot benefit structured output learning, because per Sum query contains multiple parts, and some relations may share the same entity instance, which could confuse the similarity computation in self-attention modules. To leverage two-level information and context effectively, PST decodes Part and Sum queries separately, with group-wise part-sum interaction. By comparing (d) vs (e), this design outperforms vanilla transformer, with the margin 4.54/6.52% and 6.31/7.30% at R@50/100 on Phrase and Relationship detection ().

Factorized self-attention layer We check the effectiveness of the factorization self-attention layer in the Part query decoding stream. We compare the performance of PST with factorization self-attention layer vs PST with vanilla self-attention layer on VRD, and the results are shown in Table 2. It shows that factorized self-attention design leads to 1.18/2.66% improvement at R@50/100 on Relationship detection ().

Part-Sum interaction We compare two Part-Sum interaction schemes: Vanilla Self-attention vs Summation operation. The results are shown in Table 2. From it, we can see that Part-Sum bidirectional summation works better than self-attention interaction, mainly due to the determined grouping configuration between part and sum queries in PST as shown in Figure 2(c) 222In PST, the component order in a composite query is fixed. On VRD, for instance, the first part query is for Subject, the second part query for Object, and the third part query for Predicate. The grouping between part queries and sum query is also fixed by design, for instance, the first sum query and the first group of part queries represent the same relation instance..

Relationship Detection
R@50 R@100 R@50 R@100
factorization SA 22.14 26.48 25.30 29.17
23.57 27.63 26.48 31.83
Part-Sum Interaction Self-Attention 22.04 25.42 23.89 28.87
PartSum 23.57 27.63 26.48 31.83
Table 2: Ablation study on Part-Sum Transformer designs. We report Relationship detection result comparison (%) on VRD.

Composite prediction Given the two-stream design of Part-and-Sum decoding, we obtain the predictions from both part and sum levels. Thus, we study the various inference schemes: predicting structure data from part query branch, or from sum query branch, or combining two branches. To combine predictions from part and sum queries, for classification probability, we average the prediction probability of group-wise part and sum queries; for box prediction, we just average predicted positions of left-top and right-bottom points. The results comparison are shown in Table 3, and it shows that Part only inference slightly outperforms Sum only inference, and combining the predictions from two levels is able to bring a minor improvement for relationship detection 333We report the results by part query only based inference in both VRD and HOI experiments for clarity..

Relationship Detection
R@50 R@100 R@50 R@100
Part only 23.57 27.63 26.48 31.83
Sum Only 22.06 25.43 25.76 30.45
Part-Sum 24.34 27.01 27.03 31.90
Table 3: Result comparison of various composite set inference.

4.2 Visual Relationship Detection

We compare PST with a wide range of visual relationship detection solutions on the VRD datasets [29].

Competitors. Existing visual relationship detection solutions can be classified into two categories: (I) Stage-wise methods: These approaches first detect objects using pre-trained detector, and then use the outputs of the object detector (bounding boxes, visual features or spatial information) as fixed inputs for a relationship detection module. Specifically, we compare with: (1) VRD-Full [29] which combines the visual appearance and the language features of candidates boxes to learn relationships. (2) NMP [17] which builds a relationship graph and optimizes it by node-to-edge and edge-to-node message passing mechanisms. (3) CDDN [5] which proposes a context guided visual-semantic feature fusion scheme for predicate detection. (4) LSVR [53] which learns a better representation by aligning the features on both entity and relationship levels. (5) RelDN [54]

which uses contrastive loss functions to learn fine-grained visual features for decreasing the ambiguity of relationship detection. (6) BCVRD

[18] which proposes a new box-wise fusion method to better combine visual, semantic and spatial features. (7) HGAT [30] which proposes to use object-level and triplet-level reasoning to improve relationship detection. (II) End-to-End methods: These approaches detect objects and relationships jointly. Specifically, we compare with: (1) CAI [59] - leverages the subject-object context to detect relationships. (2) KL distillation[50] - uses a well-trained linguistic model to regularize the visual model learning to obtain more generalized embedding for relationships. (3) DR-Net [6] - designs a fully connected network to mine object-pair relationships. (4) Zoom-Net [49] - leverages multi-scaled contexts for fine-grained relationship detection. (5) VTransE [52] - learns a more generalized relationship representation by mapping the visual features to the relationship space.

Method Phrase Detection Relationship Detection
R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100
VRD-Full [29] 16.17 17.03 20.04 24.90 13.86 14.70 17.35 21.51
LSVR[53] 18.32 19.78 21.39 25.65 16.08 17.07 18.89 22.35
BC-VRD [18] 19.72 20.95 24.47 28.38 15.87 16.63 19.91 22.86
MLA-VRD [55] 23.36 28.12 - - 20.54 24.91 - -
NMP [17] - - - - 20.19 23.98 21.50 27.50
HGAT [30] - - - - 22.52 24.63 22.90 27.73
RelDN-IMG [54] 26.37 31.42 28.24 35.44 19.82 22.96 21.52 26.38
MF-URLN [51] 31.50 36.10 - - 23.90 26.80 - -
RelDN [54] 31.34 36.42 34.45 42.12 25.29 28.62 28.15 33.91
DR-Net [6] - - 19.93 23.45 - - 17.73 20.88
VTransE [52] 19.42 22.42 - - 14.07 15.20 - -
CAI [59] 17.60 19.24 - - 15.63 17.39 - -
ViP [23] 22.80 27.90 - - 17.30 20.00 - -
KL distilation[50] 23.14 24.03 26.32 29.43 19.17 21.34 22.68 31.89
Zoom-Net [49] 24.82 28.09 29.05 37.34 18.92 21.41 21.37 27.30
PST (ours) 30.63 33.82 32.55 40.63 23.57 27.63 26.48 31.83
Table 4: Phrase and relationship detection result comparison (%) on VRD dataset. - denotes that the results are not reported in the original paper. is the number of predicates associated with each subject-object pair. Note on VRD dataset, the maximum number of predicates is . The first block is for the stage-wise detection methods, and the second block is for end-to-end detection methods. Our method belongs to the latter. : the reported results of BC-VRD are based on Faster R-CNN for a fair comparison.

Results. The visual relation detection comparisons on VRD dataset are shown in Table 4. For clarity, the stage-wise methods are grouped in the first block, and end-to-end methods are in the second block. The proposed PST belongs to the second block, and particularly it is the first holistic end-to-end VRD solution (directly outputs all predicted relationships without any post-processing). It is evident that PST outperforms the existing end-to-end methods on both Phrase and Relationship Detection tasks, e.g. surpassing the second best end-to-end method Zoom-Net [51] with a margin of 5.81/5.73 in Phrase Detection, and 4.65%/6.22% in Relationship detection at R@50/100 when . It shows that PST is able to learn the relationships between all entities effectively.

Default Known Object
Method Backbone Detector Pose Language Full Rare NonRare Full Rare NonRare
Two-stage methods
Shen et al.[36] VGG19 COCO 6.46 4.24 7.12 - - -
HO-RCNN[4] CaffeNet COCO 7.81 5.37 8.54 10.41 8.94 10.85
InteractNet[13] ResNet-50-FPN COCO 9.94 7.16 10.77 - - -
GPNN[32] ResNet-101 COCO 13.11 9.34 14.23 - - -
iCAN[11] ResNet-50 COCO 14.84 10.45 16.15 16.26 11.33 17.73
PMFNet-Base[45] ResNet-50-FPN COCO 14.92 11.42 15.96 18.83 15.30 19.89
PMFNet[45] ResNet-50-FPN COCO 17.46 15.65 18.00 20.34 17.47 21.20
No-Frills[14] ResNet-152 COCO 17.18 12.17 18.68 - - -
TIN[25] ResNet-50 COCO 17.22 13.51 18.32 19.38 15.38 20.57
CHG[46] ResNet-50 COCO 17.57 16.85 17.78 21.00 20.74 21.08
Peyre et al.[31] ResNet-50-FPN COCO 19.40 14.63 20.87 - - -
IPNet [47] Hourglass COCO 19.56 12.79 21.58 22.05 15.77 23.92
VSGNet[43] ResNet-152 COCO 19.80 16.05 20.91 - - -
FCMNet[28] ResNet-50 COCO 20.41 17.34 21.56 22.04 18.97 23.12
ACP[20] ResNet-152 COCO 20.59 15.92 21.98 - - -
Bansal et al.[2] ResNet-50-FPN HICO-DET 21.96 16.43 23.62 - - -
PD-Net[56] ResNet-152 COCO 20.81 15.90 22.28 24.78 18.88 26.54
PastaNet[24] ResNet-50 COCO 22.65 21.17 23.09 24.53 23.00 24.99
VCL[15] ResNet-101 HICO-DET 23.63 17.21 25.55 25.98 19.12 28.03
DRG[10] ResNet-50-FPN HICO-DET 24.53 19.47 26.04 27.98 23.11 29.43
One-stage methods
UnionDet [19] ResNet-50-FPN HICO-DET 17.58 11.52 19.33 19.76 14.68 21.27
PPDM [26] Hourglass HICO-DET 21.73 13.78 24.10 24.58 16.65 26.84
HoiT [60] ResNet-50 - 23.46 16.91 25.41 26.15 19.24 28.22
PST (Ours) ResNet-50 - 23.93 14.98 26.60 26.42 17.61 29.05
Table 5: Comparison with the state-of-the-art HOI methods on HICO-DET dataset. For “Detector”, “COCO” means the method uses an off-the-shelf object detector trained on COCO. “HICO-DET” means the COCO pretrained object detector is further finetuned on HICO-DET. If “Pose” is selected, it means the method uses human pose as additional features. If “Language” is selected, it means language priors have been applied.

From the comparisons with the stage-wise VRD methods, PST outperforms the second best method HGAT [30] with a margin of 2.8% at R@50 in the Relationship detection, but lags behind the best method RelDN [54] with a margin of 0.71% and 1.32% on Phrase and relationship detection at R@50 with . We note that RelDN is a sophisticated two-stage method that:(1) leverages two CNNs for the entity and predicate visual feature learning; (2) tunes the thresholds of three margins in metric learning based losses functions; (3) combines multi-modality information (visual, semantic and spatial information) for relationship prediction. By contrast, PST predicts relations just based on visual features and detects relationships end-to-end and holistically without any post-processing. PST is simple, without any hand-designed components to represent the prior knowledge.

4.3 Human Object Interaction Detection

Figure 3: Visualization of the attention maps of decoder layers in PST. (a) is the input image, (b) shows a detected relationship from one query, and (c) visualizes this relationship’s attention maps in various decoder layers for subject (I), object (II) and predicate(III) query, respectively. Due to the space limit, we just show attention maps of four decoder layers.

Competitors. We compare our model with two types of state-of-the-art HOI methods: the two-stage methods [36, 13, 32, 11, 45, 14, 25, 46, 31, 43, 28, 20, 2, 56, 24, 15, 10, 4] and the single-stage methods [19, 26, 60]. The two-stage methods aim to detect individual objects in the first stage. Then, they associate the detected objects and infer the HOI predictions in the second stage. Two-stage methods rely on good object detections in the first stage and mostly focus on the second stage where language priors [14, 31, 28, 20, 2, 56, 24, 10] and human pose features [45, 25, 28, 20, 24] may be leveraged to facilitate the inference of HOI predictions from detected object. The single-stage methods aim to bypass the object detection step and directly output HOI predictions in one step. Previous single-stage methods [19, 26] are not end-to-end solutions. They employ multiple branches with each branch outputs complementary HOI-related predictions and rely on post-processing to decode the final HOI predictions. The most related to our approach is HoiT [60] which is an end-to-end single stage solution. HoiT employs DETR-like structure and treats HOI as a set prediction problem to achieve end-to-end learning. It can be regarded as our sum branch transformer where each query directly outputs an HOI triplet.

Results. Table 5 shows the results of our method and the other state-of-the-art HOI methods on HICO-DET dataset. We see that most of the two-stage models have mAP around 20 (default, full) on the HICO-DET test set. The best two-stage model is DRG [10] which achieves 24.5 mAP. However, it is a complex model and requires three-stage training. In comparison, as end-to-end single stage models, our model and the contemporary HoiT [60] model easily achieve 20+ mAP without using a dedicated object detector or extra pose or language information. Our model with composite queries has mAP of 23.9 and achieves the state-of-the-art performance for single-stage HOI.

4.4 Ablation Study

We report more components analysis of our proposed part-and-sum transformer model (PST) here.

Single-branch vs dual-branch in decoder Our part-and-sum transformer (PST) consists a dual-branch decoder for part queries and sum queries, i.e. part queries and sum queries are feed into variant self-attention layers, cross-attention layers and FFNs, and decoded independently. We compare this design with Single-decoder, where part and sum queries are decoded by the same layers. The results are shown in Table 6 where the dual-stream design is shown to outperform the single-branch design on both relationship detection and phrase detection tasks. We hypothesize that part queries and sum queries represent different aspects of a relationship, and it is better to decode these two kinds of queries independently.

Part and Sum decoding Phrase Detection Relationship Detection
R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100
Single-branch 27.32 32.71 31.59 37.82 20.04 23.10 24.89 29.87
Dual-branch 30.63 33.82 32.55 40.63 23.57 27.63 26.48 31.83
Table 6: Comparison(%) of a single-stream vs dual-stream in the Part-and-Sum decoder. Single-stream: Part queries and Sum queries share the same SA and CA layers; Dual-stream: Part queries and Sum queries are independently decoded by different SA and CA layers (SA: Self-attention layer, and CA: Cross-attention layer).

Varying the number and dimension of the queries We show more comparison for the tensor-based query strategy vs the vector-based query strategy by varying their numbers and dimensions. Specifically, we use 500 tensor queries in PST, and each query contains three sub-vector queries of 256 dimension. For a fair comparison, we also employ 500 vector queries but of 2563 dimension. Note that increasing the dimension of vector-based queries to three times larger () requires to three time larger dimension of image memory. To this end, we repeat the features of the encoder three times. By doing it, each query in both models can have the equal embedding dimension to represent each relationship. The comparison of results is shown in Table 7. From it, we see that (1) increasing the number of vector queries from 500 to 1500 is not able to bring the clear benefit; (2) Increasing the dimension of each vector query does not provide more information; (3) Tensor based query outperforms variant vector based query, with a margin of 4.65%/5.16% increase on R@50/100 on Relationship detection when ; (4) Compared to the tensor based query, composite queries further improve the performance due to part-and-sum two-level learning.

Query style Query Number Query Dimension Phrase Detection Relationship Detection
R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100
Vector 1500 256 26.39 29.68 29.41 32.68 18.63 21.14 20.26 23.64
Vector 500 26.17 29.43 27.66 32.71 17.88 19.41 19.97 23.08
Vector 500 25.13 30.17 27.88 33.34 18.65 20.94 21.20 25.92
Tensor 500 256 30.40 34.86 32.29 37.68 23.28 26.30 25.46 29.65
Composite 500 256 30.63 33.82 32.55 40.63 23.57 27.63 26.48 31.83
Table 7: Comparison() of variant query designs in the PST decoder.

Part-and-Sum design in the HOI task More comparisons for the vector based Transformer (PST-Sum), tensor based Transformer (PST-Part), and part-and-sum Transformer (PST) on the HOI task are given below. More experimental results are shown in Table 8. PST is shown to outperform PST-Part and PST-Sum, same as the performance comparison in the VRD task.

Default Known Object
Method Backbone Detector Pose Language Full Rare NonRare Full Rare NonRare
PST-Sum ResNet-50 - - - 21.37 13.85 23.62 23.28 15.24 25.69
PST-Part ResNet-50 - - - 22.24 14.15 24.65 24.15 15.61 26.70
PST ResNet-50 - - - 23.93 14.98 26.60 26.42 17.61 29.05
Table 8: Results of variant Part-and-Sum Decoder on HICO-DET dataset. If “Pose” is selected, it means the method uses human pose as additional features. If “Language” is selected, it means language priors have been applied.

4.5 Composite Attention Visualisation

To better understand how the model works and what input information it uses to perform relationship detection, we visualize the cross-attention maps of the decoder layers of PST, since cross-attention measures the correlation between the query embedding and image feature tokens. Given a test image, we extract cross-attention maps from all decoder layers for the subject, object and predicate query embedding individually. We visualize the attention maps of one query in Figure 3 (c), and include results of more queries in the supplementary material. In Figure 3, the relationship query embedding is decoded to “person-wear-shoes” semantically, and according to the attention maps, we can see that (1) the transformer decoder incrementally focuses on the “person” and “shoes” area in the image, to infer the subject and object entities; (2) the predicate (“Wear”) is mostly predicted from the union area of the subject and object, which suggests that the attention scheme is capable of automatically modeling the subject-object context for predicate detection.

5 Conclusion

In this work, we have presented a Transformer-based detector, Part-and-Sum Transformers (PST), for visual composite set detection. PST emphasizes the importance of maintaining separate representations for the sum and parts while enhancing their interactions.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.2.
  • [2] A. Bansal, S. S. Rambhatla, A. Shrivastava, and R. Chellappa (2019) Detecting human-object interactions via functional generalization. CoRR abs/1904.03181. External Links: Link, 1904.03181 Cited by: §4.3, Table 5.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. ECCV. Cited by: Appendix A, §1, §1, §2, §3.1, §3.3.
  • [4] Y. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng (2018) Learning to detect human-object interactions. In WACV, Cited by: §1, §1, §1, §2, §4.3, Table 5, §4.
  • [5] Z. Cui, C. Xu, W. Zheng, and J. Yang (2018) Context-dependent diffusion network for visual relationship detection. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1475–1482. Cited by: §4.2.
  • [6] B. Dai, Y. Zhang, and D. Lin (2017) Detecting visual relationships with deep relational networks. In CVPR, Cited by: §4.2, Table 4.
  • [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
  • [9] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester (2010) Cascade object detection with deformable part models. In

    2010 IEEE Computer society conference on computer vision and pattern recognition

    pp. 2241–2248. Cited by: §3.2.2.
  • [10] C. Gao, J. Xu, Y. Zou, and J. Huang (2020) DRG: dual relation graph for human-object interaction detection. External Links: 2008.11714 Cited by: §1, §2, §4.3, §4.3, Table 5.
  • [11] C. Gao, Y. Zou, and J. Huang (2018) ICAN: instance-centric attention network for human-object interaction detection. CoRR abs/1808.10437. External Links: Link, 1808.10437 Cited by: §4.3, Table 5.
  • [12] C. Gao, Y. Zou, and J. Huang (2018) Ican: instance-centric attention network for human-object interaction detection. In BMVC, Cited by: §1, §1, §2, §4.
  • [13] G. Gkioxari, R. B. Girshick, P. Dollár, and K. He (2017) Detecting and recognizing human-object interactions. CoRR abs/1704.07333. External Links: Link, 1704.07333 Cited by: §4.3, Table 5.
  • [14] T. Gupta, A. G. Schwing, and D. Hoiem (2018) No-frills human-object interaction detection: factorization, appearance and layout encodings, and training techniques. CoRR abs/1811.05967. External Links: Link, 1811.05967 Cited by: §4.3, Table 5.
  • [15] Z. Hou, X. Peng, Y. Qiao, and D. Tao (2020) Visual compositional learning for human-object interaction detection. External Links: 2007.12407 Cited by: §1, §2, §4.3, Table 5.
  • [16] J. Hu, W. Zheng, J. Lai, S. Gong, and T. Xiang (2013) Recognising human-object interaction via exemplar based modelling. In Proceedings of the IEEE international conference on computer vision, pp. 3144–3151. Cited by: §1, §2.
  • [17] Y. Hu, S. Chen, X. Chen, Y. Zhang, and X. Gu (2019) Neural message passing for visual relationship detection. In ICML Workshop on Learning and Reasoning with Graph-Structured Representations, Long Beach, CA, Cited by: §1, §2, §4.2, Table 4.
  • [18] S. Inayoshi, K. Otani, A. Tejero-de-Pablos, and T. Harada (2020) Bounding-box channels for visual relationship detection. In ECCV, Cited by: §4.2, Table 4.
  • [19] B. Kim, T. Choi, J. Kang, and H. J. Kim (2020) UnionDet: union-level detector towards real-time human-object interaction detection. In European Conference on Computer Vision, pp. 498–514. Cited by: §1, §2, §4.3, Table 5.
  • [20] D. Kim, X. Sun, J. Choi, S. Lin, and I. S. Kweon (2020) Detecting human-object interactions with action co-occurrence priors. External Links: 2007.08728 Cited by: §4.3, Table 5.
  • [21] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §1, §1, §2.
  • [22] J. Lafferty, A. McCallum, and F. C. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML, Cited by: §2.
  • [23] Y. Li, W. Ouyang, X. Wang, and X. Tang (2017)

    Vip-cnn: visual phrase guided convolutional neural network

    In CVPR, Cited by: Table 4.
  • [24] Y. Li, L. Xu, X. Liu, X. Huang, Y. Xu, S. Wang, H. Fang, Z. Ma, M. Chen, and C. Lu (2020)

    PaStaNet: toward human activity knowledge engine

    External Links: 2004.00945 Cited by: §1, §2, §4.3, Table 5.
  • [25] Y. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H. Fang, Y. Wang, and C. Lu (2018) Transferable interactiveness prior for human-object interaction detection. CoRR abs/1811.08264. External Links: Link, 1811.08264 Cited by: §4.3, Table 5.
  • [26] Y. Liao, S. Liu, F. Wang, Y. Chen, C. Qian, and J. Feng (2020) Ppdm: parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 482–490. Cited by: §4.3, Table 5.
  • [27] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §2, §4, §4.
  • [28] Y. Liu, Q. Chen, and A. Zisserman (2020) Amplifying key cues for human-object-interaction detection. In European Conference on Computer Vision, Cited by: §4.3, Table 5.
  • [29] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In ECCV, Cited by: §1, §1, §1, §2, §4.2, §4.2, Table 4, §4, §4, §4.
  • [30] L. Mi and Z. Chen (2020) Hierarchical graph attention network for visual relationship detection. In CVPR, Cited by: §1, §2, §4.2, §4.2, Table 4.
  • [31] J. Peyre, I. Laptev, C. Schmid, and J. Sivic (2018) Detecting rare visual relations using analogies. CoRR abs/1812.05736. External Links: Link, 1812.05736 Cited by: §4.3, Table 5.
  • [32] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018)

    Learning human-object interactions by graph parsing neural networks

    CoRR abs/1808.07962. External Links: Link, 1808.07962 Cited by: §4.3, Table 5.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2, §2.
  • [34] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, Cited by: §4.
  • [35] T. Serre, A. Oliva, and T. Poggio (2007) A feedforward architecture accounts for rapid categorization. Proceedings of the national academy of sciences 104 (15), pp. 6424–6429. Cited by: §2.
  • [36] L. Shen, S. Yeung, J. Hoffman, G. Mori, and L. Fei-Fei (2018) Scaling human-object interaction recognition through zero-shot learning. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1568–1576. External Links: Document Cited by: §1, §2, §4.3, Table 5.
  • [37] L. Shen, S. Yeung, J. Hoffman, G. Mori, and L. Fei-Fei (2018) Scaling human-object interaction recognition through zero-shot learning. In WACV, Cited by: §1, §1, §2.
  • [38] J. Shotton, J. Winn, C. Rother, and A. Criminisi (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In CVPR, Cited by: §2.
  • [39] Z. Si and S. Zhu (2013) Learning and-or templates for object recognition and detection. IEEE transactions on pattern analysis and machine intelligence 35 (9), pp. 2189–2205. Cited by: §2.
  • [40] H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMLP, Cited by: §2.
  • [41] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer (2005) Large margin methods for structured and interdependent output variables.. Journal of machine learning research 6 (9). Cited by: §2.
  • [42] Z. Tu (2008) Auto-context and its application to high-level vision tasks. In CVPR, Cited by: §2.
  • [43] O. Ulutan, A. S. M. Iftekhar, and B. S. Manjunath (2020) VSGNet: spatial attention network for detecting human object interactions using graph convolutions. External Links: 2003.05541 Cited by: §4.3, Table 5.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §A.1, §1, §2, §4.
  • [45] B. Wan, D. Zhou, Y. Liu, R. Li, and X. He (2019) Pose-aware multi-level feature network for human object interaction detection. CoRR abs/1909.08453. External Links: Link, 1909.08453 Cited by: §4.3, Table 5.
  • [46] H. Wang, W. Zheng, and L. Yingbiao (2020) Contextual heterogeneous graph network for human-object interaction detection. External Links: 2010.10001 Cited by: §4.3, Table 5.
  • [47] T. Wang, T. Yang, M. Danelljan, F. S. Khan, X. Zhang, and J. Sun (2020) Learning human-object interaction detection using interaction points. External Links: 2003.14023 Cited by: Table 5.
  • [48] Y. Xu, W. Xu, D. Cheung, and Z. Tu (2021) Line segment detection using transformers without edges. In CVPR, Cited by: §2.
  • [49] G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy (2018)

    Zoom-net: mining deep feature interactions for visual relationship recognition

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 322–338. Cited by: §1, §2, §4.2, Table 4.
  • [50] R. Yu, A. Li, V. I. Morariu, and L. S. Davis (2017) Visual relationship detection with internal and external linguistic knowledge distillation. In ICCV, Cited by: §1, §2, §4.2, Table 4.
  • [51] Y. Zhan, J. Yu, T. Yu, and D. Tao (2019) On exploring undetermined relationships for visual relationship detection. In CVPR, Cited by: §4.2, Table 4.
  • [52] H. Zhang, Z. Kyaw, S. Chang, and T. Chua (2017) Visual translation embedding network for visual relation detection. In CVPR, Cited by: §4.2, Table 4.
  • [53] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, and M. Elhoseiny (2019) Large-scale visual relationship understanding. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Cited by: §4.2, Table 4.
  • [54] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro (2020) Graphical contrastive losses for scene graph parsing. In CVPR, Cited by: §1, §2, §4.1, §4.2, §4.2, Table 4, §4.
  • [55] S. Zheng, S. Chen, and Q. Jin (2019) Visual relation detection with multi-level attention. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 121–129. Cited by: Table 4.
  • [56] X. Zhong, C. Ding, X. Qu, and D. Tao (2020) Polysemy deciphering network for robust human-object interaction detection. External Links: 2008.02918 Cited by: §4.3, Table 5.
  • [57] S. Zhu, D. Mumford, et al. (2007) A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision 2 (4), pp. 259–362. Cited by: §2.
  • [58] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: Appendix A, §2, §3.1, §4.1, §4, §4.
  • [59] B. Zhuang, L. Liu, C. Shen, and I. Reid (2017) Towards context-aware interaction recognition for visual relationship detection. In ICCV, Cited by: §4.2, Table 4.
  • [60] C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei, and J. Sun (2021) End-to-end human object interaction detection with hoi transformer. In CVPR, Cited by: §2, §4.3, §4.3, Table 5.

Appendix A Part-and-Sum Transformers for Composite Set Detection

In this work, we focus on end-to-end structured data detection by Part-and-Sum Transformers for tasks like visual relationship detection and Human Object Interaction detection. We provide a more detailed discussion for the designs of vanilla decoder, tensor-based decoder, and composite (part-and-sum) decoder. These three alternatives differ in the form of queries and how attention is implemented.

Figure 1(a) in the main submission gives an illustration; for an input image, we use a convolutional neural network (CNN) model to extract image features, which are fed into a standard [3]/ deformable [58] transformer encoder. The encoder is composed of multiple self-attention layers to tokenize the visual features. After that, a transformer decoder takes the visual tokens together with a set of learnable queries as input to detect a composite set (visual relationships or human object interactions). We denote the tokenized features of the Transformer Encoder as , and the learnable queries as ; and the embedding of the outputs of a decoder as . For embeddings , the structural prediction is inferred by a prediction module, denoted as .

a.1 Vanilla decoder with vector-based query

Our vanilla Transformer decoder contains query embeddings, and each query is a vector, representing a relationship:


where is a vector of a size , and the overall query is . The queries are feed into multiple decoder layers of a same design. Specifically, each decoder layer contains a Multi-head self-attention layer [44], learning the cross-relationship context; and a multi-head cross-attention layer, to learn the representations by attending various image positions; and a feed forward network (FFN) to further embed each query. All query embeddings are feed into these three components one by one, and the last outputs are feed into the following decoder blocks. The decoding process in each decoder block is written as:


where is Self-attention layer (SA), and is the Cross-attention layer (CA). Note that in vanilla Transformer, each query represents a relationship, i.e. containing multiple components.

a.2 Tensor-based decoder with tensor-based Query

Unlike vector based query, tensor-based query represents a relationship by a tensor which contains multiple sub-queries to represent each part individually, such as Subject, Predicate and Object parts. The tensor based query can be written as:


where each query includes three sub-queries to represent subject, predicate, and object, respectively. By doing so, all parts are learnt individually, reducing the ambiguity in similarity computation in attention schemes. It is important for learning the relationships sharing the same subject or object entity. In decoding, attention layers handle all sub-queries, written as:


Vector query and tensor-based query are conceptually different, and the former learns each two-level/structure data as a whole/Sum, while the latter learns each two-level/structure data by part learning. Furthermore, self-attention layers are functionally different in these two designs: self-attention among all sum queries is to mine inter-relation context, while self-attention layer among part queries is to mine the context of entities, which indirectly benefits relationship learning.

a.3 Composite (part-and-sum) decoder with composite Query

Composite query models each relationship in a structural manner, and learns a relationship in both part and sum levels. Each composite query contains three part sub-queries for Subject, Predicate and Object entities, and one sum sub-query for a whole relationship. The composite query can be written as:


where are part queries, and is a sum query for relationship . In the decoding, part query and sum query are separately decoded by different self-attention layers and , and cross-attention layers and , written as:

Figure 4: Qualitative results on (a) VRD and (b) HOI by PST. Each sub-image shows one predicted relationship/interaction. “R” refers to predicate; “S” refers to subject; and “O” refers to object.

Factorized self-attention. To enhance part-based relationship learning, we designs a Factorized self-attention layer, which firstly conducts intra-relationship self-attention, and conducts inter-relationship self-attention. The intra-relationship self attention layer leverages the parts context to benefit relationship prediction, for example, subject query and object query are “person” and “horse” helps predict predicate “Ride”. The inter-relationship self-attention layer leverages the inter-relationship context, to enhance the holistic relationship prediction per image. For example, the existence of “Person read book” helps infer the relationship “Person sit”, rather than “Person run”, which is particularly important for multiple interactions detection for same person entity. The Factorized self-attention is written as:


where Intra-relation self-attention and Inter-relation self-attention layers are written as:


Note that the Factorized self-attention design also can be used for Tensor based query to enhance the inter part-query learning.

Figure 5: Visualization of small-entity relationship detection. The exampled relationship predictions are (a) Person-hit-ball; (b) Person-wear-jacket; (c) Tower-has-clock; and (d) Person-has-bag. From it, PST is able to detect small subjects or objects in relationships and further detects the overall relationships properly. “R” refers to predicate; “S” refers to subject; and “O” refers to object.
Figure 6: Visualization of relationship detection with instance ambiguity. There exist multiple spatially close relationship instances of the same type. Specifically, there exists multiple same type and close relationships, for example, “person-has-phone” in (a) and (b); “person-use-laptop” in (c); “Person-wear-helmet” in (d), “Motorcycle-carry-person” in (e), and “person-has-phone” in (f). “R” refers to predicate; “S” refers to subject; and “O” refers to object.

Appendix B Visualisations of VRD and HOI detection

PST directly predicts all relationships in a set. Figure 4 shows exampled relationship detection results and human object interaction detection results by PST in (a) and (b). Each sub-image visualizes one predicted relationship. It shows that there exist multiple relationships between one entity-pair. For example, in Figure 4 (a), PST detects “Tower-has-clock” and “Clock-on-tower”; and “Road-under-tower” and “Tower-on-road”.

Appendix C More Qualitative Illustrations

In addition to the results shown in the main paper, we provide more qualitative results and analysis for visual relationship detection and human-object interaction detection by the proposed PST.

c.1 Small-entity relationship detection

In the VRD task, relationships are composed of multiple types, some of which pose particular challenges, such as small-entity and spatial relationships. We show some examples in Figure 5 for small-entity relationship detection. PST is able to detect small subjects and objects well, such as “ball” in (a), “person” and “jacket” in (b), “clock” in (c) and “bag” in (d). This happens because the part queries are able to mine the subject-predicate-object context, and the sum queries leverage the inter-relation context; the two types of contextual information provide effective information for detecting small entities in a relationship.

c.2 Instance ambiguity in relationship detection

Instance ambiguity in relationship detection causes a detection failure, where the predicted relationships wrongly associate subject and object instances, although the categories of relationships are predicted correctly. For instance, in Figure 6 (a) and (b), there exist two same type relationships “Person-has-phone”, and they are visually close. Relationship instance ambiguity makes it hard to associate each “phone” instances with the surrounding “Person” instances. This ambiguity is caused by that multiple relationship instances of the same relationship type are too close, and the visual clues for associating the subject-object instances are subtle. We examine PST in these challenging cases and show some examples in Figure 6. It shows that PST is able to associate subject-object instances correctly in this hard situation, thanks to the effective intra-relation and inter-relation attention for context modeling.

c.3 Failure cases by PST

Figure 7: Visualization of a few failure cases by PST. There are four main relation detection types: (a) Inaccurate entity detection caused by crowed instances; (b) Wrong object detection; (c) Wrong association between subject and object instances; (d) Wrong predicate classification. “R” refers to predicate; “S” refers to subject; and “O” refers to object.

We visualize the typical errors of relationship detection by PST in Figure 7. There are four typical errors: (1) PST localizes entities inaccurately, when the entity instances are crowed. For instance, in Figure 7 (a), there are multiple horses close to each other, and PST localizes multiple horses as one Object entity of a relationship “Person-on-horse”. (2) Object detection mistakes cause the failure in relationship detection, such as wrong entity “camera” detected in Figure 7 (b). (3) Relationship instance ambiguity challenges PST. For instance, in Figure 7 (c), the watch is associated with a wrong person instance which is very close to the right person instance. (d) Predicate is wrongly predicted, for instance, PST classifies the relationship between “Person” and “bottle” as “hold”.