In this paper, we study problems such as visual relationship detection (VRD) [29, 21] and human object interaction (HOI) [12, 37, 4] where a composite set of a two-level (part-and-sum) hierarchy is to be detected and localized in an image. In both VRD and HOI, the output consists of a set of entities. Each entity, referred to as a “sum”, represents a triplet structure composed of parts: the parts are (subject, object, predicate) in VRD and (human, interaction, object) in HOI. The sum-and-parts structure naturally forms a two-level hierarchical output - with the sum at the root level and the parts at the leaf level. In the general setting for composite set detection, the hierarchy consists of two levels, but the number of parts can be arbitrary.
|(a) Pipeline overview.|
|(b) Part-and-Sum Transformer Decoder.|
Many existing approaches in VRD [29, 17, 30, 54, 54] and HOI [36, 4, 24, 15, 10] are based on two stage processes in which some parts (e.g. the subject and the object in VRD) are detected first, followed by detection of the association (sum) and the additional part (the predicate). Single stage approaches [50, 49, 16, 19] with end-to-end learning for VRD and HOI also exist. In practice, two-stage approaches produce better performance while single-stage methods are easier to train and use.
The task for an object detector is to detect and localize all valid objects in an input image, making the output a set. Though object detectors such as FasterRCNN  are considered end-to-end trainable, they perform instance level predictions, and require post-processing using non-maximum suppression to recover the entire set of objects in an image. Recent developments in Transformers  and their extensions to object detection  enable set-level end-to-end learning by eliminating anchor proposals and non-maximum suppression.
In this paper, we formulate the visual relationship detection (VRD) [29, 21] and human object interaction (HOI) [12, 37, 4] as composite set (two-level hierarchy) detection problems and propose a new approach, part-and-sum Transformers (PST) to solve them. PST is different from existing detection transformers where each object is represented by a vector-based query in either a one-level set . We show the importance of establishing an explicit “sum” representation for the triplet as a whole to be simultaneously modeled/engaged with the part queries (e.g., subject, object, and predicate). Both the global and the part features have also been modeled in the discriminatively trained part-based model (DPM) algorithm  though the global and part interactions there are limited to relative spatial locations. To summarize, we develop a new approach, part-and-sum Transformers (PST), to solve the composite set detection problem by creating composite queries and composite attention mechanism to account for both the sum (vector-based query) and parts (tensor-based queries) representations. Using systematic experiments, we study the roles of the sum and part queries and intra- and inter-token attention in Transformers. The effectiveness of the proposed PST algorithm is demonstrated in the VRD and HOI tasks.
2 Related Work
The standard object detection task  is a set prediction problem in which each element refers to a bounding box. If the prediction output is a hierarchy of multiple layers, e.g. , standard sliding window based approaches 
that rely on features extracted from the entire window no longer suffice. Algorithms performing inference hierarchically exist[57, 35, 39] but they are not trained end-to-end. Here we study problems that require predictions of a two-level structured set including visual relationship detection (VRD) [29, 21] and human object interaction (HOI) [12, 37, 4]. We aim to develop a general-purpose algorithm that is trained end-to-end for composite set detection with a new Transformer design, which is different from the previous two-stage [29, 17, 30, 54, 54, 36, 4, 24, 15, 10] and single-stage approaches [50, 49, 16, 19]. The concept of sum-and-max 
is related to our part-and-sum approach but the two approaches have large differences in many aspects. In terms of structural modeling, the problem of structured prediction (or semantic labeling) has been long studied in machine learning[22, 41] and computer vision [38, 42].
Transformers  have recently been applied to many computer-vision tasks [7, 40, 48]. Prominently, the object detector based on transformer (DETR)  has shown comparable results to well-established non-fully differentiable models based on CNNs and NMS, such as Faster RCNN . Deformable DETR  has matched the performance of the previous state-of-the-art while preserving the end-to-end differentiability of DETR. A recent attempt  parallel to ours also applies DETR to the HOI task. Our proposed Part-and-Sum Transformers (PST) is different in the algorithm design with the development of composite queries and composite attention to simultaneously model the global and local information, as well as their interactions.
3 Composite Set Detection by Transformer
In this section, we describe the PST formulation for visual composite set detection. We use VRD as the example, and the formulation can be straightforwardly extended to HOI by assuming that the subject is always human and the predicate as the interaction with the object.
Given an image , the goal of VRD is to detect a set of visual relations . Each visual relation , a sum, has three parts: subject, object and predicate, i.e., . For each , the subject and object have class labels and and bounding boxes and . The predicate has a class label . VRD is therefore a composite set detection task, where each instance in the set is a composite entity consisting of three parts.
The overview of the proposed PST is shown in Figure 1 (a). Given an input image, we first obtain the image feature maps from a CNN backbone. The image features with learnable position embeddings are further encoded/tokenized by a standard  or deformable  transformer encoder. Those tokenized image features and a set of learnable queries are put into a transformer decoder to infer the classes and positions of every composite data. Unlike standard object detection, composite set detection not only detects all object entities but also entity-wise structure/relationships. For accurate modeling composite data, we propose composite query based Part-and-Sum transformer decoder, to learn each composite/structure data on both entity and relationship levels, as shown in Figure 1 (b). In the following sections, we detail the PST model and the corresponding processes of training and inference.
3.2 Part-and-Sum Transformer (PST)
To construct the PST model, we first describe the vector based query, tensor based query, and composite queries that are used for composite set prediction. We then formulate the composite transformer decoder layer based on the composite queries.
3.2.1 Query Design for Composite Data
Vector based query A standard decoder used in DETR takes a set of vector based queries as input, as shown in Figure 2 (a). When applying this formulation to the relationship detection task, one can use feedforward networks (FFNs) to directly predict a subject-predicate-object triplet from the output of each query. This straightforward extension of DETR, while serving as a reasonable baseline, is sub-optimal as each query mixes the parts and their interaction altogether inside a vector. This makes the parts and their interaction implicitly modeled, limiting the expressiveness and representation capability of the visual relationship model.
Tensor based query To explicitly model the parts and their relationships (e.g. Subject, Predicate, and Object), we propose a tensor-based query representation using disjoint sub-vectors as sub-queries. Specifically for VRD, in a tensor based query representation, three sub-queries represent Subject, Predicate, and Object. All queries together form a matrix, where is the number of queries, is the number of entities in a relationship (), and is the feature dimension of sub-queries. This formulation enables part-wise decoding in the transformer decoder, as shown in Figure 2 (b). Technically, the vector based query represents each relationship as a whole/Sum, whereas tensor based query models the parts disjointly. The difference in the query design leads to a difference in the learned contexts: self-attention layers among vector based queries mine inter-relation context, while self-attention layers among part queries mine the inter-part context.
Composite query On the one hand, vector based query is able to capture the relationship as a sum/whole, but there exists an intrinsic ambiguity in the parts. On the other hand, tensor based query models each part explicitly, but it lacks the knowledge of the relationships as a sum, which is important for the subject-object association. Based on the observation above, we propose a composite query representation. Formally, each composite query is composed of part queries (a tensor query) as well as a sum/whole query (a vector based query). In VRD, each composite query is composed of three sub-queries to represent Subject, Predicate, and Object; and one sum query to represent the relationship. , and , where , and denote subject, predicate and object sub-queries. Assuming composite queries in the decoder, the overall query is a tensor, where is the dimension of sub-queries.
3.2.2 Part-and-Sum Transformer Decoder
As the composite query includes both part and sum queries, we separately decode the part queries and the sum query . To enable mutual benefits of part and sum learning, we also set up part-sum interaction. Additionally, we propose a factorized self-attention layer for further enhancing part level learning. The architecture of Part-and-Sum Transformer Decoder is illustrated in Figure 2 (c).
Part-and-Sum separate decoding
. The PST decoder has a two-stream architecture, for part and sum queries decoding, respectively. Each decoding stream contains a self-attention layers (SA), cross-attention layers (CA), and feed-forward neural networks (FFN). Letand denote respectively SA and CA layers for part queries. Decoding the part queries is written as:
where denotes the tokenized image features from the Transformer Encoder. Similarly, decoding the sum queries can be written as:
Each entity has both part and global embeddings via two separate sequential modules . The self-attention exploits the context among all queries. Part-and-Sum separated decoding effectively models two different types of contexts: the self-attention in part queries explores the inter-component context, for example, when one part query predicts “person”, it reinforces related predicates such as “eat” and “hold”; while self-attention for global queries exploits the inter-relationship context, for example, a sum query that predicts “Person read book”, is a clue to infer “person sit” relationship. These contexts provide the interactions needed for the accurate inference of the structured output.
Factorized self-attention layer. To make the interactions within a group-wise part query more structured, we design a factorized self-attention layer, as shown in Figure 2 (b). Instead of doing self-attention among all part queries as in Eq. 1, a factorized self-attention layer first conducts intra-relation self-attention, and then conducts inter-relation self-attention. The intra-relation self attention layer leverages the contexts of the parts to benefit relationship prediction, for example, subject query and object query are “person” and “horse” helps predict predicate “Ride”. The inter-relation self-attention layer leverages the inter-relation context, to enhance the holistic relation prediction per image, which is particularly important for multiple interactions detection for the same subject entity. More details are in the supplementary materials.
Part-Sum interaction. Part query decoding embeds more accurate component information, while global embedding contains more accurate component association. These two aspects are both important to the structured output detection, and are mutually beneficial to each other . Thus, we design interaction between the two decoding streams, enabling part-sum conditions. Specifically, after FFNs in the decoder, for each part embedding , we combine it with the sum query embedding, while for each sum query , we fuse all three part query embeddings. The part-sum interaction is formulated as:
where is layer normalisation .
3.3 Model Training and Inference
Composite prediction. For each composite query , we predict the classes of subject, object and predicate; and the bounding boxes for subject and object. Specifically, for each part query, we predict corresponding classes by using a one-layer linear layer, and predict boxes using a shallow MLP head. Besides, we can also construct the global representation from group-wise part queries, by concatenating all part queries, and denote this as . The part query predication is:
where are the FFNs for subject, object and predicate classification; and are the FFNs for predicting the boxes of subject and object; is an FFN for predicting relation triplet.
For Sum query prediction, we predict classes and boxes of all parts from a Sum query , that is:
where are FFNs for subject, object and predicate classification; and are the FFNs for predicting the boxes of subject and object in the global level. Note that the last layer in
is a Softmax layer, while the last layer inis a Sigmoid layer.
Composite bipartite matching. We conduct composite group-wise bipartite matching, i.e. considering all components belonging to a relation jointly in computing set-to-set similarity. Specifically, for a relation, there are three Part queries (subject, object and predicate), and a triplet embedding. The bipartite matching algorithm finds a permutation of the predictions so that the total matching cost is minimized
where is a set of all possible permutation of elements, and is the th element of the permutation .
We define the following matching cost between the th ground truth and the corresponding th prediction determined by permutation :
whereas computed by Eq. 4 and 5, and is a predicted bounding box ( and denote the subject and object embeddings from a Sum query branch). includes GIoU and L1 losses, same as . Here we use a union box of the subject and object in a relation to represent the target box of a corresponding predicate.
Training loss. Given two-level Part and Sum outputs, we compute classification losses and box regression losses on both levels. Once we obtained the best permutation that minimizes the overall matching cost between and , we can compute the total loss as
|Query Type||Transformer Decoder Design||Phrase Detection||Relationship Detection|
We evaluate our method on two composite set detection applications: Visual Relationship Detection (VRD), and Human Object Interaction detection (HOI).
Datasets. (1) For the VRD task, we evaluate the proposed PST on VRD dataset , containing images, and entity categories and predicate categories. Relationships are labeled as a set of subject, predicate, object triplets, and all subject and object entities in relationships are annotated with an entity category and a bounding box. We follow the data split of  and use images for training/validation/test. There are visual relationship instances that belong to triplet types, and relation types that only appear in the test set, which are used for zero-shot relationship detection. (2) For HOI task, we conduct an evaluation on HICO-DET dataset 
, including 38,118 training images and 9,658 testing images. In this dataset, there are the same 80 object categories as MS-COCO and 117 verb categories, and objects and verbs construct 600 classes of HOI triplets. One person is able to interact with multiple objects in various ways at the same time in this dataset.
Task Settings. For the VRD task, we test PST on Phrase Detection and Relationship Detection [29, 54]. In Phrase Detection, the model detects one bounding box for each relationship, and recognizes the categories of subject, object and predicate in the relationship. In Relationship Detection, the model detects two individual bounding boxes for both subject and object entities in the relationship, and classifies the subject, predicate and object in the relationship. In both tasks, we consider two settings: single and multiple predicates between a pair of subject and object entities, with representing the number of predicates between a pair.
(1) In VRD, we use relationship detection recall@K as the evaluation metric, as the true relationships annotations are incomplete. Following the evaluation in, for each detected relationship, we compute joint probability of subject, predicate and object category prediction as the score for that relationship, and then rank all detected relationships to compute the recall metric. For a relationship to be detected correctly, all three elements are required to be correctly classified and the IoU between the predicted bounding boxes and groundtruth bounding boxes are greater than 0.5. (2) In HOI, we use mean average precision (mAP)  as the evaluation metric. An HOI detection is considered correct only when the action and the object class are both correctly recognized and the corresponding human and object bounding boxes detection have higher than 0.5 IoU with the groundtruth boxes.
Implementation Details. On both VRD and HOI task, PST shares configurations. Specifically, PST uses the standard ResNet-50 network as the backbone, followed by a Transformer encoder with six encoder layers, the same as Deformable DETR . The proposed Part-and-Sum transformer decoder contains six layers of the proposed two-stream Part-and-Sum decoder layers. All feed-forward networks are two-linear-layer shallow networks. We set up auxiliary losses after each decoder layer, and use 400 composite queries, with three part queries representing the subject, object and predicate in a visual relation or human, object and interaction in a HOI, respectively. Note that, in our experiment, we use the vanilla multi-head self-attention module  as a self-attention layer, and use a deformable multi-head cross-attention module as a cross-attention layer.
During training, we train PST in an end-to-end manner, with a learning rate for the backbone, and
for the rest of the network. The training batch-size is set as 2, and the training epoch is 300. On both part and sum decoding streams, the loss weights of the subject, object, predicate classification and triplet reasoning are set as 1, and of all bounding box losses are 5 and GIoU loss are 2, and the part decoding loss and sum decoding loss are equal. We initialize the backbone and Transformer Encoder of PST by Deformable DETR model , pretrained on MS-COCO . We use the common data augmentation in training including image flipping, rescaling and cropping 111Note that we do not apply the extra relationship level data augmentation. If an entity disappears due to data augmentation, its related relationships are not included in relationship learning..
4.1 Part-and-Sum Transformer Analysis
Part-and-Sum Transformer Decoder We first compare and analyze different Transformer designs, with different query types on VRD. The various Transformer designs are compared in Figure 2. Vector based query is the most straightforward way to detect structured outputs using a Transformer, by formulating an individual structure entity as a vector query, and feeding queries into a vanilla Transformer decoder  to learn embeddings for each relation. Then, three one-linear-layer heads are used to predict the classes of subjects, objects and predicates, and two three-layer MLP heads to regress the boxes. The result comparison are shown in Table 1.
We can see that (1) with vanilla transformer decoder, tensor based query outperforms vector based query (in (a) vs (b)), with the margin 1.01/1.64% and 2.03/2.81% at R@50/100 on Phrase and Relationship detection (). It is because vector query models the structure entity as a whole, and embeds multiple parts in one query. This design increases the difficulty of Hungarian matching. (2) For Tensor based query, Part Transformer outperforms the Vanilla Transformer with a clear margin (in (b) vs (c)). This benefit mainly comes from the factorized design in the self-attention layer, and the relation-level constraint (in Eq. 8). The former enhances the intra-relation context, reducing the ambiguity of entity recognition, for example, Subject “Person” and Object “horse” are important clues to infer Predicate “Ride”. The latter learns the relation as a whole to reduce the entity instance confusion . (3) Despite the part-and-sum benefit inside the composite query, vanilla transformer with composite query degrades (in (d) vs (a)), compared with using the vanilla query. It shows that directly mixing Part and Sum queries cannot benefit structured output learning, because per Sum query contains multiple parts, and some relations may share the same entity instance, which could confuse the similarity computation in self-attention modules. To leverage two-level information and context effectively, PST decodes Part and Sum queries separately, with group-wise part-sum interaction. By comparing (d) vs (e), this design outperforms vanilla transformer, with the margin 4.54/6.52% and 6.31/7.30% at R@50/100 on Phrase and Relationship detection ().
Factorized self-attention layer We check the effectiveness of the factorization self-attention layer in the Part query decoding stream. We compare the performance of PST with factorization self-attention layer vs PST with vanilla self-attention layer on VRD, and the results are shown in Table 2. It shows that factorized self-attention design leads to 1.18/2.66% improvement at R@50/100 on Relationship detection ().
Part-Sum interaction We compare two Part-Sum interaction schemes: Vanilla Self-attention vs Summation operation. The results are shown in Table 2. From it, we can see that Part-Sum bidirectional summation works better than self-attention interaction, mainly due to the determined grouping configuration between part and sum queries in PST as shown in Figure 2(c) 222In PST, the component order in a composite query is fixed. On VRD, for instance, the first part query is for Subject, the second part query for Object, and the third part query for Predicate. The grouping between part queries and sum query is also fixed by design, for instance, the first sum query and the first group of part queries represent the same relation instance..
Composite prediction Given the two-stream design of Part-and-Sum decoding, we obtain the predictions from both part and sum levels. Thus, we study the various inference schemes: predicting structure data from part query branch, or from sum query branch, or combining two branches. To combine predictions from part and sum queries, for classification probability, we average the prediction probability of group-wise part and sum queries; for box prediction, we just average predicted positions of left-top and right-bottom points. The results comparison are shown in Table 3, and it shows that Part only inference slightly outperforms Sum only inference, and combining the predictions from two levels is able to bring a minor improvement for relationship detection 333We report the results by part query only based inference in both VRD and HOI experiments for clarity..
4.2 Visual Relationship Detection
We compare PST with a wide range of visual relationship detection solutions on the VRD datasets .
Competitors. Existing visual relationship detection solutions can be classified into two categories: (I) Stage-wise methods: These approaches first detect objects using pre-trained detector, and then use the outputs of the object detector (bounding boxes, visual features or spatial information) as fixed inputs for a relationship detection module. Specifically, we compare with: (1) VRD-Full  which combines the visual appearance and the language features of candidates boxes to learn relationships. (2) NMP  which builds a relationship graph and optimizes it by node-to-edge and edge-to-node message passing mechanisms. (3) CDDN  which proposes a context guided visual-semantic feature fusion scheme for predicate detection. (4) LSVR  which learns a better representation by aligning the features on both entity and relationship levels. (5) RelDN 
which uses contrastive loss functions to learn fine-grained visual features for decreasing the ambiguity of relationship detection. (6) BCVRD which proposes a new box-wise fusion method to better combine visual, semantic and spatial features. (7) HGAT  which proposes to use object-level and triplet-level reasoning to improve relationship detection. (II) End-to-End methods: These approaches detect objects and relationships jointly. Specifically, we compare with: (1) CAI  - leverages the subject-object context to detect relationships. (2) KL distillation - uses a well-trained linguistic model to regularize the visual model learning to obtain more generalized embedding for relationships. (3) DR-Net  - designs a fully connected network to mine object-pair relationships. (4) Zoom-Net  - leverages multi-scaled contexts for fine-grained relationship detection. (5) VTransE  - learns a more generalized relationship representation by mapping the visual features to the relationship space.
|Method||Phrase Detection||Relationship Detection|
Results. The visual relation detection comparisons on VRD dataset are shown in Table 4. For clarity, the stage-wise methods are grouped in the first block, and end-to-end methods are in the second block. The proposed PST belongs to the second block, and particularly it is the first holistic end-to-end VRD solution (directly outputs all predicted relationships without any post-processing). It is evident that PST outperforms the existing end-to-end methods on both Phrase and Relationship Detection tasks, e.g. surpassing the second best end-to-end method Zoom-Net  with a margin of 5.81/5.73 in Phrase Detection, and 4.65%/6.22% in Relationship detection at R@50/100 when . It shows that PST is able to learn the relationships between all entities effectively.
|Shen et al.||VGG19||COCO||6.46||4.24||7.12||-||-||-|
|Peyre et al.||ResNet-50-FPN||COCO||✓||19.40||14.63||20.87||-||-||-|
|Bansal et al.||ResNet-50-FPN||HICO-DET||✓||21.96||16.43||23.62||-||-||-|
From the comparisons with the stage-wise VRD methods, PST outperforms the second best method HGAT  with a margin of 2.8% at R@50 in the Relationship detection, but lags behind the best method RelDN  with a margin of 0.71% and 1.32% on Phrase and relationship detection at R@50 with . We note that RelDN is a sophisticated two-stage method that:(1) leverages two CNNs for the entity and predicate visual feature learning; (2) tunes the thresholds of three margins in metric learning based losses functions; (3) combines multi-modality information (visual, semantic and spatial information) for relationship prediction. By contrast, PST predicts relations just based on visual features and detects relationships end-to-end and holistically without any post-processing. PST is simple, without any hand-designed components to represent the prior knowledge.
4.3 Human Object Interaction Detection
Competitors. We compare our model with two types of state-of-the-art HOI methods: the two-stage methods [36, 13, 32, 11, 45, 14, 25, 46, 31, 43, 28, 20, 2, 56, 24, 15, 10, 4] and the single-stage methods [19, 26, 60]. The two-stage methods aim to detect individual objects in the first stage. Then, they associate the detected objects and infer the HOI predictions in the second stage. Two-stage methods rely on good object detections in the first stage and mostly focus on the second stage where language priors [14, 31, 28, 20, 2, 56, 24, 10] and human pose features [45, 25, 28, 20, 24] may be leveraged to facilitate the inference of HOI predictions from detected object. The single-stage methods aim to bypass the object detection step and directly output HOI predictions in one step. Previous single-stage methods [19, 26] are not end-to-end solutions. They employ multiple branches with each branch outputs complementary HOI-related predictions and rely on post-processing to decode the final HOI predictions. The most related to our approach is HoiT  which is an end-to-end single stage solution. HoiT employs DETR-like structure and treats HOI as a set prediction problem to achieve end-to-end learning. It can be regarded as our sum branch transformer where each query directly outputs an HOI triplet.
Results. Table 5 shows the results of our method and the other state-of-the-art HOI methods on HICO-DET dataset. We see that most of the two-stage models have mAP around 20 (default, full) on the HICO-DET test set. The best two-stage model is DRG  which achieves 24.5 mAP. However, it is a complex model and requires three-stage training. In comparison, as end-to-end single stage models, our model and the contemporary HoiT  model easily achieve 20+ mAP without using a dedicated object detector or extra pose or language information. Our model with composite queries has mAP of 23.9 and achieves the state-of-the-art performance for single-stage HOI.
4.4 Ablation Study
We report more components analysis of our proposed part-and-sum transformer model (PST) here.
Single-branch vs dual-branch in decoder Our part-and-sum transformer (PST) consists a dual-branch decoder for part queries and sum queries, i.e. part queries and sum queries are feed into variant self-attention layers, cross-attention layers and FFNs, and decoded independently. We compare this design with Single-decoder, where part and sum queries are decoded by the same layers. The results are shown in Table 6 where the dual-stream design is shown to outperform the single-branch design on both relationship detection and phrase detection tasks. We hypothesize that part queries and sum queries represent different aspects of a relationship, and it is better to decode these two kinds of queries independently.
|Part and Sum decoding||Phrase Detection||Relationship Detection|
Varying the number and dimension of the queries We show more comparison for the tensor-based query strategy vs the vector-based query strategy by varying their numbers and dimensions. Specifically, we use 500 tensor queries in PST, and each query contains three sub-vector queries of 256 dimension. For a fair comparison, we also employ 500 vector queries but of 2563 dimension. Note that increasing the dimension of vector-based queries to three times larger () requires to three time larger dimension of image memory. To this end, we repeat the features of the encoder three times. By doing it, each query in both models can have the equal embedding dimension to represent each relationship. The comparison of results is shown in Table 7. From it, we see that (1) increasing the number of vector queries from 500 to 1500 is not able to bring the clear benefit; (2) Increasing the dimension of each vector query does not provide more information; (3) Tensor based query outperforms variant vector based query, with a margin of 4.65%/5.16% increase on R@50/100 on Relationship detection when ; (4) Compared to the tensor based query, composite queries further improve the performance due to part-and-sum two-level learning.
|Query style||Query Number||Query Dimension||Phrase Detection||Relationship Detection|
Part-and-Sum design in the HOI task More comparisons for the vector based Transformer (PST-Sum), tensor based Transformer (PST-Part), and part-and-sum Transformer (PST) on the HOI task are given below. More experimental results are shown in Table 8. PST is shown to outperform PST-Part and PST-Sum, same as the performance comparison in the VRD task.
4.5 Composite Attention Visualisation
To better understand how the model works and what input information it uses to perform relationship detection, we visualize the cross-attention maps of the decoder layers of PST, since cross-attention measures the correlation between the query embedding and image feature tokens. Given a test image, we extract cross-attention maps from all decoder layers for the subject, object and predicate query embedding individually. We visualize the attention maps of one query in Figure 3 (c), and include results of more queries in the supplementary material. In Figure 3, the relationship query embedding is decoded to “person-wear-shoes” semantically, and according to the attention maps, we can see that (1) the transformer decoder incrementally focuses on the “person” and “shoes” area in the image, to infer the subject and object entities; (2) the predicate (“Wear”) is mostly predicted from the union area of the subject and object, which suggests that the attention scheme is capable of automatically modeling the subject-object context for predicate detection.
In this work, we have presented a Transformer-based detector, Part-and-Sum Transformers (PST), for visual composite set detection. PST emphasizes the importance of maintaining separate representations for the sum and parts while enhancing their interactions.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.2.
-  (2019) Detecting human-object interactions via functional generalization. CoRR abs/1904.03181. External Links: Cited by: §4.3, Table 5.
-  (2020) End-to-end object detection with transformers. ECCV. Cited by: Appendix A, §1, §1, §2, §3.1, §3.3.
-  (2018) Learning to detect human-object interactions. In WACV, Cited by: §1, §1, §1, §2, §4.3, Table 5, §4.
-  (2018) Context-dependent diffusion network for visual relationship detection. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1475–1482. Cited by: §4.2.
-  (2017) Detecting visual relationships with deep relational networks. In CVPR, Cited by: §4.2, Table 4.
-  (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
Cascade object detection with deformable part models.
2010 IEEE Computer society conference on computer vision and pattern recognition, pp. 2241–2248. Cited by: §3.2.2.
-  (2020) DRG: dual relation graph for human-object interaction detection. External Links: Cited by: §1, §2, §4.3, §4.3, Table 5.
-  (2018) ICAN: instance-centric attention network for human-object interaction detection. CoRR abs/1808.10437. External Links: Cited by: §4.3, Table 5.
-  (2018) Ican: instance-centric attention network for human-object interaction detection. In BMVC, Cited by: §1, §1, §2, §4.
-  (2017) Detecting and recognizing human-object interactions. CoRR abs/1704.07333. External Links: Cited by: §4.3, Table 5.
-  (2018) No-frills human-object interaction detection: factorization, appearance and layout encodings, and training techniques. CoRR abs/1811.05967. External Links: Cited by: §4.3, Table 5.
-  (2020) Visual compositional learning for human-object interaction detection. External Links: Cited by: §1, §2, §4.3, Table 5.
-  (2013) Recognising human-object interaction via exemplar based modelling. In Proceedings of the IEEE international conference on computer vision, pp. 3144–3151. Cited by: §1, §2.
-  (2019) Neural message passing for visual relationship detection. In ICML Workshop on Learning and Reasoning with Graph-Structured Representations, Long Beach, CA, Cited by: §1, §2, §4.2, Table 4.
-  (2020) Bounding-box channels for visual relationship detection. In ECCV, Cited by: §4.2, Table 4.
-  (2020) UnionDet: union-level detector towards real-time human-object interaction detection. In European Conference on Computer Vision, pp. 498–514. Cited by: §1, §2, §4.3, Table 5.
-  (2020) Detecting human-object interactions with action co-occurrence priors. External Links: Cited by: §4.3, Table 5.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §1, §1, §2.
-  (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML, Cited by: §2.
Vip-cnn: visual phrase guided convolutional neural network. In CVPR, Cited by: Table 4.
PaStaNet: toward human activity knowledge engine. External Links: Cited by: §1, §2, §4.3, Table 5.
-  (2018) Transferable interactiveness prior for human-object interaction detection. CoRR abs/1811.08264. External Links: Cited by: §4.3, Table 5.
-  (2020) Ppdm: parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 482–490. Cited by: §4.3, Table 5.
-  (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §2, §4, §4.
-  (2020) Amplifying key cues for human-object-interaction detection. In European Conference on Computer Vision, Cited by: §4.3, Table 5.
-  (2016) Visual relationship detection with language priors. In ECCV, Cited by: §1, §1, §1, §2, §4.2, §4.2, Table 4, §4, §4, §4.
-  (2020) Hierarchical graph attention network for visual relationship detection. In CVPR, Cited by: §1, §2, §4.2, §4.2, Table 4.
-  (2018) Detecting rare visual relations using analogies. CoRR abs/1812.05736. External Links: Cited by: §4.3, Table 5.
Learning human-object interactions by graph parsing neural networks. CoRR abs/1808.07962. External Links: Cited by: §4.3, Table 5.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2, §2.
-  (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, Cited by: §4.
-  (2007) A feedforward architecture accounts for rapid categorization. Proceedings of the national academy of sciences 104 (15), pp. 6424–6429. Cited by: §2.
-  (2018) Scaling human-object interaction recognition through zero-shot learning. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1568–1576. External Links: Cited by: §1, §2, §4.3, Table 5.
-  (2018) Scaling human-object interaction recognition through zero-shot learning. In WACV, Cited by: §1, §1, §2.
-  (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In CVPR, Cited by: §2.
-  (2013) Learning and-or templates for object recognition and detection. IEEE transactions on pattern analysis and machine intelligence 35 (9), pp. 2189–2205. Cited by: §2.
-  (2019) LXMERT: learning cross-modality encoder representations from transformers. In EMLP, Cited by: §2.
-  (2005) Large margin methods for structured and interdependent output variables.. Journal of machine learning research 6 (9). Cited by: §2.
-  (2008) Auto-context and its application to high-level vision tasks. In CVPR, Cited by: §2.
-  (2020) VSGNet: spatial attention network for detecting human object interactions using graph convolutions. External Links: Cited by: §4.3, Table 5.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §A.1, §1, §2, §4.
-  (2019) Pose-aware multi-level feature network for human object interaction detection. CoRR abs/1909.08453. External Links: Cited by: §4.3, Table 5.
-  (2020) Contextual heterogeneous graph network for human-object interaction detection. External Links: Cited by: §4.3, Table 5.
-  (2020) Learning human-object interaction detection using interaction points. External Links: Cited by: Table 5.
-  (2021) Line segment detection using transformers without edges. In CVPR, Cited by: §2.
Zoom-net: mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 322–338. Cited by: §1, §2, §4.2, Table 4.
-  (2017) Visual relationship detection with internal and external linguistic knowledge distillation. In ICCV, Cited by: §1, §2, §4.2, Table 4.
-  (2019) On exploring undetermined relationships for visual relationship detection. In CVPR, Cited by: §4.2, Table 4.
-  (2017) Visual translation embedding network for visual relation detection. In CVPR, Cited by: §4.2, Table 4.
Large-scale visual relationship understanding.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §4.2, Table 4.
-  (2020) Graphical contrastive losses for scene graph parsing. In CVPR, Cited by: §1, §2, §4.1, §4.2, §4.2, Table 4, §4.
-  (2019) Visual relation detection with multi-level attention. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 121–129. Cited by: Table 4.
-  (2020) Polysemy deciphering network for robust human-object interaction detection. External Links: Cited by: §4.3, Table 5.
-  (2007) A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision 2 (4), pp. 259–362. Cited by: §2.
-  (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: Appendix A, §2, §3.1, §4.1, §4, §4.
-  (2017) Towards context-aware interaction recognition for visual relationship detection. In ICCV, Cited by: §4.2, Table 4.
-  (2021) End-to-end human object interaction detection with hoi transformer. In CVPR, Cited by: §2, §4.3, §4.3, Table 5.
Appendix A Part-and-Sum Transformers for Composite Set Detection
In this work, we focus on end-to-end structured data detection by Part-and-Sum Transformers for tasks like visual relationship detection and Human Object Interaction detection. We provide a more detailed discussion for the designs of vanilla decoder, tensor-based decoder, and composite (part-and-sum) decoder. These three alternatives differ in the form of queries and how attention is implemented.
Figure 1(a) in the main submission gives an illustration; for an input image, we use a convolutional neural network (CNN) model to extract image features, which are fed into a standard / deformable  transformer encoder. The encoder is composed of multiple self-attention layers to tokenize the visual features. After that, a transformer decoder takes the visual tokens together with a set of learnable queries as input to detect a composite set (visual relationships or human object interactions). We denote the tokenized features of the Transformer Encoder as , and the learnable queries as ; and the embedding of the outputs of a decoder as . For embeddings , the structural prediction is inferred by a prediction module, denoted as .
a.1 Vanilla decoder with vector-based query
Our vanilla Transformer decoder contains query embeddings, and each query is a vector, representing a relationship:
where is a vector of a size , and the overall query is . The queries are feed into multiple decoder layers of a same design. Specifically, each decoder layer contains a Multi-head self-attention layer , learning the cross-relationship context; and a multi-head cross-attention layer, to learn the representations by attending various image positions; and a feed forward network (FFN) to further embed each query. All query embeddings are feed into these three components one by one, and the last outputs are feed into the following decoder blocks. The decoding process in each decoder block is written as:
where is Self-attention layer (SA), and is the Cross-attention layer (CA). Note that in vanilla Transformer, each query represents a relationship, i.e. containing multiple components.
a.2 Tensor-based decoder with tensor-based Query
Unlike vector based query, tensor-based query represents a relationship by a tensor which contains multiple sub-queries to represent each part individually, such as Subject, Predicate and Object parts. The tensor based query can be written as:
where each query includes three sub-queries to represent subject, predicate, and object, respectively. By doing so, all parts are learnt individually, reducing the ambiguity in similarity computation in attention schemes. It is important for learning the relationships sharing the same subject or object entity. In decoding, attention layers handle all sub-queries, written as:
Vector query and tensor-based query are conceptually different, and the former learns each two-level/structure data as a whole/Sum, while the latter learns each two-level/structure data by part learning. Furthermore, self-attention layers are functionally different in these two designs: self-attention among all sum queries is to mine inter-relation context, while self-attention layer among part queries is to mine the context of entities, which indirectly benefits relationship learning.
a.3 Composite (part-and-sum) decoder with composite Query
Composite query models each relationship in a structural manner, and learns a relationship in both part and sum levels. Each composite query contains three part sub-queries for Subject, Predicate and Object entities, and one sum sub-query for a whole relationship. The composite query can be written as:
where are part queries, and is a sum query for relationship . In the decoding, part query and sum query are separately decoded by different self-attention layers and , and cross-attention layers and , written as:
Factorized self-attention. To enhance part-based relationship learning, we designs a Factorized self-attention layer, which firstly conducts intra-relationship self-attention, and conducts inter-relationship self-attention. The intra-relationship self attention layer leverages the parts context to benefit relationship prediction, for example, subject query and object query are “person” and “horse” helps predict predicate “Ride”. The inter-relationship self-attention layer leverages the inter-relationship context, to enhance the holistic relationship prediction per image. For example, the existence of “Person read book” helps infer the relationship “Person sit”, rather than “Person run”, which is particularly important for multiple interactions detection for same person entity. The Factorized self-attention is written as:
where Intra-relation self-attention and Inter-relation self-attention layers are written as:
Note that the Factorized self-attention design also can be used for Tensor based query to enhance the inter part-query learning.
Appendix B Visualisations of VRD and HOI detection
PST directly predicts all relationships in a set. Figure 4 shows exampled relationship detection results and human object interaction detection results by PST in (a) and (b). Each sub-image visualizes one predicted relationship. It shows that there exist multiple relationships between one entity-pair. For example, in Figure 4 (a), PST detects “Tower-has-clock” and “Clock-on-tower”; and “Road-under-tower” and “Tower-on-road”.
Appendix C More Qualitative Illustrations
In addition to the results shown in the main paper, we provide more qualitative results and analysis for visual relationship detection and human-object interaction detection by the proposed PST.
c.1 Small-entity relationship detection
In the VRD task, relationships are composed of multiple types, some of which pose particular challenges, such as small-entity and spatial relationships. We show some examples in Figure 5 for small-entity relationship detection. PST is able to detect small subjects and objects well, such as “ball” in (a), “person” and “jacket” in (b), “clock” in (c) and “bag” in (d). This happens because the part queries are able to mine the subject-predicate-object context, and the sum queries leverage the inter-relation context; the two types of contextual information provide effective information for detecting small entities in a relationship.
c.2 Instance ambiguity in relationship detection
Instance ambiguity in relationship detection causes a detection failure, where the predicted relationships wrongly associate subject and object instances, although the categories of relationships are predicted correctly. For instance, in Figure 6 (a) and (b), there exist two same type relationships “Person-has-phone”, and they are visually close. Relationship instance ambiguity makes it hard to associate each “phone” instances with the surrounding “Person” instances. This ambiguity is caused by that multiple relationship instances of the same relationship type are too close, and the visual clues for associating the subject-object instances are subtle. We examine PST in these challenging cases and show some examples in Figure 6. It shows that PST is able to associate subject-object instances correctly in this hard situation, thanks to the effective intra-relation and inter-relation attention for context modeling.
c.3 Failure cases by PST
We visualize the typical errors of relationship detection by PST in Figure 7. There are four typical errors: (1) PST localizes entities inaccurately, when the entity instances are crowed. For instance, in Figure 7 (a), there are multiple horses close to each other, and PST localizes multiple horses as one Object entity of a relationship “Person-on-horse”. (2) Object detection mistakes cause the failure in relationship detection, such as wrong entity “camera” detected in Figure 7 (b). (3) Relationship instance ambiguity challenges PST. For instance, in Figure 7 (c), the watch is associated with a wrong person instance which is very close to the right person instance. (d) Predicate is wrongly predicted, for instance, PST classifies the relationship between “Person” and “bottle” as “hold”.