Mining the Benefits of Two-stage and One-stage HOI Detection

by   Aixi Zhang, et al.
Beihang University

Two-stage methods have dominated Human-Object Interaction (HOI) detection for several years. Recently, one-stage HOI detection methods have become popular. In this paper, we aim to explore the essential pros and cons of two-stage and one-stage methods. With this as the goal, we find that conventional two-stage methods mainly suffer from positioning positive interactive human-object pairs, while one-stage methods are challenging to make an appropriate trade-off on multi-task learning, i.e., object detection, and interaction classification. Therefore, a core problem is how to take the essence and discard the dregs from the conventional two types of methods. To this end, we propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner. In detail, we first design a human-object pair generator based on a state-of-the-art one-stage HOI detector by removing the interaction classification module or head and then design a relatively isolated interaction classifier to classify each human-object pair. Two cascade decoders in our proposed framework can focus on one specific task, detection or interaction classification. In terms of the specific implementation, we adopt a transformer-based HOI detector as our base model. The newly introduced disentangling paradigm outperforms existing methods by a large margin, with a significant relative mAP gain of 9.32



There are no comments yet.


page 7


DIRV: Dense Interaction Region Voting for End-to-End Human-Object Interaction Detection

Recent years, human-object interaction (HOI) detection has achieved impr...

PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection

We propose a single-stage Human-Object Interaction (HOI) detection metho...

Human-Object Interaction Detection via Weak Supervision

The goal of this paper is Human-object Interaction (HO-I) detection. HO-...

Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer

Recent developments in transformer models for visual data have led to si...

HOTR: End-to-End Human-Object Interaction Detection with Transformers

Human-Object Interaction (HOI) detection is a task of identifying "a set...

Visual Relationship Detection with Relative Location Mining

Visual relationship detection, as a challenging task used to find and di...

Cascaded Human-Object Interaction Recognition

Rapid progress has been witnessed for human-object interaction (HOI) rec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of Human-Object Interaction (HOI) detection chao2018learning ; liao2020ppdm ; gao2018ican ; li2018transferable ; gkioxari2018detecting ; gupta2015visual ; li2020hoi ; chen_2021_asnet is to make a machine detailedly understand human activities from a static image. Human activities in this task are abstracted as a set of <human, object, action> HOI triplets. Thus, an HOI detector is required to locate human-object pairs and classify their corresponding action simultaneously. Based on this definition, we can summarize conventional HOI detection methods into two paradigms, i.e.

, two-stage methods, and one-stage methods. These two paradigms have made significant progress with the development of deep learning, but both paradigms still have their shortcomings due to their structural design. This paper aims to present a detailed analysis of methods under these two paradigms and propose a solution to mine the benefits of two-stage and one-stage methods.

We first take a closer look at the conventional two-stage and one-stage HOI detectors. Conventional two-stage methods gao2018ican ; chao2018learning ; li2018transferable ; Gao-ECCV-DRG are mostly with a serial architecture. As shown in Figure 1 (a), two-stage methods detect humans and objects first and then feeds the human-object pairs, which are generated by matching humans and objects one by one, into an interaction classifier. The serial architecture suffers from locating the interactive human-object pairs under the interference of a large number of negative pairs only based on local region features. Otherwise, the efficiency of two-stage methods is also limited by the serial architecture. To alleviate these problems, one-stage methods liao2020ppdm ; Kim2020_unidet ; zou2021_hoitrans ; chen_2021_asnet ; tamura2021qpic ; Kim_2021_CVPR are proposed to detect the HOI triplets directly and break HOI detection as multi-task learning, i.e., human-object detection and interaction classification, which is shown in Figure 1(b). Therefore, one-stage methods can easily focus on the interactive human-object pairs and effectively extract corresponding features in an end-to-end manner. However, it is difficult for a single model to make a good trade-off on multi-task learning since human-object detection and interaction classification are two very different tasks, which requires the model to focus on different visual features. As shown in Figure 1(c), though some previous methods liao2020ppdm ; chen_2021_asnet design two parallel branches to detect instances and predict interaction respectively, the interaction classification branch still needs to regress additional offsets to associate humans and objects. Thus the interaction branch is also required to make a trade-off between interaction classification and human and object positioning.

Therefore, the intuitive idea is to take the essence and discard the dregs from the two paradigms. To attain this, we propose a novel end-to-end one-stage framework with disentangling human-object detection and interaction classification in a cascade manner, namely Cascade Disentangling Network (CDN). The original intention of our framework is to keep the advantages of conventional one-stage methods, directly and accurately locating the interactive human-object pairs, and bring the advantages of two-stage methods into our one-stage framework, disentangling human-object detection and interaction classification. As shown in Figure 1(d), in our proposed framework, we design a human-object pair decoder based on the one-stage paradigm by removing the interaction classification function, namely HO-PD, and following an isolated interaction classifier. To instantiate our idea with an end-to-end manner, we design the HO-PD based on the previous state-of-the-art one-stage transformer-based HOI detector, HOI-Trans zou2021_hoitrans and QPIC tamura2021qpic , where we remove the interaction classification head for each query and make it focus on human-object pairs detection. Otherwise, we design an independent HOI decoder for interaction classification to make the interaction classification unaffected by human-object detection. Therefore, there exists a core problem, i.e., how to link the human-object pair and the corresponding action class. To address this problem, we initialize the query embedding of the HOI decoder with the output of the last layer of the HO-PD. In this case, the HOI decoder is able to learn the corresponding action category under the guidance of the query embedding and free out from the human-object detection task. Moreover, we design a decoupling dynamic re-weighting manner to handle the long-tailed problems in HOI detection.

Figure 1: (a) Two-stage framework, (b) one-stage end-to-end framework, (c) one-stage framework with parallel architecture, and (d) our one-stage framework with a cascade disentangling head.

Our contributions can be summarized threefold: (1) We conduct a detailed analysis of two conventional HOI detection paradigms, i.e., two-stage and one-stage. (2) We propose a novel one-stage framework with a cascade disentangling decoder to combine the advantages of two-stage and one-stage methods. (3) Our method outperforms previous state-of-the-art methods by a large margin on the HOI detection task, especially achieves a performance gain on rare classes of HICO-Det.

2 Analysis of Two-stage and One-stage HOI detectors

We first introduce a unified formulation for the HOI detection problem. Given a human-centric image , the model is required to predict a set of HOI triplets , where , and denotes a human bounding-box, an object bounding-box and their corresponding action category, respectively.

Two-stage HOI detector. Two-stage detectors can be regarded as an instance-driven manner, detecting instances first and predicting interaction based on the detected instances. The two-stage detector divides into two stages, i.e., detection and interaction classification . In the first stage, we suppose that produces human bounding-boxes and object bounding-boxes and generates human-object pairs. In general, the number of true-positive interactive human-object pairs, denoted as , is much smaller than . However, in the second stage, needs to scan all pairs one by one and predict an action category with its corresponding confidence score. In this case, is required to inference times to find interactive pairs from pairs. We argue that this manner causes three problems. Firstly, these models produce a more additional computational cost, whose time complexity is

. Secondly, the imbalance between positive and negative samples makes the model easily overfit to negative samples. Thus the model tends to assign a ‘no-interaction’ class for human-object pairs with very high confidence, suppressing the true-positive samples. Thirdly, the accuracy of interaction classification is influenced by the non-end-to-end pipeline. Because the interaction classification is mostly based on the region features extracted by

, while the core of is to regress bounding-boxes and its extracted features pay more attention to the edge of regions, thereby such features are not good options for interaction classification, which needs more context. However, it is an excellent property for two-stage methods that disentangling detection and interaction classification makes each stage focus on its task and produce good results in each stage.

One-stage HOI detector. As for one-stage methods, they detect all HOI triplets directly and simultaneously with an end-to-end framework. Such paradigm has greatly relieved the three problems of two-stage methods, especially for efficiency, where the time complexity is reduced to . Most one-stage methods are interaction-driven, which directly locate the interaction point liao2020ppdm or interactive human-object pairs zou2021_hoitrans , thereby reducing negative sample interference. However, coupling human-object detection and interaction classification limit their performance because it is hard to generate a unified feature representation for two very different tasks. Though the parallel one-stage methods break HOI detection into two parallel branches, their interaction branch still suffers from multi-task learning. Specifically, the optimization target of interaction branch is , where and are associative embeddings, e.g., offset, to match interaction with human and object respectively. Therefore, even if detection is organized as an independent branch, the interaction branch must position humans and objects for the association. The set-based detectors couple detection and interaction completely, whose optimization function is .

Next, we introduce a simple one-stage framework with disentangling human-object detection and interaction classification, namely CDN, to mine the benefits of two-stage and one-stage HOI detectors. Our CDN disentangles the original set-based one-stage optimization function into two cascade decoders. Firstly, we predict human-object pair by . Secondly, we apply an isolated decoder to predict the action category by . More details are in the following.

Figure 2: The framework of our CDN. It is comprised of three components:Visual Feature Extractor, Human-Object Pair Decoder (HO-PD) and Interaction Decoder. We first apply a CNN-transformer combined architecture to extract sequenced visual features . Then, we divide HOI detection into two cascade transformer-based decoders. Firstly, we regress the human-object bounding-box pairs based on and a set of random-initialized queries by HO-PD. Secondly, we predict one or many action categories for each predicted human-object pairs, where we take the output of HO-PD to initialize the interaction queries and aggregate information with . Finally, the HOI triplets are formed by the output of the cascade decoders.

3 Method

In this section, we will present a detailed introduction to the pipeline of our proposed CDN. In section 3.1, we present an overview of our framework and briefly introduce the pipeline. In section 3.2, we introduce the visual feature extractor. The cascade disentangling HOI decoder is introduced in section 3.3. Section 3.4 introduces a novel dynamic re-weighting mechanism that mitigates the long-tailed problem. The detailed training process and post-processing are discussed in section 3.5.

3.1 Overview

The architecture of our proposed CDN is illustrated in Figure 2. Our CDN is organized in a cascade manner with a visual feature extractor. Given an image, we first follow transformer-based detection methods carion2020endtoend ; zou2021_hoitrans to apply a CNN followed by a transformer encoder architecture to extract visual features into a sequence. Then we detect HOI triplets in two cascade decoders. Firstly, we apply the Human-Object Pair Decoder (HO-PD) to predict a set of human-object bounding-boxes pairs based on a set of learnable queries. Next, taking the output of the last layer of HO-PD as queries, an isolated interaction decoder is utilized to predict the action category for each query. Finally, the HOI triplets are formed by the output of the above two cascade decoders.

3.2 Visual Feature Extractor

We define the visual feature extractor by combining a CNN and a transformer encoder. Fed with an input image with shape , the CNN generates a feature map of shape . Then, is reduced to by a projection convolution layer with a kernel size . Next, a flatten operator is used to generating the flatten feature by collapsing the spatial dimensions into one dimension. This flatten feature is then fed into a transformer encoder and the position encoding , which distinguishes the relative position in the sequence . Thanks to the multi-head self-attention mechanism, the transformer encoder produces a feature map with richer contextual information by summarizing global information. The output of the encoder is denoted as global memory with a dimension of .

3.3 Cascade Disentangling HOI Decoder

The cascade disentangling HOI decoder consists of two decoders: Human-Object Pair Decoder (HO-PD) and interaction decoder. Both decoders share the same architecture, a transformer-based decoder, with independent weights. In this subsection, we first introduce the general architecture of the decoder and then elaborate on the two decoders in detail, respectively.

Transformer-based decoder. We follow the transformer-based object detector DETR carion2020endtoend to design the basic architecture in our cascade disentangling HOI decoder. We apply transformer decoder layers for each decoder and equip each decoder layer with several FFN heads for intermediate supervision. Specifically, each decoder layer is comprised of a self-attention module and a multi-head co-attention module. During feed-forward, fed into a set of learnable queries , each decoder layer first applies a self-attention module on all queries and then conducts a multi-head co-attention operation between queries and the sequenced visual features, and outputs a set of updated queries. For the FFN heads, each head is comprised of one or several MLP branches, and each branch is for a specific task, e.g., regression, or classification. All queries share the same FFN heads. Therefore, each decoder can be simply represented as:


Besides, the number of queries is determined by the number of positive samples of an image.

HO-PD. Firstly, we design the HO-PD to predict a set of human-object pairs from the sequenced visual features. To this end, we first randomly initialize a set of learnable queries as HO queries. Then we apply a transformer-based decoder, which takes HO queries and sequenced visual features as input and applies three FFN heads for each query to predict human bounding-box, object bounding-box, and object class, which form a human-object pair. An additional interactive score head is utilized to determine whether the human-object pair is an interactive pair or not. In this case, is instantiated as , which is consist of a set of human-object pairs . Thus, HO-PD can be denoted as:


In addition, we keep the output queries of the last layer of HO-PD as for the following step.

Interaction Decoder. Secondly, we propose the interaction decoder to classify the human-object queries and assign one or several action categories for each human-object query. To classify each human-object query one-to-one, we initialize with the output of HO-PD . In this way, can provide prior knowledge to guide to learn the corresponding action categories for each human-object query. The other components and inputs are the same as HO-PD, which conducts self-attention among queries and co-attention with and . The final output is a set of action categories . Therefore, this process can be formulated as:


In our proposed cascade disentangling HOI decoder, the task of HOI detection is disentangled into two relatively independent steps: human-object pairs detection and interaction classification. Therefore, each step can aggregate more related features to concentrate on its corresponding task.

3.4 Decoupling Dynamic Re-weighting

The HOI datasets usually have long-tail class distribution for both object class and action class. To alleviate the long-tail problem, we design a dynamic re-weighting mechanism for further improvements with a decoupling training strategy. In detail, we first train the whole model with regular losses. Then, we freeze the parameters of the visual feature extractor and only train the cascade disentangling decoders with a relatively small learning rate and the designed dynamic re-weighted losses.

During decoupling training, at each iteration, we apply a queue with length to accumulate object number for each object class , and another queue with length to accumulate interaction number for each action category . The dynamic re-weighting coefficients are presented as follow:


where is the number of accumulated positive samples of category by the queues and , is the number of accumulated background samples, is the number of categories, and exponent is a hyper-parameter that adapts the magnitude of mitigation. Specifically, the weight of background class, , is designed to balance the positives and negatives. For the stability of the dynamic re-weighted training, the weight coefficients are initialized as with those calculated by 4 using the static number of instance and action categories. The final dynamic weights are given as , where is a smooth factor, given as . The factor transits from to with the increasing of .

3.5 Training and Post-processing

In this section, we introduce the training and inference processes in detail. Especially, we will introduce a novel Pair-wise Non-Maximal Suppression (PNMS) strategy in the inference process.

Training. Following the set-based training process of HOI-Trans zou2021_hoitrans and QPIC tamura2021qpic , we first match each ground-truth with its best-matching prediction by the bipartite matching with the Hungarian algorithm. Then the loss is produced between the matched predictions and the corresponding ground truths for the final back-propagation. During matching, we consider the predictions of two cascade decoders together. The loss of CDN is composed by five parts: the box regression loss , the intersection-over-union loss  rezatofighi2019generalized , the interactive score loss , the object class loss , and the action category loss . The target loss is the weighted sum of these parts as:


where , , , and are the hyper-parameters for adjusting the weights of each loss.

Inference. The inference process is to composite the output of instance-related FFNs and the interaction-related FFN to form HOI triplets. By our cascade disentangling decoder architecture, the instance queries and the interaction queries are one-to-one corresponding, therefore, the five components <human bounding box, object bounding box, object class, interactive score, action class> can be homologous in each of the dimensions per FFN head. Formally, we generate the -th output prediction as <, , argmax>. The HOI triplet score is given by , where and are the scores of interaction and object classification, respectively, and

is the interactive score from the interactive FFN head for the query vector being an human-object pair.

PNMS. After sorting in descending order and generating the top HOI triplets, we design a pair-wise non-maximal suppression (PNMS) method to further filter out human-object pairs from pair-wise bounding boxes overlapping perspective. For two HOI triplets and , the pair-wise overlap PIoU is calculated as:


where the operators and compute the intersection and union areas between the two boxes of and , respectively. and are the balancing parameters between humans and objects.

4 Experiments

In this section, we conduct comprehensive experiments to demonstrate the superiority of our designed CDN. In section 4.1, we briefly introduce the experimental benchmarks. Section 4.2 presents implementation details. Next, It is a detailed experimental comparison and analysis of two-stage and one-stage methods in section 4.3. In section 4.4, we compare our methods with the previous state-of-the-art methods. The ablation studies and components analysis are included in 4.5.

4.1 Datasets and Evaluation Metrics

Datasets. We carry out experiments on two widely-used HOI detection benchmarks: HICO-Det chao2018learning

and V-COCO 

gupta2015visual . We follow the standard evaluation scheme. HICO-Det consists of Creative Common images from Flickr ( for training and for test) with more than K human-object pairs. It contains the same object categories as MS-COCO lin2014microsoft and action categories. The objects and actions form classes of HOI triplets. V-COCO is derived from MS-COCO dataset, which consists of images in the trainval subset and images in the test subset. It has action categories ( HOIs and body motions) and object categories. For both datasets, one person can interact with multiple objects in different ways at the same time.

Evaluation Metrics. Following the standard evaluation chao2018learning

, we use the mean average precision (mAP) as the evaluation metric. For one positively predicted HOI triplet , it needs to contain accurate human and object locations (box IoU with reference to GT box is greater than

) and correct object and action categories. Specifically, for HICO-Det, besides the full set of HOI classes, we also report the mAP over a rare set of HOI classes that have less than training instances and a non-rare set of the other HOI classes. For V-COCO, we report the role mAP for two scenarios: scenario includes the cases even without any objects (for the four action categories of body motions), and scenario ignores these cases.

4.2 Implementation Details

We implement three variant architectures of CDN: CDN-S, CDN-B, and CDN-L, where ‘S’, ‘B’, and ‘L’ denote small, base, and large, respectively. For CDN-S and CDN-B, we adopt ResNet-50 with a -layer transformer encoder as the visual feature extractor. For the cascade decoders, CDN-S is equipped with both -layer transformers, while CDN-B has a -layer transformer for each decoder. CDN-L only replaces the ResNet-50 with ResNet-101 in CDN-B. The reduced dimension size is set to . The number of queries is set to for HICO-Det and for V-COCO since the average number of positives for variant human-object pairs per image of HICO-Det is smaller than V-COCO. The human and object box FFNs have

linear layers with ReLU, while the object and action category FFNs have one linear layer.

During training, we initialize the network with the parameters of DETR carion2020endtoend trained with the MS-COCO dataset. We set the weight coefficients , , , and to , , , and , respectively. We optimize the network by AdamW loshchilov2018decoupled with the weight decay . We first train the whole model for epochs with a learning rate of decreased by times at the th epoch. Then, during the decoupling training process, we fine-tune the cascade disentangling decoders together with the box, object, and action FFNs for epochs with a learning rate of . We use both object and action dynamic re-weighting for HICO-Det and only action dynamic re-weighting for V-COCO. The re-weighting parameter p is set to for both object and action. The length of training sample queue for both object and action is set to , where is the sample number of the training set. All experiments are conducted on the Tesla V100 GPUs and CUDA10.2, with a batch size of .

For validation, we select 100 detection results with the highest scores and then adopt PNMS to further filter results. The threshold, , and of PNMS are set to , , and , respectively.

4.3 Experiment Analysis of Two-stage and One-stage Methods

In this part, we introduce a detailed experimental analysis of conventional two-stage and one-stage methods and our proposed CDN from the following three aspects.

Human-object Pair Generation. We first explore the quality of the human-object pairs generation between two-stage and one-stage methods. To attain this, we conduct a detailed experiment based on a representative two-stage method iCAN gao2018ican

. We first implement a PyTorch version iCAN as the baseline model, denoted as iCAN

, which only applies human and object appearance with a COCO-pretrained Faster-RCNN detector ren2015faster . For a fair comparison, we first fine-tune DETR on HICO-Det for epochs only with the instance detection annotation based on COCO-pretrained weights. Then we combine the detected human and object bounding-boxes, whose confidences are greater than a threshold, one by one to generate human-object pairs denoted as iCAN in Table 1. We train our CDN only with HO-PD for epochs and get the human-object pairs from the output directly. Then, we graft the human-object pairs to the baseline model to extract box features and utilize the same interaction classifier in the second stage of iCAN. In this way, we degrade the number of pairs from to , which means time complexity is reduced from to . Primarily, HO-PD significantly promotes mAP from to , as shown in Table  1. This indicates that one-stage methods are much superior in human-object pair generation.

Strategy Full Rare Non-Rare iCAN 14.16 12.26 14.73 iCAN 15.37 13.23 16.01
24.05 18.32 25.76
QPIC tamura2021qpic 29.07 21.85 31.23 CDN-S base 30.96 27.02 32.14
Table 1: Analysis of Two-stage and One-stage Methods. denotes our implemented PyTorch version iCAN gao2018ican baseline model. denotes replacing instance detection boxes given by a HICO-Det fine-tuned DETR detector to extract box features. ‘HO-PD+iCAN’ denotes replacing original one-by-one generated human-object pairs with our HO-PD generated. ‘CDN-S base’ denotes CDN-S w/o re-weighting and PNMS strategies.
Figure 3: Visualization of Feature Maps for Queries. Visual attended features for query with top-1 score extracted from the last layer of the decoder of (a) QPIC, (b) HO-PD in CDN, and (c) interaction decoder in CDN. Zoom in for details.

Interaction Classification. We aim to study the interaction classification between conventional multi-task one-stage methods and our disentangled one-stage detector. We can regard QPIC tamura2021qpic as a multi-task version of our CDN. Table 1 shows that our CDN-S has achieved relative mAP gain. Especially, our CDN significantly outperforms QPIC for rare classes with a improvement. The performance of rare classes can partly reflect the accuracy of interaction classification.

Feature Learning. This part discusses the differences in feature learning between the conventional one-stage method, QPIC, and our CDN from a qualitative view. As shown in Figure 3, we visualized the feature maps extracted from the last layer of the decoder of QPIC, the HO-PD, and the interaction decoder in our CDN. We can see that HO-PD and QPIC attend very similar regions, e.g., the boundaries of humans and objects and the human-object contact areas, which are beneficial for locating the interactive human-object pairs. However, the interaction decoder concentrates on human-pose and the regions that contribute to understanding human actions.

Default Know Object
Method Detector Backbone Extra Full Rare Non-Rare Full Rare Non-Rare
Two-stage Method:
InteractNet gkioxari2018detecting COCO ResNet-50-FPN 9.94 7.16 10.77 - - -
GPNN qi2018learning COCO Res-DCN-152 13.11 9.34 14.23 - - -
iCAN gao2018ican COCO ResNet-50 14.84 10.45 16.15 16.26 11.33 17.73
No-Frills Gupta_2019_ICCV COCO ResNet-152 P 17.18 12.17 18.68 - - -
PMFNet Wan_2019_ICCV COCO ResNet-50-FPN P 17.46 15.65 18.00 20.34 17.47 21.20
CHGNet wang2020contextual COCO ResNet-50 17.57 16.85 17.78 21.00 20.74 21.08
DRG Gao-ECCV-DRG COCO ResNet-50-FPN T 19.26 17.74 19.71 23.40 21.75 23.89
VCL hou2020visual COCO ResNet-50 19.43 16.55 20.29 22.00 19.09 22.87
IP-Net wang2020learning COCO Hourglass-104 19.56 12.79 21.58 22.05 15.77 23.92
VSGNet Ulutan_2020_CVPR COCO ResNet-152 19.80 16.05 20.91 - - -
FCMNet Liu20a COCO ResNet-50 20.41 17.34 21.56 22.04 18.97 23.12
ACP kim2020detecting COCO ResNet-152 T 20.59 15.92 21.98 - - -
PD-Net zhong2020polysemy COCO ResNet-152-FPN T 20.81 15.90 22.28 24.78 18.88 26.54
DJ-RN li2020detailed COCO ResNet-50 P 21.34 18.53 22.18 23.69 20.64 24.60
IDN li2020hoi COCO ResNet-50 23.36 22.47 23.63 26.43 25.01 26.85
One-stage Method:
UnionDet Kim2020_unidet COCO ResNet-50-FPN 17.58 11.72 19.33 19.76 14.68 21.27
DIRV fang2020dirv COCO EfficientDet-d3 21.78 16.38 23.39 25.52 20.84 26.92
PPDM-Hourglass liao2020ppdm HICO-DET Hourglass-104 21.94 13.97 24.32 24.81 17.09 27.12
HOI-Trans zou2021_hoitrans HICO-DET ResNet-50 23.46 16.91 25.41 26.15 19.24 28.22
GG-Net zhong2021glance HICO-DET Hourglass-104 23.47 16.48 25.60 27.36 20.23 29.48
ATL hou2021atl HICO-DET ResNet-50 23.81 17.43 25.72 27.38 22.09 28.96
HOTR Kim_2021_CVPR HICO-DET ResNet-50 25.10 17.34 27.42 - - -
AS-Net chen_2021_asnet HICO-DET ResNet-50 28.87 24.25 30.25 31.74 27.07 33.14
QPIC tamura2021qpic HICO-DET ResNet-50 29.07 21.85 31.23 31.68 24.14 33.93
CDN-S HICO-DET ResNet-50 31.44 27.39 32.64 34.09 29.63 35.42
CDN-B HICO-DET ResNet-50 31.78 27.55 33.05 34.53 29.73 35.96
CDN-L HICO-DET ResNet-101 32.07 27.19 33.53 34.79 29.48 36.38
Table 2: Performance comparison on the HICO-Det test set. The ‘P’, ‘T’ represent human pose information and the language feature, respectively.

4.4 Comparison to State-of-the-Art

We conduct experiments on HICO-Det and V-COCO benchmarks to verify the effectiveness of our proposed methods. For HICO-Det dataset as shown in Table 2, comparing to the previous state-of-the-art two-stage method FCMNet Liu20a with ResNet-50 as backbone, our CDN-B significantly promotes mAP from to , with a relative gain of . Even compared with PD-Net zhong2021polysemy which adopts extra language feature and DJ-RN li2020detailed which utilizes extra human pose features, CDN-B achieves and relative mAP gains, respectively. When comparing to the one-stage method AS-Net chen_2021_asnet and QPIC tamura2021qpic which also adopt transformer-based detector architecture, CDN-B attains and point relative mAP gains, respectively. Table 4 shows the comparisons on V-COCO dataset. CDN-B achieves on Scenario and on Scenario , which significantly outperform previous state-of-the-art method QPIC with relative and gains, respectively. As for efficiency analysis, CDN-S has almost the same number of parameters and flops compared to QPIC, but CDN-S achieves mAP on HICO-Det, higher than QPIC.

4.5 Ablation Study

In this subsection, we analyse the effectiveness of the proposed strategies and components in detail. All experiments are eventuated on the HICO-Det dataset.

Strategies. The performance of each strategy is evaluated in Table (a)a. Our pure model without any additional post-processing operation, namely base model, achieves mAP , promoting compared with QPIC tamura2021qpic . It indicates the superiority of the architecture of disentangling human-object detection and interaction classification. The re-weighted training further promotes mAP to , with a gain of , and the gain mainly lies in rare classes. Finally, the PNMS further improves mAP to .

Dynamic re-weighting. In this part, we conduct experiments to evaluate the components in the dynamic re-weighted training strategy based on the base model as shown in Table (b)b. If we only decouple training without re-weighting, the model achieves mAP of , which is lower than the base model. Therefore, it shows that the performance gain does not come from a longer training process. Adding static weights to losses promotes mAP to . The dynamic re-weighting method improves the re-weighting effect since it captures the real-time weight of each class for each real-time sample during training. Thus it can sufficiently dig information from every single sample to achieve the best overall performance. Our method obtains best result mAP when = and = .

Method Extra Two-stage Method: InteractNet gkioxari2018detecting 40.0 - GPNN qi2018learning 44.0 - iCAN gao2018ican 45.3 52.4 TIN li2018transferable 47.8 54.2 VCL hou2020visual 48.3 - DRG Gao-ECCV-DRG T 51.0 - IP-Net wang2020learning 51.0 - VSGNet Ulutan_2020_CVPR 51.8 57.0 PMFNet Wan_2019_ICCV P 52.0 - PD-Net zhong2020polysemy T 52.6 - CHGNet wang2020contextual 52.7 - FCMNet Liu20a 53.1 - ACP kim2020detecting T 53.23 - IDN li2020hoi 53.3 60.3 One-stage Method: UnionDet Kim2020_unidet 47.5 56.2 HOI-Trans zou2021_hoitrans 52.9 - AS-Net chen_2021_asnet 53.9 - GG-Net zhong2021glance 54.7 - HOTR Kim_2021_CVPR 55.2 64.4 DIRV fang2020dirv 56.1 - QPIC tamura2021qpic 58.8 61.0 CDN-S 61.68 63.77 CDN-B 62.29 64.42 CDN-L 63.91 65.89
Table 3: Performance comparison on the V-COCO test set. The ‘P’, ‘T’ represent the human pose information and the language feature, respectively.
Strategy Full Rare Non-Rare QPIC tamura2021qpic 29.07 21.85 31.23 base 31.06 26.68 32.36 + re-weighting 31.38 27.36 32.58 + PNMS 31.78 27.55 33.05 (a) Strategies: Analysis of improvements by various training strategies. Strategy p Full Rare Non-Rare base - - 31.06 26.68 32.36 decouple - - 30.90 26.09 32.33 static - 0.7 31.25 27.12 32.49 dynamic 0.8 31.33 27.45 32.49 dynamic 0.7 31.34 27.48 32.49 dynamic 0.7 31.38 27.36 32.58 (b) Dynamic re-weighting: Analysis of decouple training with dynamic re-weighted losses, i.e., different queue length , coefficient p and dynamic or static. thres Full Rare Non-Rare - - - 31.38 27.36 32.58 1 1 0.8 31.66 27.46 32.91 1 1 0.7 31.75 27.50 33.03 1 0.7 0.7 31.77 27.54 33.03 1 0.5 0.8 31.75 27.51 33.02 1 0.5 0.7 31.78 27.55 33.05 (c) PNMS: The effects of different settings of PNMS coefficients, i.e., , , and thres denotes threshold.
Table 4: Ablation studies of our proposed method on the HICO-Det test set. We carry out all experiments based on the base model (CDN-B).


On the basis of the model after re-weighted training, we compare the variance by different parameter settings of the PNMS strategy, which is shown in Table 

(c)c. We fix the human box balance factor to . Then we tune the object box balance factor and the threshold of the PIoU to filter pair-wise boxes. We achieve best performance mAP when = and thres = . The fact that is smaller than , indicates that the overall performance is more sensitive to human boxes compared with object boxes in our framework.

5 Related Work

Two-stage Methods. Most previous HOI detectors are with a two-stage paradigm gao2018ican ; chao2018learning ; li2018transferable ; Gao-ECCV-DRG . Firstly, a fine-tuned detector ren2015faster ; he2017mask is applied to detect the instances. Secondly, generating the human-object pairs by matching detected human and object one by one, and then feeding them into an interaction classifier. To improve the interaction classification, some extra features were applied, such as human pose shen2018scaling ; Li_2019_CVPR ; Gupta_2019_ICCV , human parts Zhou_2019_ICCV ; Wan_2019_ICCV ; li2020detailed , and language features Xu_2019_CVPR ; Gao-ECCV-DRG ; Liu20a ; kim2020detecting . Besides, some two-stage methods qi2018learning ; Ulutan_2020_CVPR ; wang2020contextual ; YangZ20_IJCAI_In-GraphNet ; Zhou_2019_ICCV

applied graph neural networks to model the interactions.

One-stage Methods. One-stage methods detect HOI triplets directly liao2020ppdm ; wang2020learning ; Kim2020_unidet ; zou2021_hoitrans ; chen_2021_asnet ; tamura2021qpic ; Kim_2021_CVPR . In detail,  liao2020ppdm ; wang2020learning proposed a point-based interaction detection method which performs inference at each interaction key point.  Kim2020_unidet proposed an anchor-based method to predict the interactions for each human-object union box. Recently, set-based detection approach has been proposed to handle HOI detection as a set prediction problem zou2021_hoitrans ; chen_2021_asnet ; tamura2021qpic . Specifically,  zou2021_hoitrans ; tamura2021qpic designed a transformer encoder-decoder architecture to directly predict HOI detection results in an end-to-end manner, while  chen_2021_asnet utilized parallel instance and interaction decoder branches to adaptively aggregate the HOI triplets.

6 Conclusion

In this paper, we explore the essential pros and cons of two-stage and one-stage HOI detection in detail. We propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner. Our CDN can keep the advantage of one-stage methods, directly and accurately locating the interactive human-object pairs, and bring the benefit of two-stage methods, disentangling detection and interaction classification. Our novel paradigm has outperformed previous methods by margins. However, we only implement a specific version to mine the benefits of two-stage and one-stage methods. In the future, we plan to apply our idea with more general one-stage methods and introduce more advantages of two-stage methods into the one-stage framework.


  • [1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • [2] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
  • [3] Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang, and Chen Qian. Reformulating hoi detection as adaptive set prediction. In CVPR, 2021.
  • [4] Hao-Shu Fang, Yichen Xie, Dian Shao, and Cewu Lu. Dirv: Dense interaction region voting for end-to-end human-object interaction detection. In AAAI, 2021.
  • [5] Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. Drg: Dual relation graph for human-object interaction detection. In ECCV, 2020.
  • [6] Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
  • [7] Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. Detecting and recognizing human-object interactions. In CVPR, 2018.
  • [8] Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
  • [9] Tanmay Gupta, Alexander Schwing, and Derek Hoiem. No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In ICCV, 2019.
  • [10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  • [11] Zhi Hou, Yu Baosheng, Yu Qiao, Xiaojiang Peng, and Dacheng Tao.

    Affordance transfer learning for human-object interaction detection.

    In CVPR, 2021.
  • [12] Zhi Hou, Xiaojiang Peng, Yu Qiao, and Dacheng Tao. Visual compositional learning for human-object interaction detection. In ECCV, 2020.
  • [13] Bumsoo Kim, Taeho Choi, Jaewoo Kang, and Hyunwoo J. Kim. Uniondet: Union-level detector towards real-time human-object interaction detection. In ECCV, 2020.
  • [14] Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J. Kim. Hotr: End-to-end human-object interaction detection with transformers. In CVPR, 2021.
  • [15] Dong-Jin Kim, Xiao Sun, Jinsoo Choi, Stephen Lin, and In So Kweon. Detecting human-object interactions with action co-occurrence priors. In ECCV, 2020.
  • [16] Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, and Cewu Lu. Detailed 2d-3d joint representation for human-object interaction. In CVPR, 2020.
  • [17] Yong-Lu Li, Xinpeng Liu, Xiaoqian Wu, Yizhuo Li, and Cewu Lu. Hoi analysis: Integrating and decomposing human-object interaction. Advances in Neural Information Processing Systems, 33, 2020.
  • [18] Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yan-Feng Wang, and Cewu Lu. Transferable interactiveness prior for human-object interaction detection. In CVPR, 2019.
  • [19] Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019.
  • [20] Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In CVPR, 2020.
  • [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [22] Yang Liu, Qingchao Chen, and Andrew Zisserman. Amplifying key cues for human-object-interaction detection. In ECCV, 2020.
  • [23] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018.
  • [24] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.
  • [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [26] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
  • [27] Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.
  • [28] Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In CVPR, 2021.
  • [29] Oytun Ulutan, A S M Iftekhar, and B. S. Manjunath. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR, 2020.
  • [30] Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.
  • [31] Hai Wang, Wei-shi Zheng, and Ling Yingbiao. Contextual heterogeneous graph network for human-object interaction detection. In ECCV, 2020.
  • [32] Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. Learning human-object interaction detection using interaction points. In CVPR, 2020.
  • [33] Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S. Kankanhalli. Learning to detect human-object interactions with knowledge. In CVPR, 2019.
  • [34] Dongming Yang and Yuexian Zou. A graph-based interactive reasoning for human-object interaction detection. In IJCAI, 2020.
  • [35] Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. Polysemy deciphering network for human-object interaction detection. In ECCV, 2020.
  • [36] Xubin Zhong, Changxing Ding, Xian Qu, and Dacheng Tao. Polysemy deciphering network for robust human–object interaction detection. IJCV, 2021.
  • [37] Xubin Zhong, Xian Qu, Changxing Ding, and Dacheng Tao. Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In CVPR, 2021.
  • [38] Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.
  • [39] Cheng Zou, Bohan Wang, Yue Hu, Junqi Liu, Qian Wu, Yu Zhao, Boxun Li, Chenguang Zhang, Chi Zhang, Yichen Wei, and Jian Sun. End-to-end human object interaction detection with hoi transformer. In CVPR, 2021.