PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection

12/30/2019 ∙ by Yue Liao, et al. ∙ 4

We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes are unlikely to form meaning HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new applicationoriented database named HOI-A, which severs as a good supplement to the existing datasets. The source code and the dataset will be made publicly available to facilitate the development of HOI detection.



There are no comments yet.


page 2

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human-Object Interaction (HOI) detection [28, 10, 9, 8, 11, 15, 21] has received increasing attention recently. Given an image, HOI detection aims to detect the triplet human, interaction, object . Different from the general visual relationship detection [18, 27, 19, 12, 30], the subject of the triplet is fixed as human while the interaction is action. HOI detection is an important step toward the high-level semantic understanding of human-centric scenes. It has a lot of applications, such as activity analysis, human-machine interaction and intelligent monitoring.

The conventional HOI detection methods [2, 21, 11, 15, 24] mostly consist of two stages. The first stage is the human-object proposal generation. A pre-trained detector [7, 22] is used to localize both the humans and objects. Then human-object proposals are generated by pairwisely combining the filtered human boxes and object boxes. The second stage is the proposal classification which predicts the interactions for each human-object proposal. The limitations of the two-stage methods’ effectiveness and efficiency are mainly because their two stages are sequential and separated. The proposal generation stage is completely based on object detection confidences. Each human/object proposal is independently generated. The possibility of combining two proposals to form a meaningful HOI triplet in the second stage is not taken into account. Therefore, the generated human-object proposals may have relatively low quality. Moreover, in the second stage, all human-object proposals need to be linearly scanned, while only a few of them are valid. The extra computational cost is large. Therefore, we argue that the non-sequential and highly-coupled framework is needed.

We propose a parallel HOI detection framework and reformulate HOI detection as a point detection and matching problem. As shown in Figure 2, we represent a box as a center point and corresponding sizes (width and height). Moreover, we define an interaction point as the midpoint of the human and object center points. To match each interaction point with the human point and the object point, we design two displacements from the interaction point to the corresponding human and object point. Based on the novel reformulation, we design a novel single-stage framework Parallel Point Detection and Matching (PPDM), which breaks up the complex task of HOI detection into two simpler parallel tasks. The PPDM is composed of two parallel branches. The first branch is points detection

, which estimates the three center points (interaction, human and object points), corresponding sizes (width and height) and two local offsets (human and object points). The interaction point can be considered as providing contextual information for both human and object detection. In other words, estimating the interaction point implicitly enhances the detection of humans and objects. The second branch is

points matching

. Two displacements from the interaction point to human and object points are estimated. The human and object points originated from the same interaction points are considered as matched. In the novel parallel architecture, the point detection branch estimates the interaction points, which implicitly provide context and regularization for the human and object detection. The isolated detection boxes unlikely to form meaning HOI triplet are suppressed while the more likely detection boxes are enhanced. It is different with human-object proposal generation stage in two-stage methods, where all detection human/object boxes indiscriminately form the human-object proposals to fed into the second stage. Moreover, in the point matching branch, the matching is only applied around limited numbers of filtered candidate interaction points, which saves a lot of computational cost. To the contrary, in the proposal classification stage of two-stage methods, all human-object proposal need to be classified. Experimental results on the public benchmark HICO-Det 

[2] and our newly collected HOI-A dataset show that PPDM outperforms state-of-the-art methods in terms of accuracy and speed.

Figure 2: PPDM contains two parallel branches. In the point detection branch, the human/object box denoted as the center points, widths, and heights, are detected. Moreover, an interaction point, i.e., the midpoint of the human and object point, is also localized. Simultaneously, in the point matching branch, two displacements from each interaction point to the human/object are estimated. The human and object linked by the same interaction point are matched and considered to belong to the same HOI triplet.
Figure 3: Overview of the proposed PPDM framework. We firstly apply a key-point heatmap prediction network, e.g. Hourglass-104 or DLA-34, to extract the appearance feature from an image. a) Point Detection Branch: Based on the extracted visual feature, we utilize three convolutional modules to predict the heatmap of the interaction points, human center points, and object center points. Additionally, to generate the final box, we regress the 2-D size and the local offset. b) Point Matching Branch: the first step of this branch is to regress the displacements from the interaction point to the human point and object point respectively. Based on the predicted points and displacements, the second step is to match each interaction point with the human point and object point to generate a set of points triplets.

The existing datasets such as HICO-Det [21] and V-COCO [10] have greatly boosted the development of related research. These datasets are very general. However, in practical applications, several limited, frequent HOI categories need to be paid special attention to. To this end, we collect a new Human-Object Interaction for Applications dataset (HOI-A) with the following features: 1) specially selected kinds of HOI categories with wide application values, such as smoke and ride. 2) huge intra-class variations including various illuminations and different human poses for each category. The HOI-A is more application-driven, severing as a good supplement to the existing datasets.

Our contributions are summarized as: 1) We reformulate the HOI detection task as a point detection and matching problem and propose a novel one-stage PPDM solution. 2) PPDM is the first HOI detection method to achieve real-time and outperforms state-of-art methods both HICO-Det and HOI-A benchmarks. 3) A large-scale and application-oriented HOI detection dataset is collected to supplement existing datasets. Both source code and the dataset are to be released to facilitate the related research.

2 Related Work

HOI Detection Methods. The existing HOI detection methods can be mostly divided into two stages: in the first stage, an object detector [22] is applied to localize the human and objects; in the second stage, pairing the detected human and object, and feeding their features into a classification network to predict the interaction between the human and object. Current works pay more attention to exploring how to improve the second stage. The most recent works aim to understand HOI by capturing context information [6, 25] or human structural message [24, 5, 4, 31]. Some works [21, 26, 31] formulated the second stage as a graph reasoning problem and use graph convolutional network to predict the HOI.

The above methods are all proposal based, thus their performance is limited by the quality of proposals. Additionally, the existing methods have to spend much computational cost in proposals generation and feature extraction process. Based on these drawbacks, we propose a novel one-stage and proposal-free framework to detect HOI.

HOI Detection Datasets. There are mainly two common used HOI detection benchmarks: VCOCO [10] and HICO-Det [2], and a human-centric relationship detection dataset: HCVRD [33]. The VCOCO is a relatively small dataset, which is a subset of MSCOCO [17] including images annotated with actions based on COCO annotation. The HICO-Det is a large-scale and generic HOI detection dataset, including images, which has verbs and object categories (same as COCO). The HCVRD is collected from the general visual relationship detection dataset, Visual Genome [13]. It has images, predicate categories and kinds of objects. Comparing the former two HOI detection datasets, which only focuses on human actions, the HCVRD is concerned about a more general human-centric relationship, e.g., spatial relationships, possessive relationships.

The previous HOI detection datasets mostly concentrate on common and general actions. From a practical view, we build up a new HOI-A dataset, which has about 38K images only annotated with limited typical kinds of actions with practical significance.

3 Parallel Point Detection and Matching

3.1 Overview

HOI detection aims to estimate the HOI triplet human, interaction, object , which is composed of the subject box and class, the human action class and the object box and class. We break up the complex task of HOI detection into two simpler parallel tasks that can be assembled to form the final results. The framework of the proposed Parallel Point Detection and Matching (PPDM) method is shown in Figure 3. The first branch of PPDM is points detection. It estimates the center points, corresponding sizes (width and height) and local offsets of both humans and objects. The center, size and offset collaboratively represent some box candidates. Moreover, the interaction point which is defined as the midpoint of a corresponding human center point, object center point pair is also estimated. The second branch of PPDM is points matching. The displacements between the interaction point and the corresponding human and object point are estimated. The human and box points linked by the same interaction points are considered as matched.

3.2 Point Detection

The point detection branch estimates the human box, object box and interaction point. A human box is represented as its center point , the corresponding size (width and height) as well as the local point offset

to recover the discretization error caused by the output stride. The object box is represented similarly. Moreover, we define the interaction point

as the midpoint of the paired human point and object point. Considering the receptive filed of the interaction point is large enough to contain both human and object, the human action can be estimated based on the feature of . Actually, when there are human in the dataset, each human box is represented as . For the convenience of description, we omit the subscript when no confusion is caused. Similar omissions are also applicable for and .

In Figure 3, the input image is fed into the feature extractor to produce the feature , where and are the width and height of the input image and is the output stride. The point heatmaps are of low-resolution, thus we also calculate the low-resolution center points. Given a ground-truth human point , the corresponding low-resolution point is . The low-resolution ground-truth object point can be computed in the same way. Based on the low-resolution human and object points, the ground-truth interaction point can be defined as .

Point location loss. Directly detecting a point is difficult, thus we follow the key-point estimation methods [20] to splat a point into a heatmap with a Gaussian kernel. Thereby the point detection is transformed into a heatmap estimation task. The three ground-truth low-resolution points , and are splatted into three Gaussian heatmaps, including human point heatmap , object point heatmap and interaction point heatmap , where is the number of object categories and is the the number of interaction classes. Note that in and , only the channel corresponding to the specific object class and human action are non-zero. The three heatmaps are produced by adding three respective convolutional blocks upon the feature map , each of which is composed of a

convolutional layer with ReLU, followed by a

convolutional layer and a Sigmoid.

For the three heatmaps, we all apply an element-wise focal loss [16]. For example, given an estimated interaction point heatmap and the corresponding ground-truth heatmaps

, the loss function is:


where is equal to the number of interaction points (HOI triplet) in the image, and is the score at location for class in the predicted heatmaps . We set as 2 and as 4 following the default setting in [14, 32]. The losses and for the human points and the object points can be computed similarly.

Size and offset loss. Besides the center points, the box size and the local offset for the center points are needed to form the human/object box. Four convolutional blocks are added to the feature map to estimate the 2-D size and the local offset of human and object boxes respectively. Each block contains a convolutional layer with ReLU and a convolutional layer.

During training, we only compute the loss at each location of the ground truth human point and object point and ignore all other locations. We take the loss function for the local offset as an example, while the size regression loss is defined similarly. The ground truth local offset for the human point localized at is defined as . Thus the loss function is the summation of the human box loss and object box loss .


where and denote the ground-truth human and object points sets in the training set. and are the number of human points and object points. Note that is not necessarily equal to . For example a human may correspond to multiple actions and objects. is defined similarly with Equation 3.

3.3 Point Matching

The points matching branch pairs the human box with its corresponding object box by using the interaction point as the bridge. More specifically, the interaction point is treated as the anchor. Two displacements and , i.e., the displacements between interaction point vs. human/box point are estimated. The coarse human point and object point are plus and respectively.

Our proposed displacement branch is composed of two convolutional modules. Each module consists of a convolutional layer with ReLU and a convolutional layer. The size of both subject and object displacement maps are .

Displacement loss. To train the displacement branch, we apply loss for each interaction point. The ground-truth displacement from the interaction point located at to the corresponding human point can be computed by . The predicted displacement at location of is . The displacement loss is defined as:


where denotes the ground-truth interaction point sets in the training set. is the number of interaction points. The loss function for displacement from the interaction point to the object point has the same form.

Triplet matching. Two aspects are considered to judge whether a human/object point can be matched with the interaction point. The human/object needs to: 1) be close to the coarse human/object point generated by interaction point plus the displacement and 2) have high confidence scores. On basis of these, for the detected interaction point , we rank the points in the detected human point set by Equation 5 and select the optimal one.


where denotes the confidence score for human point . The optimal object box can be calculated similarly.

3.4 Loss and Inference

The final loss can be obtained by weighted summing the above losses:


where we set the as following [14, 32]. , and are point location loss, and are displacement loss while and are size and offset lose .

During the inference, we firstly do a max-pooling operation with stride 1 on the predicted human, object and interaction points heatmap, which plays a similar role as NMS. Secondly, we select top human points , object center points and interaction points through the corresponding confidence scores , and across all categories. Then, we find the subject point and object point for each selected interaction point by Equation 5. For each matched human point , we get the final box as:


where and are the refined location of the human center point. is the size of box in the corresponding position. The final HOI detection results are a set of triplets, and the confidence score for the triplet is .

4 HOI-A Dataset

The existing datasets such as HICO-Det [21] and V-COCO [10] have greatly boosted the development of related research. However, in practical application, there are limited frequent HOI categories that need to be paid special attention to, which are not emphasized in previous datasets. We then introduce a new dataset called Human-Object Interaction for Application (HOI-A).

Figure 4: Example images of our HOI-A dataset. We take human, smoke, cigarette as an example. The (a)-(d) show huge intra-class variations of human, smoke, cigarette in the wild. The (e)-(f) show two kinds of negative samples.

As shown in Table 1, we select the categories of verb driven by practical application. Each kind of verb in HOI-A dataset has its corresponding application scenario, for example ‘talk on’ can be applied in dangerous action detection, e.g., if the human is talking on phone in-car, it can be considered as a dangerous driving action.

Verbs Objects # Instance
smoking cigarette 8624
talk on mobile phone 18763
play (mobile phone) mobile phone 6728
eat food 831
drink drink 6898
ride bike, motorbike, horse 7111
cigarette, mobile phone, food
drink, document, computer
kick sports ball 365
read document 869
play (computer) computer 1402
Table 1: The list and occurrence numbers of the verbs of the corresponding objects in HOI-A dataset.

4.1 HOI-A Construction

We describe the image collection and annotation process for constructing the HOI-A dataset. The first step is collecting candidate images, which can be divided into two parts, namely positive and negative images collections.

Positive Images Collection. We collect positive images in two ways, i.e., camera shooting and crawling. Camera shooting is an important way to enlarge the intra-class variations of the data. We employ 50 performers and require them to perform all predefined actions in different scenes and illumination, with various poses, and take photos of them respectively with an RGB camera and an IR camera. For data crawling from the Internet, we generate a series of keywords based on the HOI triplet person, action name, object name, action pair action name, object name and action name, and retrieve images from the Internet.

Negative Images Collection. Negative Images Collection. There are two kinds of negative samples of the predefined human, interaction, object . 1) The concerned object appears in the image, but the concerned action does not happen. For example, in Figure 4(f), although the cigarette appears in the image, it is not smoked by a human. Therefore, the image is still a negative sample. 2) Other action similar to the concerned action happens but the concerned object is missing. For example, in Figure 4

(e), the man is smoking at a glance. But a closer look shows there is no cigarette in the image. We collect this kind of negative sample in the ‘attack’ manner. We firstly train a multi-label action classifier based on the annotated positive images. The classifier takes an image as input and outputs the probability of action classification. Then, we let actors perform arbitrarily to attack the classifier without any interacted objects. If the attack is successful, we record this image as a hard negative sample.

Annotation. The process of annotation contains two steps: box annotation and interaction annotation. First, all objects in the pre-defined categories are annotated with a box and the corresponding category. Second, we visualize the boxes in the images with their id and annotate whether a person has the defined interactions with a object. The annotator should record the triplet person ID, action ID, object ID. For more accurate annotation, each image is annotated by 3 annotators. The annotation of an image is regarded qualified if at least 2 annotators share the same annotation.

4.2 Dataset Properties

Scale. Our HOI-A dataset consists of 38,668 annotated images, kinds of objects and action categories. In detail, it contains human instances, object instances and interaction instances. There are on average interactions performed per person. Table 1 lists the number of instances for each verb which occurs at least times. verbs appear more than times. To our knowledge, this is already the largest HOI dataset, in terms of the number of images per interaction category.

Intra-Class Variation. To enlarge the intra-class variation of the data, each type of verbs in our HOI-A dataset will be captured with three general scenes including indoor, outdoor and in-car, three lighting conditions including dark, natural and intense, various human poses and different angles. Additionally, we shoot the images with two kinds of cameras: RGB and IR.

5 Experiments

Method Feature Full(mAP %) Rare(mAP %) Non-Rare(mAP %) Inference Time (ms) FPS
Shen et. al [23] A + P 6.46 4.24 7.12 - -
HO-RCNN [2] A + S 7.81 5.37 8.54 - -
VSRL [10] A 9.09 7.02 9.71 - -
InteractNet [8] A 9.94 7.16 10.77 145 6.90
GPNN [21] A 13.11 9.34 14.23 197 + 48 = 245 4.08
Xu et. al [26] A + L 14.70 13.26 15.13 - -
iCAN [6] A + S 14.84 10.45 16.15 92 + 112 = 204 4.90
PMFNet-Base [24] A + S 14.92 11.42 15.96 - -
Wang et. al [25] A 16.24 11.16 17.75 - -
No-Frills [11] A + S + P 17.18 12.17 18.68 197 + 230 + 67 = 494 2.02
TIN [15] A + S + P 17.22 13.51 18.32 92 + 98 + 323 = 513 1.95
RPNN [31] A + P 17.35 12.78 18.71 - -
PMFNet [24] A + S + P 17.46 15.65 18.00 92 + 98 + 63 = 253 3.95
PPDM-DLA A 19.02 12.65 20.92 27 37.03
PPDM-Hourglass A 21.10 14.46 23.09 71 14.08

Table 2: Performance comparison on the HICO-DET test set. The ‘A’, ‘P’, ‘S’, ‘L’ represent the appearance feature, human pose information, the spatial feature, and the language feature, respectively.

5.1 Exmperimental Setting

Datasets. To verify the effectiveness of our PPDM, we conduct experiments not only on our HOI-A dataset but also on the general HOI detection dataset HICO-Det [2]. HICO-Det is a large-scale dataset for common HOI detection. It has images ( for training and for test), annotated with verbs including ‘no-interaction’ and object categories. The verbs and objects form kinds of HOI triplets, where types of HOIs which appear times are considered as the rare set, and the rest kinds of HOIs form the non-rare set.

Metric. Following the standard-setting in HOI detection task, we use mean average precious (mAP) as the metric. If a predicted triplet is considered as a true positive sample, it needs to match a certain ground-truth triplet. Specifically, they have the same HOI class and their human and object boxes have overlap with IOUs large than . There is a slight difference when computing AP on the two datasets. We compute AP per HOI class in HICO-Det and compute AP per verb class in HOI-A dataset.

Implementation Details. We use two common heatmap prediction networks as our feature extractor, Hourglass-104 [20, 14] and DLA-34 [29, 32]. Hourglass-104 is a general heatmap prediction network commonly used in keypoint detection and object detection. In PPDM, we use the modified version Hourglass-104 proposed in [14]. The DLA-34 is a lightweight backbone network, and we apply a refined version proposed in [32]. The receptive field of the network need large enough to cover the subject and the object. Hourglass-104 has a sufficiently large receptive field, while that of DLA-34 cannot cover the region including the human and the object, due to its relatively shallow architecture. Thus for the DLA-based model, we concatenate the last three level features and apply a graph-based global reasoning module [3] to enlarge the receptive field for the interaction point and displacement prediction. In the global reasoning module, we set the channels of the node and the reduced feature as and respectively. For Hourglass-104, we only use the last-level feature for all the following modules. We initialize the feature extractor with the pre-trained COCO [17]. Our experiments are all conducted on the Titan Xp GPU and CUDA 9.0.

During training and inference, the input resolution is and the output is . PPDM is trained with Adam on 8 GPUs. We set the hyper-parameter following [32]

, which is robust to our framework. We train the model based on DLA-34 with a 128 sized mini-batch for 110 epochs, with a learning rate of 5e-4 decreased to 5e-5 at the 90th epoch. For the hourglass-104 based model, we train it with a batch size of 32 for 110 epochs, with a learning rate of 3.2e-4 decreased by 10 times at the 90th epoch. We follow 

[14, 32] applying data augmentation, i.e., random scale and random shift to train the model and there is no augmentation during inference. We set the number of selected predictions as 100.

5.2 Comparison to State-of-the-art

We compare PPDM with state-of-the-art methods on two datasets. The quantitative results can be seen in Table 2 and Table 4, and the qualitative results are presented in Figure 5. Additionally, more results can be found in supplementary materials. The compared methods mainly use a pre-trained Faster R-CNN [22] to generate a set of human-object proposals, which are then fed into a pairwise classification network. As shown in Table 2, to more accurately classify the HOI, many methods use additional human pose feature or language feature.

Figure 5: Visualization results compared with iCAN on HICO-Det. The first row is the prediction of iCAN and the second row by PPDM. Purple denotes the subject and red is the object. If a subject has interaction with an object, they will be linked by a green line. We show results with top-3 confidence per image: 1-blue, 2-yellow, 3-pink. The ‘no’ denotes ‘no-interaction’.

5.2.1 Quantitative Analysis

HICO-Det. See table 2. Our PPDM-DLA and PPDM-Hourglss both outperform all previous state-of-the-art methods. Specifically, our PPDM-Hourglass achieves a significant performance gain () comparing to the previous best method PMFNet [24]. We can see the previous methods with mAP greater than all use the human pose as an additional feature, while our PPDM only uses the appearance feature. Performance of PPDM is slightly lower than PMFNet on the rare subset. However, the baseline model in PMFNet without using human pose information only achieves mAP on the rare-set. The performance gain on the rare-set may mainly come from the additional human pose feature. The human structural information plays an

Method Top 5 Top 10 Top 100
iCAN [6] 11.70 13.43 14.84
TIN [15] 13.70 15.24 17.11
PPDM-Hourglass 18.92 20.35 21.10

Table 3: Top-K results on HICO-Det Test Set.

important role in understanding human actions, thus we regard how to utilize human context in our framework as a significant future work.

Method mAP (%) Time (ms)
Faster Interaction Net [1] 56.03 -
GMVM [1] 60.26 -
URNet [1] 66.04 -
iCAN [6] 44.23 194
TIN [15] 48.64 501
PPDM-DLA 67.03 27
PPDM-Hourglass 71.45 71
Table 4: Performance comparison on HOI-A test set.

To verify the claim that our PPDM is more sensitive to the interactive human-object pairs than the previous two-stage methods, we select two representative open-source two-stage methods, i.e., iCAN 

[6] and TIN [15], for comparison on HICO-Det dataset. For this purpose, we compare the mAP in ‘top 5’, ‘top 10’, and ‘top 100’ setting, e.g., the ‘top 5’ setting represents that we only select the predictions with the top5 confidence scores for each image to compute the mAP. As shown in Table 5, the performance of the two-stage methods have dropped about when only keeping the top5 confidence results, while our PPDM only drops .

HOI-A. The compared methods in HOI-A dataset are composed of two parts. Firstly, we select the top-3 methods from the leaderboard of ICCV 2019 PIC challenge HOI detection track [1]. Comparing to the top-1 method, URNet [1], which uses a very strong detector, our methods still outperform it. Secondly, we choose two open-source state-of-the-art methods, iCAN [6] and TIN [15], as the baselines on our HOI-A dataset. We first pre-train Faster R-CNN with FPN and ResNet-50 on HOI-A dataset, and then follow their original settings to train the HOI classifier. The results show our PPDM outperforms the two methods by a significant margin. Additionally, for our selected interaction types with practical significance, our PPDM can achieve very high performance, which can be practically applicable.

5.2.2 Qualitative Analysis

We visualize the HOI prediction with the top-3 confidence score on HICO-Det dataset based on PPDM-DLA, and compare our results with the typical two-stage method iCAN [6]. As shown in Figure 5, we select some representative failure cases of the two-stage methods. We can see iCAN tends to focus on the human/object with a high detection score but without interaction. In Figure 5(b) and Figure 5(c), due to the huge imbalance between positive/negative samples, iCAN easily produces high confidence for the ‘no-interaction’ type. In Figure 5(d), the person sitting on the airplane is so small that it cannot be detected. However, our PPDM can accurately predict the HOI triplets with high confidence in these cases. Because PPDM is not dependent on the proposals. Moreover, PPDM concentrates on the HOI triplets understanding.

Method Full Rare Non-Rare Time
1 Basic Model 18.46 11.97 20.40 24
2 + Feature Fusion 18.66 11.86 20.69 26
3 + Global Reasoning 18.63 12.61 20.42 26
4 Union Center 18.07 11.53 20.02 27
5 PPDM-DLA 19.02 12.65 20.92 27

Table 5: Component analysis on HICO-Det Test Set.

5.2.3 Efficiency Analysis

We compare the inference speed on a single Titan Xp GPU with the methods which have released code or reported the speed. As shown in Table 2, PPDM with DLA and Hourglass are both faster than other methods by a large margin. PPDM-DLA is the only real-time method, which only takes ms for inference. Concretely, the inference time of two-stage HOI detection methods can be divided into proposal generation time and HOI classifier time. Besides, the pose based methods take extra time to estimate human key-points. It can be seen that the speed of PPDM-DLA is faster than any stage of the compared methods.

5.3 Component Analysis

We analyze the proposed components in PPDM from quantitative and qualitative views.

Figure 6: Visualization of interaction points heatmaps and displacements. Red and purple line represent displacements from interaction point (green) to human and object.

Feature Extractor. We analyze effectiveness of the additional modules in DLA backbone, i.e., feature fusion and global reasoning. The first row in Table 5 represents the basic framework with DLA, where we predict the interaction only based on the last-level feature. It shows that the basic model can still outperform all existing methods. It proves the effectiveness of our designed framework. The second row shows the result of the basic model with the feature fusion, with a performance boost of points. If we add a global reasoning module to the basic framework, it can be seen in the third row of Table 5 that the performance improves by mAP. We conclude that a larger receptive field and global context are helpful to interaction prediction.

Point Detection. To verify whether the midpoint of two center points is the best choice to predict the interaction, we perform an experiment based on the interaction point at the center of the union of human and object boxes, which is another suitable location to predict the interaction. See the 4th row in Table 5. The mAPs drop 1 point compared with PPDM-DLA. It is common that two objects interact with the same person and may locate in the human box, in which case the center points of their union boxes overlap. Additionally, we analyze our interaction point qualitatively. As shown in Figure 6, the predicted interaction almost accurately locates at the midpoint of the human/object points, though the human is apart from the object or in the object.

Point Matching. To further understand the displacement, we visualize the displacements in Figure 6. We can see the interaction point plus the corresponding displacement is very close to the center point of the human/object box, even though the human/object is hard to be detected.

6 Conclusion and Future Work

In this paper, we propose a novel one-stage framework and a new dataset for HOI detection. Our proposed method can outperform the existing methods by a margin also with a significantly faster speed. It breaks the limits of the traditional two-stage methods and directly predicts the HOI triplets by a parallel framework. Our proposed HOI-A dataset is more inclined to practical application for HOI detection. For future work, we plan to explore how to utilize human context in our framework. Additionally, we plan to enrich the action categories for HOI-A dataset.


  • [1] http://www.picdataset.com/challenge/leaderboard/hoi2019.
  • [2] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
  • [3] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. In CVPR, 2019.
  • [4] Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. In ECCV, 2018.
  • [5] Wei Feng, Wentao Liu, Tong Li, Jing Peng, Chen Qian, and Xiaolin Hu.

    Turbo learning framework for human-object interactions recognition and human pose estimation.

  • [6] Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
  • [7] Ross Girshick. Fast r-cnn. In CVPR, 2015.
  • [8] Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. Detecting and recognizing human-object interactions. In CVPR, 2018.
  • [9] Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. TPAMI, 2009.
  • [10] Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
  • [11] Tanmay Gupta, Alexander Schwing, and Derek Hoiem. No-frills human-object interaction detection: Factorization, appearance and layout encodings, and training techniques. In ICCV, 2019.
  • [12] Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir Globerson. Mapping images to scene graphs with permutation-invariant structured prediction. In NIPS, 2018.
  • [13] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.

    International Journal of Computer Vision

    , 123(1):32–73, 2017.
  • [14] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
  • [15] Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yan-Feng Wang, and Cewu Lu. Transferable interactiveness prior for human-object interaction detection. In CVPR, 2019.
  • [16] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In CVPR, 2017.
  • [17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [18] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
  • [19] Alejandro Newell and Jia Deng. Pixels to graphs by associative embedding. In NIPS, 2017.
  • [20] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
  • [21] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu.

    Learning human-object interactions by graph parsing neural networks.

    In ECCV, 2018.
  • [22] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [23] Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.
  • [24] Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.
  • [25] Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, and Jorma Laaksonen. Deep contextual attention for human-object interaction detection. In ICCV, 2019.
  • [26] Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S. Kankanhalli. Learning to detect human-object interactions with knowledge. In CVPR, 2019.
  • [27] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017.
  • [28] Bangpeng Yao and Li Fei-Fei. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. TPAMI, 2012.
  • [29] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR, 2018.
  • [30] Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. Graphical contrastive losses for scene graph parsing. In CVPR, 2019.
  • [31] Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.
  • [32] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [33] Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. Care about you: towards large-scale human-centric visual relationship detection. arXiv preprint arXiv:1705.09892, 2017.