Human-Object Interaction (HOI) detection [28, 10, 9, 8, 11, 15, 21] has received increasing attention recently. Given an image, HOI detection aims to detect the triplet human, interaction, object . Different from the general visual relationship detection [18, 27, 19, 12, 30], the subject of the triplet is fixed as human while the interaction is action. HOI detection is an important step toward the high-level semantic understanding of human-centric scenes. It has a lot of applications, such as activity analysis, human-machine interaction and intelligent monitoring.
The conventional HOI detection methods [2, 21, 11, 15, 24] mostly consist of two stages. The first stage is the human-object proposal generation. A pre-trained detector [7, 22] is used to localize both the humans and objects. Then human-object proposals are generated by pairwisely combining the filtered human boxes and object boxes. The second stage is the proposal classification which predicts the interactions for each human-object proposal. The limitations of the two-stage methods’ effectiveness and efficiency are mainly because their two stages are sequential and separated. The proposal generation stage is completely based on object detection confidences. Each human/object proposal is independently generated. The possibility of combining two proposals to form a meaningful HOI triplet in the second stage is not taken into account. Therefore, the generated human-object proposals may have relatively low quality. Moreover, in the second stage, all human-object proposals need to be linearly scanned, while only a few of them are valid. The extra computational cost is large. Therefore, we argue that the non-sequential and highly-coupled framework is needed.
We propose a parallel HOI detection framework and reformulate HOI detection as a point detection and matching problem. As shown in Figure 2, we represent a box as a center point and corresponding sizes (width and height). Moreover, we define an interaction point as the midpoint of the human and object center points. To match each interaction point with the human point and the object point, we design two displacements from the interaction point to the corresponding human and object point. Based on the novel reformulation, we design a novel single-stage framework Parallel Point Detection and Matching (PPDM), which breaks up the complex task of HOI detection into two simpler parallel tasks. The PPDM is composed of two parallel branches. The first branch is points detection
, which estimates the three center points (interaction, human and object points), corresponding sizes (width and height) and two local offsets (human and object points). The interaction point can be considered as providing contextual information for both human and object detection. In other words, estimating the interaction point implicitly enhances the detection of humans and objects. The second branch ispoints matching
. Two displacements from the interaction point to human and object points are estimated. The human and object points originated from the same interaction points are considered as matched. In the novel parallel architecture, the point detection branch estimates the interaction points, which implicitly provide context and regularization for the human and object detection. The isolated detection boxes unlikely to form meaning HOI triplet are suppressed while the more likely detection boxes are enhanced. It is different with human-object proposal generation stage in two-stage methods, where all detection human/object boxes indiscriminately form the human-object proposals to fed into the second stage. Moreover, in the point matching branch, the matching is only applied around limited numbers of filtered candidate interaction points, which saves a lot of computational cost. To the contrary, in the proposal classification stage of two-stage methods, all human-object proposal need to be classified. Experimental results on the public benchmark HICO-Det and our newly collected HOI-A dataset show that PPDM outperforms state-of-the-art methods in terms of accuracy and speed.
The existing datasets such as HICO-Det  and V-COCO  have greatly boosted the development of related research. These datasets are very general. However, in practical applications, several limited, frequent HOI categories need to be paid special attention to. To this end, we collect a new Human-Object Interaction for Applications dataset (HOI-A) with the following features: 1) specially selected kinds of HOI categories with wide application values, such as smoke and ride. 2) huge intra-class variations including various illuminations and different human poses for each category. The HOI-A is more application-driven, severing as a good supplement to the existing datasets.
Our contributions are summarized as: 1) We reformulate the HOI detection task as a point detection and matching problem and propose a novel one-stage PPDM solution. 2) PPDM is the first HOI detection method to achieve real-time and outperforms state-of-art methods both HICO-Det and HOI-A benchmarks. 3) A large-scale and application-oriented HOI detection dataset is collected to supplement existing datasets. Both source code and the dataset are to be released to facilitate the related research.
2 Related Work
HOI Detection Methods. The existing HOI detection methods can be mostly divided into two stages: in the first stage, an object detector  is applied to localize the human and objects; in the second stage, pairing the detected human and object, and feeding their features into a classification network to predict the interaction between the human and object. Current works pay more attention to exploring how to improve the second stage. The most recent works aim to understand HOI by capturing context information [6, 25] or human structural message [24, 5, 4, 31]. Some works [21, 26, 31] formulated the second stage as a graph reasoning problem and use graph convolutional network to predict the HOI.
The above methods are all proposal based, thus their performance is limited by the quality of proposals. Additionally, the existing methods have to spend much computational cost in proposals generation and feature extraction process. Based on these drawbacks, we propose a novel one-stage and proposal-free framework to detect HOI.
HOI Detection Datasets. There are mainly two common used HOI detection benchmarks: VCOCO  and HICO-Det , and a human-centric relationship detection dataset: HCVRD . The VCOCO is a relatively small dataset, which is a subset of MSCOCO  including images annotated with actions based on COCO annotation. The HICO-Det is a large-scale and generic HOI detection dataset, including images, which has verbs and object categories (same as COCO). The HCVRD is collected from the general visual relationship detection dataset, Visual Genome . It has images, predicate categories and kinds of objects. Comparing the former two HOI detection datasets, which only focuses on human actions, the HCVRD is concerned about a more general human-centric relationship, e.g., spatial relationships, possessive relationships.
The previous HOI detection datasets mostly concentrate on common and general actions. From a practical view, we build up a new HOI-A dataset, which has about 38K images only annotated with limited typical kinds of actions with practical significance.
3 Parallel Point Detection and Matching
HOI detection aims to estimate the HOI triplet human, interaction, object , which is composed of the subject box and class, the human action class and the object box and class. We break up the complex task of HOI detection into two simpler parallel tasks that can be assembled to form the final results. The framework of the proposed Parallel Point Detection and Matching (PPDM) method is shown in Figure 3. The first branch of PPDM is points detection. It estimates the center points, corresponding sizes (width and height) and local offsets of both humans and objects. The center, size and offset collaboratively represent some box candidates. Moreover, the interaction point which is defined as the midpoint of a corresponding human center point, object center point pair is also estimated. The second branch of PPDM is points matching. The displacements between the interaction point and the corresponding human and object point are estimated. The human and box points linked by the same interaction points are considered as matched.
3.2 Point Detection
The point detection branch estimates the human box, object box and interaction point. A human box is represented as its center point , the corresponding size (width and height) as well as the local point offset
to recover the discretization error caused by the output stride. The object box is represented similarly. Moreover, we define the interaction pointas the midpoint of the paired human point and object point. Considering the receptive filed of the interaction point is large enough to contain both human and object, the human action can be estimated based on the feature of . Actually, when there are human in the dataset, each human box is represented as . For the convenience of description, we omit the subscript when no confusion is caused. Similar omissions are also applicable for and .
In Figure 3, the input image is fed into the feature extractor to produce the feature , where and are the width and height of the input image and is the output stride. The point heatmaps are of low-resolution, thus we also calculate the low-resolution center points. Given a ground-truth human point , the corresponding low-resolution point is . The low-resolution ground-truth object point can be computed in the same way. Based on the low-resolution human and object points, the ground-truth interaction point can be defined as .
Point location loss. Directly detecting a point is difficult, thus we follow the key-point estimation methods  to splat a point into a heatmap with a Gaussian kernel. Thereby the point detection is transformed into a heatmap estimation task. The three ground-truth low-resolution points , and are splatted into three Gaussian heatmaps, including human point heatmap , object point heatmap and interaction point heatmap , where is the number of object categories and is the the number of interaction classes. Note that in and , only the channel corresponding to the specific object class and human action are non-zero. The three heatmaps are produced by adding three respective convolutional blocks upon the feature map , each of which is composed of a
convolutional layer with ReLU, followed by aconvolutional layer and a Sigmoid.
For the three heatmaps, we all apply an element-wise focal loss . For example, given an estimated interaction point heatmap and the corresponding ground-truth heatmaps
, the loss function is:
where is equal to the number of interaction points (HOI triplet) in the image, and is the score at location for class in the predicted heatmaps . We set as 2 and as 4 following the default setting in [14, 32]. The losses and for the human points and the object points can be computed similarly.
Size and offset loss. Besides the center points, the box size and the local offset for the center points are needed to form the human/object box. Four convolutional blocks are added to the feature map to estimate the 2-D size and the local offset of human and object boxes respectively. Each block contains a convolutional layer with ReLU and a convolutional layer.
During training, we only compute the loss at each location of the ground truth human point and object point and ignore all other locations. We take the loss function for the local offset as an example, while the size regression loss is defined similarly. The ground truth local offset for the human point localized at is defined as . Thus the loss function is the summation of the human box loss and object box loss .
where and denote the ground-truth human and object points sets in the training set. and are the number of human points and object points. Note that is not necessarily equal to . For example a human may correspond to multiple actions and objects. is defined similarly with Equation 3.
3.3 Point Matching
The points matching branch pairs the human box with its corresponding object box by using the interaction point as the bridge. More specifically, the interaction point is treated as the anchor. Two displacements and , i.e., the displacements between interaction point vs. human/box point are estimated. The coarse human point and object point are plus and respectively.
Our proposed displacement branch is composed of two convolutional modules. Each module consists of a convolutional layer with ReLU and a convolutional layer. The size of both subject and object displacement maps are .
Displacement loss. To train the displacement branch, we apply loss for each interaction point. The ground-truth displacement from the interaction point located at to the corresponding human point can be computed by . The predicted displacement at location of is . The displacement loss is defined as:
where denotes the ground-truth interaction point sets in the training set. is the number of interaction points. The loss function for displacement from the interaction point to the object point has the same form.
Triplet matching. Two aspects are considered to judge whether a human/object point can be matched with the interaction point. The human/object needs to: 1) be close to the coarse human/object point generated by interaction point plus the displacement and 2) have high confidence scores. On basis of these, for the detected interaction point , we rank the points in the detected human point set by Equation 5 and select the optimal one.
where denotes the confidence score for human point . The optimal object box can be calculated similarly.
3.4 Loss and Inference
The final loss can be obtained by weighted summing the above losses:
During the inference, we firstly do a max-pooling operation with stride 1 on the predicted human, object and interaction points heatmap, which plays a similar role as NMS. Secondly, we select top human points , object center points and interaction points through the corresponding confidence scores , and across all categories. Then, we find the subject point and object point for each selected interaction point by Equation 5. For each matched human point , we get the final box as:
where and are the refined location of the human center point. is the size of box in the corresponding position. The final HOI detection results are a set of triplets, and the confidence score for the triplet is .
4 HOI-A Dataset
The existing datasets such as HICO-Det  and V-COCO  have greatly boosted the development of related research. However, in practical application, there are limited frequent HOI categories that need to be paid special attention to, which are not emphasized in previous datasets. We then introduce a new dataset called Human-Object Interaction for Application (HOI-A).
As shown in Table 1, we select the categories of verb driven by practical application. Each kind of verb in HOI-A dataset has its corresponding application scenario, for example ‘talk on’ can be applied in dangerous action detection, e.g., if the human is talking on phone in-car, it can be considered as a dangerous driving action.
|talk on||mobile phone||18763|
|play (mobile phone)||mobile phone||6728|
|ride||bike, motorbike, horse||7111|
4.1 HOI-A Construction
We describe the image collection and annotation process for constructing the HOI-A dataset. The first step is collecting candidate images, which can be divided into two parts, namely positive and negative images collections.
Positive Images Collection. We collect positive images in two ways, i.e., camera shooting and crawling. Camera shooting is an important way to enlarge the intra-class variations of the data. We employ 50 performers and require them to perform all predefined actions in different scenes and illumination, with various poses, and take photos of them respectively with an RGB camera and an IR camera. For data crawling from the Internet, we generate a series of keywords based on the HOI triplet person, action name, object name, action pair action name, object name and action name, and retrieve images from the Internet.
Negative Images Collection. Negative Images Collection. There are two kinds of negative samples of the predefined human, interaction, object . 1) The concerned object appears in the image, but the concerned action does not happen. For example, in Figure 4(f), although the cigarette appears in the image, it is not smoked by a human. Therefore, the image is still a negative sample. 2) Other action similar to the concerned action happens but the concerned object is missing. For example, in Figure 4
(e), the man is smoking at a glance. But a closer look shows there is no cigarette in the image. We collect this kind of negative sample in the ‘attack’ manner. We firstly train a multi-label action classifier based on the annotated positive images. The classifier takes an image as input and outputs the probability of action classification. Then, we let actors perform arbitrarily to attack the classifier without any interacted objects. If the attack is successful, we record this image as a hard negative sample.
Annotation. The process of annotation contains two steps: box annotation and interaction annotation. First, all objects in the pre-defined categories are annotated with a box and the corresponding category. Second, we visualize the boxes in the images with their id and annotate whether a person has the defined interactions with a object. The annotator should record the triplet person ID, action ID, object ID. For more accurate annotation, each image is annotated by 3 annotators. The annotation of an image is regarded qualified if at least 2 annotators share the same annotation.
4.2 Dataset Properties
Scale. Our HOI-A dataset consists of 38,668 annotated images, kinds of objects and action categories. In detail, it contains human instances, object instances and interaction instances. There are on average interactions performed per person. Table 1 lists the number of instances for each verb which occurs at least times. verbs appear more than times. To our knowledge, this is already the largest HOI dataset, in terms of the number of images per interaction category.
Intra-Class Variation. To enlarge the intra-class variation of the data, each type of verbs in our HOI-A dataset will be captured with three general scenes including indoor, outdoor and in-car, three lighting conditions including dark, natural and intense, various human poses and different angles. Additionally, we shoot the images with two kinds of cameras: RGB and IR.
|Method||Feature||Full(mAP %)||Rare(mAP %)||Non-Rare(mAP %)||Inference Time (ms)||FPS|
|Shen et. al ||A + P||6.46||4.24||7.12||-||-|
|HO-RCNN ||A + S||7.81||5.37||8.54||-||-|
|GPNN ||A||13.11||9.34||14.23||197 + 48 = 245||4.08|
|Xu et. al ||A + L||14.70||13.26||15.13||-||-|
|iCAN ||A + S||14.84||10.45||16.15||92 + 112 = 204||4.90|
|PMFNet-Base ||A + S||14.92||11.42||15.96||-||-|
|Wang et. al ||A||16.24||11.16||17.75||-||-|
|No-Frills ||A + S + P||17.18||12.17||18.68||197 + 230 + 67 = 494||2.02|
|TIN ||A + S + P||17.22||13.51||18.32||92 + 98 + 323 = 513||1.95|
|RPNN ||A + P||17.35||12.78||18.71||-||-|
|PMFNet ||A + S + P||17.46||15.65||18.00||92 + 98 + 63 = 253||3.95|
5.1 Exmperimental Setting
Datasets. To verify the effectiveness of our PPDM, we conduct experiments not only on our HOI-A dataset but also on the general HOI detection dataset HICO-Det . HICO-Det is a large-scale dataset for common HOI detection. It has images ( for training and for test), annotated with verbs including ‘no-interaction’ and object categories. The verbs and objects form kinds of HOI triplets, where types of HOIs which appear times are considered as the rare set, and the rest kinds of HOIs form the non-rare set.
Metric. Following the standard-setting in HOI detection task, we use mean average precious (mAP) as the metric. If a predicted triplet is considered as a true positive sample, it needs to match a certain ground-truth triplet. Specifically, they have the same HOI class and their human and object boxes have overlap with IOUs large than . There is a slight difference when computing AP on the two datasets. We compute AP per HOI class in HICO-Det and compute AP per verb class in HOI-A dataset.
Implementation Details. We use two common heatmap prediction networks as our feature extractor, Hourglass-104 [20, 14] and DLA-34 [29, 32]. Hourglass-104 is a general heatmap prediction network commonly used in keypoint detection and object detection. In PPDM, we use the modified version Hourglass-104 proposed in . The DLA-34 is a lightweight backbone network, and we apply a refined version proposed in . The receptive field of the network need large enough to cover the subject and the object. Hourglass-104 has a sufficiently large receptive field, while that of DLA-34 cannot cover the region including the human and the object, due to its relatively shallow architecture. Thus for the DLA-based model, we concatenate the last three level features and apply a graph-based global reasoning module  to enlarge the receptive field for the interaction point and displacement prediction. In the global reasoning module, we set the channels of the node and the reduced feature as and respectively. For Hourglass-104, we only use the last-level feature for all the following modules. We initialize the feature extractor with the pre-trained COCO . Our experiments are all conducted on the Titan Xp GPU and CUDA 9.0.
During training and inference, the input resolution is and the output is . PPDM is trained with Adam on 8 GPUs. We set the hyper-parameter following 
, which is robust to our framework. We train the model based on DLA-34 with a 128 sized mini-batch for 110 epochs, with a learning rate of 5e-4 decreased to 5e-5 at the 90th epoch. For the hourglass-104 based model, we train it with a batch size of 32 for 110 epochs, with a learning rate of 3.2e-4 decreased by 10 times at the 90th epoch. We follow[14, 32] applying data augmentation, i.e., random scale and random shift to train the model and there is no augmentation during inference. We set the number of selected predictions as 100.
5.2 Comparison to State-of-the-art
We compare PPDM with state-of-the-art methods on two datasets. The quantitative results can be seen in Table 2 and Table 4, and the qualitative results are presented in Figure 5. Additionally, more results can be found in supplementary materials. The compared methods mainly use a pre-trained Faster R-CNN  to generate a set of human-object proposals, which are then fed into a pairwise classification network. As shown in Table 2, to more accurately classify the HOI, many methods use additional human pose feature or language feature.
5.2.1 Quantitative Analysis
HICO-Det. See table 2. Our PPDM-DLA and PPDM-Hourglss both outperform all previous state-of-the-art methods. Specifically, our PPDM-Hourglass achieves a significant performance gain () comparing to the previous best method PMFNet . We can see the previous methods with mAP greater than all use the human pose as an additional feature, while our PPDM only uses the appearance feature. Performance of PPDM is slightly lower than PMFNet on the rare subset. However, the baseline model in PMFNet without using human pose information only achieves mAP on the rare-set. The performance gain on the rare-set may mainly come from the additional human pose feature. The human structural information plays an
|Method||Top 5||Top 10||Top 100|
important role in understanding human actions, thus we regard how to utilize human context in our framework as a significant future work.
|Method||mAP (%)||Time (ms)|
|Faster Interaction Net ||56.03||-|
To verify the claim that our PPDM is more sensitive to the interactive human-object pairs than the previous two-stage methods, we select two representative open-source two-stage methods, i.e., iCAN and TIN , for comparison on HICO-Det dataset. For this purpose, we compare the mAP in ‘top 5’, ‘top 10’, and ‘top 100’ setting, e.g., the ‘top 5’ setting represents that we only select the predictions with the top5 confidence scores for each image to compute the mAP. As shown in Table 5, the performance of the two-stage methods have dropped about when only keeping the top5 confidence results, while our PPDM only drops .
HOI-A. The compared methods in HOI-A dataset are composed of two parts. Firstly, we select the top-3 methods from the leaderboard of ICCV 2019 PIC challenge HOI detection track . Comparing to the top-1 method, URNet , which uses a very strong detector, our methods still outperform it. Secondly, we choose two open-source state-of-the-art methods, iCAN  and TIN , as the baselines on our HOI-A dataset. We first pre-train Faster R-CNN with FPN and ResNet-50 on HOI-A dataset, and then follow their original settings to train the HOI classifier. The results show our PPDM outperforms the two methods by a significant margin. Additionally, for our selected interaction types with practical significance, our PPDM can achieve very high performance, which can be practically applicable.
5.2.2 Qualitative Analysis
We visualize the HOI prediction with the top-3 confidence score on HICO-Det dataset based on PPDM-DLA, and compare our results with the typical two-stage method iCAN . As shown in Figure 5, we select some representative failure cases of the two-stage methods. We can see iCAN tends to focus on the human/object with a high detection score but without interaction. In Figure 5(b) and Figure 5(c), due to the huge imbalance between positive/negative samples, iCAN easily produces high confidence for the ‘no-interaction’ type. In Figure 5(d), the person sitting on the airplane is so small that it cannot be detected. However, our PPDM can accurately predict the HOI triplets with high confidence in these cases. Because PPDM is not dependent on the proposals. Moreover, PPDM concentrates on the HOI triplets understanding.
|2||+ Feature Fusion||18.66||11.86||20.69||26|
|3||+ Global Reasoning||18.63||12.61||20.42||26|
5.2.3 Efficiency Analysis
We compare the inference speed on a single Titan Xp GPU with the methods which have released code or reported the speed. As shown in Table 2, PPDM with DLA and Hourglass are both faster than other methods by a large margin. PPDM-DLA is the only real-time method, which only takes ms for inference. Concretely, the inference time of two-stage HOI detection methods can be divided into proposal generation time and HOI classifier time. Besides, the pose based methods take extra time to estimate human key-points. It can be seen that the speed of PPDM-DLA is faster than any stage of the compared methods.
5.3 Component Analysis
We analyze the proposed components in PPDM from quantitative and qualitative views.
Feature Extractor. We analyze effectiveness of the additional modules in DLA backbone, i.e., feature fusion and global reasoning. The first row in Table 5 represents the basic framework with DLA, where we predict the interaction only based on the last-level feature. It shows that the basic model can still outperform all existing methods. It proves the effectiveness of our designed framework. The second row shows the result of the basic model with the feature fusion, with a performance boost of points. If we add a global reasoning module to the basic framework, it can be seen in the third row of Table 5 that the performance improves by mAP. We conclude that a larger receptive field and global context are helpful to interaction prediction.
Point Detection. To verify whether the midpoint of two center points is the best choice to predict the interaction, we perform an experiment based on the interaction point at the center of the union of human and object boxes, which is another suitable location to predict the interaction. See the 4th row in Table 5. The mAPs drop 1 point compared with PPDM-DLA. It is common that two objects interact with the same person and may locate in the human box, in which case the center points of their union boxes overlap. Additionally, we analyze our interaction point qualitatively. As shown in Figure 6, the predicted interaction almost accurately locates at the midpoint of the human/object points, though the human is apart from the object or in the object.
Point Matching. To further understand the displacement, we visualize the displacements in Figure 6. We can see the interaction point plus the corresponding displacement is very close to the center point of the human/object box, even though the human/object is hard to be detected.
6 Conclusion and Future Work
In this paper, we propose a novel one-stage framework and a new dataset for HOI detection. Our proposed method can outperform the existing methods by a margin also with a significantly faster speed. It breaks the limits of the traditional two-stage methods and directly predicts the HOI triplets by a parallel framework. Our proposed HOI-A dataset is more inclined to practical application for HOI detection. For future work, we plan to explore how to utilize human context in our framework. Additionally, we plan to enrich the action categories for HOI-A dataset.
-  http://www.picdataset.com/challenge/leaderboard/hoi2019.
-  Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
-  Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. In CVPR, 2019.
-  Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. In ECCV, 2018.
Wei Feng, Wentao Liu, Tong Li, Jing Peng, Chen Qian, and Xiaolin Hu.
Turbo learning framework for human-object interactions recognition and human pose estimation.2019.
-  Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
-  Ross Girshick. Fast r-cnn. In CVPR, 2015.
-  Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. Detecting and recognizing human-object interactions. In CVPR, 2018.
-  Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. TPAMI, 2009.
-  Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
-  Tanmay Gupta, Alexander Schwing, and Derek Hoiem. No-frills human-object interaction detection: Factorization, appearance and layout encodings, and training techniques. In ICCV, 2019.
-  Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir Globerson. Mapping images to scene graphs with permutation-invariant structured prediction. In NIPS, 2018.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.
Visual genome: Connecting language and vision using crowdsourced
dense image annotations.
International Journal of Computer Vision, 123(1):32–73, 2017.
-  Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
-  Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yan-Feng Wang, and Cewu Lu. Transferable interactiveness prior for human-object interaction detection. In CVPR, 2019.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In CVPR, 2017.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
-  Alejandro Newell and Jia Deng. Pixels to graphs by associative embedding. In NIPS, 2017.
-  Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu.
Learning human-object interactions by graph parsing neural networks.In ECCV, 2018.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Li Fei-Fei. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.
-  Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.
-  Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, and Jorma Laaksonen. Deep contextual attention for human-object interaction detection. In ICCV, 2019.
-  Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S. Kankanhalli. Learning to detect human-object interactions with knowledge. In CVPR, 2019.
-  Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017.
-  Bangpeng Yao and Li Fei-Fei. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. TPAMI, 2012.
-  Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In CVPR, 2018.
-  Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. Graphical contrastive losses for scene graph parsing. In CVPR, 2019.
-  Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.
-  Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
-  Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. Care about you: towards large-scale human-centric visual relationship detection. arXiv preprint arXiv:1705.09892, 2017.