The task of Human-Object Interaction (HOI) detection aims to localize and classify triplets ofhuman, verb, object from a still image. Beyond detecting and comprehending instances, e.g., object detection [15, 8], segmentation 
and human pose estimation, detecting HOIs requires a deeper understanding of visual semantics to depict complex relationships between human-object pairs. HOI detection is related to action recognition  but presents different challenges, e.g., an individual can simultaneously take multiple interactions with surrounding objects. Besides, associating ever-changing roles with various objects leads to finer-grained and diverse samples of interactions.
Most existing HOI detection approaches infer HOIs by directly employing the appearance features of a person and an object extracted from Convolutional Neural Networks (CNNs), which may neglect the detailed and high-level interactive semantics implied between the related targets. To remedy the limitation above, a number of algorithms exploit additional contextual cues from the image such as human intention  and attention boxes . More recently, several works have taken advantage of additional annotations, e.g., human pose [20, 13] and body parts [23, 18]. Although incorporating contextual cues and annotations generally benefits feature expression, it brings several drawbacks.
Firstly, stack of convolutions and contextual cues are deficient in modelling HOIs since recognizing HOIs requires reasoning beyond feature extraction. However, dominant methods are limited by treating each visual component separately without considering crucial semantic dependencies among related targets (i.e., scene, human and object). Secondly, employing additional human pose and body parts in HOI detection algorithms brings a large computational burden.
To address these issues, we propose a novel graph-based model called Interactive Graph (abbr. in-Graph) to infer HOIs by reasoning and integrating strong interactive semantics among scene, human and object. As illustrated in Figure 1, our model goes beyond current approaches lacking the capability to reason interactive semantics. In particular, in-Graph model contains three core procedures, i.e., a project function, a message passing process and an update function. Here, the project function generates a unified space to make two related targets syncretic and interoperable. The message passing process further integrates semantic information by propagating messages among nodes. Finally, the update function transforms the reasoned nodes to convolution space, providing enhanced representation for HOI-specific modeling.
Based on the proposed in-Graph model, we then offer a general framework referred to as in-GraphNet to implicitly parse scene-wide interactive semantics and instance-wide interactive semantics for inferring HOIs rather than treat each visual target separately. Concretely, the proposed in-GraphNet is a multi-stream network assembling two-level in-Graph models (i.e., scene-wide in-Graph and instance-wide in-Graph). The final HOI predictions are made by combining all exploited semantics. Moreover, our framework is free from additional annotations such as human pose.
We perform extensive experiments on two public benchmarks, i.e., V-COCO  dataset and HICO-DET  dataset. Our method provides obvious performance gain compared with the baseline and outperforms the state-of-the-art methods (both pose-free and pose-based methods) by a sizable margin. We also provide detailed ablation studies of our method to facilitate the future research.
2 Related Work
2.1 Contextual Cues in HOI Detection
The early human activity recognition  task is confined to scenes containing single human-centric action and ignores spatial localization of the person and related object. Therefore, Gupta  introduced visual semantic role labeling to learn interactions between human and object. HO-RCNN  introduced a three-branch architecture with one branch each for a human candidate, an object candidate, and an interaction pattern encoding the spatial position of the human and object. Recently, several works have taken advantage of contextual cues and detailed annotations to improve HOI detection. Auxiliary boxes  were employed to encode context regions from the human bounding boxes. InteractNet  extended the object detector Faster R-CNN  with an additional branch and estimated an action-specific density map to identify the locations of interacted objects. iHOI  utilized human gaze to guide the attended contextual regions in a weakly-supervised setting for learning HOIs.
Very recently, human pose  has been widely adopted as an additional cue to tackle HOI detection. Pair-wise human parts correlation  was exploited to learn HOIs. TIN  combined human pose and spatial configuration to encode pose configuration maps. PMFNet  developed a multi-branch network to learn a pose-augmented relation representation to incorporate interaction context, object features and detailed human parts. RPNN  introduced an object-bodypart graph and a human-bodypart graph to capture relationships between body parts and surrounding instances (i.e., human and object).
Although extracting contextual evidence benefits feature expression, it is not favored since additional annotations and computation are indispensable. For example, pose-based approaches are inseparable from pre-trained human pose estimators like Mask R-CNN  and AlphaPose , which bring large workload and computational burden.
2.2 Semantic Reasoning
Attention mechanism  in action recognition helps to suppress irrelevant global information and highlight informative regions. Inspired by action recognition methods, iCAN  exploited an instance-centric attention mechanism to enhance the information from regions and facilitate HOI classification. Furthermore, Contextual Attention  proposed a deep contextual attention framework for HOI detection, in which context-aware appearance features for human and object were captured. PMFNet  focused on pose-aware attention for HOI detection by employing human parts. Overall, methods employing attention mechanisms learn informative regions but treat each visual target (i.e., scene, human and object) separately, which are still insufficient to exploit interactive semantics for inferring HOIs.
Graph Parsing Neural Network (GPNN)  introduced a learnable graph-based structure, in which HOIs were represented with a graph structure and parsed in an end-to-end manner. The above structure was a generalization of Message Passing Neural Network 
using a message function and a Gated Recurrent Unit (GRU) to iteratively update states. GPNN was innovative but showed some limitations. Firstly, it reasoned interactive features at the coarse instance-level (i.e., each instance was encoded as an infrangible node), which suffered from handling complex interactions. In addition, it required iteratively message passing and updating. Thirdly, it excluded semantic information from the scene in inferring HOIs. Lately, RPNN introduced a complicated structure with two graph-based modules incorporated together to infer HOIs, but in which fine-grained human pose was required as prior information.
In this paper, we aim to develop a novel graph-based model to provide interactive semantic reasoning between visual targets. Instead of coarse and iterative message passing between instances, our model captures pixel-level interactive semantics between targets all at once. Furthermore, our model is free from costly annotations like human pose.
3 Proposed Method
3.1 Overview of in-Graph
The detailed design of proposed in-Graph model is provided in Figure 2. Scene, human and object are three semantic elements been considered as three visual targets in our model, referred to as , , and , respectively. The proposed in-Graph takes two targets once to conduct pixel-level interactive reasoning. Each of the targets takes convolutional feature according to the corresponding boxes (i.e., the whole image, human candidate boxes and object candidate boxes) as input. Here H*W denotes locations and D denotes feature dimension.
We first propose a project function to map two feature tensors into a graph-based semantic space named interactive space, where a fully-connected graph structure can be built. Based on the graph structure, message passing process is then adopted as modeling the interaction among all nodes by propagating and aggregating interactive semantics. Finally, the update function provides a reversed projection over interactive space and output featureY, enabling us to utilize the reasoned semantics in convolution space. We then describe its architecture in details and explain how we apply it into HOI detection task.
3.2 Project Function
Project function aims to provide an pattern to fuse two targets together, after which message passing process can be efficiently computed. The calculation process of can be divided into three parts: a feature conversion denoting as , a weights inference denoting as and a linear combination, where and are learnable parameters. Finally, the function outputs a matrix , where and are input feature tensors, denotes the number of nodes in interactive space and refers to dimension.
Feature conversion. Given feature tensors of two targets , we first employ convolutions to reduce the dimensions of and to , thus the computation of the block can be valid decreased. The obtained tensors are then reshaped from to planar , obtaining ,where its two-dimensional location pixels of
are converted to one-dimensional vector. After that, a concatenation operation is adopted to integrate and by dimensional , obtaining .
Weights inference. Here, we infer learnable projection weights so that semantic information from original features can be weighted aggregated. Instead of designing complicated calculations, we simply use convolution layers to generate the dynamically weights. In this step, and are feed into convolutions to obtain weight tensors with channel of N. Obtained feature tensors are then reshaped as planar . Finally, integrated projection weights are obtained by a concatenation operation.
Linear combination. Since the project function involves two targets, linear combination is a necessary step to aggregate the semantic information and transform targets to the unified interactive space. In particular, node in interactive space is calculated as follow. Here , .
The proposed project function is simple and fast since all parameters are end-to-end learnable and come from 1*1 convolutions. Such a function achieves semantic fusion between two targets and maps them into an interactive space effectively.
3.3 Message Passing and Update Function
3.3.1 Message Passing
After projecting targets from convolution space to interactive space, we have a structured representation of a fully-connected graph , where each node contains a feature tensor as its state and all nodes are considered as fully-connected with each other. Based on the graph structure, message passing process is adopted to broadcast and integrate semantic information from all nodes over the graph.
GPNN  applies an iterative process with GRU to enable nodes to communicate with each other, whereas it needs to run several times iteratively towards convergence. We reason interactive semantics over the graph structure by adopting a single-layer convolution to efficiently build communication among nodes. In our model, the message passing functions is computed by:
Here denotes the adjacency matrix among nodes learned by gradient decent during training, reflecting the weights for edge . denotes the state update of nodes. In our implementation, the operation of is a channel-wise 1D convolution layer that performs Laplacian smoothing  and propagates semantic information among all connected nodes. After information diffusion, the implements addition point to point which updates the hidden node states according to the incoming messages.
3.3.2 Update Function
To apply above reasoning results into convolutional network, an update function provides a reverse projection for reasoned nodes from interactive space to convolution space, which output as a new feature tensor. Given the reasoned nodes , update function first adopts a linear combination as follows:
Where the projection weights is transposed and reused, here , .
After the linear combination, we reshape the obtained tensor from planar to three-dimensional . Finally, a convolution is attached to expand the feature dimensions from to to match the inputs. In this way, updated features in convolution space can play its due role in the following schedule.
3.4.1 Assembling in-Graph Model
Our in-Graph model improves the ability of modelling HOIs by employing interactive semantic reasoning beyond stack of convolutions. It is noted that the human visual system is able to progressively capture interactive semantics from the scene and related instances to recognize a HOI. Taking the HOI triplet human, surf, surfboard as an example, the scene-wide interactive semantics connected with the scene (e.g., sea) and instances (e.g., human, surfboard) can be captured as prior knowledge and instance-wide interactive semantics between the person and surfboard are learned to further recognize the verb (i.e., surf) and disambiguate other candidates (e.g., carry). Inspired by this human perception, we assign in-Graphs in two levels to build in-GraphNet, which are scene-wide level and instance-wide level. The scene-wide in-Graphs contain a human-scene in-Graph and an object-scene in-Graph, the instance-wide in-Graph refers to human-object in-Graph.
An overview of proposed in-GraphNet is shown in Figure 3. Since in-Graph models are light-weight, it can be easily incorporated into existing CNN architectures. ResNet-50  is employed as the backbone network in our implementation. We denote the candidate bounding boxes of the human and object by and , respectively. We also express the box of the whole scene as . After the process of shared ResNet-50 C1-C4 and RoI pooling according to respective candidate boxes, ResNet-50 C5 generates input features for targets , , and , teamed as , and . These three targets are then assigned into three in-Graph models in pairs. While input features for in-Graph model have dimension , we set reduced dimension , number of nodes .
3.4.2 Three Branches
The in-GraphNet contains a human-centric branch, an object-centric branch and a spatial configuration branch, where , and the spatial configuration  based on two-channel interaction pattern  are adopted as basic feature representations, respectively. In scene-wide interactive reasoning, semantic features obtained from human-scene in-Graph and object-scene in-Graph are concatenated to and respectively to enrich presentations in human-centric and object-centric branches. In instance-wide interactive reasoning, semantic feature output from human-object in-Graph is concatenated into the object-centric branch only, because appearance of an object is usually constant in different interactions and provides minor effect in human-centric representation. Finally, the enhanced features from three branches are fed into fully connected layers and perform classification operations. In this way, the entire framework is implemented to be fully differentiable and end-to-end trainable using gradient-based optimization. With the formulation above, rich interactive relations between visual targets can be explicitly utilized to infer HOIs.
|Gupta et al. ||ResNet-50-FPN||31.8|
|GPNN ||Deformable CNN||44.0|
|Contextual Att ||ResNet-50||47.3|
|Shen et al. ||VGG-19||6.46||4.24||7.12|
|GPNN ||Deformable CNN||13.11||9.34||14.23|
|Contextual Att ||ResNet-50||16.24||11.16||17.75|
Since HOI detection is a multi-label classification problem where more than one HOI label might be assigned to a human, object candidate, our model is trained in a supervised fashion using the multi-label binary cross-entropy loss. All three branches are trained jointly, where the overall loss for each interaction category is the sum of three losses from three branches.
For each image, pairwise candidate boxes () for each HOI category are assigned binary labels based on the prediction. Similar to general HOI detection frameworks [6, 4], we use a fusion of scores output from each branch to predict a final score for each HOI.
Here , and denotes scores output from binary sigmoid classifiers in human-centric branch, object-centric branch and spatial configuration branch, respectively. and are object detection scores output from object detector for candidate boxes. is the final score for each HOI.
4 Experiments and Evaluations
4.1 Experimental Setup
Datasets and evaluation metrics. We evaluate our model and compare it with the state-of-the-arts on two large-scale benchmarks, including V-COCO and HICO-DET  datasets. V-COCO  includes 10,346 images, which is a subset of MS COCO dataset. It contains 16,199 human instances in total and provides 26 common HOI annotations. HICO-DET  contains about 48k images and 600 HOI categories over 80 object categories, which provides more than 150K annotated human, object pairs. We use role mean average precision (role mAP)  on both benchmarks.
. Human boxes with scores higher than 0.8 and object boxes with scores higher than 0.4 are kept for detecting HOIs. We train our model with Stochastic Gradient Descent (SGD), using a learning rate of 1e-4, a weight decay of 1e-4, and a momentum of 0.9. The strategy of interactiveness knowledge training
is adopted in our training and the model is trained for 300K and 1800K iterations on V-COCO and HICO-DET, respectively. All our experiments are conducted by tensorflow on a GPU of GeForce GTX TITAN X.
4.2 Overall Performance
We compare our method with several state-of-the-arts in this subsection. Methods being compared are classified into two categories, i.e., methods that are free from human pose (top of tables) and methods relying on additional pose estimators (middle of tables). Meanwhile, we strip all modules related to in-Graphs from our proposed framework as the baseline. Comparison results on V-COCO and HICO-DET in terms of are shown in Table 1 and Table 2, respectively.
Firstly, our in-GraphNet obtains 48.9 mAP on V-COCO and 17.72 mAP on HICO-DET (Defualt full mode) and achieves absolute gains of 4.1 points and 2.3 points compared with the baseline, which are relative improvements of 9.4% and 15%. Besides, although pose-based methods usually perform better than pose-free methods, our method as a pose-free one outperforms all the others with the best performance, validating its efficacy for HOI detection task.
4.3 Ablation Studies
|number of nodes (N)||128||256||512||1024|
We adopt several ablation studies in this subsection. VCOCO serves as the primary testbeds on which we further analyze the individual effect of components in our method.
Comprehending in-Graph. We show two examples in Figure 4 to visualize the effects of three in-Graphs been adopted. The brightness of the pixel indicates how much the feature been noticed. Intuitively, the three in-Graphs learn different interactive semantics from pairwise reasoning. Human-scene in-Graph and Object-scene in-Graph exploit interactive semantics between scene and instances. Human-object in-Graph, on the other hand, mostly focuses on the regions roughly correspond to the on-going action between human and object. In addition, to further dissect in-Graph model, empirical tests have been conducted to study the effect of number of nodes in interactive spaces. As summarized in Table 3, we get the best result when is set as the value of 512.
Effects of adopting different in-Graph models. As shown in Table 4, while directly concatenating in human-centric branch and in object-centric branch, we can see that simply concatenating different targets without interactive reasoning barely improves the result. When we only adopt the scene-wide in-Graphs or the instance-wide in-Graph, the mAP are 48.3 and 47.7 respectively, indicating the respective effects of these two parts. Specifically, we can draw a relatively conclusion from detailed class-wise results that scene-wide in-Graphs are more adept in modelling interactions closely related to the environment, while instance-wide in-Graph performs better in depicting interactions closely related to human pose.
4.4 Quantitive Examples
For visualization, several examples of detection are given in Figure 5. We first compare our results with baseline to demonstrate our improvements. We can see from the first two rows that our method is capable of detecting various HOIs with higher scores. In addition, the third row shows that our method can adapt to complex environments to detect multiple people taking different interactions with diversified objects.
In this paper, we propose a graph-based model to address the problem of lack interactive reasoning in existing HOI detection methods. Beyond convolutions, our proposed in-Graph model efficiently reasons interactive semantics among visual targets by three procedures, i.e., a project function, a message passing process and an update function. We further construct an in-GraphNet assembling two-level in-Graph models in a multi-stream framework to parse scene-wide interactive semantics and instance-wide interactive semantics for inferring HOIs. The in-GraphNet is free from costly human pose and end-to-end trainable. Extensive experiments have been conducted to evaluate our method on two public benchmarks, including V-COCO and HICO-DET. Our method outperforms both existing human pose-free and human pose-based methods, validating its efficacy in detecting HOIs.
This paper was partially supported by National Engineering Laboratory for Video Technology - Shenzhen Division, and Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing). Special acknowledgements are given to AOTO-PKUSZ Joint Lab for its support.
Learning to detect human-object interactions.
workshop on applications of computer vision, pp. 381–389. Cited by: §1, §2.1, §3.4.2, Table 2, §4.1.
-  (2018) Pairwise body-part attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–67. Cited by: §2.1.
Rmpe: regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343. Cited by: §1, §2.1, §2.1.
-  (2018) ICAN: instance-centric attention network for human-object interaction detection.. british machine vision conference, pp. 41. Cited by: §2.2, §3.4.2, §3.4.3, Table 1, Table 2, §4.1.
Neural message passing for quantum chemistry.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §2.2.
Detecting and recognizing human-object interactions.
computer vision and pattern recognition, pp. 8359–8367. Cited by: §2.1, §3.4.3, Table 1, Table 2.
-  (2015) Contextual action recognition with r*cnn. international conference on computer vision, pp. 1080–1088. Cited by: §2.1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §2.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §3.4.1.
-  (2017) The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, pp. 1–23. Cited by: §2.1.
-  (2018) Detecting visual relationships using box attention. arXiv: Computer Vision and Pattern Recognition. Cited by: §1, Table 1.
Deeper insights into graph convolutional networks for semi-supervised learning. In
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.3.1.
-  (2019) Transferable interactiveness knowledge for human-object interaction detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3585–3594. Cited by: §1, §2.1, Table 1, Table 2, §4.1.
-  (2018) Learning human-object interactions by graph parsing neural networks. european conference on computer vision, pp. 407–423. Cited by: §2.2, §3.3.1, Table 1, Table 2.
-  (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §1, §2.1, §4.1.
-  (2015) Action recognition using visual attention. international conference on learning representations. Cited by: §1, §2.2.
-  (2018) Scaling human-object interaction recognition through zero-shot learning. workshop on applications of computer vision, pp. 1568–1576. Cited by: Table 2.
-  (2019) Pose-aware multi-level feature network for human object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9469–9478. Cited by: §1, §2.1, §2.2.
-  (2019) Deep contextual attention for human-object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5694–5702. Cited by: §2.2, Table 1, Table 2.
-  (2019) Interact as you intend: intention-driven human-object interaction detection. IEEE Transactions on Multimedia. Cited by: §1, §2.1, Table 1, Table 2.
-  (2018) Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692. Cited by: §1.
-  (2016) Situation recognition: visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5534–5542. Cited by: §1, §2.1, Table 1, §4.1.
-  (2019) Relation parsing neural network for human-object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 843–851. Cited by: §1, §2.1, §2.2, Table 1, Table 2.