Code for paper: Visual-Semantic Graph Attention Networks for Human-Object Interaction Detection. Project page: http://www.juanrojas.net/vsgat/
In scene understanding, machines benefit from not only detecting individual scene instances but also from learning their possible interactions. Human-Object Interaction (HOI) Detection tries to infer the predicate on a <subject,predicate,object> triplet. Contextual information has been found critical in inferring interactions. However, most works use features from single object instances that have a direct relation with the subject. Few works have studied the disambiguating contribution of subsidiary relations in addition to how attention might leverage them for inference. We contribute a dual-graph attention network that aggregates contextual visual, spatial, and semantic information dynamically for primary subject-object relations as well as subsidiary relations. Graph attention networks dynamically leverage node neighborhood information. Our network uses attention to first leverage visual-spatial and semantic cues from primary and subsidiary relations independently and then combines them before a final readout step. Our network learns to use primary and subsidiary relations to improve inference: encouraging the right interpretations and discouraging incorrect ones. We call our model: Visual-Semantic Graph Attention Networks (VS-GATs). We surpass state-of-the-art HOI detection mAPs in the challenging HICO-DET dataset, including in long-tail cases that are harder to interpret. Code, video, and supplementary information is available at http://www.juanrojas.net/VSGAT.READ FULL TEXT VIEW PDF
We tackle the challenging problem of human-object interaction (HOI)
Reasoning human object interactions is a core problem in human-centric s...
Comprehensive visual understanding requires detection frameworks that ca...
The human-object interaction (HOI) detection task refers to localizing
Objects are entities we act upon, where the functionality of an object i...
Extracting graph representation of visual scenes in image is a challengi...
We seek to detect visual relations in images of the form of triplets t =...
Code for paper: Visual-Semantic Graph Attention Networks for Human-Object Interaction Detection. Project page: http://www.juanrojas.net/vsgat/
Human-Object Interaction (HOI) detection has recently gained important traction. The goal is to infer an interaction predicate for the
, human pose estimation[Dabral2018LearningMotion, pavllo20193d], scene segmentation [he2017mask], and action recognition [Yan2018SpatialRecognition, Li2019Actional-StructuralRecognition]; the harder problem of HOI has progressed less so, as it is a more challenging problem. HOI requires better understanding of contextual information for better inference. HOI starts with instance detection (localizing and identifying subject and objects) and continues with interaction inference as illustrated in Fig. 1a.
HOI research typically infers interactions based on the local features of a subject-object pair [liu2016ssd, chao2018learning, gkioxari2018detecting, gao2018ican]. Over time, additional contextual cues have been leveraged to improve inference [Li_2019_CVPR, xu2019learning, 2019PeyreDetecting, bansal2019detecting, wan2019pose, gupta2018nofrills]. However, to date, there are still many interactions that confuse our systems. Consider Fig. 1
(b) for example. On the left, a girl holds and licks a knife. Primary visual and spatial information can infer so. But what is often not considered is that the subsidiary relations
In this paper, we study the disambiguating power of subsidiary scene relations via a double Graph Attention Network that aggregates visual-spatial, and semantic information in parallel. The network uses attention to leverage primary and subsidiary contextual cues to gain additional disambiguating power. Our work is the first to use dual attention graphs. We call our system: Visual-Semantic Graph Attention Networks (VS-GATs).
We begin the HOI task by using instance detection (Sec. 3.2). Then a pair of Graph Attention networks are created. The first graph’s nodes are instantiated from bounding-box visual features; while the edges are instantiated from corresponding spatial features (Sec. 3.3.2). The second graph’s nodes are instantiated from word embedding features associated with corresponding visual nodes (see Sec. 3.3.3). An attention mechanism then updates the node features of each graph and learning from primary and subsidiary contextual relations. A combined graph is created by concatenating both graph’s node features. Then inference is done through a readout step on box-paired subject-object nodes. Fig. 2 offers an overview of our system. Notice the existing edges between subsidiary instances and the human as well as in-between objects. These relations, along with attention mechanism, aid in discerning scenes better. The inference modules is trained and tested in the challenging HICO-DET dataset [chao2018learning] and surpasses state-of-the-art (SOTA) for the Full and Rare categories with mAPs of 19.66 and 15.79 respectively. The results show the additional attention mechanism from subsidiary contextual cues improves inference even for samples with few training examples. We expect to see improved attention mechanisms learn better contextual cues and continue to improve interaction inference.
Learning to better understand context is critical to scene understanding. Recently, researchers have exploited visual, spatial, semantic, interactiveness, human pose, and functional approximation as contextual cues. Simultaneously a host of architectures including deep nets, graph nets, and graph attention nets have also been used. In this section we present the works by keying in on the architecture type and then describing the various contexts used therein.
A primary way to do HOI has been to extract visual features from instance detectors along with spatial information to instantiate multi-streams of DNNs. Each stream may contain information of detected human(s), objects, and perhaps some representation of interaction. A final fusion step is undertaken where individual inferences scores are multiplied to yield a final one [chao2018learning, gkioxari2018detecting, gao2018ican]. Lu et al. [lu2016visual] considered semantic information under the multi-stream DNN setting stating that interaction relationships are also semantically related to each other. Gupta et al. [gupta2018nofrills] and Wan et al. [wan2019pose] emphasized a fine-grained layout of the human pose and leverage relation elimination or interactiveness modules to improve inference. Li et al. [Li2019Actional-StructuralRecognition], include an interactiveness network that like Gupta et al. eliminates non-interactive edges. Visual, spatial, and pose features are concatenated and input into the interactiveness discriminator which finally outputs a detection classification. Peyre et al. [2019PeyreDetecting] and Bansal et al. [bansal2019detecting] go a step further and consider semantic functional generalization. Bansal et al. consider how humans interact with functionally similar objects in similar manners. They leverage word2vec embeddings [mikolov2013distributed] to train a fixed human-action with other similar object embeddings. Peyre et al. [2019PeyreDetecting] use the similar concept of visual analogies. They instantiate another stream using a visual-semantic embedding of the triplet resulting in a triagram. Analogies like functional approximation rely on similarity of function but in this case at a visual level.
Graph Neural Networks (GNNs) [Wu2019a] were first conceptualized as recurrent graph neural networks (RecGNNs) [gori2005new, scarselli2008computational]. RecGNNs learned a target node through neighbor information aggregation until convergence. Afterwards, Convolutional Graph Neural Networks (ConvGNNs) were devised under two main streams: spectral-based [bruna2013spectral] and spatial-based approaches [niepert2016learning]. These graphs generalize the convolution operation from a grid to a graph. A node’s representation is created by aggregating self and neighboring features. ConvGNNs stack multiple graph layers to generate high-level node representations. Spatio-temporal Graph Neural Networks simultaneously consider both spatial and temporal dependencies and often leverage ConvGNNs for spatial encoding and RecGNNs for temporal encoding.
GNNs have been used to model scene relations and knowledge structures. Kato et al. [kato2018compositional]
use an architecture that consists of one stream of convolutional features and another stream composed of a semantic graph. In the latter, verbs and nouns form nodes which use word embeddings as their features. New connected action nodes are inferred. Learning propagates features which are finally merged with visual features from the convolutional network. Note that this work did not classify HOI detections; instead they inferred a single interaction from the global scene. Xuet al. [xu2019learning]
, similarly use a visual stream with convolutional features for human and object instances and a parallel knowledge graph that yields candidate verb features. The authors lean on the concept of semantic regularities to assert that visual-semantic feature pairs contain semantic structure that needs to be retained. They finish by conducting a multi-modal joint embedding, where the objective maximizes the similarity of positive pairs and minimizes similarity across all non matching pairs.
More recently Velivckovic et al. al [velivckovic2017graph] introduced Graph Attention Networks (GATs). GATs operate on graph structured data and leverage masked self-attentional layers. Nodes attend their neighbor’s features and dynamically learn edge-weight proportions with neighbors according to their usefulness.
Yang et al. [yang2018graph] proposed a Graph R-CNN network to learn scene relations visually (not HOI detection). Their system extracts visual features through a region proposal network (RPN) and builds a graph. They learn to prune irrelevant edges and use an attention mechanism to propagate higher-order context throughout the graph. They predict per-node edge attentions and learn to modulate information flow across unreliable or unlikely edges. Sun et al. [sun2019relational], do multi-person action forecasting in video. They use a RecGNN based on visual and spatio-temporal features to create and update the graph. Lastly, Qi et al. [qi2018learning] are the first to explicitly use a GAT for HOI in images and video. They propose a graph parsing neural net (GPNN) that takes node and edge visual features as input from which a graph is formed. The graph structure is set by an adjacency matrix. Message updates leverage attention mechanisms via a weighted sum of the messages of the other nodes. Finally, a readout function is used for interaction inference.
To date, only Qi et al. [qi2018learning] have used GAT architectures that consider subsidiary relations. We further improve the SOTA by integrating additional contextual cues into the attention mechanism (spatial and semantic ones) through a parallel dual-attention graph architecture and show it’s disambiguating power.
In this section, we first introduce the notion of a graph, then describe the visual and semantic graphs; their attention mechanisms; the fusion step; inference; training, and finally implementation details.
A graph is defined as that consists of a set of nodes and a set of edges. Node features and edge features are denoted by and respectively. Let be the node and be the directed edge from to .
A graph with nodes has a node features matrix and an edge feature matrix where
is the feature vector of nodeand is the feature vector of edge . Fully connected edges imply .
More formally, given an input image , and interaction class labels
, then the goal is to infer the joint probability, which can be factorized as:
This factorization serves to design our system. First, is accomplished through the visual-spatial and semantic-GAT graphs of Sec. 3.3.2 and 3.3.3. is accomplished by combining both graphs as described in Sec. 3.4 and illustrated in Fig. 2. The last factorization is the inference step accomplished through box-pairing and readout described in Sec. 3.5. Training and implementation detail are also offered in Sec. 3.5.1 and Sec. 3.6 respectively.
In this section we describe how contextual features are extracted and later describe how they are used to instantiate nodes and edges in Sec. 3.3.
Visual features are extracted from subject and object proposals generated from a two-stage Faster-RCNN (RestNet-50-FPN) [renNIPS15fasterrcnn, he2016deep, lin2017feature]. First, the RPN generates (hundreds) of subject and object proposals. Thus, for an image , the th human bounding-box and the th object bounding-box are used to extract latent features from Faster-RCNNs last fully-connected layer ( after the ROI pooling layer) to instantiate the visual graph nodes as described in Sec. 3.3.2.
Note that we use an empirically-derived score threshold to limit the number of subject and object proposals (see Sec. 3.6 for specific numbers). By eliminating non-useful proposals, the system better focuses its resources on the more important relations. Several other works have more sophisticated proposals [gupta2018nofrills, Li_2019_CVPR, yang2018graph]; we leave this as future work.
Spatial features such as bounding box locations and relative locations are informative about the relationship that proposals have with each other [Zhuang_2017_ICCV, hu2017modeling, plummer2017phrase, zhang2017visual]. Spatial features are also useful to encode the predicate. Consider the “ride” predicate, then we can deduce that subject is above the object.
Given a pair of bounding boxes, their paired-coordinates are given by and . Along with respective areas and and an image area of size .
Spatial features can be grouped into (i) relative scale and (ii) relative position features. Bounding-box relative scale features are defined as:
Relative position bounding-box features are defined as:
The relative position expression is similar to the bounding box regression coefficients proposed in [renNIPS15fasterrcnn], but two center scales are added to yield more obvious cues.
In this work, we use word2vec embeddings as semantic features [mikolov2013distributed]. word2vec takes a text corpus as input and produces latent word vectors as outputs. The latent representation retains semantic and syntactic similarity. Similar context is made evident through spatial proximity; indicating that words have mutual dependencies.
We use the publicly available word2vec vectors pre-trained on the Google News dataset (about 100 billion words) [Google2013GoogleHosting.]. The model yields a 300-dimensional vector for 3 million words and phrases and is trained according to [mikolov2013distributed]. All existing object classes in the HICO-DET dataset are used to obtain the word2vec latent vector representations offline.
In this section, we first introduce the general concept of Graph Neural Networks before proceeding to define Visual-GAT and Semantic-GAT.
GNNs use the graph structure and node features to dynamically update the node vector representation [xu2018how]. An anchor node’s features are updated through aggregation—using neighboring features to update the anchor node. If time or multiple layers are involved, then after aggregation iterations, a given node’s representation encodes updated structural information. The node aggregation operation on node is generically defined as:
Initially, and is the set of nodes adjacent to .
The and functions in GNNs is crucial. Many have been proposed [xu2019learning]; averaging being a common aggregation method as in Eqtn. 5. We now drop since our problem consists of a single layer.
where, is the concatenation operation.
If we consider that in the first step of HOI detection, RPN yield hundreds of proposals, then an averaging method for node features will introduce significant noise. Instead, as proposed in [sun2019relational], a weighted sum is better suited to mitigate the noise. In this work, we follow the proposal in [sun2019relational] closely. Consider a virtual node for a node that is computed as a weighted sum over all neighbors:
Weights are given by:
where, is an computationally efficient attention function that weighs the importance of node to node and can be implemented through the self-attention neural-net mechanism of [Vaswani2017AttentionNeed, velivckovic2017graph]. Its parameters are jointly learned with the target task during back propagation without additional supervision.
Once is computed, then an update mechanism is used to update the output feature . Specifics about the update will be given for the vision and semantic graphs separately in the subsequent sections.
The visual graph instantiates a node from the latent features of each of detected objects. Then, edge is constructed from the spatial features from Sec. 3.2. We use an edge function to integrate the features on the edge along with its two connected nodes according to:
where, are the derived latent features for , with .
We then apply the the attention mechanism of Eqtn. 7 to calculate the distributions of soft weights on each edge and after which we apply a custom weighted sum:
where, means element-wise summation operation. Note that latent feature includes contextual spatial information.
After that, we leverage a node feature updated function to update each node’s features:
At this point, we can get an “updated visual graph” with new features as illustrated in Fig. 2. The different edge thickness’ represents the soft weight distributions. In our method, we implement , , and as a single fully-connected layer network with hidden node dimensions of 1024, 1, and 1024 respectively.
In the semantic graph, word2vec latent representations of the class labels of detected objects are used to instantiate the graph’s nodes. In this graph, we do not assign any features on the edges. We denote as the word embedding for node .
As with the visual graph, we use an function and an function to compute the distributions of soft weights on each edge:
Then, the global semantic features for each node are computed through the linear weighted sum:
After that, we update the node’s features:
As with the visual graph, here too, we output an “updated visual graph” with new features as shown in Fig. 2. Similarly, , , and are designed as single fully-connected layer networks with hidden sizes 1024, 1, and 1024 respectively.
To jointly leverage the dynamic information of both the visual and the semantic GATs, it is necessary to fuse them as illustrated in the “Combined Graph” of Fig. 2. The fusion operation can be straightforward however. In this work, we concatenate the features of each of the updated nodes to produce new nodes. We also initialize the edges with the original spatial figures described in Sec. 3.2. We denote the combined node features as for node , where .
The last step is to infer the interaction label for a predicate as part of our original triplet <subject, predicate, object>. Note that a person can concurrently perform different actions with each of the available target objects111And, it is also possible to have more than one human in an image. We can test only for humans based on their class label. That is, the subject can ‘hold’ or ‘lick’ the knife. In effect, HOI detection is a multi-label classification problem [gao2018ican]; where, each interaction class is independent and not mutually exclusive.
Thus, to simplify inference, we box-pair specific subject-object bounding boxes for all object (nodes) directly linked to the th human node. [gao2018ican, gkioxari2018detecting]. Box-pairing is illustrated in the inference section of Fig. 2.
After box-pairing, we use the final human node representation , the final object node representation and mutual spatial edge features to form an action-specific representation for prediction.
First, an action category score where, n denotes the total number of possible actions, is computed. The computation requires a readout step, whose function
is implemented as a multi-layer perceptron222with 2 hidden layers of dimensions 1024 and 117. and applied to each action category. The output is then run through a binary sigmoid classifier, which produces the action score as shown in Eqtn. 14.
The final score of a triplet’s predicate can be computed through the chain multiplication of the action score , the detected human score from object detection as well as the detected object score as seen in Eqtn. 15:
The overall framework is jointly optimized end-to-end, with a multi-class cross-entropy loss that is minimized between action scores and the ground truth action label for each action category:
We trained our model on 80% HICO-DETs [chao2018learning] training set, validated on the other 20% and tested with the full range of images in their test set (see Sec. 4.1 for HICO-DETs details).
Our architecture was built on Pytorch and the GDL library. For object detection we used Pytorch’s re-implemented Faster-RCNN API[renNIPS15fasterrcnn]. Faster-RCNN used a RestNet-50-FPN backbone [he2016deep, lin2017feature] trained on the COCO dataset[Lin2014MicrosoftContext]. The object detector and word2vec vectors are frozen during training. We keep the human bounding-boxes whose detection score exceeds 0.8, while for objects we use a 0.3 score threshold.
All neural network layers in our modal are constructed as MLPs as mentioned in previous sections. For training on HICO-DET, we use batch size of 32 and a dropout rate of 0.3. We use an Adam optimizer with a learning rate of 1e-5 and 300 epochs. As for the activation function, we use a LeakyReLU in all attention network layers and a ReLU elsewhere. All experiments are conducted on a single NVIDIA TITAN RTX GPU.
We evaluate the performance of VS-GATs on the HICO-DET dataset [chao2018learning] and compare with the SOTA (Table 1). Ablation studies are conducted to study the impact of the proposed techniques (Table 2). We also visualize the performance distribution of our model across objects for a given interaction (Fig. 3).
Datasets. In this work, we use the HICO-DET data set [chao2018learning] which builds on top of the HICO dataset [Chao2015HICO:Images]. The latter only consists of image-level annotations. HICO-DET on the other hand includes bounding box annotations specifically for the HOI detection task. HICO-DET consists of 38,118 training images and 9,658 testing images. The 117 interaction classes and 80 objects in HICO-DET yield 600 HOI categories in total. The dataset also has 150K annotated human-object pair instances. The dataset is divided into three different HOI categories: (i) Full: all 600 categories; (ii) Rare: 138 HOI categories that have less than 10 training instances, and (iii) Non-Rare: 462 HOI categories with more than 10 training instances.
Evaluation Metrics. We use the standard mean average precision (mAP) metric to evaluate the model’s detection performance. mAP is calculated with recall and precision which is common used for the detection task. In this case, we consider a detected result with the form <subject, predicate, object> is positive when the predicted verb is true and both the detected human and object bounding boxes have the intersection-of-union (IoU) exceed 0.5 with respect to the corresponding ground truth.
|InteractNet [gkioxari2018detecting]||Faster R-CNN with ResNet-50-FPN||9.94||7.16||10.77|
|GPNN [qi2018learning]||Deformable ConvNets [Dai2017DeformableNetworks]||13.11||9.34||14.23|
|iCAN [gao2018ican]||Faster R-CNN with ResNet-50-FPN||14.84||10.45||16.15|
|Xu et al. [xu2019learning]||Faster R-CNN with ResNet-50-FPN||14.70||13.26||15.13|
|Gupta et al. [gupta2018nofrills]||Faster R-CNN with ResNet-152||17.18||12.17||18.68|
|Li et al. [Li_2019_CVPR]||Faster R-CNN with ResNet-50-FPN||17.22||13.51||18.32|
|PMFNet [wan2019pose]||Faster R-CNN with ResNet-50-FPN||17.46||15.65||18.00|
|Peyre et al. [2019PeyreDetecting]||Faster R-CNN with ResNet-50-FPN||19.40||14.60||20.90|
|Ours(VS-GATs)||Faster R-CNN with ResNet-50-FPN||19.66||15.79||20.81|
Our experiments show our model achieves the best mAP results for SOTA in the Full and Rare HOI-DET categories and second best in the Non-Rare category. We achieve gains of +0.26 and +0.14 respectively.
Our multi-cue graph attention mechanism surpasses the performance of works like Peyre et al. [2019PeyreDetecting]
that exploit functional approximation using visual similarity and thus enabling the system to disambiguate between closely related same-action different-object scenarios. We also outperform works that leveraged human pose like Guptaet al. [gupta2018nofrills] and Wan et al. [wan2019pose]. Pose certainly would be a useful additional cue to incorporate in our work. However, our system still shows better disambiguating ability even without pose cues. We should also note the work of Bansal et al. who at this time is yet unpublished but available on pre-prints. They achieve an mAP of 21.96 for the Full category and 16.43 for the Rare category. We did not include this work in our list given that they chose to pre-train their Faster-RCNN implementation directly on the HICO-DET dataset instead of COCO as the rest of the works have done. We think this gives their system an advantage compared to system that train on the more general COCO dataset. By training in this way, the system is able to refine instance proposals and reduce uninformative instances and noise. In our work, we chose not to re-train Faster-RCNN directly on HICO-DET for comparison.
We hold that gains from our works are due to the multi-modal cues leveraged by the dual attention graphs. The graph structure enables the human node to leverage contextual cues from a wide-spread set of (primary and subsidiary) object instances that are dynamically updated by the independent attention mechanisms. Attention learns how—and which—-contextual relations aid to disambiguate the inference. This ability is useful for the Rare category which consists of long-tail distribution samples. In conclusion, attending multi-modal cues is a powerful disambiguator. More details are presented in Sec. 4.3.
In Fig. 3, we also visualize the performance distribution of our model across objects for a given interaction. As mentioned in [gupta2018nofrills], it still holds that interactions that occur with just a single object (e.g. ’kick ball’ or ’flip skateboard’) are easier to detect than those predicates that interact with various objects. Compared to [gupta2018nofrills], the median AP of interaction ’cut’ and ’clean’ shown in Fig. 3 outperform those in [gupta2018nofrills] by a considerable margin because our model does not only use single relation features but subsidiary ones as well.
In our ablation studies we conduct six different tests to understand the performance of each of the different elements of our model. We first describe each test and then analyze the results.
01 Visual Graph Only: only. In this test we remove the Semantic-GAT and keep the Visual-GAT, attention, and inference the same. This study will show the importance of aggregating visual and spatial cues.
02 Semantic Graph Only: only. In this test we remove the Visual-GAT and keep the Semantic-GAT, attention, and inference the same. This study will show the importance of only working with Semantic cues.
03 Without Attention. In this test, we use the averaging attention mechanism of Eqtn. 5 instead of the weighted sum mechanism. We still combine the graphs and infer in the same way.
04 Without Spatial Features in . In this test, we remove spatial features from the edges of the combined graph to study the role that spatial features can play after the aggregation of features across nodes.
05 Message Passing in . In this test, we leverage an additional graph attention network to process the combined graph which is similar to what we do to the original visual-spatial graph. We examine if there would be a gain from an additional message passing step on with combined feature from and .
06 Unified V-S Graph. In this test, we choose to start with a single graph in which visual and semantic features are concatenated in the nodes from the start. Spatial features are still used to instantiate edges. This test examines if there would be a gain from using combined visual-semantic features from the start instead of through separate streams.
|03 w/o attention||19.01||14.12||20.47|
|04 w/o spatial features in||18.52||14.28||19.78|
|05 Message passing in||19.23||14.31||20.70|
|06 Unified V-S graph||19.39||14.84||20.75|
We now report on the ablation test results. For the Full category, study 01 yields an mAP of 18.81 which is a large portion of our mAP result suggesting that the visual and spatial features play a primary role in inferring HOI. When only using the Semantic graph in 02, the effect is less marked though still considerable for this single contextual semantic cue. When combining these 3 contextual cues in a graph but not using the attention mechanism in test 03, we get a gain bringing the mAP to 19.01. This suggests that edge relations with multi-contextual cues are helpful even without attention. Afterwards, inserting attention but removing spatial features at the end in test 04 hurts. This indicates that spatial features, even after the aggregation stage, are helpful. By inserting spatial features in the combined graph we are basically using a dilated step in neural networks which has also shown to help classification. In test 05, we learn that additional attention in the combined graph does not confer additional benefits. Rather, attention mechanisms for the independent visual-spatial and semantic features are more informative. Similarly with test 06, a combined V-S graph is still not as effective as separating cues early on. Suggesting that visual cues and semantic cues may have some degree of orthogonality to them even though they are related to each other.
In this paper we presented a novel HOI detection architecture that studied and leveraged the role of not only primary subject-object contextual cues in interaction, but also the role of subsidiary relations. We showed that multi-modal contextual cues can be graphically represented through Graph Attention Networks to leverage primary and subsidiary contextual relations to disambiguate confusing HOI scenes. Our work ratified this posture by not only exceeding SOTA performance but excelling in classifying Rare categories in HICO-DET.