No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques

We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms more sophisticated approaches on human-object interaction detection. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (i) eliminating train-inference mismatch; (ii) rejecting easy negatives during mini-batch training; and (iii) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches while constructing training mini-batches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.


page 1

page 6

page 7


Holistic 3D Human and Scene Mesh Estimation from Single View Images

The 3D world limits the human body pose and the human body pose conveys ...

MegDet: A Large Mini-Batch Object Detector

The improvements in recent CNN-based object detection works, from R-CNN ...

Spatial Priming for Detecting Human-Object Interactions

The relative spatial layout of a human and an object is an important cue...

DRG: Dual Relation Graph for Human-Object Interaction Detection

We tackle the challenging problem of human-object interaction (HOI) dete...

A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View

The bird's-eye-view (BEV) representation allows robust learning of multi...

Code Repositories


A strong HOI Detection model without Frills!

view repo



view repo

1 Introduction

Human-object interaction (HOI) detection is the task of localizing and recognizing all instances of a predetermined set of human-object interactions. For instance, detecting the HOI “human-row-boat” refers to localizing a “human”, a “boat”, and predicting the interaction “row” for this human-object pair. Note that an image may contain multiple people rowing boats (or even the same boat), and the same person could simultaneously be interacting with the same or a different object. For example, a person can simultaneously “sit on” and “row” a boat while “wearing” a backpack.

Figure 1: Outputs of pretrained object and human-pose detectors provide strong cues for predicting interactions. Top: human and object boxes, object label, and human pose predicted by Faster-RCNN [20] and OpenPose [2] respectively. We encode appearance and layout using these predictions (and Faster-RCNN features) and use a factored model to detect human-object interactions. Bottom: boxes and pose overlaid on the input image.

In recent years, increasingly sophisticated models and techniques have been proposed for HOI detection. For instance, Chao  [4] encode the configuration of human-object box pairs using a CNN operating on a two channel binary image called the interaction pattern, whereas Gkioxari  [10] predict a distribution over target object locations based on human appearance using a mixture density network [1]. Similarly, while encoding appearance, approaches range from multitask training of a human-centric branch [10] along with an object classification branch in Faster-RCNN [20] to using an attention mechanism to gather relevant contextual information from the image [9].

In this work, we propose a no-frills, factored model for HOI detection which predicts conditional probabilities of HOI categories given a pair of candidate regions in an image and their appearance and layout encodings. Appearance and layout encodings are constructed using detections and ROI pooled features from a pretrained object detector (and optionally human pose from a pretrained pose detector). Our model consists of factors corresponding to human and object detection scores and an interaction term which is further composed of human and object appearance, and layout factors.

Figure 2: Fixing training-inference mismatch and rejecting easy negatives. The figure illustrates training and inference on a single HOI class (“human-ride-horse”) for simplicity. As shown in (a), existing models [10, 9] often train human/object and interaction branches using object and interaction classification losses respectively. The scores produced by these branches are combined during test inference to produce HOI scores. Hence, training does not accurately reflect the test inference objective. Our model, shown in (b), fixes this mismatch by optimizing the combined scores using a multi-label HOI classification loss. Our model also uses indicator functions based on the sets of detections selected for human and object categories (, ) to reject easy negative box-pairs, pairs consisting of boxes from “non-human” or “non-horse” categories during training and inference. While existing approaches also select detection candidates, the models are typically trained using mini-batches containing candidates for different HOI/object categories.

In comparison to existing approaches [10, 9]

, our layout encoding is more direct and does not require learning beyond training the object detector. We explore both a coarse layout encoding using position features computed from human-object box-pair, and a fine-grained encoding of human pose using detected keypoints. Our appearance encoding of a region is the ROI pooled representation from the object detector and does not require multi-task training or learned attention mechanisms. Appearance and layout factors in the interaction term are implemented using multi-layer perceptrons (MLPs) with up to two hidden layers.

We also develop the following training techniques for improving learning efficiency of our factored model:

  • One approach [10, 9] is to learn object detection terms and interaction terms using object and interaction classification losses respectively. This creates a mismatch

    between training and inference which would then require the scores of all factors to be fused heuristically to get HOI class probabilities. We eliminate this mismatch by directly optimizing the HOI class probabilities using a multi-label HOI classification loss (Fig. 

    2). (Interaction Loss: mAP HOI Loss: mAP)

  • Most related work begin by selecting detection candidates from an object detector and inference is performed on the selected pairs of human and object candidates. This is ideal since a pair of boxes , need not be considered for HOI class “human-ride-horse”, if either is not a “human” candidate or is not a “horse” candidate. During training however, mini-batches are constructed from candidates from different object classes. Thus in addition to learning that a candidate pair is “human-ride-horse” and not a “human-walk-horse” (same object different interaction), the model is forced to dedicate parameters to also learn that the candidate is not “human-drive-car” (different object and corresponding interactions). We introduce indicator terms in our object detection factors that reject such easy negatives (Fig. 2). (w/o indicators: mAP w indicators: mAP)

  • We construct training mini-batches by sampling two orders of magnitude larger number of negative box-pairs per positive pair than related work ( ). This is partly to support our training strategy (Sec. 3.3) which considers box-pairs as candidates for a specific HOI class (the same pair for a different interaction with the same object category is treated as a different sample). But also higher ratios are expected since the number of negative pairs is quadratic in the number of region proposals as opposed to linear for object detectors. (negatives to positive ratio : mAP : mAP)

In summary, our key contributions are: (i) a simple but competitive factored model for HOI detection that takes advantage of appearance and layout encodings constructed using a pretrained object detector (and optionally pose detector); (ii) comparison of coarse and fine-grained layout encodings; (iii) isolating the contribution of object appearance in addition to human appearance; and (ii) techniques for enhancing learning efficiency of our factored model (and similar models in literature [10, 9]).

2 Related Work

Assessing interactions between humans and objects in images is a challenging problem which has received a considerable amount of attention from the machine learning, computer vision and robotics community in the last decade 

[25, 24, 8, 7, 6, 17].

Human activity recognition is among the early efforts to analyze human actions in images or videos. Benchmarks such as UCF101 [22] and THUMOS [12]

focused on classifying a video sequence into one of

action categories. While UCF101 only dealt with carefully trimmed videos, an artificial setting, the THUMOS challenge additionally introduced the task of temporal localization of activities in untrimmed videos. Image action recognition benchmarks such as Stanford 40 Actions [26] and PASCAL VOC 2010 [17] have also been used in the literature. While similar in intent, these action recognition challenges differ from human-object interaction detection in three ways – (1) the tasks are limited to images or videos containing a single human-centric action, such as bowling, diving, fencing, ; (2) the action classes are disjoint and often involve interaction with an object unique to the activity (allowing models to cheat by simply recognizing the object); and (3) spatial localization of neither the person nor the object being interacted with is required.

Moving from actions to interactions, Chao  [5, 4] introduce the HICO and HICO-DET datasets to address the above limitations. The HICO dataset consists of a large collection of images annotated with human-object interactions with a diverse set of interactions with COCO [15] object categories. Unlike previous tasks, HOI classification is multi-label in nature since each image may contain multiple humans interacting with same or different objects. Recently, Chao extended the HICO dataset with exhaustive bounding box annotations for each of the HOI classes to create HICO-DET. Due to the human-centric nature of the annotation task and predefined set of objects and interactions, HICO-DET does not suffer from the missing annotation problem (at least to the same extent) that plagues datasets such as Visual Genome [14] and VRD [16] that are used for the general visual relationship (object-object interaction) detection task.

In a similar effort, Gupta  [11] augment the COCO dataset [15] by annotating people (agents) with one of action labels along with location and labels of objects fulfilling various semantic roles for the action. In another visual equivalent of the semantic role labelling (SRL) task studied in NLP, Yatskar  [27] create an image dataset for situation recognition, which is defined to subsume recognition of activity, participating objects and their roles.

In this work, we choose HICO-DET as a test bed for experimentation due to its large, diverse, and exhaustively annotated set of human-object interactions which allows for an accurate and meaningful evaluation. The task is also a natural extension of classical object detection to detection of human-object pairs with interaction labels.

Existing models for HOI detection. In [4] Chao propose HO-RCNN, a -stream architecture with one stream each for a human candidate, an object candidate, and a geometric encoding of the pair of boxes using the proposed interaction pattern. Each stream produces scores for every possible object-interaction category ( for HICO-DET). The set of scores are combined using late-fusion to make the final prediction. Note that this approach treats “ride bicycle” and “ride horse” as independent visual entities and does not use the knowledge of “ride” being a common component. In contrast, our approach exploits this compositionality to learn shared visual appearance and geometric representations (, “ride” typically involves a human box above an object box). In other words, weight sharing between different HOI classes in our factored model makes it more data efficient than  [4] which predicts scores for HOI categories using independent weights in the last 600-way fully connected layer in each of the 3 streams.

Gkioxari  [10] propose InteractNet, which takes a multitask learning [3] perspective on the problem. The idea is to augment the Faster-RCNN [20] object detection framework with a human-centric branch and an interaction branch that are trained jointly alongside the original object recognition branch. To incorporate geometric cues, a Mixture Density Network (MDN) [1] is used to produce parameters of the object location distribution given the human appearance. This distribution is used to score candidate objects for a given human box. The model is trained using object classification loss for the object branch, interaction classification losses for the human centric action classification branch and the optional interaction branch, and a smooth L1 loss between the ground truth box-pair encoding and mean predicted by the localization MDN. During inference, predictions from these branches are fused heuristically. In addition to differences in the details of factorization, and appearance and layout encodings used in our model, we introduce training techniques for enhancing learning efficiency of similar factored models for this task. optimize the final HOI score obtained after fusing the individual factor scores. We also more directly encode box-pair layout using absolute and relative bounding box features which are then scored using a dedicated factor.

Gao  [9] follow an approach similar to [10] but introduce an attention mechanism that augments human and object appearance with contextual information from the image. Attention map is computed using keys derived from the human/object appearance encoding and the context is computed as an attention weighted average of convolution features. The model is trained using an interaction classification loss. The only sources of contextual information in our model are the ROI pooled region features from the object detector and adding a similar attention mechanism may further improve performance.

3 Method

In the following, we first present an overview of the proposed factor model, followed by details of different factors and our training strategy.

Input :  Image ,
Set of object (), interaction (), and
      HOI () classes of interest,
Pretrained object (Faster-RCNN) and
      human-pose (OpenPose) detectors
// Stage 1: Create a set of box candidates for each object (including human)
1 Run Faster-RCNN on to get region proposals ()
2      with ROI appearance features and detection probabilities
3 foreach  do
4        Construct
7        Update to keep at most 10 highest ranking detections.
8 end foreach
9Run OpenPose on to get skeletal-keypoints
// Stage 2: Score candidate pairs using the proposed factored model
10 foreach  do
11        foreach  do
12               Compute box configuration features for
13               Compute pose features for
14               Compute
15                     using equations 1,  2, and 3
16        end foreach
       Output : Ranked list of as detections for class with probabilities. For any probability of belonging to class is predicted as 0.
18 end foreach
// Steps 10-17 are implemented with a single forward pass on a mini-batch of precomputed features
Algorithm 1 Inference on a single image

3.1 Overview

Given an image and a set of object-interaction categories of interest, human-object interaction (HOI) detection is the task of localizing all human-object pairs participating in one of the said interactions. The combinatorial search over human and object bounding-box locations and scales, as well as object labels, , and interaction labels, , makes both learning and inference challenging. To deal with this complexity, we decompose inference into two stages. In the first stage, object category specific bounding box candidates are selected using a pre-trained object detector such as Faster-RCNN. For each HOI category, , for each triplet , a set of candidate human-object box-pairs is constructed by pairing every human box candidate with every object box candidate . In the second stage, a factored model is used to score and rank candidate box-pairs for each HOI category. Our factor graph consists of human and object appearance, box-pair configuration (coarse layout) and human-pose (fine-grained layout) factors that operate on appearance and layout encodings constructed from outputs of pretrained object and human-pose detectors. The model is parameterized to share representations and computation across different object and interaction categories to efficiently score candidate box-pairs for all HOI categories of interest in a single forward pass. See alg:inference for a detailed description of the inference procedure.

3.2 Factored Model

For an image , given a human-object candidate box pair , human pose keypoints detected inside (if any), and the set of box candidates for each object category, the factored model computes the probability of occurrence of human-object interaction in as follows:



is a random variable denoting if

is labeled as a human, denotes if is labeled as object category , and denotes if the interaction assigned to the box-pair is . The above factorization assumes that human and object class labels depend on the individual boxes and the image, while the interaction label depends on the box-pair, pose, object label under consideration, and the image. The conditioning on and is to ensure that for a box-pair where or , probability for is predicted as . This is implemented using indicator functions in the first two terms. For brevity, we will refer to the left hand side of the above equation as . We now describe how the 3 terms are modelled.

3.2.1 Detector Terms

The first two terms in Eq. 1 are modelled using the set of candidate bounding boxes for each object class and classification probabilities produced by a pretrained object detector. For any object category (including ), the detector term can be computed as


where the term corresponds to the probability of assigning object class to region in image by the object detector. The indicator term checks if belongs to the set of candidate bounding boxes for selected from the set of all region proposals using non-maximum suppression and thresholding on class probabilities.

3.2.2 Interaction Term

Interaction term refers to the probability of entities in and engaging in interaction . Note that the interaction term is conditioned on the object label . This allows the model to learn that only certain interactions are feasible for a given object. For example, it is possible to “clean” or “eat at” a “dinning table” but not to “drive” or “greet” it. In practice, we found conditioning on did not affect results significantly. To utilize appearance and layout information, the interaction term is further factorized as follows:



is the Sigmoid function and each

is a learnable deep net factor. We now describe each of these factors along with the network architecture and appearance and layout encodings the factor operates on:

Appearance. Factors and predict the interaction that the human and the object are engaged in, based on visual appearance alone. The appearance of a box in an image is encoded using Faster-RCNN [20] (Resnet-152 backbone) average pooled fc7features extracted from the RoI. By design, this representation captures context in addition to content within the box. The dimensional fc7 features are fed into a multi-layer perceptron (MLP) with a single

dimensional hidden layer with Batch Normalization 


and ReLU 

[18]. The output layer has neurons, one per interaction category in .

Box Configuration. Object label and the absolute and relative positions and scales of the human and object boxes are often indicative of the interaction, even without even the appearance (, a human box above and overlapping with a ‘horse’ box strongly suggests a ‘riding’ interaction). captures this intuition by predicting a score for each interaction given an encoding of the bounding boxes and the object label. The object label is encoded as a (

) dimensional one hot vector. The bounding boxes are represented using a 21 dimensional feature vector. We encode the

absolute position and scale of both the human and object boxes using box width, height, center position, aspect ratio, and area. We also encode relative configuration of the human and object boxes using relative position of their centers, ratio of box areas and their intersection over union. These 21 dimensional features are concatenated with their log absolute values and the object label encoding and passed through an MLP with 2 hidden layers, 122 () dimensional each (same as the input feature dimension), with Batch Normalization and ReLU.

Figure 3: Interaction confusions. Element in each heatmap visualizes , the probability of interaction for box-pair , averaged across all box pairs with ground truth interaction . Each row is independently normalized and exponentiated to highlight the interactions most confused with interaction . Only 30 of the 117 classes with the highest median AP across objects (see Fig. 5) are shown for clarity.

Human Pose. We supplement the coarse layout encoded by bounding boxes with more fine-grained layout information provided by human pose keypoints. We use OpenPose [2, 23, 21] to detect 18 keypoints for each person in the image. A human candidate box is assigned a keypoints-skeleton if the smallest bounding box around the keypoints has or more of its area inside the human box. Similar to box features, we encode both absolute human pose and the relative location with respect to the object candidate box. The absolute pose features () consist of keypoint coordinates normalized to the human bounding box frame of reference and confidence of each keypoint predicted by OpenPose. The relative pose features () consist of offset of the top left and bottom right corners of the object box relative to each keypoint and keypoint confidences. The absolute and relative pose features and their log values are concatenated along with one hot object label encoding before feeding into . is also an MLP with 2 hidden layers with () neurons each. Both hidden layers are equipped with Batch Normalization and ReLU. The output layer, like the other factors, has 117 neurons.

Figure 4: Qualitative results with top ranking true and false positives with predicted probability. The blue and red boxes correspond to human and objects detected by pretrained Faster-RCNN detector respectively. Pose skeleton consists of 18 keypoints predicted by pretrained OpenPose detector and assigned to the human box.

3.3 Training

Since more than one HOI label might be assigned to a pair of boxes, the model is trained in a fully supervised fashion using the multi-label binary cross-entropy loss. For each image in the training set, candidate boxes for each HOI category ( for class ) are assigned binary labels based on whether both the human and object boxes in the pair have an intersection-over-union (IoU) greater than 0.5 with a ground truth box-pair of the corresponding HOI category. During training, the sample in a mini-batch consists of a box pair , HOI category for which the box pair is a candidate ( are considered candidates for HOI class ), binary label to indicate match (or not) with a ground truth box pair of class , detection scores for human and object category corresponding to class , and input features for each factor . Pair of boxes which are candidates for more than one HOI category are treated as multiple samples during training. Since the number of candidate pairs per image is 3 orders of magnitude (typically ) larger than the number of positive samples (typically ), random sampling would leave most mini-batches with no positives. We therefore select all positive samples per image and then randomly sample 1000 negatives per positive. Given a mini-batch of size constructed from a single image , the loss is computed as


where is the binary cross entropy loss and is the probability of HOI class computed for the sample in the mini-batch using Eq. 1. In our experiments, we only learn parameters of the interaction term (MLPs used to compute factors , and ).

Full Rare Non-Rare Number of training instances per HOI class
0-9 10-49 50-99 100-499 500-999 1000+
HO-RCNN [4] 7.81 5.37 8.54 - - - - - -
VSRL [11] (impl. by  [10]) 9.09 7.02 9.71 - - - - - -
InteractNet [10] 9.94 7.16 10.77 - - - - - -
GPNN [19] 13.11 9.34 14.23 - - - - - -
iCAN [9] 14.84 10.45 16.15 - - - - - -
Det 8.32 6.84 8.76 6.84 4.85 6.05 10.18 14.40 21.46
Det + Box 12.54 10.40 13.18 10.40 7.46 9.99 14.62 20.12 35.98
Det + Human App 11.12 8.82 11.80 8.82 7.73 9.19 13.41 15.85 26.42
Det + Object App 11.05 7.41 12.13 7.41 7.68 9.72 14.61 15.58 23.27
Det + App 15.74 11.35 17.05 11.35 10.58 13.96 20.11 22.76 34.75
Det + Human App + Box 15.63 12.45 16.58 12.45 9.94 12.69 19.05 23.60 39.63
Det + Object App + Box 15.68 10.47 17.24 10.47 9.97 12.84 20.48 23.88 40.87
Det + App + Box 16.96 11.95 18.46 11.95 11.02 14.00 22.02 25.01 41.13
Det + Pose 11.09 8.04 12.00 8.04 7.26 8.47 13.08 18.81 32.66
Det + Box + Pose 14.49 11.86 15.27 11.86 9.73 12.21 16.51 21.72 38.81
Det + App + Pose 15.50 10.14 17.10 10.14 10.40 13.11 20.40 23.45 36.08
Det + App + Box + Pose 17.18 12.17 18.68 12.17 11.28 14.49 22.08 25.27 41.47
Table 1: Results on HICO-Det test set. Det, Box, App, and Pose correspond to object detector terms, appearance, box configuration, and pose factors respectively. Each row was both trained and evaluated with specified factors. Best and second best numbers are highlighted in color.

4 Experiments

HICO-Det [4] and V-COCO [11] datasets are commonly used for evaluating HOI detection models. V-COCO is primarily used for legacy reasons since at the time HICO [5] dataset only had image-level annotations. HICO-Det was created to extend HICO with bounding box annotations specifically for the HOI detection task. HICO-Det is both larger and more diverse than V-COCO. While HICO-Det consists of images annotated with interactions with objects resulting in a total of HOI categories, V-COCO only has interactions with a training set the size of HICO-Det’s. Exhaustive annotations for each HOI category also makes it more suitable for based evaluation than VRD [16] which suffers from missing annotations. VRD also contains “human” as one among many subjects which makes evaluation of the impact of fine-grained human pose less reliable due to a small sample size. Hence, HICO-Det is appropriate for evaluating our contributions and makes V-COCO evaluation redundant.

In addition to comparing to the current state-of-the-art, our experiments evaluate the contribution of different factors in our model (Tab. 1) and impact of proposed training techniques (Tab. 2. Our analysis also includes visualization of distribution of performance across object and interaction categories (Fig. 5), interaction confusions (Fig. 3), and examples of top ranking detections and failure cases (Fig. 4).

HICO-Det dataset contains training and test images annotated with HOI categories. We further use an - split of the training images to generate our actual training and validation sets. For all experiments we train on this smaller training set and use the validation set for model selection. HOI categories consist of object categories (same as COCO classes) and interactions. Each image on average contains HOI detections.

4.1 Comparison to State-of-the-art

Tab. 1 shows mAP of our final models Det+App+Box and Det+App+Box+Pose, (and ablations) in comparison to existing models in the literature on various sets of HOI categories – Full is mAP across all classes, Rare on classes with less than 10 training instances, and Non-Rare on the rest. To present a clearer picture, in addition to this Rare-Non-Rare split specified in [4], we show results for a more fine-grained grouping of classes based on number of training instances.

The model most similar to ours is InteractNet [10] which extends Faster-RCNN with a human-centric branch that produces interaction scores based on human (and optionally object) appearance and a distribution over target object location. There are 4 factors contributing to the improved performance of our model over InteractNet: (i) use of significantly large ratio of negative to positive box-pairs during minibatch training (our model uses whereas  [10] uses for the detection branch and no negatives for the interaction branch); (ii) box configuration term in our model directly scores box-pair features, a formulation easier to learn than predicting distribution over target object locations using human appearance features alone; (iii) fixing training-inference mismatch (Fig 2); (iv) easy negative rejection that allows our model to focus on learning to rank only hard candidate pairs for a particular HOI category, namely all combinations of human and object detections of the relevant category. Effect of factors (i), (iii), and (iv) towards our model’s performance are further investigated in Tab. 2 and Sec. 4.2. iCAN [9] follows an approach and training procedure similar to InteractNet but augments region appearance features with contextual features computed using an attention mechanism, a contribution complementary to ours. Our ablation study over factors (Tab. 1) is also more exhaustive than  [10, 9]. For instance, we conclusively show that object appearance provides useful information complementary to human appearance for HOI detection (Det + Human App: , Det + Object App: Det + App: ).

HO-RCNN [4] takes human appearance, object appearance, and box configuration encoded as an interaction pattern as inputs and processes them with 3 separate branches, each of which produces a score for each HOI category. The scores are combined along with object detection scores to produce HOI probabilities and the model is trained using multi-label binary classification loss. Our model improves over HO-RCNN in two ways: (i) weight sharing in our factored model (also in InteractNet and iCAN) makes it more data efficient than  [4] which predicts scores for 600 HOI categories using independent weights in the last 600-way fully connected layer; and (ii) we explicitly encode spatial layout as opposed to [4] which has to learn such a representation via a CNN.

GPNN [19] adopt a completely different approach based on message passing over an inferred graph. While in theory, such an approach jointly infers all HOI detections in an image (as opposed to making predictions for one candidate box-pair at a time), the advantages of this approach over carefully designed but simpler fixed graph approaches like our factor model and (which also enjoys the benefit of context in spite of pairwise inference) remains to be demonstrated.

4.2 Training Techniques

Training the model using interaction classification loss on the probabilities predicted by the interaction term, as done in [10, 9], is suboptimal in comparison to training using HOI classification loss ( vs mAP) even though the same set of parameters are optimized by both losses. This is because the latter provides an opportunity for the interaction term to calibrate itself relative to the detection terms. This approach is also used in [4] but without the strong weight sharing assumptions made by our factor model.

A distinguishing feature of our factor model is the use of indicator functions in detection score factors (Eq. 3

) and loss function (Eq. 

4). The indicators ensure that a box-pair predicts zero probability for an HOI category for which or and that the model learns to rank only relevant box-pairs (those in set for class ). Tab. 2 shows that even while using the indicators during inference, not using them during training causes a drop in mAP from to .

Finally, as shown in Tab. 2, increasing the ratio of negative box-pairs sampled per positive in a mini-batch during training leads to a dramatic increase in performance. This is in contrast to low ratios (typically ) used for training object detectors and in related work [4, 10]. This is partly to support our training strategy (Sec. 3.3) which considers box-pairs as candidates for a specific HOI class (the same pair for a different interaction with the same object category is treated as a different sample). But also since the number of negative pairs is quadratic in the number of region proposals as opposed to linear for object detectors higher ratios than object detectors are expected for learning to reject false positives.

Neg./Pos. Indicators HOI Loss Interaction Loss mAP
10 13.40
50 15.51
100 16.30
500 17.06
1000 16.96
1500 16.62
1000 15.93
1000 15.89
Table 2: Training techniques evaluated using Det + App + Box model. The results highlight the importance of: (i) large negative to positive ratio in mini-batches; (ii) using indicators during training to only learn to rank candidates selected specifically for a given HOI category instead of all detection pairs; (iii) directly optimizing the HOI classification loss instead of training with an interaction classification loss and then combining with object detector scores heuristically. Best and second best numbers are highlighted in color.

4.3 Factor Ablation Study

To identify the role of different sources of appearance and spatial information in our model we train models with subsets of available factors.

The role of individual factors can be assessed by comparing Det, Det+Box, Det+App, and Det+Pose. Note that appearance terms lead to largest gains over Det followed by Box and Pose. We further analyse the contribution of human and object appearance towards predicting interactions. Interestingly, while Det+Human App and Det+Object App perform comparably ( and ), the combination outperforms either of them with an mAP of showing that the human and object appearance provide some complementary information. Note that an mAP of () or less would indicate completely redundant or noisy signals. Similar sense of complementary information can be assessed from Table 1 for App-Box, App-Pose, and Box-Pose pairs.

While Det+Box+Pose improves over Det+Box, Det+App+Pose and Det+App perform comparably. Similarly, Det+App+Box+Pose only slightly improves the performance of Det+App+Box. This suggests that while it is useful to encode fine-grained layout in addition to coarse layout, human appearance encoded using object detector features may already be capturing human pose information to some extent.

Another way of understanding the role of factors is to consider the drop in performance when a particular factor is removed from the final model. Relative to Det+App+Box+Pose, performance drops are , , and mAP for App, Box and Pose factors respectively.

Figure 5: Spread of performance

(range and quartiles) across interactions with the same object (top) and across objects for a given interaction (bottom). The horizontal axis is sorted by median AP.

4.4 How is the performance distributed across objects and interactions?

Fig. 5 visualizes the spread of performance of our final model across interactions with a given object and across objects for a given interaction. The figure shows that for most objects certain interactions are much easier to detect than others (with the caveat that AP computation for any class is sensitive to the number of positives for that class in the test set). Similar observation is true for different objects given an interaction. In addition, we observe that interactions which can occur with only a specific object category (as indicated by absence of box) such as “kick-ball” and “flip-skateboard” are easier to detect than those that tend to occur with more than one object such as “cut” and “clean” and could have drastically different visual and spatial appearance depending on the object. Heatmaps in Fig. 3 show the interactions that are confused by different models. Comparing heatmap b with a shows the role of the appearance factor in reducing confusion between interactions. For instance, without App “eat” is confused with “brush with” and “drink with”, but not in the final model. Similarly, c and d can be compared with a for the effects of Box and Pose factors respectively.

4.5 Qualitative Results

Qualitative results (Fig. 4) demonstrate the advantages of building HOI detectors on the strong foundation of object detectors. False positives are more commonly due to incorrect interaction than object. Interaction errors are often due to fine grained differences between classes: , “carry” “wield” “baseball bat” and “inspect” “repair” “boat.” Notice in some examples like “inspect airplane” and “watch bird,” cues for preventing false positives are as subtle as gaze direction.

4.6 Conclusion

We propose a no-frills approach to HOI detection which is competitive with existing literature without many of their complexities. This is achieved through appropriate factorization of the HOI class probabilities, encodings of appearance and layout constructed from the outputs of pretrained object and pose detectors, and improved training techniques. Our ablation study shows the importance of human and object appearance, coarse layout, and fine-grained layout for the HOI detection task. We also evaluate the significance of the proposed training techniques which can easily be incorporated into other factored models in the literature as well.


  • [1] C. M. Bishop. Mixture density networks. 1994.
  • [2] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh.

    Realtime multi-person 2d pose estimation using part affinity fields.

    In CVPR, 2017.
  • [3] R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
  • [4] Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng. Learning to detect human-object interactions. arXiv preprint arXiv:1702.05448, 2017.
  • [5] Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1017–1025, 2015.
  • [6] V. Delaitre, J. Sivic, and I. Laptev. Learning person-object interactions for action recognition in still images. In Advances in neural information processing systems, pages 1503–1511, 2011.
  • [7] C. Desai and D. Ramanan. Detecting actions, poses, and objects with relational phraselets. In European Conference on Computer Vision, pages 158–172. Springer, 2012.
  • [8] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for static human-object interactions. In

    Computer vision and pattern recognition workshops (CVPRW), 2010 IEEE computer society conference on

    , pages 9–16. IEEE, 2010.
  • [9] C. Gao, Y. Zou, and J.-B. Huang. ican: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.
  • [10] G. Gkioxari, R. Girshick, P. Dollár, and K. He. Detecting and recognizing human-object interactions. arXiv preprint arXiv:1704.07333, 2017.
  • [11] S. Gupta and J. Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
  • [12] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
  • [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [16] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer, 2016.
  • [17] S. Maji, L. Bourdev, and J. Malik.

    Action recognition from a distributed representation of pose and appearance.

    In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3177–3184. IEEE, 2011.
  • [18] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • [19] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu.

    Learning human-object interactions by graph parsing neural networks.

  • [20] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [21] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
  • [22] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [23] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [24] B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 9–16. IEEE, 2010.
  • [25] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 17–24. IEEE, 2010.
  • [26] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1331–1338. IEEE, 2011.
  • [27] M. Yatskar, L. Zettlemoyer, and A. Farhadi. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5534–5542, 2016.