A strong HOI Detection model without Frills!
We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms more sophisticated approaches on human-object interaction detection. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (i) eliminating train-inference mismatch; (ii) rejecting easy negatives during mini-batch training; and (iii) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches while constructing training mini-batches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset.READ FULL TEXT VIEW PDF
The 3D world limits the human body pose and the human body pose conveys
The improvements in recent CNN-based object detection works, from R-CNN ...
The relative spatial layout of a human and an object is an important cue...
We propose a new 3D holistic++ scene understanding problem, which jointl...
We try to address the problem of document layout understanding using a s...
Human pose transfer (HPT) is an emerging research topic with huge potent...
This paper aims at one newly raising task in vision and multimedia resea...
A strong HOI Detection model without Frills!
Human-object interaction (HOI) detection is the task of localizing and recognizing all instances of a predetermined set of human-object interactions. For instance, detecting the HOI “human-row-boat” refers to localizing a “human”, a “boat”, and predicting the interaction “row” for this human-object pair. Note that an image may contain multiple people rowing boats (or even the same boat), and the same person could simultaneously be interacting with the same or a different object. For example, a person can simultaneously “sit on” and “row” a boat while “wearing” a backpack.
In recent years, increasingly sophisticated models and techniques have been proposed for HOI detection. For instance, Chao  encode the configuration of human-object box pairs using a CNN operating on a two channel binary image called the interaction pattern, whereas Gkioxari  predict a distribution over target object locations based on human appearance using a mixture density network . Similarly, while encoding appearance, approaches range from multitask training of a human-centric branch  along with an object classification branch in Faster-RCNN  to using an attention mechanism to gather relevant contextual information from the image .
In this work, we propose a no-frills, factored model for HOI detection which predicts conditional probabilities of HOI categories given a pair of candidate regions in an image and their appearance and layout encodings. Appearance and layout encodings are constructed using detections and ROI pooled features from a pretrained object detector (and optionally human pose from a pretrained pose detector). Our model consists of factors corresponding to human and object detection scores and an interaction term which is further composed of human and object appearance, and layout factors.
, our layout encoding is more direct and does not require learning beyond training the object detector. We explore both a coarse layout encoding using position features computed from human-object box-pair, and a fine-grained encoding of human pose using detected keypoints. Our appearance encoding of a region is the ROI pooled representation from the object detector and does not require multi-task training or learned attention mechanisms. Appearance and layout factors in the interaction term are implemented using multi-layer perceptrons (MLPs) with up to two hidden layers.
We also develop the following training techniques for improving learning efficiency of our factored model:
between training and inference which would then require the scores of all factors to be fused heuristically to get HOI class probabilities. We eliminate this mismatch by directly optimizing the HOI class probabilities using a multi-label HOI classification loss (Fig.2). (Interaction Loss: mAP HOI Loss: mAP)
Most related work begin by selecting detection candidates from an object detector and inference is performed on the selected pairs of human and object candidates. This is ideal since a pair of boxes , need not be considered for HOI class “human-ride-horse”, if either is not a “human” candidate or is not a “horse” candidate. During training however, mini-batches are constructed from candidates from different object classes. Thus in addition to learning that a candidate pair is “human-ride-horse” and not a “human-walk-horse” (same object different interaction), the model is forced to dedicate parameters to also learn that the candidate is not “human-drive-car” (different object and corresponding interactions). We introduce indicator terms in our object detection factors that reject such easy negatives (Fig. 2). (w/o indicators: mAP w indicators: mAP)
We construct training mini-batches by sampling two orders of magnitude larger number of negative box-pairs per positive pair than related work ( ). This is partly to support our training strategy (Sec. 3.3) which considers box-pairs as candidates for a specific HOI class (the same pair for a different interaction with the same object category is treated as a different sample). But also higher ratios are expected since the number of negative pairs is quadratic in the number of region proposals as opposed to linear for object detectors. (negatives to positive ratio : mAP : mAP)
In summary, our key contributions are: (i) a simple but competitive factored model for HOI detection that takes advantage of appearance and layout encodings constructed using a pretrained object detector (and optionally pose detector); (ii) comparison of coarse and fine-grained layout encodings; (iii) isolating the contribution of object appearance in addition to human appearance; and (ii) techniques for enhancing learning efficiency of our factored model (and similar models in literature [10, 9]).
Assessing interactions between humans and objects in images is a challenging problem which has received a considerable amount of attention from the machine learning, computer vision and robotics community in the last decade[25, 24, 8, 7, 6, 17].
focused on classifying a video sequence into one ofaction categories. While UCF101 only dealt with carefully trimmed videos, an artificial setting, the THUMOS challenge additionally introduced the task of temporal localization of activities in untrimmed videos. Image action recognition benchmarks such as Stanford 40 Actions  and PASCAL VOC 2010  have also been used in the literature. While similar in intent, these action recognition challenges differ from human-object interaction detection in three ways – (1) the tasks are limited to images or videos containing a single human-centric action, such as bowling, diving, fencing, ; (2) the action classes are disjoint and often involve interaction with an object unique to the activity (allowing models to cheat by simply recognizing the object); and (3) spatial localization of neither the person nor the object being interacted with is required.
Moving from actions to interactions, Chao [5, 4] introduce the HICO and HICO-DET datasets to address the above limitations. The HICO dataset consists of a large collection of images annotated with human-object interactions with a diverse set of interactions with COCO  object categories. Unlike previous tasks, HOI classification is multi-label in nature since each image may contain multiple humans interacting with same or different objects. Recently, Chao extended the HICO dataset with exhaustive bounding box annotations for each of the HOI classes to create HICO-DET. Due to the human-centric nature of the annotation task and predefined set of objects and interactions, HICO-DET does not suffer from the missing annotation problem (at least to the same extent) that plagues datasets such as Visual Genome  and VRD  that are used for the general visual relationship (object-object interaction) detection task.
In a similar effort, Gupta  augment the COCO dataset  by annotating people (agents) with one of action labels along with location and labels of objects fulfilling various semantic roles for the action. In another visual equivalent of the semantic role labelling (SRL) task studied in NLP, Yatskar  create an image dataset for situation recognition, which is defined to subsume recognition of activity, participating objects and their roles.
In this work, we choose HICO-DET as a test bed for experimentation due to its large, diverse, and exhaustively annotated set of human-object interactions which allows for an accurate and meaningful evaluation. The task is also a natural extension of classical object detection to detection of human-object pairs with interaction labels.
Existing models for HOI detection. In  Chao propose HO-RCNN, a -stream architecture with one stream each for a human candidate, an object candidate, and a geometric encoding of the pair of boxes using the proposed interaction pattern. Each stream produces scores for every possible object-interaction category ( for HICO-DET). The set of scores are combined using late-fusion to make the final prediction. Note that this approach treats “ride bicycle” and “ride horse” as independent visual entities and does not use the knowledge of “ride” being a common component. In contrast, our approach exploits this compositionality to learn shared visual appearance and geometric representations (, “ride” typically involves a human box above an object box). In other words, weight sharing between different HOI classes in our factored model makes it more data efficient than  which predicts scores for HOI categories using independent weights in the last 600-way fully connected layer in each of the 3 streams.
Gkioxari  propose InteractNet, which takes a multitask learning  perspective on the problem. The idea is to augment the Faster-RCNN  object detection framework with a human-centric branch and an interaction branch that are trained jointly alongside the original object recognition branch. To incorporate geometric cues, a Mixture Density Network (MDN)  is used to produce parameters of the object location distribution given the human appearance. This distribution is used to score candidate objects for a given human box. The model is trained using object classification loss for the object branch, interaction classification losses for the human centric action classification branch and the optional interaction branch, and a smooth L1 loss between the ground truth box-pair encoding and mean predicted by the localization MDN. During inference, predictions from these branches are fused heuristically. In addition to differences in the details of factorization, and appearance and layout encodings used in our model, we introduce training techniques for enhancing learning efficiency of similar factored models for this task. optimize the final HOI score obtained after fusing the individual factor scores. We also more directly encode box-pair layout using absolute and relative bounding box features which are then scored using a dedicated factor.
Gao  follow an approach similar to  but introduce an attention mechanism that augments human and object appearance with contextual information from the image. Attention map is computed using keys derived from the human/object appearance encoding and the context is computed as an attention weighted average of convolution features. The model is trained using an interaction classification loss. The only sources of contextual information in our model are the ROI pooled region features from the object detector and adding a similar attention mechanism may further improve performance.
In the following, we first present an overview of the proposed factor model, followed by details of different factors and our training strategy.
Given an image and a set of object-interaction categories of interest, human-object interaction (HOI) detection is the task of localizing all human-object pairs participating in one of the said interactions. The combinatorial search over human and object bounding-box locations and scales, as well as object labels, , and interaction labels, , makes both learning and inference challenging. To deal with this complexity, we decompose inference into two stages. In the first stage, object category specific bounding box candidates are selected using a pre-trained object detector such as Faster-RCNN. For each HOI category, , for each triplet , a set of candidate human-object box-pairs is constructed by pairing every human box candidate with every object box candidate . In the second stage, a factored model is used to score and rank candidate box-pairs for each HOI category. Our factor graph consists of human and object appearance, box-pair configuration (coarse layout) and human-pose (fine-grained layout) factors that operate on appearance and layout encodings constructed from outputs of pretrained object and human-pose detectors. The model is parameterized to share representations and computation across different object and interaction categories to efficiently score candidate box-pairs for all HOI categories of interest in a single forward pass. See alg:inference for a detailed description of the inference procedure.
For an image , given a human-object candidate box pair , human pose keypoints detected inside (if any), and the set of box candidates for each object category, the factored model computes the probability of occurrence of human-object interaction in as follows:
is a random variable denoting ifis labeled as a human, denotes if is labeled as object category , and denotes if the interaction assigned to the box-pair is . The above factorization assumes that human and object class labels depend on the individual boxes and the image, while the interaction label depends on the box-pair, pose, object label under consideration, and the image. The conditioning on and is to ensure that for a box-pair where or , probability for is predicted as . This is implemented using indicator functions in the first two terms. For brevity, we will refer to the left hand side of the above equation as . We now describe how the 3 terms are modelled.
The first two terms in Eq. 1 are modelled using the set of candidate bounding boxes for each object class and classification probabilities produced by a pretrained object detector. For any object category (including ), the detector term can be computed as
where the term corresponds to the probability of assigning object class to region in image by the object detector. The indicator term checks if belongs to the set of candidate bounding boxes for selected from the set of all region proposals using non-maximum suppression and thresholding on class probabilities.
Interaction term refers to the probability of entities in and engaging in interaction . Note that the interaction term is conditioned on the object label . This allows the model to learn that only certain interactions are feasible for a given object. For example, it is possible to “clean” or “eat at” a “dinning table” but not to “drive” or “greet” it. In practice, we found conditioning on did not affect results significantly. To utilize appearance and layout information, the interaction term is further factorized as follows:
is the Sigmoid function and eachis a learnable deep net factor. We now describe each of these factors along with the network architecture and appearance and layout encodings the factor operates on:
Appearance. Factors and predict the interaction that the human and the object are engaged in, based on visual appearance alone. The appearance of a box in an image is encoded using Faster-RCNN  (Resnet-152 backbone) average pooled fc7features extracted from the RoI. By design, this representation captures context in addition to content within the box. The dimensional fc7 features are fed into a multi-layer perceptron (MLP) with a single
dimensional hidden layer with Batch Normalization
and ReLU. The output layer has neurons, one per interaction category in .
Box Configuration. Object label and the absolute and relative positions and scales of the human and object boxes are often indicative of the interaction, even without even the appearance (, a human box above and overlapping with a ‘horse’ box strongly suggests a ‘riding’ interaction). captures this intuition by predicting a score for each interaction given an encoding of the bounding boxes and the object label. The object label is encoded as a (
) dimensional one hot vector. The bounding boxes are represented using a 21 dimensional feature vector. We encode theabsolute position and scale of both the human and object boxes using box width, height, center position, aspect ratio, and area. We also encode relative configuration of the human and object boxes using relative position of their centers, ratio of box areas and their intersection over union. These 21 dimensional features are concatenated with their log absolute values and the object label encoding and passed through an MLP with 2 hidden layers, 122 () dimensional each (same as the input feature dimension), with Batch Normalization and ReLU.
Human Pose. We supplement the coarse layout encoded by bounding boxes with more fine-grained layout information provided by human pose keypoints. We use OpenPose [2, 23, 21] to detect 18 keypoints for each person in the image. A human candidate box is assigned a keypoints-skeleton if the smallest bounding box around the keypoints has or more of its area inside the human box. Similar to box features, we encode both absolute human pose and the relative location with respect to the object candidate box. The absolute pose features () consist of keypoint coordinates normalized to the human bounding box frame of reference and confidence of each keypoint predicted by OpenPose. The relative pose features () consist of offset of the top left and bottom right corners of the object box relative to each keypoint and keypoint confidences. The absolute and relative pose features and their log values are concatenated along with one hot object label encoding before feeding into . is also an MLP with 2 hidden layers with () neurons each. Both hidden layers are equipped with Batch Normalization and ReLU. The output layer, like the other factors, has 117 neurons.
Since more than one HOI label might be assigned to a pair of boxes, the model is trained in a fully supervised fashion using the multi-label binary cross-entropy loss. For each image in the training set, candidate boxes for each HOI category ( for class ) are assigned binary labels based on whether both the human and object boxes in the pair have an intersection-over-union (IoU) greater than 0.5 with a ground truth box-pair of the corresponding HOI category. During training, the sample in a mini-batch consists of a box pair , HOI category for which the box pair is a candidate ( are considered candidates for HOI class ), binary label to indicate match (or not) with a ground truth box pair of class , detection scores for human and object category corresponding to class , and input features for each factor . Pair of boxes which are candidates for more than one HOI category are treated as multiple samples during training. Since the number of candidate pairs per image is 3 orders of magnitude (typically ) larger than the number of positive samples (typically ), random sampling would leave most mini-batches with no positives. We therefore select all positive samples per image and then randomly sample 1000 negatives per positive. Given a mini-batch of size constructed from a single image , the loss is computed as
where is the binary cross entropy loss and is the probability of HOI class computed for the sample in the mini-batch using Eq. 1. In our experiments, we only learn parameters of the interaction term (MLPs used to compute factors , and ).
|Full||Rare||Non-Rare||Number of training instances per HOI class|
|VSRL  (impl. by )||9.09||7.02||9.71||-||-||-||-||-||-|
|Det + Box||12.54||10.40||13.18||10.40||7.46||9.99||14.62||20.12||35.98|
|Det + Human App||11.12||8.82||11.80||8.82||7.73||9.19||13.41||15.85||26.42|
|Det + Object App||11.05||7.41||12.13||7.41||7.68||9.72||14.61||15.58||23.27|
|Det + App||15.74||11.35||17.05||11.35||10.58||13.96||20.11||22.76||34.75|
|Det + Human App + Box||15.63||12.45||16.58||12.45||9.94||12.69||19.05||23.60||39.63|
|Det + Object App + Box||15.68||10.47||17.24||10.47||9.97||12.84||20.48||23.88||40.87|
|Det + App + Box||16.96||11.95||18.46||11.95||11.02||14.00||22.02||25.01||41.13|
|Det + Pose||11.09||8.04||12.00||8.04||7.26||8.47||13.08||18.81||32.66|
|Det + Box + Pose||14.49||11.86||15.27||11.86||9.73||12.21||16.51||21.72||38.81|
|Det + App + Pose||15.50||10.14||17.10||10.14||10.40||13.11||20.40||23.45||36.08|
|Det + App + Box + Pose||17.18||12.17||18.68||12.17||11.28||14.49||22.08||25.27||41.47|
HICO-Det  and V-COCO  datasets are commonly used for evaluating HOI detection models. V-COCO is primarily used for legacy reasons since at the time HICO  dataset only had image-level annotations. HICO-Det was created to extend HICO with bounding box annotations specifically for the HOI detection task. HICO-Det is both larger and more diverse than V-COCO. While HICO-Det consists of images annotated with interactions with objects resulting in a total of HOI categories, V-COCO only has interactions with a training set the size of HICO-Det’s. Exhaustive annotations for each HOI category also makes it more suitable for based evaluation than VRD  which suffers from missing annotations. VRD also contains “human” as one among many subjects which makes evaluation of the impact of fine-grained human pose less reliable due to a small sample size. Hence, HICO-Det is appropriate for evaluating our contributions and makes V-COCO evaluation redundant.
In addition to comparing to the current state-of-the-art, our experiments evaluate the contribution of different factors in our model (Tab. 1) and impact of proposed training techniques (Tab. 2. Our analysis also includes visualization of distribution of performance across object and interaction categories (Fig. 5), interaction confusions (Fig. 3), and examples of top ranking detections and failure cases (Fig. 4).
HICO-Det dataset contains training and test images annotated with HOI categories. We further use an - split of the training images to generate our actual training and validation sets. For all experiments we train on this smaller training set and use the validation set for model selection. HOI categories consist of object categories (same as COCO classes) and interactions. Each image on average contains HOI detections.
Tab. 1 shows mAP of our final models Det+App+Box and Det+App+Box+Pose, (and ablations) in comparison to existing models in the literature on various sets of HOI categories – Full is mAP across all classes, Rare on classes with less than 10 training instances, and Non-Rare on the rest. To present a clearer picture, in addition to this Rare-Non-Rare split specified in , we show results for a more fine-grained grouping of classes based on number of training instances.
The model most similar to ours is InteractNet  which extends Faster-RCNN with a human-centric branch that produces interaction scores based on human (and optionally object) appearance and a distribution over target object location. There are 4 factors contributing to the improved performance of our model over InteractNet: (i) use of significantly large ratio of negative to positive box-pairs during minibatch training (our model uses whereas  uses for the detection branch and no negatives for the interaction branch); (ii) box configuration term in our model directly scores box-pair features, a formulation easier to learn than predicting distribution over target object locations using human appearance features alone; (iii) fixing training-inference mismatch (Fig 2); (iv) easy negative rejection that allows our model to focus on learning to rank only hard candidate pairs for a particular HOI category, namely all combinations of human and object detections of the relevant category. Effect of factors (i), (iii), and (iv) towards our model’s performance are further investigated in Tab. 2 and Sec. 4.2. iCAN  follows an approach and training procedure similar to InteractNet but augments region appearance features with contextual features computed using an attention mechanism, a contribution complementary to ours. Our ablation study over factors (Tab. 1) is also more exhaustive than [10, 9]. For instance, we conclusively show that object appearance provides useful information complementary to human appearance for HOI detection (Det + Human App: , Det + Object App: Det + App: ).
HO-RCNN  takes human appearance, object appearance, and box configuration encoded as an interaction pattern as inputs and processes them with 3 separate branches, each of which produces a score for each HOI category. The scores are combined along with object detection scores to produce HOI probabilities and the model is trained using multi-label binary classification loss. Our model improves over HO-RCNN in two ways: (i) weight sharing in our factored model (also in InteractNet and iCAN) makes it more data efficient than  which predicts scores for 600 HOI categories using independent weights in the last 600-way fully connected layer; and (ii) we explicitly encode spatial layout as opposed to  which has to learn such a representation via a CNN.
GPNN  adopt a completely different approach based on message passing over an inferred graph. While in theory, such an approach jointly infers all HOI detections in an image (as opposed to making predictions for one candidate box-pair at a time), the advantages of this approach over carefully designed but simpler fixed graph approaches like our factor model and (which also enjoys the benefit of context in spite of pairwise inference) remains to be demonstrated.
Training the model using interaction classification loss on the probabilities predicted by the interaction term, as done in [10, 9], is suboptimal in comparison to training using HOI classification loss ( vs mAP) even though the same set of parameters are optimized by both losses. This is because the latter provides an opportunity for the interaction term to calibrate itself relative to the detection terms. This approach is also used in  but without the strong weight sharing assumptions made by our factor model.
A distinguishing feature of our factor model is the use of indicator functions in detection score factors (Eq. 3
) and loss function (Eq.4). The indicators ensure that a box-pair predicts zero probability for an HOI category for which or and that the model learns to rank only relevant box-pairs (those in set for class ). Tab. 2 shows that even while using the indicators during inference, not using them during training causes a drop in mAP from to .
Finally, as shown in Tab. 2, increasing the ratio of negative box-pairs sampled per positive in a mini-batch during training leads to a dramatic increase in performance. This is in contrast to low ratios (typically ) used for training object detectors and in related work [4, 10]. This is partly to support our training strategy (Sec. 3.3) which considers box-pairs as candidates for a specific HOI class (the same pair for a different interaction with the same object category is treated as a different sample). But also since the number of negative pairs is quadratic in the number of region proposals as opposed to linear for object detectors higher ratios than object detectors are expected for learning to reject false positives.
|Neg./Pos.||Indicators||HOI Loss||Interaction Loss||mAP|
To identify the role of different sources of appearance and spatial information in our model we train models with subsets of available factors.
The role of individual factors can be assessed by comparing Det, Det+Box, Det+App, and Det+Pose. Note that appearance terms lead to largest gains over Det followed by Box and Pose. We further analyse the contribution of human and object appearance towards predicting interactions. Interestingly, while Det+Human App and Det+Object App perform comparably ( and ), the combination outperforms either of them with an mAP of showing that the human and object appearance provide some complementary information. Note that an mAP of () or less would indicate completely redundant or noisy signals. Similar sense of complementary information can be assessed from Table 1 for App-Box, App-Pose, and Box-Pose pairs.
While Det+Box+Pose improves over Det+Box, Det+App+Pose and Det+App perform comparably. Similarly, Det+App+Box+Pose only slightly improves the performance of Det+App+Box. This suggests that while it is useful to encode fine-grained layout in addition to coarse layout, human appearance encoded using object detector features may already be capturing human pose information to some extent.
Another way of understanding the role of factors is to consider the drop in performance when a particular factor is removed from the final model. Relative to Det+App+Box+Pose, performance drops are , , and mAP for App, Box and Pose factors respectively.
Fig. 5 visualizes the spread of performance of our final model across interactions with a given object and across objects for a given interaction. The figure shows that for most objects certain interactions are much easier to detect than others (with the caveat that AP computation for any class is sensitive to the number of positives for that class in the test set). Similar observation is true for different objects given an interaction. In addition, we observe that interactions which can occur with only a specific object category (as indicated by absence of box) such as “kick-ball” and “flip-skateboard” are easier to detect than those that tend to occur with more than one object such as “cut” and “clean” and could have drastically different visual and spatial appearance depending on the object. Heatmaps in Fig. 3 show the interactions that are confused by different models. Comparing heatmap b with a shows the role of the appearance factor in reducing confusion between interactions. For instance, without App “eat” is confused with “brush with” and “drink with”, but not in the final model. Similarly, c and d can be compared with a for the effects of Box and Pose factors respectively.
Qualitative results (Fig. 4) demonstrate the advantages of building HOI detectors on the strong foundation of object detectors. False positives are more commonly due to incorrect interaction than object. Interaction errors are often due to fine grained differences between classes: , “carry” “wield” “baseball bat” and “inspect” “repair” “boat.” Notice in some examples like “inspect airplane” and “watch bird,” cues for preventing false positives are as subtle as gaze direction.
We propose a no-frills approach to HOI detection which is competitive with existing literature without many of their complexities. This is achieved through appropriate factorization of the HOI class probabilities, encodings of appearance and layout constructed from the outputs of pretrained object and pose detectors, and improved training techniques. Our ablation study shows the importance of human and object appearance, coarse layout, and fine-grained layout for the HOI detection task. We also evaluate the significance of the proposed training techniques which can easily be incorporated into other factored models in the literature as well.
Realtime multi-person 2d pose estimation using part affinity fields.In CVPR, 2017.
Computer vision and pattern recognition workshops (CVPRW), 2010 IEEE computer society conference on, pages 9–16. IEEE, 2010.
Action recognition from a distributed representation of pose and appearance.In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3177–3184. IEEE, 2011.
Learning human-object interactions by graph parsing neural networks.2018.