Human-Object Interaction (HOI) detection, which aims to localize and infer relationships between human and objects in images/videos, , is an essential step towards deeper scene and action understanding [chao2018learning, gao2018ican]. In real-world scenarios, long-tailed distributions are common for the data perceived by human vision system, , actions/verbs and objects [liu2019large]. The combinatorial nature of HOI further highlights the issues of long-tailed distributions in HOI detection, while human can efficiently learn to recognize seen and even unseen HOIs from limited samples. An intuitive example of open long-tailed HOI detection is shown in Figure 1, in which one can easily recognize the unseen action “ride bear”, nevertheless it never even happened. However, existing HOI detection approaches usually focus on either the head [gao2018ican, liao2019ppdm, wang2018low], the tail [xu2019learning] or unseen categories [shen2018scaling, preye2019detecting], leaving the problem of open long-tailed HOI detection poorly investigated.
Open long-tailed HOI detection falls into the category of the long-tailed zero-shot learning problem, which is usually referred into several isolated problems, including long-tailed learning [japkowicz2002class, he2009learning], few-shot learning [fei2006one, vinyals2016matching], zero-shot learning [lampert2009learning]. To address the problem of imbalanced training data, existing methods mainly focus on three strategies: 1) re-sampling [drummond2003c4, han2005borderline, kang2019decoupling]
; 2) re-weighted loss functions[cui2019class, cao2019learning, hayat2019gaussian]; and 3) knowledge transfer [wang2017learning, liu2019large, fei2006one, lampert2009learning, schonfeld2019generalized, frome2013devise]. Specifically, re-sampling and re-weighted loss functions are usually designed for imbalance problem, while knowledge transfer is introduced to relieve all the long-tailed [wang2017learning], few-shot [snell2017prototypical], and zero-shot problem [xian2018feature, frome2013devise]. Recently, two popular knowledge transfer methods have received increasing attention from the community, data generation [wang2017learning, wang2018low, xian2018feature, liu2019large, fei2006one, lampert2009learning, schonfeld2019generalized, keshari2020generalized] (transferring head/base classes to tail/unseen classes) and visual-semantic embedding [frome2013devise] (transferring from language knowledge). Along the first way, we address the problem of open long-tailed HOI detection from the perspective of HOI generation.
Unlike the samples in typical long-tailed zero-shot learning for visual recognition, each HOI sample is composed of a verb and an object, and different HOIs may share the same verb or object (, “ride bike” and “ride horse”). In cognitive science, human perceives concepts as the compositions of shareable components [biederman1987recognition, hoffman1983parts] (, verb and object in HOI), which indicates that human can conceive a new concept through a composition of existing components. Inspired by this, several zero-and few-shot HOI detection approaches have been proposed to enforce the factored primitive (verb and object) representation of the same primitive class to be similar among different HOIs, such as factorized model [shen2018scaling, bansal2020detecting] and factor visual-language model [xu2019learning, preye2019detecting, bansal2020detecting]. However, regularizing factor representation, enforcing the same verb/object representation to be similar among different HOIs, is only sub-optimal for HOI detection. Recently, Hou [hou2020visual] present to compose novel HOI samples via combining decomposed verbs and objects between pair-wise images and within image. Nevertheless, it still remains a great challenge to compose massive HOI samples in each minibatch from images due to limited number of HOIs in each image, especially when the distribution of objects/verbs is also long-tailed. We demonstrate the distribution of the number of objects in Figure 2.
The long-tailed distribution of objects/verbs makes it difficult to compose new HOIs from each mini-batch, significantly degrading the performance of compositional learning-based methods for rare and zero-shot HOI detection [hou2020visual]. Inspired by recent success of visual object representation generation [xian2018feature, hariharan2017low, wang2018low], we thus apply fabricated object representation, instead of fabricated verb representation, to compose more balanced HOIs. We referred to the proposed compositional learning framework with fabricated object representation as Fabricated Compositional Learning or FCL. Specifically, we first extract verb representations from input images, and then design a simple yet efficient object fabricator to generate object representation. Next, the generated visual object features are further combined with the verb features to compose new HOI samples. With the proposed object fabricator, we are able to generate balanced objects for each verb within the mini-batch of training data as well as compose massive balanced HOI training samples.
The main contributions of this paper can be summarized as follows: 1) proposing to compose HOI samples for Open Long-Tailed HOI detection; 2) designing an object fabricator to generate objects for HOI composition; 3) significantly outperforming recent state-of-the-art methods on HICO-DET dataset among rare and unseen categories.
2 Related Works
HOI Detection. HOI detection is essential for deeper scene and action understanding [chao2018learning]. Recent HOI detection approaches usually focus on representation learning [gao2018ican, zhou2019relation, ulutan2020vsgnet, wang2019deep, wan2019pose], zero/few-shot generalization [shen2018scaling, xu2019learning, preye2019detecting, bansal2020detecting, hou2020visual], and One-Stage HOI detection [liao2019ppdm, wang2020learning]. Specifically, existing methods improve HOI representation learning by exploring the relationships among different features [qi2018learning, zhou2019relation, ulutan2020vsgnet], including pose information [li2018transferable, wan2019pose, li2020detailed], context [gao2018ican, wang2019deep], and human parts [zhou2019relation]; Generalization methods for HOI detection mainly include visual-language model [preye2019detecting, xu2019learning], factorized model [shen2018scaling, gupta2019no, ulutan2020vsgnet, bansal2020detecting], and HOI composition [hou2020visual]. Recently, Liao [liao2019ppdm] and Wang [wang2020learning] propose to detect the interaction point for HOI by heatmap-based localization [newell2016stacked]. Wang [wang2020discovering] try to detect HOI with novel objects by leveraging human visual clues to localize interacting objects. However, existing HOI approaches usually fail to investigate the imbalance issue and zero-shot detection. Inspired by the factorized model [shen2018scaling], we propose to compose visual verb and fabricated objects to address the open long-tailed issue in HOI detection. Furthermore, according to whether detect the objects with a separated detector or not, existing HOI detection approaches can be divided into two categories: 1) one-stage [shen2018scaling, liao2019ppdm, wang2020learning, gkioxari2018detecting] and two-stage [gao2018ican, li2018transferable, zhou2019relation, ulutan2020vsgnet, wang2019deep, xu2019learning, bansal2020detecting, ulutan2020vsgnet]. Two-stage methods usually achieve better performance and our method falls into this category.
Compositional Learning. Irving Biederman illustrates that human representations of concepts are decomposable [biederman1987recognition]. Meanwhile, Lake [lake2017building] argue compositionality is one of the key blocks in a human-like learning system. Tokmakov [tokmakov2019learning]
apply the compositional deep representation into few-shot learning. External knowledge graph and graph convolutional networks in[kato2018compositional] are used to compose verb-object pairs for HOI recognition. Recently, Hou [hou2020visual] propose a novel visual compositional learning framework to compose HOIs from image-pairs for HOI detection, failing to address the open and long-tailed issues. Therefore, we further compose verb and fake object representations for HOI detection.
Generalized Zero/Few-Shot Learning. Different from typical zero/few-shot learning [fei2006one, lampert2009learning, vinyals2016matching], generalized zero/few-shot learning [xian2018zero] is a more realistic variant, since the performance is evaluated on both seen and unseen classes [schonfeld2019generalized, chao2016empirical]. The distribution of HOIs is naturally long-tailed [chao2018learning], , most classes have a few training examples. Moreover, the open long-tailed HOI detection aims to handle the long-tailed, low-shot and zero-shot issue in a unified way. The long-tailed data distribution [japkowicz2002class, he2009learning, huang2016learning] is one of challenging problem in visual recognition. Currently, re-sampling [gupta2019lvis, kang2019decoupling], specific loss [lin2017focal, cui2019class, cao2019learning, hayat2019gaussian], knowledge transfer [wang2017learning, liu2019large], and data generation [wang2018low, kumar2018generalized, xian2018feature, alfassy2019laso] are major strategies for imbalanced learning [japkowicz2002class, he2009learning, huang2016learning]. To make full use of the composition characteristic of HOI, we aim to compose HOI samples by visual feature generation to relieve the open long-tailed issue in HOI detection. Recent feature generation methods [kumar2018generalized, xian2018feature]
mainly depend on Variational Autoencoder[kingma2013auto]goodfellow2014generative], which usually suffer from the problem of model collapse [salimans2016improved]. Wang [wang2018low] present a new method for low-shot learning that directly learns to hallucinate examples that are useful for classification. Similar to [wang2018low], we compose HOI samples with an object fabricator in an end-to-end optimization without using the adversarial loss.
In this section, we first describe the multi-branch compositional learning framework for HOI detection. We then introduce the proposed fabricated compositional learning for open long-tailed HOI detection.
3.1 Multi-branch HOI Detection
HOI detection aims to find the interactions between human and different objects in a given image/video. Existing HOI detection methods [gao2018ican, li2018transferable, bansal2020detecting] usually contain two separated stages: 1) human and object detection; and 2) interaction detection. Specifically, we first use a common object detector, , Faster R-CNN [ren2015faster], to localize the positions and extract the features for both human and objects. According to the union of human and object bounding boxes, we then extract the verb feature from the feature map of backbone networks via the ROI-Pooling operation. Similar to [gao2018ican, gupta2019no, li2018transferable], an additional stream for spatial pattern, , spatial stream, is defined as the concatenation of human and object masks, , the value in the human/object bounding box region is 1 and 0 elsewhere. As a result, we obtain several input streams from the first stage, , human stream, object stream, verb stream, and spatial stream.
The input streams from the first stage then are used to construct different branches in the second stage: 1) the spatial HOI branch, which concatenates the spatial and the human streams to construct spatial HOI feature for HOI recognition; 2) the HOI branch, which concatenates the verb and the object streams; and 3) the fabricated compositional branch, which is based on a new stream, the fabricator stream, to generate fake object features for composing new HOIs. Specifically, the fabricated compositional branch generates novel HOIs by combining visual verb features and generated object features. The main multi-branch HOI detection framework is shown in Figure 3, and we leave the details of the fabricated compositional branch in next section.
3.2 Fabricated Compositional Learning
The motivation of compositional learning is to decompose a model/concept into several sub-models/concepts, in which each sub-model/concept focuses on a specific task, and then all responses are coordinated and aggregated to make the final prediction [biederman1987recognition]. Recent compositional learning method for HOI detection considers each HOI as the combination of a verb and an object to compose new HOIs from objects and verbs within the mini-batch of training samples [kato2018compositional, hou2020visual]. However, existing compositional learning methods fail to address the problem of long-tailed distribution on objects.
To address the open long-tailed issue, we propose to generate balanced objects for each decoupled visual verb as follows. Formally, we denote as the label of a verb , as the label of an object and as the HOI label of . Given another verb representation (sharing the same label with ), and another object representation (sharing the same label with ), regardless of the sources of the verb and object representations, an effective composition of verb and object should be
where indicates the HOI classification network. By doing this, we can compose new verb-object pair , which have similar semantic type to the real pair , to relieve the scarcity of rare and unseen HOI categories. To generate effective verb-object pair , we regularize the verb representation and object representation such that same verbs/objects have similar feature representation.
Similar to previous approaches, such as factor visual-language joint embedding [xu2019learning, preye2019detecting] and factorized model [shen2018scaling, gupta2019no], when is similar to and is similar to , we then have that Equation (1) can be generalized to HOI detection via the compositional branch. We refer to the proposed compositional learning framework with fabricated object representation as Fabricated Compositional Learning or FCL. We train the proposed method with composited HOI samples in an end-to-end manner, and the overall loss function are defined as follows:
where aims to regularize verb and object features, indicates a typical compositional learning loss function for the classification network with composite HOI samples as the input, is the loss for Spatial HOI Branch.
are the hyper-parameters to balance different loss functions. Specifically, object feature extracted from a pre-trained object detector backbone network (Faster-RCNN[ren2015faster]) are usually discriminative. Thus, we only regularize verb representation.
3.2.1 Object Generation
The HOI is composed of a verb and an object, in which the verb is usually a very abstract notation compared to the object, making it difficult to directly generate verb features. Recent visual feature generation methods have demonstrated the effectiveness of feature generation for visual object recognition [wang2018low, xian2018feature]. Therefore, we devise an object fabricator to generate object feature representations for composing novel HOI samples.
The overall framework of object generation is shown in Figure 4. Specifically, we maintain a pool of object identity embeddings, , . We provide three kinds of embeddings in supplementary material. In each HOI, the pose of the object is usually influenced by the human who is interacting the object [zhang2020phosa], and the person who is interacting with the object is firmly related to verb feature representation. Thus, for each extracted verb and the object ( and is the number of all different objects), we concatenate the object identity embedding , the verb feature
and a noise vector, as the input of the object fabricator, ,
where is the fake object feature and indicates the object fabricator network. Here, the noise is used to increase the diversity of generated objects. We then combine the fake object feature and the verb to compose a new HOI sample . Specifically, during training, both real HOIs and composite HOIs share the same HOI classification network .
3.2.2 Efficient HOI Composition
To compose new HOIs from verb and object representations, we need to remove some infeasible composite HOIs (, “ride vase”) as illustrated in Figure 4. To avoid frequently checking the pair , we use an efficient HOI composition similar to [hou2020visual]. Specifically, the HOI label space is decoupled into verb and object spaces, , the co-occurrence matrices and , where , , and indicate the number of verbs, objects and HOI categories, respectively. Given an one-hot HOI label vector , we then have the verb label vectors,
where can be a multi-hot vector with multiple verbs, , ). Similarly, combining the verb with all objects, we have the matrix as labels of all fake objects. Let denote the verb labels corresponding to fake object features , the new interaction label can then be evaluated as follows,
where indicates the logical operation “”. Finally, the logical operation automatically filters out the infeasible HOIs since the labels of those infeasible HOIs are all-zero vectors in the label space.
Training. The verb feature contains the pose information of the object, making it difficult to jointly train the network with an object fabricator from scratch. Therefore, we introduce a step-wise training strategy for the long-tailed HOI detection. Firstly, we pre-train the network by , and without the fabricator branch. Then, we fix the pre-trained model and train the randomly initialized object fabricator via the loss function for the fabricator branch . Lastly, we jointly fine-tune all branches by in an end-to-end manner. To avoid the bias to seen data in the first step, we optimize the network in one step for zero-shot HOI detection (See analysis in Section 4.4).
Inference. The fabricated branch is only used in the training stage, , we remove it during the inference stage. Similar to previous multi-branch methods [gao2018ican, li2018transferable, hou2020visual], for each human-object bounding box pair (, ), the final HOI prediction for each category , can be evaluated as follows,
where and indicate the object detection scores for the human and object, respectively. and are the scores from the Spatial branch and the HOI branch, respectively.
In this section, we first introduce datasets and metrics, and then provide the details of the implementation of our method. Next, we present our experimental results compared with state-of-the-art approaches. Finally, we conduct ablation studies to validate the components in our work.
4.1 Datasets and Metrics
We adopt the largest HOI datasets HICO-DET [chao2018learning]
, which contains 47,776 images including 38,118 images for training and 9,658 images for testing. All 600 HOI categories are constructed from 80 object categories and 117 verb categories. HICO-DET provides more than 150k annotated human-object pairs. In addition, V-COCO is another small HOI dataset with 29 categories[gupta2015visual]. Considering that V-COCO mainly focuses to verb recognition and do not contain a severe long-tailed issue, we mainly evaluate the proposed method on HICO-DET. We also illustrate the result on visual relation detection [lu2016visual, zhan2019exploring], which requires to detect the triplet (subject, predicate, object) in supplementary materials. We follow the evaluation settings in [chao2018learning], a HOI prediction is a true positive if 1) both the human and object bounding boxes have IoUs larger than 0.5 with the reference ground truth bounding boxes; and 2) the HOI prediction is accurate.
4.2 Implementation Details
Similar to [bansal2020detecting, hou2020visual], our HOI detection model contains two separated stages: 1) we finetune the Faster R-CNN detector pre-trained on COCO [lin2014microsoft] using HICO-DET to detect the human and objects 111We use the Faster R-CNN detector implemented in detectron2 [wu2019detectron2].; 2) we use the proposed FCL model for HOI classification. Specifically, all branches are two-layer MLP sigmoid classifiers with 2048-d input and 1024-d hidden units. Fabricator is a two-layer MLP. The is a sigmoid classifier for verb representation. , and are binary cross entropy losses. and are set according to HOI dataset, and we can also set them by prior knowledge to detect more types of unseen HOIs. Besides, to prevent the fabricated HOIs from dominating the model optimization process, we randomly sample fabricated HOIs in each mini-batch to keep that the number of fabricated HOIs is not more than three times the number of non-fabricated HOIs. We train our network for one million iterations by SGD optimizer on the HICO-DET dataset with an initial learning rate of 0.01, a weight decay of 0.0005, and a momentum of 0.9. We set as 2.0, as 0.5 and as 0.3, while we set 1 for the coefficient of
. The hyper-parameters are ablated in supplementary materials. We jointly fine-tune the model with the object fabricator for 500k iterations, and decay the initial learning rate 0.01 with a cosine annealing schedule. All our experiments on HICO-DET are conducted using TensorFlow[abadi2016tensorflow] on a single Nvidia GeForce RTX 2080Ti GPU. We evaluate V-COCO based on PMFNet [wan2019pose] with two GPUs. We do not use auxiliary verb loss since there are only two kinds of objects on V-COCO. We set as 1 and as 0.25 on V-COCO.
4.3 Comparison to Recent State-of-the-Arts
Our method aims to relieve open long-tailed HOI detection. However current approaches usually focus on full categories, rare categories and unseen categories separately. In order to compare with state-of-the-art methods, we evaluate our method on long-tailed detection and generalized zero-shot detection separately. The HOI detection result is evaluated with mean average precision (mAP) (%).
4.3.1 Effectiveness for Zero-Shot HOI Detection
There are different settings [bansal2020detecting] for zero-shot HOI detection: 1) unseen composition; and 2) unseen object. Specifically, for the unseen composition setting, it indicates that the training data contains all factors (, verbs and objects) but misses the verb-object pairs; for the unseen object setting, it requires to detect unseen HOIs, in which the object do not appear in the training data. For unseen composition HOI detection, similar to [hou2020visual], we select two groups of 120 unseen HOIs from tail preferentially (rare first) and from head preferentially (non-rare first) separately, which roughly compares the lowest and highest performances. As a result, we report our result in the following settings: Unseen (120 HOIs), Seen (480 HOIs), Full (600 HOIs) in the “Default” mode on HICO-DET dataset. For a better comparison, we implement the factorized model [shen2018scaling] under our framework for unseen composition zero-shot HOI detection. For unseen object HOI detection, we use the same HOI categories for unseen data as [bansal2020detecting] (randomly selecting 12 objects from the 80 objects and picking all HOIs containing there objects as unseen HOIs). Then, we report our results in the setting: Unseen (100 HOIs), Seen (500 HOIs), Full (600 HOIs). To compare with the contemporary work [hou2020visual], we use the same object detection result released by [hou2020visual]. Here, our baseline method is the model without object fabricator, , the compositional branch.
|VCL [hou2020visual] (rare first)||UC||10.06||24.28||21.43|
|Baseline (rare first)||UC||8.94||24.18||21.13|
|Factorized (rare first)||UC||7.35||22.19||19.22|
|FCL (rare first)||UC||13.16||24.23||22.01|
|VCL [hou2020visual] (non-rare first)||UC||16.22||18.52||18.06|
|Baseline (non-rare first)||UC||13.47||19.22||18.07|
|Factorized (non-rare first)||UC||15.72||16.95||16.71|
|FCL (non-rare first)||UC||18.66||19.55||19.37|
|FCL + VCL||25.27||20.57||26.67||27.71||22.34||28.93|
|(FCL + VCL)||30.11||24.46||31.80||32.17||26.00||34.02|
|(FCL + VCL)||45.25||36.27||47.94||-||-||-|
Unseen composition. Table 1 shows that FCL achieves large improvement on Unseen category by 4.22% and 5.19% than baseline, and by 3.10% and 2.44% compared to previous works [bansal2020detecting, hou2020visual] on the two selection strategies respectively. Meanwhile, the two selection strategies witness a consistent improvement with FCL on nearly all categories, which indicates that composing novel HOI samples contributes to overcome the scarcity of HOI samples. In rare first selection, FCL has a similr result to baseline and VCL [hou2020visual] on Seen category. But step-wise optimization can improve the result on Seen category and Full category (See Table 6). In addition, the factorized model has a very poor performance in the head classes compared to our baseline. Noticeably, factorized model achieves better performance on Unseen category than baseline in non-rare first selection while has worse result on Unseen category in rare first selection. FCL witnesses a consistent improvement in different evaluation settings. In the remaining data, unseen HOIs of rare first zero-shot have more rare verbs (less than 10 instances) than that of non-rare first zero-shot.
Unseen object. We further evaluate FCL in novel object zero-shot HOI detection, which requires to detect HOIs that is interacting with novel objects. Table 1 shows FCL effectively improves the baseline by 2.68% on Unseen Category, although there are no real objects of unseen HOIs in training set. This illustrates the ability of FCL for detecting unseen HOIs with novel objects. Here, the same as [bansal2020detecting], we also use a generic detector to enable unseen object detection.
4.3.2 Effectiveness for Long-Tailed HOI Detection
We compare FCL with recent state-of-the-art HOI detection approaches [wang2020learning, liao2019ppdm, bansal2020detecting, hou2020visual, gao2020drg] using fine-tuned object detector on HICO-DET to validate its effectiveness on long-tailed HOI detection. For fair comparison, we use the same fine-tuned object detector provided by [hou2020visual]. For evaluation, we follow the settings in [chao2018learning]: Full (600 HOIs), Rare (138 HOIs), Non-Rare (462 HOIs) in “Default” and “Known Object” on HICO-DET.
In Table 2, we find that the proposed method achieves new state-of-the-art performance, 24.68% and 26.80% mAP on “Default” and “Known Object”. Meanwhile, we achieve a significant performance improvement of 2.82% over the contemporary best rare performance model [hou2020visual] under the same object detector, which indicates the effectiveness of the proposed compositional learning for the long-tailed HOI detection. Furthermore, with the same object detection result to [gao2020drg], our results surprisingly increase to 29.12% on “Default” mode. Here, we merely change the detection result provided in [hou2020visual] to that provided in [gao2020drg] during inference. Particularly, we find our method is complementary to compose HOIs between images [hou2020visual]. By simply fusing the result provided by [hou2020visual] with FCL, we can further largely improve the results under different object detectors.
|FCL w/o noise||19.45||17.69||21.22||15.74|
|FCL w/o verb||19.20||18.02||21.04||14.71|
|FCL + verb fabricator||19.47||16.93||21.43||15.89|
4.3.3 Effectiveness on V-COCO
We also evaluate FCL on V-COCO. Although the data on V-COCO is balanced, FCL still improves the baseline (reproduced PMFNet [wan2019pose]) in Table 5.
4.4 Ablation Studies
For a robust validation of the proposed method in rare categories and unseen categories simultaneously, we select 24 rare categories and 96 non-rare categories for zero-shot learning (remained 30,662 training instances). This result is roughly between non-rare first selection and rare first selection in Table 1. See supplementary material for unseen type details and ablation study of long-tailed HOI detection based on Table 2. We conduct ablation study on FCL, verb regularization loss, verb fabricator, step-wise optimization and the effect of object detector.
Fabricated Compositional Learning. In Table 3, we find that the proposed compositional method with fabricator can steadily improve the performance and it is orthogonal to verb feature regularization (verb regularization loss).
Verb Feature Regularization. We use a simple auxiliary verb loss to regularize verb features. Although verb regularization loss can slightly improve the rare and unseen category performance (See row 1 and row 3 in Table 3), FCL further achieves better performance. This indicates that regularizing factor features is suboptimal compared to the proposed method. Semantic verb regularization like [xu2019learning] has a similar result (See supplementary materials).
Verb and Noise for Fabricator. Table 4 demonstrates that performance drops without verb representation or noise. This shows verb representations can provide useful information for generating objects and noise efficiently improves the performance by increasing feature diversity. We meanwhile find the fabricator still effectively improves the baseline without verb or noise by comparing Table 3 and Table 4, which indicates the efficiency of FCL.
Verb Fabricator. The result of fabricating verb features (from verb identity embedding, object features and noise) is even worse as in Table 4. This verifies that it is difficult to directly generate useful verb or HOI samples due to the complexity and abstraction. Supplementary materials provide more visualized analysis of verb and object feature.
|one step (long-tailed)||24.03||18.42||25.70||-|
|one step (ZS)||19.69||18.22||20.82||17.64|
|one step (rare first ZS)||22.01||15.55||24.56||13.16|
|step-wise (rare first ZS)||22.45||17.19||25.34||12.12|
|one step (non-rare ZS)||19.37||15.39||20.56||18.66|
|step-wise (non-rare ZS)||19.11||17.12||21.02||15.97|
Step-wise Optimization. Table 6 illustrates that step-wise training has better performance in rare and non-rare categories while has worse performance in unseen categories. We think it might be because the model with the step-wise training has the bias to seen categories in the first step since there are no training data for unseen categories.
Object Detector. The quality of detected objects has important effect on two-stage HOI Detection methods [hou2020visual]. Table 7 shows that the improvement of FCL over baseline is higher with the fine-tuned detector on HOI data. COCO detector without finetuning on HICO-DET contains a large number of false positive and false negative boxes on HICO-DET due to domain shift, which is in fact less useful to evaluate the effectiveness of modeling human interactions for HOI detection. If the detected boxes during inference are false, the features extracted from the false boxes are also unreal and have large shift to the fabricated objects during training. This causes that fabricated objects are less useful for inferring HOIs during inference. Besides, GT boxes provide a strong object label prior for verb recognition.
5 Qualitative Analysis
Illustration of improvement among categories. In Figure 5, we find that the rarer the category is, the more the proposed method can improve. The result illustrates the benefit of FCL for long-tailed issue in HOI Detection.
Visualized Analysis between fabricated and real object features. Figure 6 presents that cosine similarity between fabricated and real object features gradually goes down to stability in step-wise training. This demonstrates the end-to-end optimization with shared HOI classifier helps fabricate efficient and similar objects during optimization process. More analysis of generated object representations by t-SNE is provided in Supplementary Materials.
In this paper, we introduce a Fabricated Compostional Learning approach to compose samples for open long-tailed HOI Detection. Specifically, we design an object fabricator to fabricate object features, and then stitch the fake object features and real verb features to compose HOI samples. Meanwhile, we utilize an auxiliary verb regularization loss to regularize the verb feature for improving Human-Object Interaction generalization. Extensive experiments illustrate the efficiency of FCL on the largest HOI detection benchmarks, particularly for low-shot and zero-shot detection.
Acknowledgements This work was supported in part by Australian Research Council Projects FL-170100117, DP-180103424, IH-180100002, and IC-190100031.
Appendix A Additional Details of the Proposed Method
a.1 More examples of Open Long-tailed HOI Detection
Figure 7 provides more clear illustration of open long-tailed HOI detection. Open long-tailed HOI detection aims to detect head, tail and unseen classes in one integrated way from long-tailed HOI examples.
a.2 Factorized model
We implement the factorized model under our framework. In details, we replace the HOI branch in Figure 3 in the paper with verb and object stream. The two streams predict the verb and object respectively. During inference, we merge the score of verb and object to obtain HOI score as follows,
where () is the co-occurrence matrix between verbs (objects) and HOIs, is the score from object stream and is the score from verb stream.
a.3 The Effect of Objects on HOI Detection
In the nature, different types of objects form a long-tail distribution. Then, all those actions that people perform on those objects are inevitably long-tailed. As a result, those HOIs that we observed are long-tailed. This motivates us to fabricate balanced objects for composing HOI samples with visual verbs. We have demonstrated the long-tailed distribution of objects in Figure 2 in the paper and the effect of different object detector on HOI detection in Table 6 in paper. We further illustrate HOI detection has roughly similar performance to object detection among most object categories in Figure 8, which also illustrates the importance of object detector for HOI detection at the same time. Meanwhile, it is necessary to balance the the distribution of objects.
a.4 The Number of Primitives in two Zero-Shot Setting
We have count the number of unseen HOI primitives (verb and object) in the remaining data of two zero-shot setting. Unseen HOIs of rare first zero-shot has 40 verbs, 5 of which have less than 10 instances in the remaining data, while Unseen HOIs of non-rare first zero-shot have only 30 verbs and all have more 10 instances. We think this partly explains why Factorized method has worse result on unseen category in rare first setting. When the primitives of unseen HOI are few in the training data. Factorized method possibly achieves worse result on unseen category.
a.5 Fusion of HOI prediction and Generic Object Detector
In our experiment, we directly predict 600 HOI classes in HICO-DET. The predictions of HOI (verb-object pair) also contain object information. We think the object information in HOI prediction and the generic object detector might be complementary. Thus, we convert HOI scores to object scores and fuse it with as follow,
Where and are 0.3 and 0.7 respectively, and . Then, we use the new object score in Equation 6. Meanwhile, we can also update the object category according to . Table 8 shows we can improve the result a bit under VCL detector which provides all scores for each object category. Noticeably, our baseline under VCL detector also uses this strategy and we do not use this in zero-shot settings. For the DRG object detector, we also do not use this strategy. To some extent, this slightly shows HOI prediction and object detection can be mutually promoted, and provides some insights for our future work although this strategy is not much useful.
FCL w/o Fusion
Appendix B Additional Quantitative analysis
b.1 Object Identity
In Table 9, we compare three kinds of object identity. The object variables are identified after we fine-tune the fabricator in the first step. Meanwhile, in the end-to-end optimization, the object variables can maintain object semantic information. We find word embedding [pennington2014glove] and object variables achieve similar performance ( 24.78% vs 24.68%), while the performance of one-hot representation is a bit worse. Particularly, the HOI model is initialized with a pretrained object detector model. Thus, one-step optimization can also optimize the Fabricator according to the pre-trained backbone. In the main paper, the result of long-tailed HOI detection is the model using word embedding as identity embedding. For simplicity, we use randomly initialized variables as object identity embedding for other model, randomly initialize identity embedding.
b.2 Visual Relation Detection
We also present the efficiency of FCL in Predicate Detection on Visual Relation Detection [lu2016visual] in Table 10. Here, we combine subject, predicate and fabricated object to generate novel relation samples [zhan2019exploring]. Table 10 illustrates an important improvement on zero-shot predicate detection compared to the state-of-the-art approach with FCL.
b.3 Semantic Verb Regularization
We also experiment with semantic verb regularization similar to [xu2019learning] with Graph Convolutional Network and verb word embeddings graph. In details, we use the cosine distance loss to regularize the visual verb representation to be similar to the corresponding word embedding. Here, similar to [xu2019learning], we equally treat same category of verbs among different HOIs as same. Table 11 illustrates FCL is orthogonal to semantic regularization. Meanwhile, auxiliary verb loss achieve similar performance compared to semantic verb regularization [xu2019learning]. When we incorporate both semantic regularization and auxiliary verb loss, the improvement is limited. This means verb regularization loss in the paper and semantic verb regularization have similar effect on the model.
b.4 Object Feature Regularization
visual object feature regularization. Object features are usually more discriminative. Meanwhile, we initialize our backbone with the faster-rcnn pre-trained in COCO dataset, which largely helps us to obtain discriminative object features. Thus, it is unnecessary to use auxiliary object loss to regularize object features (See Table 12). Meanwhile, we find the object features is more discriminative from the t-SNE graph in Figure 12.
w/o object loss
|auxiliary object loss||24.54||19.93||25.92|
b.5 The Effect of Union Box on FCL
We extract verb representation from the union box of human and object. In Table 13, we illustrate with human box verb, FCL still effectively improves the baseline. This shows the proposed method is orthogonal to the verb representation. Noticeably, although the union box contains the object, the HOI model mainly learns the verb representation via compositional learning, and largely ignores the identity information of the object. Thus, the object in the union box do not have much effect on Fabricator. By comparing human box and union box for verb representation in Table 2 in paper and Table 13, we find verb representation from union box largely improves the performance since it provides more context information for verb representation.
|FCL (human box)||23.83||18.62||25.39|
b.6 Verb Analysis
The same verb might has different meanings in different HOIs. However, the verb in HOI dataset (e.g. HICO-DET) mainly represents action. Thus, the verb in HOI dataset is usually not ambiguous. Meanwhile, the deep convolutional network (Resnet) is able to fit some ambiguous and even random data [2017Understanding]. Therefore, we can use factorized method [xu2019learning] for HOI detection and the ambiguous verbs do not affect the compositional learning on HICO-DET [hou2020visual], even if there are still some ambiguous verbs (e.g. hold) who can be related to multiple objects.
Besides, we further demonstrates the improvement of FCL among different categories of verbs in Figure 9. We find the ambiguity does not affect the performance of those verbs in fact. For example, although the verb “hold” is related to 61 kinds of objects in HICO-DET, the correpsonding HOIs of “hold” still achieves considerable improvement.
Inspired by that people interact similar objects in a similar manner. we also design an approach to select composite HOIs according to the similarity between different object of objects, we only keep those composite HOIs whose object is in the top neighbors of the verb’s original object. The original object of the verb is the visual object paired with the verb in the HOI annotation. This helps us to filter out those ambiguous composite HOIs. Specifically, we calculate the similarity between different classes of objects by its word embedding [pennington2014glove]. Then we can obtain the top neighbors for each class of objects. Table 14 shows with more similar objects, the performance steadily improves. Particularly, there are only one verb relating to more than 40 HOIs, and 4 verbs with more than 20 HOIs in HICO-DET. When , we only keep composite HOIs whose objects have the same label to the original object.
b.7 Complementarity and Orthogonality to previous methods
Orthogonal to spatial pattern. Table 19 illustrates that the spatial pattern strategy [gao2018ican, li2018transferable, wan2019pose] largely improves the performance, and the proposed compositional learning is orthogonal to spatial pattern.
Orthogonal to re-weighting. In our baseline, we utilize the re-weighting strategy that is used in [li2018transferable, hou2020visual] to compare directly with [hou2020visual]. We demonstrate FCL is orthogonal to re-weighting in Table 16. Without the useful re-weighting strategy, FCL still achieves similar improvement than baseline.
b.8 Complementary Analysis of fabricator
In this section, we conduct analysis of fabricator on HOI detection without unseen data (the full long-tailed HOI detection). We witness the similar trend compared to the ablation study in the paper.
|FCL w/o noise||24.22||19.23||25.72|
|FCL w/o verb||24.29||18.98||25.87|
Verb and Noise for fabricating objects. Table 18 demonstrates the efficiency of verb and noise. Particularly, the performance in the full HOI detection drops larger than that in zero-shot study in the paper. We think it is because the improvement on unseen category is large, while there are no unseen category in the full HOI detection.
Verb Fabricator. Table 18 illustrates if we fabricate verb features to augment HOI samples, the performance apparently decreases to 23.93% in long-tailed HOI detection. This again illustrates that the verb feature is more complex and it is difficult to generate efficient verb features to facilitate HOI detection.
b.9 Additional Ablation Study
Step-wise optimization. We also provide the comparison between step-wise optimization and one-step optimization in unseen object HOI detection in Table 15.
Hyper-Parameters. We follow the hyper-parameters in [hou2020visual] for and . For , we provide the ablated experiment in Table 20 based on 0.5 because we think is less important than .
Appendix C Additional Qualitative Analysis
c.1 Object Representations
We analyze the real object features and fabricated object features in detail in Figure 10, 11 by selecting top 10 frequent classes in HICO-DET. 1) In Figure 10 (a) and Figure 11 (a), we find the fake object features of the same class are close to each other, while the features from different classes are separable although they might share the same verb. 2) Figure 10 (b) and Figure 11 (c) show features of different verbs slightly cluster together within each object class. We can find there are outliers in some object classes because those outliers have different verbs
We can find there are outliers in some object classes because those outliers have different verbs. 3) for unseen object ZSL, Figure 11 shows all fake object features of the same class are also closer to each other. Particularly, the unseen objects (red edge in row b) are also separable from others. 4) The Column 3 in Figure 10 and Figure 11 illustrate fake object features are still separable from its real objects of the same class. However, there are still some fabricated features are closer to it’s corresponding real features (the dark blue class in Figure 10 and the jade-green class in Figure 11). We think Column 3 in the two Figures also shows a future direction for fabricating objects, generate more realistic objects.
c.2 Primitive Features
Figure 12 illustrates verb features are apparently more difficult to distinguish. The verb representation is abstract and complicated. By contrast, object representations extracted from modern object detector are more discriminative. By comparing Figure 12 with the Figures in VCL [hou2020visual], we can find the objects of FCL are more discriminative.
c.3 Qualitative Comparison
In Figure 13, we compare our baseline with our proposed method. Apparently, our proposed method efficiently detects rare categories, while the corresponding baseline can not. In fact, all the HOIs detected by our method in Figure 13 have less than five samples in training set which is much less than the rare setting (less than 10 samples).
c.4 Failure cases analysis
We provide some false positive results on Rare category in Figure 14. All failure cases can be separated into four groups: blurry image, wrong verb, wrong object, wrong match. If the image is blurry or has partial occlusion, it is hard to detection the interaction right. Besides, verb is usually hard to classify. Meanwhile, small objects also cause that the network detect object wrongly (the carrot in Figure 14). Lastly, even though the network can recognize action and object correctly, it also possibly mismatches the interaction. For example, in Figure 14, the women do not interact with the banana on the corner of the table.