Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

by   Daesik Kim, et al.
Seoul National University

In this work, we introduce a novel weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that have not many examples using transferable knowledge from human-object interactions (HOI). While WSOD shows lower performance than full supervision, we mainly focus on HOI as the main context which can strongly supervise complex semantics in images. Therefore, we propose a novel module called RRPN (relational region proposal network) which outputs an object-localizing attention map only with human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervision of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to the object detection but also to other domains such as semantic segmentation. The experimental results on HICO-DET dataset show the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects on HICO-DET and V-COCO datasets.


page 2

page 3

page 5

page 6

page 8

page 9

page 10

page 12


Boosting Weakly Supervised Object Detection with Progressive Knowledge Transfer

In this paper, we propose an effective knowledge transfer framework to b...

Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection

Multimodal supervision has achieved promising results in many visual lan...

Activity Driven Weakly Supervised Object Detection

Weakly supervised object detection aims at reducing the amount of superv...

Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly Supervised Object Detection

Fully supervised object detection has achieved great success in recent y...

Domain Generalisation for Object Detection

Domain generalisation aims to promote the learning of domain-invariant f...

LoANs: Weakly Supervised Object Detection with Localizer Assessor Networks

Recently, deep neural networks have achieved remarkable performance on t...

Weakly Supervised One-Shot Detection with Attention Siamese Networks

We consider the task of weakly supervised one-shot detection. In this ta...


footnotetext: *All of authors equally contributed to this work.footnotetext: Corresponding author.

In a decade, object detection has become one of the most successful fields in computer vision with various applications

[Ren et al.2015, Dai et al.2016, Redmon et al.2016, Liu et al.2016]. Most of the successful models have emerged after the release of large scale datasets (e.g. PASCAL VOC, MS-COCO [Everingham et al.2010, Lin et al.2014]) with bounding box annotations. Given input images, conventional object detection models can localize boxes with the corresponding class scores. Thus, they normally require manually annotated bounding boxes containing accurate coordinate values and object labels for training.

Figure 1: 1) Two different types of description of an object. A is human’s way of identifying an object while B is for machines. 2) Manually annotating time for three tasks. Bearman et al. [Bearman et al.2016]estimated annotation times for image level, bounding box and pixel level. At rightmost, annotating time of relation sentence can be similar to that of image-level since only action verb “hold” is added.

However, annotating bounding boxes is time-consuming and labor-intensive. It can also be difficult to expand the volume of a dataset by adding more object classes or adding more images. Therefore, researches to reduce those costs in various ways have drawn attentions these days.

Weakly supervised object detection (WSOD) has been proposed to tackle the aforementioned problems [Zhu et al.2017, Shi, Caesar, and Ferrari2017, Jie et al.2017]. It is to detect objects within images by weak supervision such as image-level labels. At the cost of lowered annotation cost, WSOD performs worse than full supervision.

To overcome a limitation of weak supervision, some approaches [Shi, Caesar, and Ferrari2017]

rely on another type of full supervision with transfer learning. Transferred knowledge from a source domain could support weak supervision in a target domain. However, annotating other types of

labels such as segmentation mask is also expensive.

Our main intuition is that supervision of machines is totally different from that of humans. For example, Fig. 1(1) shows different ways of identifying objects between humans and machines. While we should provide accurate coordinate values of object boxes for machines, humans usually recognize new objects from contexts. Contexts also can reinforce supervision without much additional efforts.

Especially, how objects are related to human actions can be practical and advantageous since information about a human can be a proper evidence for recognizing contexts in an image. Moreover, humans can easily express contexts with sentences as shown in Fig. 1(1), so that linguistic labels can be a key to reduce annotation cost for humans as shown in Fig. 1(2). Compared to other annotating costs [Bearman et al.2016], the cost of annotating a relation sentence such as “person, hold, bottle” can be almost similar to that of image-level annotation. Thus, we propose a novel paradigm to learn unseen objects based on human-object interaction (HOI).

Our key idea is to exploit transferable knowledge from HOI contexts annotated by language as is in [Chao et al.2018]. Specifically, we propose a novel module that predicts object locations from HOI. Since the actual coordinate values can not be specified, we use an attention map as localization results to connect it with a bounding box. Moreover, in order to train a full object detector (e.g. Faster-RCNN [Ren et al.2015]) in an end-to-end fashion, we design a new module as an add-on type.

The objective of this paper is to make our model learn additional rare classes with weak verbal supervision annotated easily by human. During the first stage, strong supervision on non-rare classes teach our model to localize a proper location with a human pose and an action verb. In the next stage, only weak supervision with transferred knowledge keeps training an object detector for unseen rare classes.

Our main contributions can be summarized as follows:

  • We define a new weakly-supervised object detection scheme which mainly relies on interactions between a human and objects without box annotations.

  • We propose a novel module called RRPN (Relational Region Proposal Network) to localize boxes by using the location of a human and verb embeddings.

  • The proposed RRPN is designed as a universal add-on type that can be easily adapted into existing models such as Faster-RCNN.

Our experiments validate that our model outperforms baselines on the HICO-DET [Chao et al.2018] dataset and can effectively transfer the knowledge to other dataset such as V-COCO [Gupta and Malik2015].

Related works

Weakly Supervised Object Localization and Detection Most of the weakly supervised object localization and detection methods have been proposed based on an image-level supervision. With cheaper but weaker annotations, studies [Bilen and Vedaldi2016, Diba et al.2017, Kantorov et al.2016, Oquab et al.2015, Tang et al.2017, Jie et al.2017] mainly tried to enhance performance by multiple instance learning (MIL). In MIL, a bag is defined as a collection of regions in an image. It is labeled as positive if at least one object is positive and labeled as negative if all the objects are negative.

tang2018pcl tang2018pcl have proposed proposal cluster learning algorithm to learn refined instance classifier. yang2019activity yang2019activity have proposed activity-driven WSOD, which also exploits action classes as contextual information to localize objects without box annotations. However, those works has generated box proposals using Selective Search

[Uijlings et al.2013], which is a rule-based algorithm. Since the box proposal method cannot be trained, there is a fundamental limitation which proper proposals hardly exist in novel data. uijlings2018revisiting uijlings2018revisiting have addressed a new WSOD framework that revisits knowledge transfer for training object detectors on target classes. Since this work has optimized a box proposal network for target classes by MIL, box generators can be insufficiently trained with rare classes due to a lack of contextual information. Our method resolves the aforementioned issues with a novel box proposal module that can transfer knowledge using an HOI dataset.

Human-object interaction Visual recognition of HOI is crucial for comprehending a scene in an image. Early work studied mutual context of human pose and objects [Yao and Fei-Fei2010] and Bayesian model [Gupta and Davis2007, Gupta, Kembhavi, and Davis2009] with handcraft features. Recently, with success of deep learning, chao2015hico chao2015hico introduced a new large-scale benchmark, “Humans Interacting with Common Objects” (HICO), for HOI recognition, which was expanded for detection problems in HICO-DET [Chao et al.2018]. In order to solve HICO-DET datasets, various approaches have been proposed. In [Chao et al.2018], combined features from human proposals and object regions were used to solve HOI detection. gkioxari2018detecting gkioxari2018detecting proposed a human and object detector-based approach estimating a density map based on Faster-RCNN architecture. A recent approach [Qi et al.2018] generates the HOI graph and propagates message between nodes to infer relationships in a parsing graph. In this paper, rather than directly solving HOI problems, we exploit contextual information in HICO-DET to construct a weakly-supervised object detector.

Algorithm Overview

Figure 2: Overview of our algorithm. 1) During the training phase for source classes, RRPN is also trained to predict attention map from human-object interaction. 2) In the target-class training phase, an object detector is trained using the ground truth class label, and the box label provided by the trained RRPN. In other words, our problem focuses solely on solving the weakly supervised object detection problem on the target classes. 3) As a result, the trained object detector for target classes can infer box coordinates and object classes with only an image input.

An overview of our algorithm is illustrated in Fig. 2. Let is the data set, where is the label of the image and is the number of images. The image label is organized as a tuple as shown below:


where, are the bounding box and the class of an object in the image, is the action verb corresponding to the object, and is the number of tuples in the image . To evaluate the proposed method, we divided into two sub-categories based on the number of objects in a class: non-rare (source classes) and rare (target classes) . Note that, there is no object class duplicates but all action verbs are overlapped between two subcategory datasets.

For , we normally train the first object detector (blue circle in Fig. 2) with full supervision using (). Along with training of the object detector, we also train an object localizer (red circle) called RRPN with newly defined inputs. Since the RRPN should learn how to localize an object only with the information on a human and an action verb, we use the image , the verb , and the pose of the human as inputs. The simply comes from but the human pose is extracted from an image

with an existing human pose estimation method. As a results, the RRPN predicts an attention map

of an object location in the -th image from human’s action and appearance. We optimize losses regarding the object class and the location using and for the object detector, but create a Gaussian map of and use it as a ground truth in the training of the attention map of the RRPN. In this phase, since the ground truth bounding box location is available, the RRPN can learn common knowledge between objects and human actions.

For , we assume that only object class information and the action verb are available but the bounding box information is not. To fill the absence of , we exploit learned knowledge inferred by the RRPN with the same kinds of inputs as the training phase for the source classes. Since the output of RRPN is an attention map, we extract a coordinate by thresholding it and generate pseudo bounding box . Then, we normally train the second object detector (green rectangle in Fig. 2) for . Since we already have used all action verbs to train the RRPN in the previous phase and transfer the same parameters in the training phase for the target classes, it can infer an object location with a human pose and an action verb. In Fig. 2, after the RRPN already learned to localize unseen object “Apple” with verb “EAT” and grabbing pose in the training phase of , it can infer a proper location as a pseudo ground truth . In conclusion, we use weak supervision by human actions to train a full object detector.

Eventually, the trained object detector in the second phase can predict objects in only with an input image as shown in Fig. 2. Although we have not shown the real location of “Apple”, it is possible to predict the class score and the coordinate of an “Apple” object.

Figure 3: Overall network architecture of the proposed algorithm. Relational Region Proposal Network (RRPN) at the top is mounted on a basic Faster-RCNN model at the bottom. In RRPN, a combined feature from verb, pose and image produces an attention map through a network which has four blocks. With source classes, the RRPN is trained with a Gaussian mask from the ground truth bounding box. However, with target classes, the RRPN generates a pseudo ground truth bounding box so that Faster-RCNN can be optimized.

In this scheme, we can additionally train new object detector for unseen rare classes without bounding box annotations. Moreover, since we already trained the RRPN with strong supervisions, we need smaller amount of data in target classes compared to other WSOD algorithms. Our experiments validate that our target domain which contains extremely rare object classes is trained successfully by our method.


Fig. 3 depicts the overall architecture of the proposed algorithm. The proposed algorithm consists of two modules, including the RRPN and the object detector. More precisely, it means that RRPN can be combined with the conventional architecture such as the Faster-RCNN. RRPN is a multi-stage encoder-decoder network, which is responsible for predicting an object-location-centric attention map from a multi-domain integrated feature map. The other module, object detector, is a conventional object detector which is trained for a given input image using the ground truth label and the bounding box .

In order to exploit the knowledge of interaction, we train after the training of is done. Since, however, we do not account for the continual learning, and do not share parameters for the object detector. While the object detector is trained for with supervision, at the same time, RRPN is also trained to learn the knowledge from interactions between a human and an object through action verbs. Then, the object detector is trained for without object bounding boxes, i.e. in a weakly supervised way, using the transferred knowledge from .

Training on the Source classes

are object classes on which data can be easily acquired. Training on is a standard supervised object detection procedure by using ground truth class and box labels for all the objects. The main purpose of training on is to predict an object location from a human-object interaction. Therefore, RRPN is also trained at the same time as the training of the object detector. The detailed training procedure for Faster-RCNN is applied in the same way as the original paper. The training procedure of RRPN is as follows.

Relational Region Proposal Network (RRPN)

RRPN is designed to be universally applicable to various task’s models, including other object detectors, in an add-on manner, and can share the backbone network with other model for image features to improve memory efficiency.

As mentioned above, RRPN predicts attention map for a given image using a multi-domain feature map as an input, where are the depth, width and height of the feature map. is obtained by


where, , , are the image feature, pose feature, and verb feature obtained by their corresponding models , , , and is the matrix concatenation. Here, is the corresponding model parameters. The convolution operation for

computes the object existence probability for a combination of

, , and at a specific location on .

As in (2), we used three feature maps to utilize contexts from various domains in a given dataset, and each feature map has its own contribution. The pose and word feature are responsible for the visual context of the human’s location and action, and the distinguishable linguistic context for the human’s action, respectively. The image feature is responsible for representing the whole scene as well as the object of interest. The details for each feature maps are as follows:

Pose feature We use the well-known human pose estimation model, OPENPOSE [Cao et al.2017], to extract pose features. OPENPOSE predicts the location of human body joints using image or video as an input. The output consists of channels corresponding to each joint and a channel representing background information. In this paper, we used a pose estimation model with 19 channels including 18 joints and 1 background. In order to feed distinct information of human pose to the RRPN, we exploit the 18 channels except for the background channel as the pose feature.

Verb feature The widely used GloVe-twitter-27B-25d model [J. Pennington and Manning2014]

is applied as the word embedding model for the verb. Since a word is embedded into a vector, one needs to convert it into a tensor form for integration with other features. While

and may have different spatial-wise activations depending on , must have the same value regardless of positions. In designing , we also take this consideration into account. In order to match the spatial dimension with others, the verb feature is copied to every spatial position. So that dimension of is converted from to . By stacking a depth-wise word vector at all spatial positions, we can conduct a convolution operation using the same verbal information at all position of . Note that, among the HICO-DET datasets, tuples with ‘No interaction’ verb labels were excluded from training and validation phases for accurate evaluation of the proposed algorithm.

Image feature The proposed algorithm makes use of the representative two-stage object detection model, Faster-RCNN. It consists of a feature extractor, a back-bone network, and a region proposal network (RPN). The output feature map of the back-bone network of the Faster-RCNN is used as the image feature for the RRPN. When training on the target classes, the parameters of the backbone network are reused, but the parameters of RPN are reset.

Multi-domain feature map is then fed into the network to predict attention map . In order to robustly detect objects in various sizes, we designed the network architecture which has four blocks as in [Wang and Shen2018]: an Encoder block , two decoder blocks , and an attention block . takes as an input and outputs two feature maps with different spatial dimensions. Then, each output feature map feeds into and , respectively. The output feature maps of and having the same spatial dimension are concatenated and inputted to the attention block resulting in an attention map as


The output of RRPN is an attention map which emphasizes the location where the object is likely to be located. To train attention maps, we create a Gaussian map , as a ground truth attention map, using . RRPN is trained using as the label. We use pixel-wise binary cross entropy loss (BCE) between .

The total loss for training on the source classes including RRPN and object detector is shown below:


where, is a hyper-parameter balancing between the two losses and is the loss for the Faster-RCNN. In the object detector point of view, the proposed algorithm on is trained in the same way as the conventional supervised objected detection algorithms.

Training On the Target classes

The object detector for should be trained without . Therefore, we define this problem as a weakly supervised object detection (WSOD) problem. We use as an alternative to the missing utilizing RRPN learned in the source classes training phase. It is expected that the trained RRPN can predict locations of unseen objects i.e. , since it is trained to predict the object location using a human pose, an action (verb) and an image feature. The training process on using the trained RRPN is as follows:

The , , and are fed into the trained RRPN. We apply a threshold to obtain a pseudo bounding box from the output attention map as


where, is a pre-defined threshold. The largest bounding box containing a valid value in is called . The pseudo ground truth bounding boxes obtained from the attention maps of all tuples in the image are collected together and used as bounding box labels for training an object detector. In this step, a different type of object detector from the one trained in training phase can be used for training. The object detector is trained to minimize detection loss using and .


In this section, we evaluate the performance of the proposed WSOD algorithm. To the best of our knowledge, no previous studies have been conducted on the relationship between object detection and HOI. Nevertheless, we conducted the performance comparison with prior works on HICO-DET.

Dataset and Pre-processing

HICO-DET dataset consists of 47,776 images (38,118 training and 9,658 testing) classified into 117 actions (verb) and 80 object classes, and the object classes are the same as MS-COCO dataset. The ground truth labels consist of a tuple of , , as in (1). Note that, the RRPN is trained based on tuples, so images containing multiple tuples are fed multiple times. The total number of tuples is 151,276 (117,871 training and 33,405 testing), and we use 131,560 tuples (102,450 training and 29,110 testing) excluding the tuples corresponding to the action label ‘no-interaction’.

In order to construct the problem environment, the whole dataset is divided into source and target datasets according to the frequency of the object class. Our basic experiment is set up with 116 verbs excluding ‘no interaction’. The number of object classes are 70 for and 10 for . In order to more clearly show the effectiveness of RRPN, in the experiment for qualitative result, we use 5 verbs and 10 and 70 . O

ther hyperparameters remain the same as the basic set up.

We also verified the proposed algorithm on V-COCO dataset for qualitative analysis. The purpose of evaluation on the V-COCO dataset is to show that the knowledge can be transferred from one dataset to other. Details on both datasets are described in the supplementary material.


We use mean Average Precision (mAP) and Recall as evaluation metrics. Because RRPN produces one bounding box for one tuple (action), Recall is used to measure how accurate the location of an object corresponding to an action is. In other words, Recall evaluates the objectness of

predicted by RRPN, and is calculated as the ratio of tuples for which IoU . On the other hand, the object detector detects all the objects in an image at once. Therefore, we use the mAP in measuring the performance of Faster-RCNN which are the standard metrics for object detectors.

and in Recall are the performance of the RRPN’s agent after the training on . When training on , RRPN is fixed and not trained. Note that Recall is measured on test set for and on both training and test set for .

Comparison with prior works

We conduct experiments to compare with prior works on HICO-DET as shown in Table 1. First two columns represent overall results of original algorithms in AD [Yang et al.2019] and PCL [Tang et al.2018]. However, both results are only able to show performance of all object classes with an entire dataset. Since our method is designed for transfer learning, we experiment to validate PCL on each of source and target domains. As a result, our best model with image, pose and verb has 17.19% which is 4 times better than the result of PCL on . Moreover, our model only with the image feature outperforms PCL on . Although a direct apple-to-apple comparison is difficult, we can see that our method is far better than the compared methods.

max width= Methods AD PCL PCL* Ours (I) Ours (I+P+V) - - - - - - 9.57 17.19 Total 5.39 3.62 - - -

Table 1: Comparison of the mAP with other WSOD algorithms on HICO-DET. (PCL* is tested by ourselves, is trained on the entire dataset and is trained on and separately. I : , P : , V: , W : Weakly supervised object detection, = 10, = 0.1)

max width=

I P V Recall@.5 (RRPN) mAP@.5 (Faster-RCNN)
() () (W) (S)
47.69 28.64 30.34 17.19 29.37
42.00 22.75 23.57 9.57 22.07
41.42 22.13 24.17 10.07 25.15
46.34 23.84 29.97 16.34 30.28
Table 2: Performance comparison of different feature combination. (Notations, , and are the same as Table 1)

Comparison with different feature combination

We experiment to verify the performance of different feature combinations. We train and test the RRPN using the same types of feature for both and in each experiment. Table 2 shows the performances of RRPN and Faster-RCNN as Recall and mAP, respectively, using different combinations of features. in mAP is the results of our WSOD, and and are the results of full supervision.

Our full model combining all three features in the top of Table 2 shows the highest performance in both Recall and mAP among all combinations. The mAP of has 17.19%, which is 7.62% better than image-only model and the Recall of the is 28.64% which is about 5% higher than other combinations. Moreover, the mAP score of our full model is only 4.88% lower than image-only fully supervised model in . Compared to models of full supervision , we believe that the mAP score of our full model can meaningfully show that it can be trained despite weak supervision of rare classes.

In the middle, using alone, Recall for has 22.75% and mAP () are much lower than mAP (). It means that RRPN could not be trained solely by . The two results in the bottom are the performance for combined features. When is combined with , Recall degrades and mAP increases slightly. It can show that that is extracted from an image is redundant unless it interacts with a verb. Combining with , however, Recall and mAP significantly increase and mAP() also increases. It is interesting that it might be more effective for not only RRPN but also Faster-RCNN when using combination of features from other domain.

max width= parameter Recall@.5 (RRPN) mAP@.5 (Faster-RCNN) () () (W) (S) 0 0.1 11.64 11.19 31.06 1.61 25.57 1 0.1 41.51 24.46 23.87 9.27 25.43 5 0.1 46.37 23.37 30.10 14.38 26.11 10 0.1 47.69 28.64 30.34 17.19 29.37 15 0.1 46.65 23.07 23.32 15.85 25.45 20 0.1 43.17 26.00 22.68 15.75 22.92 10 0.05 48.22 29.41 30.27 9.41 25.04 10 0.10 47.69 28.64 30.34 17.19 29.37 10 0.15 39.67 17.96 30.40 14.01 26.66 10 0.20 34.14 16.72 30.22 13.24 30.39

Table 3: Comparison of quantitative result of and (Notations are the same as Table 1)
Figure 4: (Left) Input image with pose, (middle) ground truth Gaussian attention mask (A) in yellow, and (Right) predicted attention map (). Red box is pseudo object box , blue box is ground truth and white box indicates the human in action. (V) the last row is the result on the V-COCO dataset. Note that a white box is used solely for visually representing an acting human in an image and is not used in training on .

Comparison with different and

In experiments in Table 3, we focus on verifying the effect of shared parameters such as in (4) and in (5).

In top of Table 3, according to the change of the in (4), the ratio of the loss weight in RRPN is determined. When is zero, due to untrained RRPN, Recall and mAP for have the lowest score while mAP for has the highest score. On the other hand, Recall and mAP are the highest at with performance improvements of 17.45% and 15.58% compared to , respectively. On the contrary, on some levels of , we can see that the performance degradation for not only but also .

This can be understood as an effect of parameter sharing for image feature extractor between RRPN and an object detector. As mentioned earlier, RRPN is a universal add-on type module which can be adapted to various computer vision tasks. To effectively utilize these advantages, we share the backbone network of RRPN and the object detector in consideration of memory efficiency. Therefore, the RRPN and the object detector affect each other through the backbone network during training.

In bottom of Table 3, according to the in (5), the size of the pseudo bounding box is determined. A small makes the size of the boxes increase, while a large makes the box small or disappear. As increases, partial information of the object is trained. For example, in the case of an apple, only the central part of the apple is trained with high , which causes many false positive. On the other hand, lowering the threshold of the box, Faster-RCNN is trained not only with an object but also with backgrounds. It is interesting that affect differently to both metrics where Recall gets higher when gets smaller but mAP get the highest score when . We believe that the RRPN can easily learn objectness with a larger box due to small , but classification of objects could be more difficult due to inaccurate localization. Therefore, too small or too large causes a degradation of mAP, and we have found the suitable value, , through the experiments by selecting the value with the highest performance in .

Figure 5: Comparison of predicted attention maps trained only by the image feature and by various integrated features with [glove, hold]. The predicted attention maps show different activations depending on the role of each feature map.

Qualitative results

Fig. 4 shows the qualitative results of the proposed algorithm on . The first column indicates input images and the last column indicates output attention maps inferred by the corresponding actions. We can see that RRPN predicts an accurate attention map on unseen object classes in . Furthermore, it can be seen that the pattern of the predicted attention map differs depending on the verb. For example, while ‘hold’ shows a strong activation value near the human hand, ‘ride’ tends to activate at the bottom of a person. Based on this, we can confirm that the object location can be estimated based on the interaction between the verb and the pose. The role of the pose can be found in the example of [Truck, Ride]. Despite that two trucks exist in an image, the activation of a truck on which the human is riding shows stronger than the other. This can be seen as a contribution of to the object localization. We also verified the performance of RRPN on the object from a different dataset, V-COCO. The RRPN is trained using of HICO-DET and predicts the attention map of of V-COCO. The bottom row shows the predicted attention map on V-COCO. We can see that the proposed algorithm can also predict the object location accurately on images even from other datasets.

Fig. 5 depicts the comparison of predicted attention map between different feature combinations on [glove, hold]. As described in section 5, the pattern of the resulting attention map can be changed by the combination of features. Since “Glove” is an unseen class, backbone has no information to extract reliable feature, so that RRPN cannot predict the location of an object accurately using only . However, if RRPN is trained using more than two features including , RRPN can infer the location of an object based either on or on . Specifically, predicted a more distinguishable attention map for an object, compared to feature map. Since and are extracted from the same image, some of the information can be redundant between two features. On the other hand, is able to provide useful information to because it is extracted from a different domain, language. Consequently, the location of an object can be predicted precisely when we use all three features. On the contrary, if RRPN trained using only and without , the output attention map only activates around the human. Thus, it can be understood that plays a role of providing supplementary information to about the object of interest.


In this paper, we proposed a novel weakly-supervised scheme for object detection problems. We introduced the RRPN which can universally localize objects in an image with information on human poses and action verbs. Using transferable knowledge from the RRPN, we can continuously train any object detector for unseen objects with weak verbal supervision describing HOI. We validated our method based on the results on HICO-DET dataset and the performances show the possibility of our method for a new WSOD training scheme. Our work shows sufficient potentials to overcome the inefficiency of the supervised training scheme in recent deep learning. Also, we can develop our method in the direction to the continual learning since we already suggested a novel method to transfer common knowledge to localize objects with HOI.


This work was supported by Next-Generation Information Computing Development Program through the NRF of Korea (2017M3C4A7077582) and Promising-Pioneering Researcher Program through Seoul National University(SNU) in 2015.


  • [Bearman et al.2016] Bearman, A.; Russakovsky, O.; Ferrari, V.; and Fei-Fei, L. 2016. What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, 549–565. Springer.
  • [Bilen and Vedaldi2016] Bilen, H., and Vedaldi, A. 2016. Weakly supervised deep detection networks.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  • [Cao et al.2017] Cao, Z.; Simon, T.; Wei, S.; and Sheikh, Y. 2017. Realtime multi-person 2d pose estimation using part affinity fields. CVPR.
  • [Chao et al.2015] Chao, Y.-W.; Wang, Z.; He, Y.; Wang, J.; and Deng, J. 2015. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, 1017–1025.
  • [Chao et al.2018] Chao, Y.-W.; Liu, Y.; Liu, X.; Zeng, H.; and Deng, J. 2018. Learning to detect human-object interactions. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 381–389. IEEE.
  • [Dai et al.2016] Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, 379–387.
  • [Diba et al.2017] Diba, A.; Sharma, V.; Pazandeh, A.; Pirsiavash, H.; and Van Gool, L. 2017. Weakly supervised cascaded convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 914–922.
  • [Everingham et al.2010] Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88(2):303–338.
  • [Gkioxari et al.2018] Gkioxari, G.; Girshick, R.; Dollár, P.; and He, K. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8359–8367.
  • [Gupta and Davis2007] Gupta, A., and Davis, L. S. 2007. Objects in action: An approach for combining action understanding and object perception. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–8. IEEE.
  • [Gupta and Malik2015] Gupta, S., and Malik, J. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474.
  • [Gupta, Kembhavi, and Davis2009] Gupta, A.; Kembhavi, A.; and Davis, L. S. 2009. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(10):1775–1789.
  • [J. Pennington and Manning2014] J. Pennington, R. S., and Manning, C. 2014. Glove: Global vectors for word representation.

    conference on empirical methods in natural language processing

  • [Jie et al.2017] Jie, Z.; Wei, Y.; Jin, X.; Feng, J.; and Liu, W. 2017. Deep self-taught learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1377–1385.
  • [Kantorov et al.2016] Kantorov, V.; Oquab, M.; Cho, M.; and Laptev, I. 2016. Contextlocnet: Context-aware deep network models for weakly supervised localization. In European Conference on Computer Vision, 350–365. Springer.
  • [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, 740–755. Springer.
  • [Liu et al.2016] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox detector. In European conference on computer vision, 21–37. Springer.
  • [Oquab et al.2015] Oquab, M.; Bottou, L.; Laptev, I.; and Sivic, J. 2015.

    Is object localization for free?-weakly-supervised learning with convolutional neural networks.

    In CVPR, 685–694.
  • [Qi et al.2018] Qi, S.; Wang, W.; Jia, B.; Shen, J.; and Zhu, S.-C. 2018. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), 401–417.
  • [Redmon et al.2016] Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
  • [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS.
  • [Shi, Caesar, and Ferrari2017] Shi, M.; Caesar, H.; and Ferrari, V. 2017. Weakly supervised object localization using things and stuff transfer. In Proceedings of the IEEE International Conference on Computer Vision, 3381–3390.
  • [Tang et al.2017] Tang, P.; Wang, X.; Bai, X.; and Liu, W. 2017. Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2843–2851.
  • [Tang et al.2018] Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; and Yuille, A. L. 2018. Pcl: Proposal cluster learning for weakly supervised object detection. TPAMI.
  • [Uijlings et al.2013] Uijlings, J.; van de Sande; Gevers, T.; and Smeulders, A. 2013. Selective search for object recognition. In International Journal of Computer Vision, 154–171.
  • [Uijlings, Popov, and Ferrari2018] Uijlings, J.; Popov, S.; and Ferrari, V. 2018. Revisiting knowledge transfer for training object class detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1101–1110.
  • [Wang and Shen2018] Wang, W., and Shen, J. 2018. Deep visual attention prediction. IEEE Transaction on Image Processing 27(5):2368–2378.
  • [Yang et al.2019] Yang, Z.; Mahajan, D.; Ghadiyaram, D.; Nevatia, R.; and Ramanathan, V. 2019. Activity driven weakly supervised object detection. CVPR.
  • [Yao and Fei-Fei2010] Yao, B., and Fei-Fei, L. 2010. Modeling mutual context of object and human pose in human-object interaction activities. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 17–24. IEEE.
  • [Zhu et al.2017] Zhu, Y.; Zhou, Y.; Ye, Q.; Qiu, Q.; and Jiao, J. 2017. Soft proposal networks for weakly supervised object localization. In Proceedings of the IEEE International Conference on Computer Vision, 1841–1850.

Appendix A Dataset

We experiment with HICO-DET [Chao et al.2018] and V-COCO [Gupta and Malik2015] datasets in consideration of HOI. Since, However, those datasets are not designed for the problem we defined, we need to consider some limitations. Most of them are related to ground truth annotation such as the class or the bounding box and can affect performance evaluation. We designed the whole experiments in consideration these limitations. Details of each dataset regarding the limitations are described in following subsections.

Figure 6: The number of tuples according to the objects and actions in HICO-DET. (a) is the number of each object in the tuples and (b) is the number of each action(verb) in the tuples

max width= Class non-rare class Tuple rare class Tuple Obj boat 12,311 traffic light 108 bicycle 11,422 refrigerator 94 motorcycle 8,719 fire hydrant 82 skateboard 7,015 parking meter 79 dining table 6,880 vase 55 horse 6,435 potted plant 49 umbrella 4,797 stop sign 47 bus 4,343 sink 40 skis 3,821 microwave 30 snowboard 3,324 toaster 29 Verb hold 22,969 move 10 ride 20,289 zip 6 sit on 15,318 tag 6 carry 6,911 flush 6 straddle 6,434 wave 5

Table 4: Comparison the number of tuple between non-rare class(top-10(obj)/5(verb) and rare class(bottom-10(obj)/5(verb)) in HICO-DET


Data Imbalancing Table.4 and Fig.6 show the frequency of each verb and object for tuples of HICO-DET. We can see a large gap between the number of frequently-appearing objects and the number of rarely-appearing objects. Due to the imbalance of the number of tuples between the object classes, some HOI’s, such as [refrigerator, move], are rarely trained on RRPN. Thus the effectiveness of RRPN on such tuples may not be verified accurately.

For tuples such as [refrigerator, move], RRPN could predict inaccurate , so that detection performance on could also be degraded.

For qualitative result, we use 5 verbs and 10 and 70 , other hyperparameters are the same as the basic set up. The reason why we shrink the size of the dataset for the qualitative result is for representing the effectiveness of RRPN more clearly. In other words, due to the inherent limitation of HICO-DET, it is difficult to obtain clear visual results to show the performance of RRPN. We, therefore, exploited non-rare verbs and increased the ratio of to see the result for various objects.

Figure 7: Examples of ground truth bounding box annotations of HICO-DET. These examples represent (a) bounding boxes are different for the same object (Top : apple, Bottom : kite), (b) overlapped bounding boxes annotation (Top : banana, Bottom : baseball bat) and (c) missing ground truth bounding boxes on an image (Top : apple, Bottom : dog). These insufficient bounding box annotations could induce biased results.

Annotations HICO-DET is a dataset for HOI which is expanded for the object detection problem, and the image label is organized as a tuple as shown in Eq. (1). Thus, multiple ground truth object bounding boxes can be annotated in a single object as shown in Fig.7.(a). On the contrary, as depicted in Fig.7.(b), bounding boxes of multiple objects can also be overlapped each other. Besides, HICO-DET is not designed only for object detection problem, so some object bounding box annotations are missing as shown in Fig.7.(c). This is different from the set up of standard object detection problems where only one box is annotated for one object, so the result of validation, such as mAP, can be biased. One way to address the aforementioned problem is integrating multiple object bounding boxes into one. However, this approach can result in other bias in the results. This is because, as shown in Fig. 7.(a), the ground truth bounding boxes for a single object are not only shown a different pattern depending on verbs but also not identically the same even on the same verb. Furthermore, if we integrate boxes depending on the IoU scores, bonding box labels of some objects can be missing as shown in Fig.7.(b).

Considering the above-mentioned conditions of HICO-DET, experiments in this paper are conducted using the original label of HICO-DET, i.e. without intentionally manipulates on the label of HICO-DET. Obviously, the experimental results can be biased by the effect of multiple ground truth bounding boxes compared to the case of using only one. Nevertheless, we conducted the whole experiments in the exact same condition and we did not compare our proposed method with other WSOD algorithms. Therefore we believe that the experimental results presented in the paper are reasonable to sufficiently demonstrate the effectiveness of the proposed method.


V-COCO dataset is made of a subset of the MS-COCO dataset [Lin et al.2014] and has 26 actions and 2 object classes (direct object and instrument). There is a problem of how to separate the source and the target. The object classes which are binary have high dependency on verb classes. Thus, the verb in the test set does not exist in the training set. On the other hand, when we transform the two classes into 80 classes, there are about 10 empty classes because V-COCO is not designed for object detection. In addition, some classes in the test set do not exist in the training set. Therefore, we present activation of the tuple with the same verb as HICO-DET.

Appendix B Implementation details

The overall structure of the proposed WSOD method consists of RRPN and Faster-RCNN. We have used Faster-RCNN with ImageNet pre-trained ResNet-101 model. RRPN consists of

, , and , and each model consists of backbone of Faster-RCNN, OPENPOSE111 [Cao et al.2017], and GloVe222 [J. Pennington and Manning2014] as described previously. We have used the pre-trained OPENPOSE and GloVe models only to extract each feature without further training.

The spatial dimension of the integrated feature map is equal to the output feature map of the Faster-RCNN backbone network, i.e. . , and

consist of 3, 5, 6 convolutional layer, respectively. Max pooling and appropriate strides were used to fit the corresponding spatial dimension.


convolution layer followed by a sigmoid layer. The spatial dimension of the output feature map is the same as the input spatial dimension of the object detector. Hyper parameters for training RRPN is as follows: learning rate: 1e-3, optimizer: stochastic gradient descent, weight decay: 1e-4, momentum: 0.9, batch size: 4, epoch or iteration: 15 epoch (source class), 30 epoch (target class). Hyper parameters for Faster-RCNN is set as the same as the original paper.

Appendix C Result

In this section, we present the performance of our algorithm by further analyzing experimental results for various situations.

The role of features

We additionally analyze the role of features by conducting the following experiments: 1) fix, 2) only, 3) only. fix is the case where we train while the parameters of a shared backbone network, which is trained on , are fixed. only and only refer the cases where we train RRPN using only and , respectively.

Table 5 represents the results of each experiment. Training on is conducted using the fixed image feature, , that is already trained on . So, is unable to learn to extract proper image features for . Consequently, mAP decreases not only for the case of weakly supervised learning () but also for the supervised learning approach ().

max width=

I P V Recall@.5 (RRPN) mAP@.5 (Faster-RCNN)
() () (W) (S)
47.69 28.64 30.34 10.58 22.90
22.05 17.34 32.14 5.24 28.20
16.78 10.06 31.24 3.54 27.20
20.28 11.00 31.60 4.33 28.50
Table 5: Quantitative result of feature combinations: fix, only, and only. means that the model for the corresponding feature is fixed during the training on . (I : Image feature, P : Pose feature, V: Verb feature, : Source, : Target, W : Weakly supervised object detection, S : Supervised object detection, = 10, = 0.1)

The experiment results on only and only, regarding (W) are significantly lower than other feature combinations that use . It implies that using a single modality of or only could be insufficient to predict an accurate object location. In other words, to accurately predict an object location from HOI, one needs to provide not only human pose and corresponding action (verb) but also information on the object of interest contained in the image feature. Meanwhile, since is copied to every spatial positions to have the same value regardless of positions, the experiment results using only have the lowest performance compared to other feature combinations. It is strange that the performance decreases when both and are used simultaneously than when only is used.

An effect of a ratio between and

We experiment to verify the performance according to the different ratio of the source classes and the target classes. We experiment with reducing 10 source classes repeatedly and the number of the verbs is fixed to 116 excluding ’no interaction’.

Our result of the basic ratio in the top of Table 6 shows the highest performance in both Recall and mAP in the target test. On the contrary, the result of the lowest ratio of the source classes in the bottom of Table 6 shows the highest performance in both Recall and mAP in the source test. As the ratio of the source classes decreases, the task at the source is relatively easier and improves performance because of the reduction of classes. On the other hand, in the target, the task becomes relatively difficult and decreases the performance because of the increase of classes. Other reason can be attributed to the reduced number of training examples as the number of source classes decreases.

max width=

# of classes Recall@.5 (RRPN) mAP@.5 (Faster-RCNN)
() () (W) (S)
70 10 47.69 28.64 30.34 17.19 29.37
60 20 47.26 25.35 31.51 11.52 27.15
50 30 46.47 27.34 33.69 12.76 24.36
40 40 45.73 23.45 34.55 9.63 25.16
30 50 48.53 20.14 40.29 9.46 21.97
Table 6: Comparison of quantitative result according to the different ratio of the source classes and the target classes ( : Source, : Target, W : Weakly supervised object detection, S : Supervised object detection, = 10, = 0.1, the number of actions (verb) = 116)

More results

Note that experiments for the qualitative results in the main paper and supplement material are conducted on a smaller dataset compared to the basic set up for the quantitative results. For qualitative result, we use 5 verbs and 10 and 70 , other hyperparameters are the same as the basic set up. The reason why we shrink the size of the dataset for the qualitative result is for representing the effectiveness of RRPN more clearly. In other words, due to the inherent limitation of HICO-DET (see section A), it is difficult to obtain clear visual results to show the performance of RRPN. We, therefore, exploited non-rare verbs and increased the ratio of to see the result for various objects. The quantitative results on the smaller dataset are represented in Table 7 and 8. Obviously, the performance of decreases. Since, however, the quantitative results on the smaller dataset show analogous tendency to the results on the basic set up, we believe the qualitative results on the smaller dataset can be used to represent the performance of RRPN.

max width=

I P V Recall@.5 (RRPN) mAP@.5 (Faster-RCNN)
() () (W) (S)
48.62 19.99 46.38 10.06 27.17
47.69 17.66 45.75 7.23 26.33
29.54 16.23 47.18 4.72 28.03
22.72 11.31 47.73 2.48 27.03
48.34 16.84 46.54 7.07 25.42
48.91 18.25 47.19 8.12 25.74
12.66 13.16 47.69 2.51 26.14
Table 7: Comparison of quantitative result of different feature combination. (I : Image feature, P : Pose feature, V: Verb feature, : Source, : Target, W : Weakly supervised object detection, S : Supervised object detection, = 10, = 70, = 10, = 0.1)

Fig. 8 shows the qualitative results on especially regarding carry and sit on. The pattern of the predicted attention maps differs depending on the verb: while carry focuses on near the human hand, sit on activates at the bottom of a human. These results are analogous to results of hold and ride in the main paper.

On the contrary, the predicted attention maps on rarely appearing verbs are inaccurate as shown in Fig. 9. This is because HOI related to rarely appearing verbs could not be sufficiently trained.

Fig. 10 also shows some examples of unsuccessful results. The verb and object class labels are annotated as [hold, scissors] and [hold, umbrella], respectively. The predicted attention maps are focused on the box instead of the scissors, and on both umbrella and suitcase rather than only the umbrella. Since RRPN predicts the location of an object according to human and its corresponding action, inaccurate attention maps can be generated when multiple objects are involved in the same action.

max width= parameter Recall@.5 (RRPN) mAP@.5 (Faster-RCNN) () () (W) (S) 0 0.1 12.66 13.16 47.71 2.53 27.62 1 0.1 47.85 17.07 45.00 6.80 27.73 5 0.1 47.96 18.82 46.15 7.88 27.25 10 0.1 48.62 19.99 46.38 10.06 27.17 15 0.1 47.52 17.94 44.12 8.95 26.99 20 0.1 44.31 15.30 43.03 7.52 27.01 10 0.05 50.76 21.83 46.98 9.14 27.27 10 0.10 48.62 19.99 46.38 10.06 27.17 10 0.15 42.76 11.98 46.03 8.50 26.68 10 0.20 23.68 7.06 43.94 6.09 27.71 10 0.30 8.11 4.49 45.82 2.85 26.69

Table 8: Comparison of quantitative result of different attention loss balance and box threshold ( : Source(10), : Target(70), W : Weakly supervised object detection, S : Supervised object detection)
Figure 8: The qualitative results on carry and sit on about three different objects each Qua(Left) Input image with pose, (middle) ground truth Gaussian attention mask, and (Right) predicted attention map. red box is pseudo object box , blue box is ground truth and white box indicates the human in acting. Note that a white box is used to solely on represent an acting human in an image and is not used in training on the target class
Figure 9: Unsuccessful results. Box color notation is the same as Fig. 8. Incorrect attention maps are predicted due to the influence of rarely-appearing tuple which is insufficient trained.
Figure 10: Unsuccessful results. Box color notation is the same as Fig. 8. Inaccurate object localization occurs when (top) two objects are overlapping (bottom) or the human doing the same action for multiple objects.