Hand-Object Interaction and Precise Localization in Transitive Action Recognition

11/12/2015 ∙ by Amir Rosenfeld, et al. ∙ Weizmann Institute of Science 0

Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations produced by deep neural networks. However, there are still many cases in which performance remains far from that of humans. A major difficulty arises in distinguishing between transitive actions in which the overall actor pose is similar, and recognition therefore depends on details of the grasp and the object, which may be largely occluded. In this paper we demonstrate how recognition is improved by obtaining precise localization of the action-object and consequently extracting details of the object shape together with the actor-object interaction. To obtain exact localization of the action object and its interaction with the actor, we employ a coarse-to-fine approach which combines semantic segmentation and contextual features, in successive stages. We focus on (but are not limited) to face-related actions, a set of actions that includes several currently challenging categories. We present an average relative improvement of 35 validate through experimentation the effectiveness of our approach.



There are no comments yet.


page 1

page 2

page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Actions where objects are barely visible. Accurately localizing the action-object is important for succeeding in classifying such images.

Recognizing actions in still images has been an active field of research in recent years, using multiple benchmarks [3, 5, 15, 20]. It is a challenging problem since it combines different aspects of recognition, including object recognition, human pose estimation, and human-object interaction. Action recognition schemes can be divided into transitive vs. intransitive. Our focus is on the recognition of transitive actions from still images. Such actions are typically based on an interaction between a body part and an action-object (the object involved in the action). In a cognitive study of 101 frequent actions and 15 body parts, Hidaka, Shohei and Smith [11]

have shown that hands and faces (in particular mouth and eyes) were by far the most frequent body parts involved in actions. It is interesting to note that in the human brain, actions related to hands and faces appear to play a special role: neurons have been identified in both physiological and fMRI studies

[2, 7] which appear to represent ’hand space’ and ’face space’, the regions of space surrounding the hands and face respectively. In this work we focus, but are not limited to such actions, in particular interactions between the mouth and small objects, including drinking, smoking, brushing teeth and others. We refer to them as face-related actions. For such actions, the ability to detect the action-object involved has a crucial effect on the attainable classification accuracy. A simple demonstration is summarized in table 1 (Section 2 below). The action objects involving such classes are often barely visible, making them hard to detect (see Figure 1). Furthermore, the spatial configuration of the action object with respect to the relevant body parts must be considered to avoid the chance of misclassification due to the detection of an unrelated action object in the image. In other words, the right action-object should be detected at a particular surrounding context.

To obtain these goals, our algorithm includes a method for the detection of the relevant action object and related body parts as means to aid the classification process. We seek the relevant body parts and the action object jointly

using a cascade of two fully convolutional networks: the first operates at the entire image level, highlighting regions of high probability for the location of the action object. The second network, guided by the result of the first, refines the probability estimates within these regions. The results of both networks are combined to generate a set of action-object candidates which are pruned using contextual features. In the final stage, each candidate, together with features extracted from the entire image and from the two networks, is used to score the image as depicting a target action class. We validate our approach on a subset of the Stanford-40 Actions

[20] dataset, achieving a 35% improvement over state of the art.

1.1 Related Work

Recent work on action recognition in still images can be categorized into several approaches. One group of methods attempt to produce accurate human pose estimation and then utilize it to extract features in relevant locations, such as near the head or hands, or below the torso [20, 4]

Others attempt to find relevant image regions in a semi-supervised manner, given the image label. Prest et al [14] find candidate regions for action-objects and optimize a cost function which seeks agreement between the appearance of the action objects for each class as well as their location relative to the person. In [16] the objectness [1] measure is applied to detect many candidate regions in each image, after which multiple-instance-learning is utilized to give more weight to the informative ones. Their method does not explicitly find regions containing action objects, but any region which is informative with respect to the target action. In [21]

a random forest is trained by choosing the most discriminative rectangular image region (from a large set of randomly generated candidates) at each node, where the images are aligned so the face is in a known location. This has the advantage of spatially interpretable results.

Some methods seek the action objects more explicitly: [20] apply object detectors from Object-Bank [8] and use their output among other cues to classify the images. In [9], outputs of stronger object detectors are combined with a pose estimation of the upper body in a neural net-setting and show improved results, where the object detectors are the main cause for the improvement in performance. Very recently [6] introduced R*CNN which is a semi-supervised approach to find the most informative image region constrained to overlap with the person bounding box. Also relevant to our work are some recent attention based methods: [19]

uses a recurrent neural network which restricts its attention to a relevant subset of features from the image at a given timestep. We do not use an RNN, but two different networks. Instead of restricting the view of the second network to a subset of the original features, we allow the network to extract new ones, allowing it to focus on more subtle details.

Unlike [14, 16, 21, 6], our approach is fully supervised. Unlike[9] we seek the relevant body parts and action objects jointly instead of using separate stages. In contrast with [6], we employ a coarse-to-fine approach which improves on the ability of a single stage to localize the relevant image region accurately.

The rest of the paper is organized as follows: in the next section, we demonstrate our claim that exact detection of the action object can be crucial for the task of classification. In Section 3, we describe our approach in detail. In Section 4 we report on experiments used to validate our approach on a subset of the Stanford-40 Actions [20] dataset. Section 5 contains a discussion & concluding remarks.

2 The Importance of Action objects in Action Classification

Figure 2: Distribution of baseline performance for various action classes. We note the typical size of the action-objects for the transitive actions (those involving objects), and classify them as small(red), large(green), and leave the rest of the classes undecided (orange). Mean performance is significantly worse on actions involving small objects (37.5 %) vs. large objects (87 %).

In this section, we demonstrate our claim that exact localization of action objects plays a critical role in classifying actions, in particular when they are hard to detect, e.g., due to their size or occlusion. We begin by a simple baseline experiment on the Stanford-40 Actions [20] dataset. It contains 9532 images of 40 action classes split to 4000 for training and the rest for testing. We train a linear SVM to discriminate between each of the 40 action classes in a one-vs-all manner, using features from the penultimate layer vgg-16 [17], i.e., fc6, and report performance in terms of average precision. Figure 2 shows the resulting per-class performance. We have marked classes where the action object tends to be very small or very large related to the image. The difference in mean performance is striking: while for the small objects, the mean AP is 37.5% and for the large ones 87%. The mean performance on the entire dataset is 67%. Hence, we further restrict our analysis to a subset of Stanford-40 Actions [20] containing 5 classes: drinking, smoking, blowing bubbles, brushing teeth and phoning - where the action objects involved tend to be quite small. We name this the FRA dataset standing for Face Related Actions. We augment the annotation of the images in FRA by adding the exact delineation of the action objects and faces and the bounding boxes of the hands. Next, we test the ability of a classifier to tell apart these 5 action classes when it is given various cues. We do this by extracting features for each image from the following regions:

  • Entire image (Global appearance)

  • Extended face bounding box scaled to 2x the original size (Face appearance)

  • Bounding box of action object (Object appearance)

  • Exact region of action object ( MO : Masked Object appearance)

We extract fc6 features from each of these regions. For MO, the masked object region, the image is cropped around the region bounding box and all pixels not belonging to the object mask are set to the value of the mean image learned by vgg-16. A linear SVM is trained on the feature representation of each feature alone and on concatenations of various combinations. We allow access to both face bounding boxes and object bounding boxes/masks at test time, as if having access to an Oracle which reveals them. The performance of the classifier with different feature combinations is summarized in Table 1. Clearly, while the global (G) and face features (F) hold some information, they are significantly outperformed by the classifier when it is given the bounding box of the action object (O). This is further enhanced by masking the surroundings of the action object (MO), which performs best; The masking holds some information about the shape of the object in addition to its appearance. Combining all features provides a further boost, owing to the combination of local and contextual cues.

In the next section, we show our approach to detecting the action object automatically.

Region Mean AP
G .58 .54 .58 .74 .57 .45
Face .69 .61 .57 .82 .72 .76
O .76 .87 .80 .79 .55 .81
MO .82 .91 .78 .85 .66 .88
G+Face+O .88 .90 .92 .91 .80 .90
All .91 .94 .93 .93 .85 .90
Table 1: Classification performance improves drastically as classifier is given access to the action objects’ exact locations : Global, Face, action Object bounding box, MO: Masked Object. This demonstrates the need for accurately detecting the action-object.

3 Approach

Figure 3: Flow of proposed method. A fully convolutional network is applied to predict body parts and action objects. This guides a secondary network to refine the predictions on a few selected image regions. Using contextual features, the regions are ranked & pruned. The remaining candidates are scored using both global and local features to produce a final classification, along with a visualization of the detected action object.

Our goal is to classify the action being performed in an image. To that end, we perform the following stages:

• Produce a probability estimate for the locations of hands, head and action objects in the entire image

  • Refine the probability estimates at a set of selected locations

  • Using the above probabilities, generate a set of candidate action objects

  • Rank and prune the set of candidate objects by using contextual features

  • Extract features from the probabilistic estimates and appearances of the image and predicted action object locations to produce a final classification.

We now elaborate on each of these stages.

3.1 Coarse to Fine Semantic Segmentation

We begin by producing a semantic segmentation of the entire image, then refining it: we train a fully convolutional network as in [10] to predict a pixel-wise segmentation of the image. The network is trained to predict jointly the location of faces, hands, and different object categories relating to the different action classes. Using the framework of [18], we do so by fine-tuning the vgg-16 network of [17]. We denote the set of learned labels as :


where is the set objects relating to action classes. We name this first network .

For a network and image , we denote by the probability maps resulting from applying to :


where the superscript is used to indicate the network which operated on the image. In other words, assigns to each pixel in a probability for each of the classes. For brevity, we might drop the parameters where it is appropriate, writing only . For clarity, we may also write instead of .

The predictions of this , albeit quite informative, can often miss the action object or misclassify it as belonging to another category (see Section 4). To improve the localization and accuracy of the estimate it produces, we train a secondary network; we use the same base network (vgg-16) but on a different extent whose purpose is to refine the predictions of the first network. We name the second network . It is trained on sub-windows of the original images which are upscaled during the training and testing process. Hence it is trained on a data of a different scale than

. Moreover, since it operates on enlarged sub-windows each image, its output is eventually transformed to a finer scale in the original image’s coordinate system. We note that in our experiments, attempting to train the 32-pixel stride version of Long et al

[10] worked significantly less well as the full 8-pixel stride version, which uses intermediate features from the net’s hierarchy to refine the final segmentation. Both networks are trained using the full 8-pixel stride method.

For each training image , we define to be a sub-window of the entire image whose extent allows a good tradeoff between the image region to be inspected (which is desirably small to capture fine details), yet contains the desired objects or body parts. The details of how this bounding box is selected depend on the nature of the dataset and are described in Section 3.4. The training set for is obtained by cropping each training image around and cropping the ground-truth label-map accordingly.

is used to refine the output of as follows: after applying to an image we seek the top local maxima of which are at least pixels apart. A window is cropped around each local maxima with a fixed size relative to the size of the original image and is applied to that sub-window after proper resizing. The probabilities predicted for overlapping sub-windows are averaged. We chose , by validating on a small subset of the training set. Denote the resulting refined probability map by . See Figure 4 for an example of the resultant coarse and fine probability maps and the resulting predictions.

(a) Coarse prediction

(b) Fine prediction
Figure 4: Coarse to fine predictions. (a) is applied to the entire image, producing (left) a pixelwise per-class probability map and (center-top) prediction; (b) guided by local maxima of , is applied to a selected set of subwindows. A missed bubble wand is detected by the refinement process, as well as other fine details. Probability of false classes (e.g.,brushing) are suppressed, see differences in predicted probability maps. Best viewed in color online.

3.2 Contextual Features

We now describe how we use the outputs for both networks to generate candidate regions for action objects. Identifying the region corresponding to the object depends on its on appearance as well as the context, e.g. of the face and hand. The outputs of the coarse and fine networks often produce good candidate regions, along with many false ones: typically tens of regions per image. In addition, the location of the candidates may be correct but their predicted identity (i.e, the maximal probability) is wrong. Therefore we produce candidate regions from the resultant probability maps in two complementary ways: Define


is the pixelwise prediction of net . We denote by the set of connected components in . In addition, let be the set of local maxima, computed separately for each channel of . For each channel we apply the adaptive thresholding technique of Otsu [13]. We denote by the union of the regions obtained by the thresholding process. We remove from regions which do not contain any element in Finally, we denote


as the set of candidate regions for the image. and are complementary since the maximal prediction of the network may not necessarily contain a local maxima in any of the probability channels.

We train a regressor to score each candidate region: Let be a candidate region and its bounding box. We extract short and long range contextual features for : for the short range features we scale by a factor of 3 while retaining the box center. Next, we split the enlarged bounding box into a grid, where we assign to each grid cell the mean value of each channel of the probability maps inside that cell. Formally, let be the window defined by and , the subwindow for the row/column. We define



is the area of each grid cell in pixels. This yields a feature vector of length

channels representing the values of the network’s predictions both inside the predicted region and in the immediate surroundings. Similarly, define a second bounding box, to be a bounding box 1/3 times the size of the image with the same center as . We define a grid inside and use this to extract long range contextual features in the same manner as for the short range

We compute these features and concatenate them for all candidate regions from the training set (including the ground truth ones). We train a regressor whose input is these features and the target output is the overlap with the bounding boxes of ground-truth regions.

It is worth noting that we have attempted to incorporate these contextual features as another layer in the network, as they effectively serve as another pooling layer over the probability maps. Perhaps due to the small training set we have used, this network did not converge to results as good as those obtained by our context features. At least for a dataset of this size, constraining the context features to be of this form performed better than letting the network learn how to perform the pooling on its own. It remains an open question if a larger dataset would facilitate end-to-end learning, effectively learning the contextual features as an integral part of the network.

3.3 Classification

We now show how all the features are combined into a final classification. We extract the features as follows: for a network we denote


where . is a concatenation of the maximal values of each channel, including those predicting the body parts; it serves us as a global representation of the image. Note that while features extracted this way are significantly more discriminative when we found that the networks are complementary; combining from both networks works better than each own its own. Next we add features from the action object and face regions. The face is detected using the face detector of [12]. We increase the size of each image by 2 before running the face detector and retain the single highest scoring face per image. Using the regressor learned in Section 3.2, we rank the candidate regions and retain the top regions per image. For each region, we extract appearance features in the form of fc6 features using the vgg-16 [18] network. In training, we use the ground-truth regions. Let , be the feature representations for the face and candidate object features respectively, where are the indices of the top-ranked regions. Let be the fc6 representation of the entire image. We extract features from the candidate regions by using both bounding boxes and masked versions as in section 2, with candidate regions as masks. The final scoring of image for class is defined as:


where (dropping the argument for brevity) and is the set of weights learned by an SVM classifier in a one-versus all manner per action class. Please refer to section 4 for experiments validating our approach.

3.4 Training

To train , we construct a fully-convolutional neural net with the DAG architecture[10]

to produce a prediction of 8-pixel stride by fine-tuning the vgg-16 network. We use a learning rate of .0001 for 100 epochs, as our training set contains only a few hundreds of images. A similar procedure is done for

, where the training samples are selected as described in Section 3.1. For the FRA dataset, the sub-windows are selected to be the ground-truth bounding box of each face scaled by a factor of 2, this includes most of the action objects. Note that for FRA we did not use the provided bounding boxes at test time. The regressor for predicting the locations of action objects is a support-vector regression trained on the context features around each candidate region (see Section 3.2). All classifiers are trained using an SVM with a regularization .

4 Experiments

G Face C F Obj Mean AP
+ + + + 0.845 0.840 0.889 0.913 0.743 0.839
+ + + + 0.851 0.836 0.907 0.909 0.751 0.851
+ + + + 0.856 0.842 0.898 0.907 0.771 0.865
+ + + + 0.830 0.818 0.841 0.905 0.753 0.835
+ + + + 0.848 0.801 0.895 0.905 0.780 0.860
+ + + + + 0.865 0.845 0.910 0.914 0.786 0.868

Table 2: Ablation Study of various features combinations. Global, Face, Fine/Coarse segmentation, action Object features). Removing the fine phase has the worst effect on performance.
C 0.636 0.593 0.692 0.720 0.551 0.623
G 0.642 0.601 0.623 0.789 0.660 0.537
F 0.668 0.565 0.816 0.697 0.513 0.749
Face 0.699 0.658 0.578 0.829 0.686 0.742
Obj 0.743 0.776 0.744 0.851 0.606 0.738
Table 3: Single feature performance for classifying actions. (Global, Face, Fine/Coarse segmentation,action Object features) On average, features extracted from the automatically detected action-object outperform others.

We now show some experimental results. We begin with FRA, a subset of Stanford-40 Actions dataset, containing 5 action classes: drinking, smoking, blowing bubbles, brushing teeth and phoning. To show the contribution of each feature, we perform the following ablation study. First, we test the performance of each feature on its own. In Table 3, we can see that while the coarse level information extracted by provides moderate results, it is outperformed by other feature types, namely the global image representation and that of the face. The fine representation performs on average slightly less well as the extended face area, with the biggest exception being the smoking category, where it improves from an AP of .578 to .816 (in many cases the cigarettes are held far away from the face). As can be seen in the images of Figure 5, the semantic segmentation of is able to capture well the cigarettes in the image. The Obj score was obtained by assigning to each image that of the highest scoring candidate region. We can see that its performance nears that of the “Oracle” classifier, which is given the bounding box at test time. See Figure 5 for an example of top-ranked detected action objectes weighted by the predicted object probabilities.

(a) drinking
(b) smoking
(c) blowing bubbles
(d) brushing teeth
(e) phoning
Figure 5: Highly scoring action-objects detected by our method, weighted by the action-object per pixel probability for the respective class.

Next, we performed an ablation study showing how performance changes when we use all of the sources of information except one. Table 2 summarizes this. We can see that performance is worst when excluding the predictions. Also, as expected by our motivation of the problem in Section 2, we see that the best performance is gained when including the predicted object locations. Overall, the increase in mean average precision obtained via the baseline global features produced by the vgg-16 network increases from 0.642 to .865, a 35% relative increase. Note that this is also quite near the results of the “Oracle” classifier (Section 2).

4.1 Joint training for Pose and Objects

It is noteworthy that the joint training of the networks to detect the hands and faces along with the action objects performed dramatically better than attempting to train a network to predict the location of the action objects only; we have attempted to train a network when the ground-truth masks contained only the locations and identities of action objects without body parts. This worked very poorly; we conjecture that while the action objects may be difficult to detect on their own, contextual cues are implicitly learned by the networks. Such cues likely include proximity to the face or hand.

5 Conclusions & Future Work

We have demonstrated that exact localization of action objects and their relation to the body parts is important for action classes where the action-objects are small and often barely visible. We have done so using two main elements. First, using a coarse-to-fine approach which focuses in a second stage on relevant image regions. Second, using contextual cues which aid the detection of the small objects. This happens both during the network’s operation, as it seeks the objects jointly with the relevant body parts, as well as pruning false object candidates generated by the networks, by considering their context explicitly. The coarse to fine approach whos networks are based on the full 8-pixel stride model of [10] utilizes features from intermediate levels of the network and not only from the top level. It outperforms a purely feed-forward method such as obtained from the 32-pixel stride version. Together, these elements aid in good localizations of action objects, leading to a significant improvement over baseline methods. Our method uses two networks and a specific form of contextual features. Our comparisons showed that the results are better than incorporating the entire process in an end-to-end pipeline; it remains an open question if this is due to the relatively small size of the training set. A current drawback of the method is that it required annotation of both body-parts and action objects in the dataset; in the future we intend to alleviate this constraint by being able to combine information from existing datasets, which typically contain annotations of objects or poses, but not both.