Actions where objects are barely visible. Accurately localizing the action-object is important for succeeding in classifying such images.
Recognizing actions in still images has been an active field of research in recent years, using multiple benchmarks [3, 5, 15, 20]. It is a challenging problem since it combines different aspects of recognition, including object recognition, human pose estimation, and human-object interaction. Action recognition schemes can be divided into transitive vs. intransitive. Our focus is on the recognition of transitive actions from still images. Such actions are typically based on an interaction between a body part and an action-object (the object involved in the action). In a cognitive study of 101 frequent actions and 15 body parts, Hidaka, Shohei and Smith 
have shown that hands and faces (in particular mouth and eyes) were by far the most frequent body parts involved in actions. It is interesting to note that in the human brain, actions related to hands and faces appear to play a special role: neurons have been identified in both physiological and fMRI studies[2, 7] which appear to represent ’hand space’ and ’face space’, the regions of space surrounding the hands and face respectively. In this work we focus, but are not limited to such actions, in particular interactions between the mouth and small objects, including drinking, smoking, brushing teeth and others. We refer to them as face-related actions. For such actions, the ability to detect the action-object involved has a crucial effect on the attainable classification accuracy. A simple demonstration is summarized in table 1 (Section 2 below). The action objects involving such classes are often barely visible, making them hard to detect (see Figure 1). Furthermore, the spatial configuration of the action object with respect to the relevant body parts must be considered to avoid the chance of misclassification due to the detection of an unrelated action object in the image. In other words, the right action-object should be detected at a particular surrounding context.
To obtain these goals, our algorithm includes a method for the detection of the relevant action object and related body parts as means to aid the classification process. We seek the relevant body parts and the action object jointly
using a cascade of two fully convolutional networks: the first operates at the entire image level, highlighting regions of high probability for the location of the action object. The second network, guided by the result of the first, refines the probability estimates within these regions. The results of both networks are combined to generate a set of action-object candidates which are pruned using contextual features. In the final stage, each candidate, together with features extracted from the entire image and from the two networks, is used to score the image as depicting a target action class. We validate our approach on a subset of the Stanford-40 Actions dataset, achieving a 35% improvement over state of the art.
1.1 Related Work
Recent work on action recognition in still images can be categorized into several approaches. One group of methods attempt to produce accurate human pose estimation and then utilize it to extract features in relevant locations, such as near the head or hands, or below the torso [20, 4]
Others attempt to find relevant image regions in a semi-supervised manner, given the image label. Prest et al  find candidate regions for action-objects and optimize a cost function which seeks agreement between the appearance of the action objects for each class as well as their location relative to the person. In  the objectness  measure is applied to detect many candidate regions in each image, after which multiple-instance-learning is utilized to give more weight to the informative ones. Their method does not explicitly find regions containing action objects, but any region which is informative with respect to the target action. In 
a random forest is trained by choosing the most discriminative rectangular image region (from a large set of randomly generated candidates) at each node, where the images are aligned so the face is in a known location. This has the advantage of spatially interpretable results.
Some methods seek the action objects more explicitly:  apply object detectors from Object-Bank  and use their output among other cues to classify the images. In , outputs of stronger object detectors are combined with a pose estimation of the upper body in a neural net-setting and show improved results, where the object detectors are the main cause for the improvement in performance. Very recently  introduced R*CNN which is a semi-supervised approach to find the most informative image region constrained to overlap with the person bounding box. Also relevant to our work are some recent attention based methods: 
uses a recurrent neural network which restricts its attention to a relevant subset of features from the image at a given timestep. We do not use an RNN, but two different networks. Instead of restricting the view of the second network to a subset of the original features, we allow the network to extract new ones, allowing it to focus on more subtle details.
Unlike [14, 16, 21, 6], our approach is fully supervised. Unlike we seek the relevant body parts and action objects jointly instead of using separate stages. In contrast with , we employ a coarse-to-fine approach which improves on the ability of a single stage to localize the relevant image region accurately.
The rest of the paper is organized as follows: in the next section, we demonstrate our claim that exact detection of the action object can be crucial for the task of classification. In Section 3, we describe our approach in detail. In Section 4 we report on experiments used to validate our approach on a subset of the Stanford-40 Actions  dataset. Section 5 contains a discussion & concluding remarks.
2 The Importance of Action objects in Action Classification
In this section, we demonstrate our claim that exact localization of action objects plays a critical role in classifying actions, in particular when they are hard to detect, e.g., due to their size or occlusion. We begin by a simple baseline experiment on the Stanford-40 Actions  dataset. It contains 9532 images of 40 action classes split to 4000 for training and the rest for testing. We train a linear SVM to discriminate between each of the 40 action classes in a one-vs-all manner, using features from the penultimate layer vgg-16 , i.e., fc6, and report performance in terms of average precision. Figure 2 shows the resulting per-class performance. We have marked classes where the action object tends to be very small or very large related to the image. The difference in mean performance is striking: while for the small objects, the mean AP is 37.5% and for the large ones 87%. The mean performance on the entire dataset is 67%. Hence, we further restrict our analysis to a subset of Stanford-40 Actions  containing 5 classes: drinking, smoking, blowing bubbles, brushing teeth and phoning - where the action objects involved tend to be quite small. We name this the FRA dataset standing for Face Related Actions. We augment the annotation of the images in FRA by adding the exact delineation of the action objects and faces and the bounding boxes of the hands. Next, we test the ability of a classifier to tell apart these 5 action classes when it is given various cues. We do this by extracting features for each image from the following regions:
Entire image (Global appearance)
Extended face bounding box scaled to 2x the original size (Face appearance)
Bounding box of action object (Object appearance)
Exact region of action object ( MO : Masked Object appearance)
We extract fc6 features from each of these regions. For MO, the masked object region, the image is cropped around the region bounding box and all pixels not belonging to the object mask are set to the value of the mean image learned by vgg-16. A linear SVM is trained on the feature representation of each feature alone and on concatenations of various combinations. We allow access to both face bounding boxes and object bounding boxes/masks at test time, as if having access to an Oracle which reveals them. The performance of the classifier with different feature combinations is summarized in Table 1. Clearly, while the global (G) and face features (F) hold some information, they are significantly outperformed by the classifier when it is given the bounding box of the action object (O). This is further enhanced by masking the surroundings of the action object (MO), which performs best; The masking holds some information about the shape of the object in addition to its appearance. Combining all features provides a further boost, owing to the combination of local and contextual cues.
In the next section, we show our approach to detecting the action object automatically.
Our goal is to classify the action being performed in an image. To that end, we perform the following stages:
• Produce a probability estimate for the locations of hands, head and action objects in the entire image
Refine the probability estimates at a set of selected locations
Using the above probabilities, generate a set of candidate action objects
Rank and prune the set of candidate objects by using contextual features
Extract features from the probabilistic estimates and appearances of the image and predicted action object locations to produce a final classification.
We now elaborate on each of these stages.
3.1 Coarse to Fine Semantic Segmentation
We begin by producing a semantic segmentation of the entire image, then refining it: we train a fully convolutional network as in  to predict a pixel-wise segmentation of the image. The network is trained to predict jointly the location of faces, hands, and different object categories relating to the different action classes. Using the framework of , we do so by fine-tuning the vgg-16 network of . We denote the set of learned labels as :
where is the set objects relating to action classes. We name this first network .
For a network and image , we denote by the probability maps resulting from applying to :
where the superscript is used to indicate the network which operated on the image. In other words, assigns to each pixel in a probability for each of the classes. For brevity, we might drop the parameters where it is appropriate, writing only . For clarity, we may also write instead of .
The predictions of this , albeit quite informative, can often miss the action object or misclassify it as belonging to another category (see Section 4). To improve the localization and accuracy of the estimate it produces, we train a secondary network; we use the same base network (vgg-16) but on a different extent whose purpose is to refine the predictions of the first network. We name the second network . It is trained on sub-windows of the original images which are upscaled during the training and testing process. Hence it is trained on a data of a different scale than
. Moreover, since it operates on enlarged sub-windows each image, its output is eventually transformed to a finer scale in the original image’s coordinate system. We note that in our experiments, attempting to train the 32-pixel stride version of Long et al worked significantly less well as the full 8-pixel stride version, which uses intermediate features from the net’s hierarchy to refine the final segmentation. Both networks are trained using the full 8-pixel stride method.
For each training image , we define to be a sub-window of the entire image whose extent allows a good tradeoff between the image region to be inspected (which is desirably small to capture fine details), yet contains the desired objects or body parts. The details of how this bounding box is selected depend on the nature of the dataset and are described in Section 3.4. The training set for is obtained by cropping each training image around and cropping the ground-truth label-map accordingly.
is used to refine the output of as follows: after applying to an image we seek the top local maxima of which are at least pixels apart. A window is cropped around each local maxima with a fixed size relative to the size of the original image and is applied to that sub-window after proper resizing. The probabilities predicted for overlapping sub-windows are averaged. We chose , by validating on a small subset of the training set. Denote the resulting refined probability map by . See Figure 4 for an example of the resultant coarse and fine probability maps and the resulting predictions.
3.2 Contextual Features
We now describe how we use the outputs for both networks to generate candidate regions for action objects. Identifying the region corresponding to the object depends on its on appearance as well as the context, e.g. of the face and hand. The outputs of the coarse and fine networks often produce good candidate regions, along with many false ones: typically tens of regions per image. In addition, the location of the candidates may be correct but their predicted identity (i.e, the maximal probability) is wrong. Therefore we produce candidate regions from the resultant probability maps in two complementary ways: Define
is the pixelwise prediction of net . We denote by the set of connected components in . In addition, let be the set of local maxima, computed separately for each channel of . For each channel we apply the adaptive thresholding technique of Otsu . We denote by the union of the regions obtained by the thresholding process. We remove from regions which do not contain any element in Finally, we denote
as the set of candidate regions for the image. and are complementary since the maximal prediction of the network may not necessarily contain a local maxima in any of the probability channels.
We train a regressor to score each candidate region: Let be a candidate region and its bounding box. We extract short and long range contextual features for : for the short range features we scale by a factor of 3 while retaining the box center. Next, we split the enlarged bounding box into a grid, where we assign to each grid cell the mean value of each channel of the probability maps inside that cell. Formally, let be the window defined by and , the subwindow for the row/column. We define
is the area of each grid cell in pixels. This yields a feature vector of lengthchannels representing the values of the network’s predictions both inside the predicted region and in the immediate surroundings. Similarly, define a second bounding box, to be a bounding box 1/3 times the size of the image with the same center as . We define a grid inside and use this to extract long range contextual features in the same manner as for the short range
We compute these features and concatenate them for all candidate regions from the training set (including the ground truth ones). We train a regressor whose input is these features and the target output is the overlap with the bounding boxes of ground-truth regions.
It is worth noting that we have attempted to incorporate these contextual features as another layer in the network, as they effectively serve as another pooling layer over the probability maps. Perhaps due to the small training set we have used, this network did not converge to results as good as those obtained by our context features. At least for a dataset of this size, constraining the context features to be of this form performed better than letting the network learn how to perform the pooling on its own. It remains an open question if a larger dataset would facilitate end-to-end learning, effectively learning the contextual features as an integral part of the network.
We now show how all the features are combined into a final classification. We extract the features as follows: for a network we denote
where . is a concatenation of the maximal values of each channel, including those predicting the body parts; it serves us as a global representation of the image. Note that while features extracted this way are significantly more discriminative when we found that the networks are complementary; combining from both networks works better than each own its own. Next we add features from the action object and face regions. The face is detected using the face detector of . We increase the size of each image by 2 before running the face detector and retain the single highest scoring face per image. Using the regressor learned in Section 3.2, we rank the candidate regions and retain the top regions per image. For each region, we extract appearance features in the form of fc6 features using the vgg-16  network. In training, we use the ground-truth regions. Let , be the feature representations for the face and candidate object features respectively, where are the indices of the top-ranked regions. Let be the fc6 representation of the entire image. We extract features from the candidate regions by using both bounding boxes and masked versions as in section 2, with candidate regions as masks. The final scoring of image for class is defined as:
where (dropping the argument for brevity) and is the set of weights learned by an SVM classifier in a one-versus all manner per action class. Please refer to section 4 for experiments validating our approach.
To train , we construct a fully-convolutional neural net with the DAG architecture
to produce a prediction of 8-pixel stride by fine-tuning the vgg-16 network. We use a learning rate of .0001 for 100 epochs, as our training set contains only a few hundreds of images. A similar procedure is done for, where the training samples are selected as described in Section 3.1. For the FRA dataset, the sub-windows are selected to be the ground-truth bounding box of each face scaled by a factor of 2, this includes most of the action objects. Note that for FRA we did not use the provided bounding boxes at test time. The regressor for predicting the locations of action objects is a support-vector regression trained on the context features around each candidate region (see Section 3.2). All classifiers are trained using an SVM with a regularization .
We now show some experimental results. We begin with FRA, a subset of Stanford-40 Actions dataset, containing 5 action classes: drinking, smoking, blowing bubbles, brushing teeth and phoning. To show the contribution of each feature, we perform the following ablation study. First, we test the performance of each feature on its own. In Table 3, we can see that while the coarse level information extracted by provides moderate results, it is outperformed by other feature types, namely the global image representation and that of the face. The fine representation performs on average slightly less well as the extended face area, with the biggest exception being the smoking category, where it improves from an AP of .578 to .816 (in many cases the cigarettes are held far away from the face). As can be seen in the images of Figure 5, the semantic segmentation of is able to capture well the cigarettes in the image. The Obj score was obtained by assigning to each image that of the highest scoring candidate region. We can see that its performance nears that of the “Oracle” classifier, which is given the bounding box at test time. See Figure 5 for an example of top-ranked detected action objectes weighted by the predicted object probabilities.
Next, we performed an ablation study showing how performance changes when we use all of the sources of information except one. Table 2 summarizes this. We can see that performance is worst when excluding the predictions. Also, as expected by our motivation of the problem in Section 2, we see that the best performance is gained when including the predicted object locations. Overall, the increase in mean average precision obtained via the baseline global features produced by the vgg-16 network increases from 0.642 to .865, a 35% relative increase. Note that this is also quite near the results of the “Oracle” classifier (Section 2).
4.1 Joint training for Pose and Objects
It is noteworthy that the joint training of the networks to detect the hands and faces along with the action objects performed dramatically better than attempting to train a network to predict the location of the action objects only; we have attempted to train a network when the ground-truth masks contained only the locations and identities of action objects without body parts. This worked very poorly; we conjecture that while the action objects may be difficult to detect on their own, contextual cues are implicitly learned by the networks. Such cues likely include proximity to the face or hand.
5 Conclusions & Future Work
We have demonstrated that exact localization of action objects and their relation to the body parts is important for action classes where the action-objects are small and often barely visible. We have done so using two main elements. First, using a coarse-to-fine approach which focuses in a second stage on relevant image regions. Second, using contextual cues which aid the detection of the small objects. This happens both during the network’s operation, as it seeks the objects jointly with the relevant body parts, as well as pruning false object candidates generated by the networks, by considering their context explicitly. The coarse to fine approach whos networks are based on the full 8-pixel stride model of  utilizes features from intermediate levels of the network and not only from the top level. It outperforms a purely feed-forward method such as obtained from the 32-pixel stride version. Together, these elements aid in good localizations of action objects, leading to a significant improvement over baseline methods. Our method uses two networks and a specific form of contextual features. Our comparisons showed that the results are better than incorporating the entire process in an end-to-end pipeline; it remains an open question if this is due to the relatively small size of the training set. A current drawback of the method is that it required annotation of both body-parts and action objects in the dataset; in the future we intend to alleviate this constraint by being able to combine information from existing datasets, which typically contain annotations of objects or poses, but not both.
-  Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 73–80. IEEE, 2010.
-  Claudio Brozzoli. Peripersonal space: a multisensory interface for body-objects interactions. PhD thesis, Université Claude Bernard-Lyon I, 2009.
-  Vincent Delaitre, Ivan Laptev, and Josef Sivic. Recognizing human actions in still images: a study of bag-of-features and part-based representations. In BMVC, volume 2, page 7, 2010.
-  Chaitanya Desai and Deva Ramanan. Detecting actions, poses, and objects with relational phraselets. In Computer Vision–ECCV 2012, pages 158–172. Springer, 2012.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
-  Georgia Gkioxari, Ross Girshick, and Jitendra Malik. Contextual action recognition with r* cnn. arXiv preprint arXiv:1505.01197, 2015.
-  Elisabetta Làdavas, Gabriele Zeloni, and Alessandro Farnè. Visual peripersonal space centred on the face in humans. Brain, 121(12):2317–2326, 1998.
Li-Jia Li, Hao Su, Li Fei-Fei, and Eric P Xing.
Object bank: A high-level image representation for scene classification & semantic feature sparsification.In Advances in neural information processing systems, pages 1378–1386, 2010.
-  Zhujin Liang, Xiaolong Wang, Rui Huang, and Liang Lin. An expressive deep model for human action parsing from a single image. In IEEE International Conference on Multimedia and Expo, ICME 2014, Chengdu, China, July 14-18, 2014, pages 1–6, 2014.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
-  Josita Maouene, Shohei Hidaka, and Linda B Smith. Body parts and early-learned verbs. Cognitive science, 32(7):1200–1216, 2008.
-  Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. In Computer Vision–ECCV 2014, pages 720–735. Springer, 2014.
-  Nobuyuki Otsu. A threshold selection method from gray-level histograms. Automatica, 11(285-296):23–27, 1975.
A. Prest, C. Schmid, , and V. Ferrari.
Weakly supervised learning of interactions between humans and objects.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):601–614, March 2012.
-  Mohammad Amin Sadeghi and Ali Farhadi. Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1745–1752. IEEE, 2011.
-  Fadime Sener, Cagdas Bas, and Nazli Ikizler-Cinbis. On recognizing actions in still images via multiple features. In Computer Vision–ECCV 2012. Workshops and Demonstrations, pages 263–272. Springer, 2012.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
A. Vedaldi and K. Lenc.
Matconvnet – convolutional neural networks for matlab.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.
-  Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Li Fei-Fei. Action recognition by learning bases of action attributes and parts. In International Conference on Computer Vision (ICCV), Barcelona, Spain, November 2011.
-  Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. Combining randomization and discrimination for fine-grained image categorization. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1577–1584. IEEE, 2011.