In many tasks in pattern recognition, and specifically in computer vision, target classes follow a long-tail distribution. In the domain of action recognition this particularly true since the product of actions and objects is much bigger than each alone, and some examples may not be observed at all. This has been observed in several studies[20, 27, 23] and it is becoming increasingly popular to overcome this problem by building ever larger datasets [14, 21]. However, the distribution of these datasets will inevitably be long-tailed as well. One way to tackle this problem is to borrow information from external data sources. For instance, it has become popular to combine language and vision using joint embedded spaces [17, 13], which allow recognizing unseen classes more reliably than using a purely visual approach.
In this work, we propose to use an annotated concept
dataset to learn a mapping from images to concepts with a visual meaning (e.g., objects and object attributes). This mapping is then used as a feature representation for classifying atarget
dataset, instead of describing the images with visual features extracted directly. This allows to describe an image or scene directly with high-level concepts instead of using visual features directly for the task. We show that training image classifiers this way, specifically in action-recognition, is as effective as training on the visual features. Moreover, we show that the concepts learned to be relevant to each category carry semantic meaning, which enables us to gain further insights into their success and failure modes. Our concept dataset is the recently introduced Visual-Genome dataset, in which we leverage the rich region annotations.
2 Previous Work
We list several lines of work related to ours. An early related work is ObjectBank , where the outputs of detectors for 200 common objects are aggregated via a spatial-pyramid to serve as feature representations. In the same spirit, ActionBank  learns detectors for various action types in videos and uses them to represent others as weighted combinations of actions. The work of  learns object attributes to describe objects in a zero-shot learning setting, so that new classes (animals) can be correctly classified by matching them to human generated lists of attributes. Recently,  learned various types of relations between actions (e.g., part of / type of / mutually exclusive) via visual and linguistic cues and leveraged those to be able to retrieve images of actions from a very large variety (27k) of action descriptions.
Other works leverage information from natural language: in 
an image is mapped to a semantic embedding space by a convex combination of word embeddings according to a pre-trained classifier on ImageNet, allowing to describe unseen classes as combinations of known ones.  makes this more robust by considering the output of the classifier along with the WordNet  hierarchy, generating image tags more reliably. The work of  mines a large image-sentence corpora for actor-verb-object triplets and clusters them into groups of semantically related actions. Recently,  used detected or provided person-bounding boxes in a multiple-instance learning framework, fusing global image context and person appearance. 
uses spatial-pyramid pooling on activations from a convolutional layer in a network and encodes them using Fisher Vectors, with impressive results.
In our work we do not aim for a concept dataset with only very common objects or one that is tailored to our specific target task (such as action-recognition or animal classification) and automatically learn how to associate the learned concepts to target classes, either directly or via a language-based model.
We begin by describing our approach at a high level, and elaborate on the detail in subsequent sub-sections.
Our goal is to learn a classifier for a given set of images and target labels. Assume we are given access to two datasets: (1) a target dataset and (2) a concept dataset . The target dataset contains training pairs of images and target labels . The concept dataset is an additional annotated dataset containing many images labeled with a broad range of common concepts. The general idea is to learn high-level concepts from the dataset and use those concepts to describe the images in . More formally: let be a set of concepts appearing in . We learn a set of concept classifiers , one for each . Once we have the concept classifiers, we use them to describe each image : we apply each classifier to the image obtaining a set of concept scores:
For brevity, we’ll use the notation . defines a mapping from the samples to a concept-space. We use as a new feature-map, allowing us to learn a classifier in terms of concepts, instead of features extracted directly from each image We note that the dataset from which we learn concepts should be rich enough to enable learning of a broad range of concepts, to allow to describe each image in well enough to facilitate the classification into the target labels.
Next, we describe the source of our concept space, and how we learn various concepts.
To learn a broad enough range of concepts, we use the recently introduced Visual Genome (VG) dataset . It contains 108,249 images, all richly annotated with bounding boxes of various regions within each image. Each region spans an entire objects or an object part, and is annotated with a noun, an attribute, and a natural language description, all collected using crowd-sourcing (see  for full details). The average number of objects per image is 21.26, for a total of 2.1 million object instances. In addition, it contains object pair relationships for a subset of the object pairs of each image. Its richness and diversity makes it an ideal candidate from which to learn various concepts.
3.1 Concepts as Classifiers
|Class||Top assigned keywords|
|brushing_teeth||toothbrush, sink, bathroom, rug, brush|
|cutting_trees||bark, limb, tree_branch, branches, branch|
|fishing||shore, mast, ripple, water, dock|
|holding_an_umbrella||umbrella, rain, handbag, parasol, raincoat|
|phoning||cellphone, day, structure, bathroom, square|
|pushing_a_cart||cart, crate, boxes, luggage, trolley|
|rowing_a_boat||paddle, oar, raft, canoe, motor|
|taking_photos||camera, cellphone, phone, lens, picture|
|walking_the_dog||dog, leash, tongue, paw, collar|
|writing_on_a_board||writing, racket, poster, letter, mask|
We explore the use of three sources of information from the VG dataset, namely (1) object annotations (2) attribute annotations (group all objects with a given attribute to a single entity) and (3) object-attributes (a specific object with a specific attribute). We assign each image a binary concept-vector where indicates if the ’th concept is present or not in the respective image. This is done separately each of the above sources of information. Despite the objects being annotated via bounding boxes, rather than training detectors, we train image-level predictors for each. This is both simpler and more robust, as it can be validated from various benchmarks ([5, 21] and many others) that detecting the presence of objects in images currently works more accurately than correctly localizing them. Moreover, weakly-supervised localization methods are becoming increasingly effective [2, 4, 18], further justified the use of image-level labels. Given labellings for different concepts (where may vary depending on the type of concept), we train a one-versus-all SVM for each one separately, using features extracted from a CNN (see experiments for details). Denote these classifiers as .
This process results in a scoring of each image (our target dataset) with a concept-feature as in Eqn. 1. For each concept score , we have
where are features extracted from image and the weight vector of the ’th classifier (we drop the bias term for brevity). is a matrix whose rows are the .
We then train a classifier to map from to its target label, by using an additional SVM for each target class; denote by each classifier, where is the number target labels of images in target dataset . The final score assigned to an image for a target class is denoted by
Where is the learned weight vector for the classifier Before we train , we apply PCA to the collection of training vectors and project them to first
dimensions according to the strongest singular values (was chosen by validation in early experiments). We found this to reduce the runtime and improve the classification results.
We also experimented with training a neural net to predict the class scores; this brought no performance gain over the SVM , despite trying various architectures and learning rates.
We next describe how we deal with training a large number of concept classifiers using the information from the VG dataset.
3.2 Refinement via Language
The object nouns and attributes themselves have very long-tail distributions in the VG dataset, and contain many redundancies. Training classifiers for all of the concepts would be unlikely for several reasons: first, the objects themselves follow a long-tail distribution, which would cause most of the classifiers to perform poorly; second, there are many overlapping concepts, which cannot be regarded as mutually exclusive classes; third, the number of parameters required to train such a number of classifiers would become prohibitively large.
To reduce the number of concepts to be learned, we remove redundancies by standardizing the words describing the various concepts and retain only concepts with at least 10 positive examples (more details in Section 4.1).
Many of the concepts overlap, such as “person” and “man”, or are closely related, such as A being a sub-type of B (“cat” vs “animal”). Hence it would harm the classifiers to treat them as mutually exclusive. To overcome this, we represent each concept by its representation produced by the GloVe method of . This 300-D representation has been shown to correlate well with semantic meaning, embedding semantically related words closely together in vector space, as well as other interesting properties.
We perform K-means clustering on the embedded word-vectors for a total of 100 clusters. As expected, these produce semantically meaningful clusters, forming groups such as : (sign,street-sign,traffic-sign,…),(collar,strap,belt,…),(flower,leaf,rose,…),(beer,coffee,juice), etc.
We assign to each concept a cluster . Denote by the set of concepts in an image according to the ground-truth in the VG dataset. We define the positive and negative training sets as follows:
In words, is the set of all images containing the target concept and is the set of all images which do not contain any concept in the same cluster as . We sample a random subset from to avoid highly-imbalanced training sets for concepts which have few positive examples. In addition, sampling lowers the chance to encounter images in which or members of were not labeled. In practice, we limit the number of positive samples of each concept to , as using more samples increased the run-times significantly with no apparent benefits in performance.
The classifiers trained on the set concepts in this way serve as the in Section 3.1. We next proceed to describe experiments validating our approach and comparing it to standard approaches.
To validate our approach, we have tested it on the Standford-40 Action dataset . It contains 9532 images with a diverse set of of 40 different action classes, 4000 images for training and the rest for testing. In addition we test our method on the recently introduced HICO  dataset, with 47774 images and 600 (not all mutually exclusive) action classes. Following are some technical details, followed by experiments checking various aspects of the method.
As a baseline for purely-visual categorization, we train and test several baseline approaches using feature combinations from various sources: (1) The global average pooling of the VGG-GAP network  (2) the output of the 2-before last fully connected layer (termed fc6) from VGG-16  and (3) The pool-5 features from the penultimate layer of ResNet-151 . The feature dimensions are 1024, 4096 and 2048, respectively. In all cases, we train a linear SVM  in a one-versus-all manner on normalized features, or the normalized concatenation of features in case of using several feature types. To assign GloVe  vectors to object names or attributes, we use the pre-trained model on the Common-Crawl (42B) corpus, which contains a vocabulary of 1.9M words. We break up phrases into their words and assign to them their mean GloVe vector. We discard a few words which are not found in the corpus at all (e.g., “ossicones”).
4.1 Visual vs Concept features
Training and testing by using the visual features is straightforward. For the concept features, we train concept detectors on both the objects and object attributes of the VG dataset. Directly using these in their raw form is infeasible as explained in Sec. 3.2. To reduce the number of classes, we normalize each object name beforehand. The object name can be either a single word, such as “dog”, or a phrase, such as “a baseball bat”. To normalize the names, we remove stop-words, such as “the”,”her”,”is”,”a”, as well as punctuation marks. We turn plural words into their singular form. We avoid lemmatizing words since we found that this slightly hinders performance, for example, the word “building” usually refers to the structure and has a different meaning if turned to “build” by lemmatization. Following this process we are left with 66,390 unique object names. They are distributed unevenly, the most common being “man”,”sky”,”ground”,”tree”, etc. We remove all objects with less than 10 images containing them in the dataset, leaving us with 6,063 total object names. We do the same for object attributes: we treat attributes regardless of the object type (e.g, “small dog” and “small tree” are both mapped to “small”), as the number of common object-attribute pairs is much smaller than the number of attributes. Note that the stop-word list for attributes is slightly different, as “white”, a common word in the dataset, is not a noun, but is a proper attribute. We are left with 1740 attributes appearing at least 10 times in the dataset. See Fig. 1 (a) for a visualization of the distribution of the object and attribute frequency. Specifically, one can see that for object-attribute pairs the long-tail distribution is much more accentuated, making them a bad candidate for concepts to learn as features for our target task.
We train classifiers using the detected objects, attributes and object-attributes pairs, as described in sections 3.1 and 3.2. Please refer to Table 2 for a comparison of the direct-visual results to our method. Except for the attribute-object concepts, we see that the concept based classification does nearly as well as the direct visual-based method, where the addition of ResNet-151  clearly improves results. Combining the predictions from the direct and concept-based (object) predictions and using the ResNet features along with the other representations achieves a new state-of-the art of 83.12% on Stanford-40 . On the recent HICO  dataset we obtain a mean average precision of 31.54%.  obtain higher results (36.1%) by fusing detected person bounding-boxes with global image context and using a weighted loss.
|Method \ Features||G||G+V||G+V+R|
|Other Works||80.81 |
|29.4 / 36.1|
4.2 Describing Actions using Concepts
For each target class, the learned classifier assigns a weight for each of the concepts. Examining this weight vector reveals the concepts deemed most relevant by the classifier. Recall that the weight vector of each learned classifier for class is ( is the number of target concepts). We checked if the highest-weighted concepts carry semantic meaning with respect to the target classes as follows: for each target class ( being the set of target classes in ) we sort the values of the learned weight-vector in descending order, and list the concepts corresponding to the obtained ranking. Table 1 shows ten arbitrarily chosen classes from the Stanford-40 Actions  dataset, with the top 5 ranked object-concepts according to the respective weight-vector. In most cases, the classes are semantically meaningful. However, in some classes we see unexpected concepts, such as holding_an_umbrellahandbag. This points to a likely bias in the Stanford-40 dataset, such as that many of the subjects holding umbrellas in the training images also carry handbags, which was indeed found to be the case by examining the training images for this class.
An interesting comparison is the concepts differentiating between related classes. For example, the top 5 ranked keywords for the class “feeding a horse” are (“mane”, “left_ear”, “hay”, “nostril”, “horse”) whereas for “riding a horse” they are (“saddle”, “horse”, “rider”, “hoof”, “jockey”). While “horse” is predictably common to both, other words are indeed strongly related to one of the classes but not the other, for example, “hay” for feeding vs “jockey”, “saddle” for riding.
4.3 Concept Visualization
To test what features contribute most to the classification result, we use the Class-Activation-Map (CAM) approach of . This method allows to visualize what image regions contributed the most to the score of each class. We can do this for the version of our method which uses only the VGG-GAP features, as the method relies on the structure specific to this network to re-project the classification scores to the image (see  for details). We visualize the average CAMs of the top-5 ranked keywords for different classes (as in the above section). We do this for two target classes for each image, one correct class and the other incorrect, to explain what image regions drive the method to decide on the image class. See Fig. 2. When the method is “forced” to explain the riding image as “feeding a horse”, we see negative weights on the rider and strong positive weights on the lower part of the horse, whereas examining the regions contributing to “riding a horse” gives a high weight to the region containing the jockey.
4.4 Distribution of Weights
We have also examined the overall statistics of the learned weight vectors. For a single weight vector , we define:
i.e, is the mean of for all classifiers of the target classes. Fig. 1 (b) displays these mean absolute weights assigned to object concepts, ordered by their frequency in VG. Not surprisingly, the first few tens of concepts have low-magnitude weights, as they are too common to be discriminative. The next few hundreds of concepts exhibit higher weights, and finally, weights become lower with diminished frequency. This can be explained due to such concepts having weaker classifiers as they have fewer positive examples, making them less reliable. A similar trend was observed when examining attributes.
4.5 Feature Selection by Semantic Relatedness
Section 4.2 provided a qualitative measure of the keywords found by the proposed method. Here, we take on a different approach, which is selecting concepts by a relatedness measure to the target classes, and measuring how well training using these concepts alone compares with choosing the top-k most common concepts. To do so, we measure their mean “importance”. As described in Section 3.2 we assign to each concept a GloVe  representation . Similarly, we assign a vector to each target class according to its name; for instance, “riding a horse” is assigned the mean of the vectors of the words “ride” and “horse”. Then, for each class we rank the vectors according to their euclidean distance from in increasing order. This induces a per-class order , which is a permutation of , such that is the ranking of in the ordering induced by . We use this to define the new mean rank of each concept:
Now, we test the predictive ability of concepts chosen from according to two orderings. The first is the frequency of , in ascending order, and the second is the sorted values (descending) of as defined in Eqn. 10. We select the first concepts for the first features ( for chance performance). For a small amount of features, e.g., , the concepts chosen according to outperform those chosen according to frequency by a large margin, i.e, 42.2 vs 34.2 respectively.
5 Conclusions & Future Work
We have presented a method which learns to recognize action in images by describing them as a weighted sum of detected concepts (objects and object attributes). The method utilizes the annotations in the VG dataset to learn a broad range of concepts, which are then used to recognize action in still images. Together with the visual features, we are able to obtain state-of-the art classification performance, as well as provide a visual and semantic explanation of the classifier’s decisions. In the future we indeed to broaden our work to capture object relationships, which are very important to action-classification as well .
-  David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
-  Hakan Bilen and Andrea Vedaldi. Weakly Supervised Deep Detection Networks. arXiv preprint arXiv:1511.02853, 2015.
-  Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. HICO: A Benchmark for Recognizing Human-Object Interactions in Images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1017–1025, 2015.
-  Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning. arXiv preprint arXiv:1503.00949, 2015.
-  Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.
LIBLINEAR: A Library for Large Linear Classification.
Journal of Machine Learning Research, 9:1871–1874, 2008.
-  Bin-Bin Gao, Xiu-Shen Wei, Jianxin Wu, and Weiyao Lin. Deep spatial pyramid: The devil is once again in the details. arXiv preprint arXiv:1504.05277, 2015.
-  Jiyang Gao, Chen Sun, and Ram Nevatia. ACD: Action Concept Discovery from Image-Sentence Corpora. arXiv preprint arXiv:1604.04784, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.
-  Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. CoRR, abs/1602.07332, 2016.
-  Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009.
Li-Jia Li, Hao Su, Li Fei-Fei, and Eric P Xing.
Object bank: A high-level image representation for scene classification & semantic feature sparsification.In Advances in neural information processing systems, pages 1378–1386, 2010.
-  Xirong Li, Shuai Liao, Weiyu Lan, Xiaoyong Du, and Gang Yang. Zero-shot image tagging by hierarchical semantic embedding. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 879–882. ACM, 2015.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, pages 740–755. Springer, 2014.
-  Arun Mallya and Svetlana Lazebnik. Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering. arXiv preprint arXiv:1604.04808, 2016.
-  George A Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11):39–41, 1995.
-  Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell.
Constrained convolutional neural networks for weakly supervised segmentation.In Proceedings of the IEEE International Conference on Computer Vision, pages 1796–1804, 2015.
-  Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global Vectors for Word Representation. In EMNLP, volume 14, pages 1532–1543, 2014.
-  Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Chuck Rossenberg, and Li Fei-Fei. Learning semantic relationships for better action retrieval in images. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1100–1109. IEEE, 2015.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  Sreemanananth Sadanand and Jason J Corso. Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1234–1241. IEEE, 2012.
-  Ruslan Salakhutdinov, Antonio Torralba, and Josh Tenenbaum. Learning to share visual appearance for multiclass object detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1481–1488. IEEE, 2011.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1331–1338. IEEE, 2011.
-  Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning Deep Features for Discriminative Localization. arXiv preprint arXiv:1512.04150, 2015.
-  Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 915–922, 2014.