This paper describes the approach that we developed to address the Yahoo! Large-scale Flickr-tag Image Classification Grand Challenge. This challenge is formulated:
Given a training set of images together with their metadata, and a class label corresponding to each training image, to build a ranking model that for each class label ranks images in a test set (without metadata) as accurate as possible.
We emphasize three particular characteristics of this challenge. First, the ground truth for both the training and testing set is user generate tags. As with any other tagging problem, these tags are expected to be noisy, incomplete and subjective. The ranking models should have the ability to learn useful information with difficult labels. Second, for this particular task, the 10 target tags are selected from tags most frequently used by Flickr users. The images from these classes (which we refer to as ‘top-level classes’) show a broad variation in terms of visual representations. Thus, the ranking model should be based on a representation that can model classes with high intra-class variations. Third, the level of visual diversity varies between the given 10 target tags. For example, images related to tag “sky” or “beach” are expected to be more visually consistent than images related to “2012” or “nature”. In view of this, the ranking model should have the ability to deal with different abstraction levels adaptively.
In view of the above, we propose in this paper a new method for content-based image classification/tagging based on subclass representation. As mentioned above, the image classes in this challenge are user-generated tags. Since social images are usually annotated by more than one tag, the co-occurrence of different tags serves as a valuable information source that can be exploited for elaborating the tags (classes) which are by themselves highly visually-varied. As shown in the example in Fig. 1, the image class “nature” can cover an extremely large range of visual representations. However, the subclasses of it, such as “flower”, “bird”, “forest”, are much more homogeneous in terms of the visual information. For this reason, we first discover the subclasses, which are tags frequently and exclusively co-occurring with the top-level class labels, for representing each image. Then, we train a binary classifier for each selected subclass. For a given image, the confidence scores of all subclass classifiers are concatenated to produce a high level representation. Thus, the representation is developed to learn models of the top-level classes. This is illustrated by Fig. 2. The final results are predicted by the ranking model based on the the learned subclass representation.
The contribution of this paper lies in the following aspects:
This method uses a co-occurrence based method to discover subclasses. Compared with semantic ontology based subclass generation methods  or using predefined concepts as subclasses , the proposed subclass representation is expected to be more discriminative in terms of predicting the target classes.
The proposed method uses confidence score instead of binary decision of the subclass classifiers as the high level features. This strategy can develop useful representations even if the performance of subclass classifier is not reliable.
The remainder of the paper is organized as follows: in Section 2, we present the details of the proposed method. The experimental framework and results are presented in Section 3 and Section 4. Then, in Section 5 we discuss previous research contributions that are related to our approach proposed in this paper. Finally, Section 6 summarizes our contributions and discusses future work.
2 Learning Subclass Representation
2.1 Mining Subclasses
As discussed above, subclasses of one class (the target class) are expected to be strongly connected with the target class and, moreover, relatively more stably reflected in visual features than the target class. To define such subclasses, we exploited the tags annotating the images. We first generate a co-occurrence matrix between photos’ tags and their top-level classes, and measure each tag’s connection to one class by its distinctive score, defined as:
where is the number of co-occurrence of the -th tag and the -th top level class. Note that in the setting of Yahoo! Challenge, the predefined to-level classes are also chosen from the user-contributed tags.
The selected subclasses for class- can be then defined as the tags that have their distinctive scores above a pre-defined threshold, as shown below:
where denotes the -th tag, and denotes the set of all tags. Note that some tags may be only assigned to a very small number of images in one class. For those tags, the limited number of training images prevents them from being effective subclasses. Taking this into account, we further rank all selected tags, , in by the number of photos in class- that are tagged with .
2.2 Subclass Representation
To generate a subclass-based representation for an image, we first use the images tagged with the subclasses to train models, i.e., Support Vector Machines (SVMs), for classifying subclasses, and then, use the confidence scores for predicting each subclass as the new representation for the image. In this sense, as illustrated in Fig.2, the image features can be treated as the fist level representation for an image, while the confidence scores of subclasses are the high level representation. Based on the subclass representations, we further learn the model that characterize the connection between these representations and the top-level class.
3 Experimental Framework
To verify the performance of the proposed approach, we carry out our experiments on a dataset of photos released by the Multimedia 2013 Yahoo! Large-scale Flickr-tag Image Classification Grand Challenge. The dataset contains 2 million Flickr photos with 10 classes, i.e., 150K training and 50K test images per class. The class labels are amongst the top tags annotated by the Flickr users. Since the release does not includes the tags associated with the photos, we re-crawled the photos’ tags using the photo ID provided in the metadata.
To develop our system, the training dataset is randomly divided into three parts with the ratio 4:3:1. The first is for training models for predicting subclasses based on image features, the second is for training models for target classes based on confidence scores from the learned subclass models, and the last is for validation and parameter selection. The test data from the grand challenge is used to evaluate the proposed system.
3.2 Multi-class Classification
To model the connection between image feature and subclass, and the connection between subclass representation and top class, we choose an SVM-based approach. For the purpose of classification for multiple classes, we apply one-against-one training strategy, which is reported to have better training time efficiency and prediction accuracy compared with other multi-class support vector machines, e.g., one-against-all 
. To generate probability estimation from the SVM model, we apply the algorithm proposed in and modified its original implementation in LibSVM  to make it suitable for distributed computing on a Hadoop-based distributed server.
We compare our subclass-representation-based approach, denoted in the following as , to two other approaches that closely relate to our approach, as listed below.
: Directly model target the 10 top-level classes based on image features.
: Project image features on to top-level 10-classes space.
4.1 Learning Subclass Models
To learn the subclasses using the co-occurrence matrix between photos’ tags and their classes, the in Eq. (2) is set to , and then we manually select the subclasses in the top tags of each class, which results in a total of 54 subclasses. As some classes may contains few distinctive tags compared with other classes, they may have few subclasses, i.e., class “travel” only contain 1 subclass. In contrast, some classes may contain more distinctive tags, i.e., 14 subclasses for class “nature”. To train these subclass models for projecting images on to the subclass space, we use a maximum 10k images per subclass as training data. Note that some subclasses may contain less than 10k images. The performance on the validation set, Average Precision (AP) for these subclasses models are illustrated in Fig. 3.
4.2 Classification Results
Fig.4 illustrates the performance in terms of mean average precision (MAP), across all 10 top-level classes, for the proposed approach and the baselines with respect to different training data scales. Overall, performs better than and in different training data scale, with an averaged gain for and for . In addition, the already achieves good performance compared to others in case of a small training data scale, e.g., 1k per class. Along with the increasing amounts of training data, the performance for levels out. This is due to the fact that many subclasses do not contain enough training photos, e.g., of them contains photos which are less than 10k.
Fig. 5 further breaks down the performance in terms of AP over different classes with respect to different training data scales. As can be seen, for classes “food”, “people”,“sky”,“nature”, gains more improvements compared to other classes. This is due to the fact that, for these classes, they own more subclasses compare to other classes, i.e., there are more distinctive tags in these classes. Also, some classes, “sky”,“people”, contains visually highly consistent subclasses, as illustrated in Fig. 3, which give a strong support for top classes. This is especially obvious for class “sky”, in which there is a subclass that has a very strong visual consistency, providing a good support for the top class. Interestingly, for the class “food”, which owns many subclasses, and each of them has a relative low AP, however, this class still has reliable performance. We conjecture that these subclass classifiers are providing useful discriminative information in form of probabilities that they yield with respect to non-relevant subclasses.
5 Related Work
Our work is closely related to method used for predicting tags with high intra-class variation, in particular, sub-category based methods or methods based on learning high level representations are often explored. We discuss these methods here in turn.
Generating sub-categories has been considered as an effective method to deal with classification problems where intra-class variation is high. The ImageNet organizes image dataset with labels corresponding to a semantic hierarchy. This method is able to build comprehensive ontology for large scale dataset. However, for a particular dataset, the sub-categories generated by data driven strategies are expected to be more discriminative.  exploits co-watch information to learn latent sub-tags for video tag prediction.  proposes to discover the image hierarchy by using both visual and tag information. Our method generates category-specific subclasses by exploring image/tag co-occurrence, and trains classifiers for each subclass-tag. These subclasses based models are expected to be discriminative in terms of estimating the target tags, corresponding to top-level classes.
. For the supervised representation learning methods, a predefined set of models are trained based on image features. The output of these models is considered as the high level representations for predicting image categories. Recently, deep neural networks have been used on unsupervised learning image representations with large scale image dataset[6, 5]. This representation has achieved promising results on different classification and tagging tasks . Our methods used the output of trained subclass classifiers as the high level features. This structure can be easily extend to deeper levels by finding discriminative tags for the subclasses.
We have presented a subclass-representation approach to the task of retrieving/ranking large scale social images to one particular class solely based on visual content. The main contribution of the approach is that by projecting the image feature representation on to a subclass space generated by exploiting the co-occurrence information of user-contributed tags, it makes use not only of the content of the photos themselves, but also of information concerning the co-occurrence of the photo’s tags with tags corresponding to top-level classes.
-  C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Trans. Intelligent Systems and Technology, 2(3):27, 2011.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR ’09, 2009.
-  C.-W. Hsu and C.-J. Lin. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Networks, 13(2):415–425, 2002.
A. Krizhevsky, I. Sutskever, and G. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems 25.
-  Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
-  L.-J. Li, C. Wang, Y. Lim, D. Blei, and L. Fei-Fei. Building and using a semantivisual image hierarchy. In CVPR, 2010.
E. P. X. Li-Jia Li, Hao Su and L. Fei-Fei.
Object bank: A high-level image representation for scene classification and semantic feature sparsification.In NIPS, 2010.
-  L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category recognition using classemes. ECCV’10, pages 776–789, 2010.
-  T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. JMLR, 5:975–1005, 2004.
-  W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In CVPR, 2011.