is a crucial problem in computer vision, where the goal is to assign multiple labels to one image based on its content. Compared with single-label image classification, multilabel image classification is more general, but it is also more challenging because of the rich semantic information and complex dependency of an image and its labels. For example, image labels may have overlapping meanings.Dog and puppy have similar meanings and are often interchangeable (Figure 1 (left)). Moreover, labels may be semantically different, capturing one (Figure 1 (middle)) or multiple (Figure 1 (right)) objects in the image. Such labels may exhibit strong co-occurrence dependencies; for example, sky and clouds are semantically different, but they often appear together in one image.
The current state-of-the-art approach to image classification is a deep convolutional neural network (CNN) trained with a softmax output layer (i.e. multinomial logistic regression) that has as many units as the number of classes. A common approach to extending CNN to multilabel classification is to transform it into multiple single-label classification problems, which can be trained with the ranking loss  or the cross-entropy loss . However, as the number of classes grows, the distinction between classes is obscured, and it becomes increasingly difficult to obtain sufficient numbers of training images for rare concepts. As the number of labels continues to grow, these models are often limited in their scalability to large numbers of object categories (introducing many model parameters, making it difficult to obtain sufficient numbers of training images). Furthermore, these methods fail to model the dependency between labels.
Alternatively, visual-semantic embedding models address these shortcomings by training a visual recognition model with both labeled images and a large corpus of unannotated text [7, 23]. Textual data are leveraged to learn semantic relationships between labels, with semantically similar labels being close to each other in the continuous embedding space. An image is transformed into that space and is close to its associated labels. Although the advantages of these image embedding methods over traditional
-way classifiers have been highlighted, handling images with multiple labels still remains problematic, because an image may contain more than one semantic concept, as depicted in Figure2.
The characteristic of varying and unordered labels one image may have in multilabel image classification hinders the direct employment of CNN that requires a fixed output size. Wei et al.  tackled the problem by creating an arbitrary number of object segment hypotheses as the inputs to a shared CNN. However, the classification performance depends largely on the quality of the extracted hypotheses and an ideal way to extract them remains unclear. Wang et al. 
proposed CNN-RNN, which utilized recurrent neural networks (RNNs)
to address this problem. Although the recurrent neurons neatly model the label co-occurrence dependencies, this approach needs to determine the orders of the labels from an unordered label set. Both methods significantly increase the model complexity (e.g. computing hypotheses or integrating RNNs) to extend CNN from single-label to multilabel image classification.
In this study, we explored and extended visual-semantic embedding models for multilabel image classification. One key observation is that, despite the complex relationship among labels in the semantic space, one image is considered as a conduit for constructing the relationship of its labels. Specifically, an image divides all words into two sets according to the image-label relevance . Therefore, we developed an image-dependent ranking model, which returns a ranked list of labels according to its relevance to the input image. The idea was implemented using a simple CNN framework, as shown in Figure 3. In contrast to conventional CNN models that learn an image representation (i.e. the image embedding vector), the developed model learns a mapping (i.e. a transformation matrix) from an image in an attempt to differentiate between its relevant and irrelevant labels. During prediction, the image transformation matrix is used to map words from the input word space into a new space, where the words can satisfactorily be ranked according to their relevance to the input image. The proposed framework has the advantage of applying visual-semantic embedding that effectively addresses the semantic redundancy among labels; it also models the label co-occurrence without introducing additional subnets that are to be integrated in the CNN framework.
Compared with state-of-the-art multilabel image classification methods, the proposed CNN model has the following characteristics:
The model takes an image as the input and provides multilabel predictions without the computation of object segments or local image regions or the requirement of ground-truth bounding box information for training.
The model addresses the semantic redundancy and the co-occurrence dependency problems in multilabel classification, and it can be trained efficiently in an end-to-end manner.
The model learns from an image a transformation, rather than a representation. The output transformation can be readily used to solve the multilabel classification problem.
The model is conceptually simple and compact, yet it is more powerful than many existing deep learning-based models for multilabel classification.
We evaluated the proposed framework with experiments conducted on the public multilabel benchmark dataset NUS-WIDE . Experimental results demonstrated that the proposed method uses less training data but achieves superior performance, compared with current state-of-the-art multilabel classification methods. We further explain the superior performance of the model and empirically interpret the behavior of the model in the Discussion section. The remainder of the paper is organized as follows: We briefly review work related to multilabel classification in Section 2. Section 3 presents the details of the processes involved in using a single CNN model to extend visual-semantic embedding for multilabel image classification. The experimental results and conclusions are provided in Sections 4 and 5, respectively.
2 Related Works
In this section, we briefly review prior work on multilabel image classification. We start with a few hand-crafted features based methods and then describe deep learning based methods.
Pioneering work for addressing the classification problem focused on learning statistical models using hand-crafted features. For example, Makadia et al. 
utilized global low-level image features and a simple combination of basic distance measures to find nearest neighbors of a given image. The keywords were then assigned using a greedy label transfer mechanism. A bag-of-words model that aggregates local features extracted densely from an image was applied in[11, 24, 3, 5]. In particular, Weston et al. 
proposed a loss function to embed images with the associated labels together based on bag-of-words. Although these works have shown some successes, hand-crafted features are not always optimal for this particular task.
In contrast to hand-crafted features, learned features with deep learning have shown great potential for various vision recognition tasks. Specifically, CNN  has demonstrated an extraordinary ability for single-label image classification [13, 16, 17, 14, 19]. To extend CNN to multilabel classification, Sermanet et al.  and Razavian et al. 
used a CNN feature-SVM pipeline, in which multilabel images were directly fed into a CNN—pretrained on ImageNet—to get CNN activations as the of-the-shelf features for classification. Beyond using CNN as a feature extractor, Gong et al.  compared several popular multilabel losses to train the network. Using a top- ranking objective achieved state-of-the-art performance. Li et al.  improved that objective using the log-sum-exp pairwise function. Hypotheses-CNN-Pooling 
employed max pooling to aggregate the predictions from multiple hypothesis region proposals. These methods treated each label independently and ignored the semantic relationships between labels.
Visual-semantic embedding models [7, 23] effectively exploit the label semantic redundancy by leveraging the textual data. Instead of manually designing the semantic label space, Frome et al.  and Norouzi et al.  used semantic information gleaned from unannotated text to learn visual-semantic embedding where semantic relationship between labels was preserved. To extend visual-semantic embedding models to multilabel classification, Wang et al.  utilized RNNs to exploit the label dependencies in an image. The recurrent neurons model the label co-occurrence dependencies by sequentially linking the label embeddings in the joint embedding space. Similar to  in which an image is represented by a number of regions of interests, the multi-instance visual-semantic embedding model (MIE)  mapped the image subregions to their corresponding labels. These methods introduced significant complexity into the CNN architecture and may not be suitable for tasks that do not have powerful computing resources.
The proposed method uses the identical infrastructure to DeViSE 
, involving only a single CNN to operate visual information. Our key motivation is to design a simple method with a new modeling-paradigm, which extends DeViSE to processing multilabel images. The proposed method is fast in training and offers instant prediction during testing (only a linear transformation is required).
The objective of this study is to extend visual-semantic embedding models for multilabel image classification. We use a CNN and a word embedding model to achieve the goal. We start with formulating a binary classification problem for multilabel image classification. Next, we describe in details the model architecture and training.
In this study, we consider the task of multiclass image classification as a single binary classification problem. Figure 4 illustrates a simplified example. Suppose we have a label set containing six words: person, wedding, dancing, sun, bridge and tree. Figure 4 (left) separates the words into two classes, where person, wedding and dancing are considered positive because these labels suitably describe the image. Similarly, Figure 4 (right) shows a different partition of words.
Given an image, we aim at partitioning labels into two disjoint sets according to the image-label relevance. The partition involves analyzing the relationship between an image and two sets of words. Based on this observation, we propose to learn an image-dependent classifier, which is able to separate the relevant and irrelevant labels of an input image.
The image-dependent classifier has a form of a linear transformation matrix, which is implemented using a CNN framework (shown in Figure 5). The model architecture is similar to the network described in  except for the layer fc8, where the dimension is set to the size of the transformation matrix. Namely, the output of the last layer is a vector of length , which can also be viewed as a matrix. The image-dependent transformation matrix is used to map labels from the -dim word vector space into a -dim Euclidean space, where the relevant and irrelevant labels can satisfactorily be separated. Table 1 specifies the number of parameters used in each layer.
In contrast to previous works [1, 7, 34, 36], the linear transformation is not used to map an image to the word vector space. Instead, the transformation learned from an image seeks for linear combinations of variables in the word vectors that characterize two sets of labels.
|Layer||Input||Kernel||Stride||Output||No. of parameters|
3.3 Loss function
The objective of the deep transformation is to separate the relevant and irrelevant labels of the given image. Specifically, we wish to derive a transformation matrix by giving an image to the CNN model :
The matrix maps labels from a -dim word vector space into a -dim Euclidean space, , where the relevant labels aggregate around a canonical point (i.e. the origin) and the irrelevant labels scatter far from it.
In multilabel image classification we have a training dataset of pairs , where each training image has several positive labels . We randomly choose other labels (40 in the experiments) as negatives. The labels are represented by the -dim word vectors (detailed in Section 3.4). The goal is to learn a transformation matrix from so that the distance between the transformed positive word vectors and the origin is smaller than that of negative ones:
Based on this intuition we define a hinge rank loss (similar to ) for a training tuple as
where is a margin that is enforced between transformed positive and negative word vectors. Note the equation 3 is a sum of individual losses for negative labels . For each negative label, the loss is zero if is greater by a margin than the average norm . Conversely, if the margin between the norm of negative label and the average norm of the positive labels is violated, the loss is proportional to the amount of violation. This is visualized in Figure 6.
Instead of selecting the closest positive, we use the average norm to eliminate the situation where mislabeled and poorly samples would dominate the loss. Note that the above loss is related to the commonly used triplet loss [28, 29, 35, 2], but it is adapted to multilabel image classification using the formulation given in Section 3.1.
It is worth mentioning that in a special case where we have only a single-label image dataset for training and the transformed dimension () is set to 1, our model maps the 4,096-dim representation at the top of the visual model into the -dim representation of the word model. This is identical to the behavior of the DeViSE model , except that DeViSE applies the dot-product similarity to produce the ranking.
3.4 Word embeddings
Vector space models (VSMs) represent words in a continuous vector space where semantically similar words are embedded nearby each other. In particular, the skip-gram model introduced by Mikolov et al. [21, 22] has been shown to efficiently learn semantically meaningful vector representations of words from a large text corpus. The model learns to represent each word as a fixed length embedding vector by predicting source context words from the target words. Because synonyms tend to appear in similar contexts, the objective function described in  drives the model to learn similar vectors for semantically related words.
In the implementation, we used word2vec  with the skip-gram model of 300-dim embeddings (i.e. = 300) and trained the word embedding model on the Google News dataset (containing about 100 billion words). Figure 7 displays the visualization of the learned vectors of the 81 concepts defined in the NUS-WIDE dataset . The vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another. For example, the labels of animals (e.g., dog, cat, tiger, cow, horses, bear, zebra, fox, elk and etc.) are gathered around the third quadrant of the figure.
The label prediction of a test image using the proposed image transformation model is trivial. Let denotes the vector representations of the label set. The CNN model takes an input image and returns a transformation matrix . For each , we calculate and rank the labels according to the L2-norm , i.e. the distance between and the origin in the new space. Labels with a small distance are retrieved.
One nice thing about the model is that the label set does not necessarily contain the labels used in training. The model has the potential to perform zero-shot classification over the unseen labels, because of utilizing word embeddings where unseen and seen labels are in the same vector space. However, zero-shot learning is not the focus of this study. We leave this point for further investigation.
3.6 Training details
The CNN model was pre-trained on a large-scale single-label image dataset—ImageNet . We further trained the network on the target multilabel dataset (e.g., NUS-WIDE ) with the loss function described in Section 3.3.
and was gradually decreased. Training time for a single epoch was around 3 seconds, and training the model roughly took 10 hours. The runtime was reported running on a machine with an Intel Core i7-7700 3.6-GHz CPU, NVIDIA’s GeForce GTX 1080 and 32 GB of RAM. The transformed dimensionwas set to 100 in the experiments.
This section presents the experimental results. We compare our approach with several state-of-the-art methods on the large-scale NUS-WIDE dataset . We also examine how the transformed dimension affects the classification performance and interpret the behavior of the model.
4.1 Experimental settings
Dataset. We evaluated the proposed method on the NUS-WIDE dataset . It contains 269,648 images collected from Flickr in the original release. We were able to retrieve only 171,144 images of this dataset because some images were either corrupted or removed from Flickr. We followed the separation guideline from NUS-WIDE and split the dataset into a training set with 102,405 images and a test set with 68,739 images. In each set, the average number of labels per image is 2.43.
NUS-WIDE releases three sets of tags associated with the images. The most widely used set contains 81 concepts, which were carefully chosen to be representative of the Flickr tags and were manually annotated. Therefore, the 81-concepts annotations are much less noisy than those directly corrected from the web. This 81-concepts set is usually used as the ground-truth for benchmarking different multilabel image classification methods.
Evaluation protocols.ranked labels are compared to the ground truth labels. The precision is the number of correct labels divided by the number of machine-generated labels. The recall is the number of correct labels divided by the number of ground truth labels.
Following previous researches [9, 26, 32], we computed the per-class precision (C-P), overall precision (O-P), per-class recall (C-R) and overall recall (O-R). The average is taken over all classes for computing C-P and C-R, and is taken over all testing examples for computing O-P and O-R. We also reported the F1 score, which is the geometrical average of the precision and the recall.
|word model||RNN||region proposals||CNN architecture|
We compared the proposed method with recent CNN-based competitive methods.
Table 2 summarizes the methods. Note that our training set contains 102,405 images, which occupies only 68.27% of the set used in these competing methods (150,000 images).
We reported the experimental results with 3 and 5 predicted labels for each image in Table 3 and Table 4, respectively. The proposed method consistently outperformed WARP in terms of all measurements. Since both methods enforced positive labels to be top ranked, the performance gain (C-F1: 6.2%, O-F1: 3.4% in top-5 prediction) may be obtained by using a word model that provided a priori knowledge about the labels.
In comparison with RNN-CNN, both methods modeled image-label and label-label dependencies. The proposed method performed slightly better than RNN-CNN (C-F1: 4.2%, O-F1: 2% in top-3 prediction) despite a much simpler network architecture was used. Finally, the proposed method had a comparable performance with MIE. However, the design principles of these methods were completely different. MIE used GoogleNet which was deeper and more complex than the AlexNet used in our model. MIE required additional computations to extract semantically meaningful sub-regions from one image, while the proposed method took a global approach. MIE modeled the region-to-label correspondence and ours modeled that between an image and its label set. The proposed method was much simpler than MIE.
4.3 Empirical analysis on the transformed dimension
Recall that we learn a transformation matrix from an image that maps the labels from the word space to a -dim Euclidean space. Now we examine the design choice in determining the dimension (). In this experiment we trained 10 models by uniformly sampling from 10 to 100. Each model was trained with 1000 iterations.
Figure 8 shows the classification performance of these models in top-3 and top-5 predictions. The classification performance is rather stable no matter which value is used. The determination of the value has little effect on the overall classification performance.
4.4 Model interpretation
|Voting (top 1)||23.2%||31.9%||47.9%||59.0%|
|Voting (top 3)||29.1%||36.3%||50.6%||62.2%|
|Voting (top 5)||31.8%||37.4%||51.1%||62.9%|
The proposed model can be viewed as a combination of classifiers with shared CNN features to produce a powerful “committee.” Recall that we obtain a transformation matrix from an image via CNN. Each -dim row vector in this matrix can be interpreted as a principal direction in the original word vector space, along which the labels are ranked. Simultaneously training CNNs is costly and may cause overfitting, and we avoid these problems using a shared CNN—all classifiers share the same image features. We use a single CNN to implement an assembly of classifiers.
This strategy leads to powerful and complementary classifiers. To illustrate this point, we individually inspected the outputs of each classifier and retained only top labels of a classifier. Next, we used a simple voting scheme to aggregate the results of all classifiers. For example, we obtained labels (may be repetitive) when was set to 1, from which we retrieved frequent labels as the final output.
Table 5 shows the classification performances in top-3 prediction with various values. By using a small (i.e. = 3) the combination of the classifiers outperformed WARP  and CNN-RNN . The classification performance was further boosted when all results are used jointly.
Next, we investigated the similarities among the outputs of the individual classifiers. In this empirical analysis, we obtained the top 5 predicted labels from each classifier. The Jaccard coefficient—defined as the size of the intersection divided by the size of the union of two sets—was used to compute the pair-wise similarity of two label sets. For each test image, we reported the average Jaccord coefficient of the pairs. Figure 9
shows the histogram of the average Jaccord coefficients computed using the test set. The mean value is 0.1001 and the standard deviation is 0.1288, indicating that the outputs of the classifiers are very different. As shown in Table3, combining these classifiers led to a powerful committee.
This interpretation relates the proposed method to Fast0Tag , which aims to learn a mapping function between the visual space and the word vector space. This approach can be viewed as a special case of our method by setting to 1.
5 Conclusion and future work
We have extended single-label visual-semantic embedding models for multilabel image classification. The complex image-to-label and label-to-label dependencies are modeled via a simple infrastructure involving only a single CNN as the visual model. In particular, a new learning paradigm is developed: we learn a transformation—rather than a representation—from an image, with the objective of optimizing the separation of the image’s relevant and irrelevant labels. Fast and accurate prediction of labels can be achieved by simply performing a linear transformation on the word vectors.
One future research direction we are pursuing is to extend the method for zero-shot prediction, in which test images are assigned with unseen labels from an open vocabulary. This would take full advantage of the word model—unseen labels are in the same vector space as the seen labels for training. Another direction is to explore the learning of nonlinear transformation, which may better exploit higher order dependencies among labels.
Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid.
Label-embedding for attribute-based classification.
IEEE International Conference on Computer Vision and Pattern Recognition, 2013.
-  R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In IEEE International Conference on Computer Vision and Pattern Recognition, 2016.
-  Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and S. Yan. Contextualizing object detection and classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):13–27, 2015.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In ACM International Conference on Image and Video Retrieval, 2009.
-  J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan. Subcategory-aware object classification. In IEEE International Conference on Computer Vision and Pattern Recognition, 2013.
-  M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
-  A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, 2013.
-  R. Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.
-  Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894, 2013.
-  M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In IEEE International Conference on Computer Vision, 2009.
-  H. Harzallah, F. Jurie, and C. Schmid. Combining efficient object localization and image classification. In IEEE International Conference on Computer Vision and Pattern Recognition, 2009.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In IEEE International Conference on Computer Vision, 2009.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
-  Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems, 1990.
-  Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In IEEE International Conference on Computer Vision and Pattern Recognition, 2004.
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng.
International Conference on Machine Learning, 2009.
-  Y. Li, Y. Song, and J. Luo. Improving pairwise ranking for multi-label image classification. In IEEE International Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Lin, Q. Chen, and S. Yan. Network in network. In International Conference on Learning Representations, 2014.
-  A. Makadia, V. Pavlovic, and S. Kumar. Baselines for image annotation. International Journal of Computer Vision, 90(1):88–105, 2010.
-  T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In International Conference on Learning Representations, 2014.
-  F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision, 2010.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. arXiv:1403.6382v3, 2014.
-  Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille. Multi-instance visual-semantic embedding. arXiv preprint arXiv:1512.06963, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems, 2004.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229v4, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556v6, 2015.
-  L. van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
-  J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. Cnn-rnn: A unified framework for multi-label image classification. In IEEE International Conference on Computer Vision and Pattern Recognition, 2016.
-  Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. Hcp: A flexible cnn framework for multi-label image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1901–1907, 2016.
J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation.
International Joint Conference on Artificial Intelligence, 2011.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, 2014.
-  Y. Zhang, B. Gong, and M. Shah. Fast zero-shot image tagging. In IEEE International Conference on Computer Vision and Pattern Recognition, 2016.