Semantic Comparison of State-of-the-Art Deep Learning Methods for Image Multi-Label Classification

by   Adam Kubany, et al.
Ben-Gurion University of the Negev

Image understanding relies heavily on accurate multi-label classification. In recent years deep learning (DL) algorithms have become very successful tools for multi-label classification of image objects. With these set of tools, various implementations of DL algorithms for multi-label classification have been published for the public use in the form of application programming interfaces (API). In this study, we evaluate and compare 10 of the most prominent publicly available APIs in a best-of-breed challenge. The evaluation of the various APIs is performed on the Visual Genome labeling benchmark dataset using 12 well-recognized similarity metrics. Additionally, for the first time in this kind of comparison, we use a semantic similarity metric to evaluate the semantic similarity performance. In this evaluation, Microsoft Computer Vision, IBM Visual Recognition, and Imagga APIs show better performance than the other APIs.



page 12


Similarity-based Multi-label Learning

Multi-label classification is an important learning problem with many ap...

Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark

Perhaps surprisingly sewerage infrastructure is one of the most costly i...

Multi-label Methods for Prediction with Sequential Data

The number of methods available for classification of multi-label data h...

Multi-label Classification of Common Bengali Handwritten Graphemes: Dataset and Challenge

Latin has historically led the state-of-the-art in handwritten optical c...

Investigating Class-level Difficulty Factors in Multi-label Classification Problems

This work investigates the use of class-level difficulty factors in mult...

Tips, guidelines and tools for managing multi-label datasets: the mldr.datasets R package and the Cometa data repository

New proposals in the field of multi-label learning algorithms have been ...

AutoCl : A Visual Interactive System for Automatic Deep Learning Classifier Recommendation Based on Models Performance

Nowadays, deep learning (DL) models being increasingly applied to variou...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurate semantic identification of objects, concepts, and labels from images is one of the preliminary challenges in the quest for image understanding. It is only natural that machine learning, and natural language researchers have been highly motivated to address these challenges. The race to achieve good label classification has been fierce and became even more so as a result of public competitions such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

[1]. The obvious next step in this quest lies in the expansion of the challenge from single to multi-label classification. With this challenge in mind, different learning approaches for multi-label classification have been suggested. Tsoumakas and Katakis [2, 3] divided these approaches into two main categories: 1) problem transformation methods which consist of the learning methods that transform the problem into one or more single-label classification problem and then transform the results into multi-label representation; and 2) algorithm adaptation methods which consist of the learning methods which try to solve the multi-label prediction problem as a whole, directly from the data. In 2012, Madjarov et al. [4] introduced a third category of methods, referred to as ensemble methods

; this category consists of methods that combine classifiers to solve the multi-label classification problem. In this approach, each of the base classifiers in the ensemble can belong to either the problem transformation or algorithm adaptation method category.

As the research field of multi-label classification advances, more effective approaches have been developed [4, 5]

. In recent years, deep learning methods, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), and their variations, have demonstrated excellent performance in visual and multi-label classification

[6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Some of the more successful methods have been published as APIs for public use. The more salient approaches were published by research groups from Imagga [17], Watson IBM [18], Clarifai [19], Microsoft [20], Wolfram Alpha [21], Google [22]

, Caffe

[23], DeepDetect [24], OverFeat [25], and TensorFlow [26]. With these recent publications, the need for a best-of-breed performance comparison has arisen. While some comparisons between multi-label classification methods have been performed in the past [4, 5], none of them have included the latest deep learning APIs. In this study, we address this need and evaluate the performance of 10 state-of-the-art deep learning approaches. A benchmark comparison is best accomplished by evaluating them with a state-of-the-art dataset. For that purpose, we chose the Visual Genome dataset [27]

, which includes rich metadata and semantic annotations for multi-domain everyday images. We evaluate and compare these 10 approaches with well-established multi-label evaluation metrics

[3, 28]. These metrics evaluate the multi-label classification performance based on whether the predicted labels exist in the image label ground truth list, however they do not take the semantic similarity between the predicted and ground truth labels into consideration. Therefore, in order to evaluate the semantic similarity between the predicted and ground truth multi-label classification, we applied a variation of the word mover’s distance (WMD) [29] metric. To the best of our knowledge, this work provides the most thorough evaluation of state-of-the-art deep learning multi-label image classification APIs, and the only work to include semantic evaluation metric.

The novel contributions of this work are: 1) a comparison of the predictive performance of 10 of the most prominent publicly available APIs for multi-label image classification, and 2) an evaluation of the APIs’ performance using a semantic similarity metric in addition to well-known metrics.

2 Multi-Label Image Classification APIs

As visual analysis and, more specifically, multi-label classification research advances, deep learning methods such as CNNs have shown superior performance [8, 9, 10, 11, 13, 15, 16, 7, 30]. Some of the most prominent approaches have evolved from theoretical algorithms to online services provided by various companies such as Imagga[17], Watson IBM (Visual Recognition API) [18], Clarifai [19], Microsoft (Computer Vision API) [20], Wolfram Alpha (Image Identification API) [21], and Google (Cloud Vision API) [22]. Except for Microsoft’s API, these commercial services don’t reveal much about their proprietary API’s algorithm structure and training scheme other than mentioning that they are based on deep neural networks and can classify multiple objects. Microsoft’s Computer Vision (CV) is based on a deep residual learning framework [31] with 86 category concepts and 2000 recognizable objects. There are also several top open-source frameworks with the capability of multiple image classification, such as Caffe [32, 23], DeepDetect [24], OverFeat [33, 25], and TensorFlow [34, 26]. In contrast to the commercial APIs, these open-source frameworks divulge more information regarding their operation. The Caffe framework is published by the Berkeley Vision and Learning Center (BVLC), here we use their CaffeNet [32] reference model which is an AlexNet [30] variation. The model was trained with the ImageNet dataset as part of the ILSVRC2012 competition. The DeepDetect framework is based on a deep neural network pretrained on a subset of ImageNet (ILSVRC2012), and in our study we evaluate the model provided by the Caffe development team which was trained with the GoogLeNet architecture. The OverFeat framework is based on a convolutional network [33]

for an image feature extraction and classification, it was trained using Torch

[35] on the ILSVRC2012 dataset. All of the open-source frameworks aim to efficiently identify the main classes of the image rather than predicting the objects’ classes.

With the publication of these APIs and services, the obvious question arises: which of these services is the best for multi-label image classification? For this best-of-breed challenge, we queried each service using the same dataset; the queries were made via the online API as stated on the service’s Web page or using a self-installed framework. In order to provide a fair comparison, we queried all the services using their vanilla versions of their pretrained algorithms. Evaluating the multi-label prediction performance of each API approach and comparing them to identify the best of the breed requires standardized measures and metrics. Various metrics have been proposed in the past for such evaluation [3, 28]; these metrics can be divided into bipartition and ranking metrics [3]. As none of the evaluated APIs provide a ranking for all of the labels in the ground truth dataset, we focus only on the bipartition metrics. For the metrics’ definitions let us denote as the multi-label binary encoding label set of image i from n images dataset where L is the q sized label set dictionary, and as the multi-label binary encoding label set of image i as predicted by the multi-label classifier h; hence, , where

, is defined as the feature vector of image


2.1 Bipartition Metrics

There are two types of bipartition evaluation metrics. Example-based bipartition evaluation metrics refer to various average differences of the predicted label set from the ground truth label set for all of the examples in the dataset, whereas label-based evaluation metrics first evaluate each label separately and then obtain the average of all of the labels.

2.1.1 Example-Based

The Hamming Loss metric [36] calculates how many times, on average, an incorrect prediction was made by h. For that purpose, it utilizes the cardinality of symmetric difference () between the actual and predicted label sets. The Hamming Loss metric is defined as follows:


We expect that Hamming Loss=0 to reflect a perfect alignment prediction, and Hamming Loss=1 to reflect a different prediction than the ground truth.
The following Accuracy, Precision, Recall, and metrics are standard metrics adapted for multi-label classification [4, 37]. Accuracy is defined as the Jaccard similarity between the predicted label set and the ground truth label set , which is then averaged over all n images.


Precision and Recall are defined as the average proportion between the number of correctly predicted labels () and either the number of predicted labels or the number of ground truth labels .



is the harmonic mean between

Precision and Recall.


2.1.2 Label-Based

Label-based metrics evaluate the performance of a classifier by first evaluating each label and then obtaining an average of all of the labels. Such averaging can be achieved by one of two conventional averaging operations, namely macro and micro averaging [38]. For that purpose, any binary evaluation metric can be applied, but usually Precision, Recall, and their harmonic mean are applied in information retrieval tasks [3].
For each label , the summation of true positives (), true negatives (), false positives (), and false negatives () are calculated according to the classifier applied. Then, the binary performance evaluation metric B can be calculated with either macro or micro-averaging operations:


Therefore, the definitions of Precision (P), Recall (R), and are easily derived as [4]:


where Macro is the harmonic mean of Precision and Recall based on first averaging each label and then averaging over all labels. On the other hand, Micro is the harmonic mean of Micro Precision and Micro Recall as defined above. Except for the Hamming Loss metric, all of the metrics receive a score on a scale of zero to one, where a higher score implies better alignment between the predicted label set and the ground truth set.

2.2 Semantic Similarity

Although the aforementioned metrics are well-known and legitimate for evaluating the similarity between the ground truth and predictions of multi-label classification, they share a significant drawback as they consistently overlook the inherent semantic similarity between each individual label. For example, let’s assume the ground truth multi-label set is {”bicycle,” ”child,” ”helmet,” ”road,” ”tree”}, and the predicted set is {”bike,” ”boy,” ”trail,” ”tree,” ”grass,” ”flower”}. Evaluating the similarity between the two label sets with the aforementioned metrics will consider only the label “tree” as a true positive and overlook the close semantic similarity between the labels {“child,” “boy”}, {“bicycle,” “bike”} and {“road,” “trail”}. This example demonstrates that these metrics might misrepresent the similarity between the two multi-label sets, and an additional kind of semantic similarity metric is required.

The word mover’s distance (WMD) [29] is a method based on the earth mover’s distance [39, 40] and aimed at evaluating the semantic distance between two documents. Therefore, let us denote as the ground truth label set of image i, and as the label set of image i predicted by the multi-label classifier h, . Note that and include the explicit label set (e.g., {”bike,” ”boy,” ”trail,” ”tree,” ”grass,” ”flower”}), where r and p don’t have to be on the same size. Defining the two label sets as two bag-of-words (BOW) allows us to apply the WDM method to evaluate their semantic distance. The WDM algorithm requires that the two BOW are represented as a normalized BOW (nBOW) vector , where , and is the number of times that the word l of n appears in the BOW. Let d be the nBOW representation of and of . The second requirement of the WDM is a semantic distance evaluation between every two labels, where c(l,k) is referred to as the cost of “traveling” from word l to word k. In the WDM method, the semantic distance is obtained by using a word embedding implementation, such as word2vec [41] or GloVe [42]

, which are unsupervised learning methods that can represent a word by a multidimensional vector. Originally

[29], the WMD was applied with word2vec, however we implemented GloVe word embedding as it has become an increasingly popular tool in various applications [43, 44, 45]. Let be the GloVe embedding matrix, where is the dim-dimensional embedding representation of word k from the vocabulary of n words. Hence, the “traveling cost” from word l to word k is defined as their Euclidean distance, . Next, let us define a sparse flow matrix , where represents the ratio of participation of word l from d to travel to word k from . It is clear that a word can participate in traveling as much as its nBOW ratio, therefore the and participation ratio restrictions are applied. Finally, the distance between the two BOW can be defined as the minimum sum of the weighted traveling cost from d to


subject to the participation ratio restrictions. For our purposes, we average the WMDs for all of the n images in the tested dataset for every API:


3 Dataset

In order to achieve the goal of image understanding, high performance multi-level classification methods must be utilized. To ensure as close to real life evaluation as possible, an adequate dataset must be used. Such a dataset should include images from multiple domains, as well as semantic annotations of objects, concepts, or labels. Image understanding will be enhanced if the dataset also includes object orientation, relations between the objects, and some textual descriptions of the image.
The Visual Genome project and dataset111We used the 1.0 version of the dataset. [27] is an attempt to provide a comprehensive and benchmark dataset for image understanding tasks. The dataset consists of 108,249 everyday multi-domain images where each is annotated with objects, relationships, descriptions, etc. For the purposes of multi-label classification, each image is associated with an average of 21 objects (out of 17,000 possibilities) with their bounding boxes, where each object label is mapped to the WordNet [46] hierarchical relation as a synset.

4 Results and Discussion

4.1 Experiment Setup

Some of the examined APIs apply limits regarding the number of image requests for multi-label classification during a period of time and in total. Given these limitations we evaluated the APIs’ performance with the first 1,000222Sorted in name ascending order. We selected the first 1,000 images that have ground base objects, as some of the dataset images don’t have them images of the Visual Genome dataset, which, to our understanding, are sufficient for an adequate performance evaluation of the examined APIs. Within this subset there are 3,728 possible objects with an average of 14 distinct objects per image. Prior to the evaluation we perform a preprocessing step on the labels from both the dataset and the API results in which we lowercase, and strip out whitespace and grammatical characters, such as colons, periods, underscoring, apostrophes, exclamation marks, etc. In addition, to allow fair comparison, the label sets only include unique labels, for example {”cat,” ”car,” ”car,” ”dog,” ”dog”} is evaluated as {”cat,” ”car,” ”dog”}. Since the various API services predicted different numbers of labels (see Table 1), the evaluation was performed based on four label levels: all the predicted labels, and the top five, three, and one label(s) according to their confidence level for each image (see Figures 1-4).

4.2 Example-Based Metric Results

One of the first observations is that in general, the examined APIs have relatively low scores. This can be explained by the fact that each of the APIs was trained with different dataset, and we compared them with their vanilla settings. These out of the box configurations and settings are necessary if we wish to choose the best service to evaluate an image without any prior knowledge of its origin and features.
Another observation that attracted our attention is that three APIs stand out with high scores: The Microsoft CV, IBM, and Imagga APIs consistently hold first, second, and third places. The Microsoft CV API almost always outperforms all others in the Accuracy, Precision, Recall, and metrics with the all, top five, and top three predicted labels, except for the top label prediction where it competes with the Imagga API for second place following the top performer IBM API.
Having a high Precision score means that the predictions made by the Microsoft CV API are correctly predicted as relevant labels, with only a few false positives. The Microsoft CV API also has high Recall scores, indicating it correctly predicted relatively more of the ground truth labels (with only a few false negatives). Considering that the Microsoft CV predicted an average of 6.77 labels per image and the IBM’s API predicted 3.16 labels per image (see Table 1) when the dataset has an average of 14 distinct labels, suggests that while the Microsoft CV and IBM’s might not predict all of the ground truth labels, they correctly predicted the ones they did predict; this is also reflected in the relatively high Accuracy score. To further demonstrate their dominance, we note that Imagga’s high Recall and Precision scores in the all predicted labels section can be easily understood because of the large number of predicted labels (47.53 predicted labels per image), however in a more challenging task, when considering only a few (the top three or one predicted labels) it drops from second to third place, while the Microsoft CV consistently holds first and second places. Obviously, the dominance of the Microsoft CV API can also be reflected in the top scores for the metric; this is because the is the harmonic mean of its high scores in the Precision and Recall metrics. The Hamming Loss metric indicate very little difference between the APIs; as the number of true positive predicted labels is much lower than the 3,728 labels in the domain.
The relatively poor results of the open-source frameworks (Caffe, DeepDetect, OverFeat, and TensorFlow) for the all, top five, and top three predicted labels can be explained by the fact that they only provide the predicted ranking for the top label or image classification and not designed to perform multi-label classification. Nevertheless, this doesn’t explain their relatively low score in the top label prediction comparison. We were also puzzled by the consistently low scores of the Wolfram Alpha Image Identification API, as it was designed to predict multi-label classification of images.

From the example-based metrics results, we can conclude that if one is looking for as many labels as possible, including several which might not be relevant (false positives), the Imagga API should be considered, whereas if only the top predicted label is needed, the IBM API is the way to go. For an API with all around top performance, the Microsoft CV API is the obvious tool for the job.

API Average Labels per Image Query Settings
Imagga 47.53 vanilla
IBM 3.16 vanilla, no threshold
Clarifai 20 vanilla with fixed number of labels
Microsoft CV 6.77 vanilla
Wolfram 5 vanilla with fixed number of labels
Google Vision 6.79 vanilla with max 10 labels per image
Caffe 5 vanilla with fixed number of labels
DeepDetect 5 vanilla with fixed number of labels
OverFeat 5 vanilla with fixed number of labels
TensorFlow 5 vanilla with fixed number of labels
Table 1: APIs’ query metadata and settings.

4.3 Label-Based Metric Results

For this type of metric, we evaluate the performance of the various APIs from the label perspective (see Figures 1-4). Like with the example-based metrics, the Microsoft CV, IBM, and Imagga APIs stand out, but here, the Google Vision, and TensorFlow APIs also some show occasionally good results. In the Macro family of metrics, we evaluate the performance of predicting each label separately and then average over all labels, whereas in the Micro measures, we evaluate the performance of all the labels’ predictions together.
When considering the APIs’ performance with the Macro metrics, we can see that the performance value is relatively low; this is again due to the large number of labels. Nevertheless, for the overall Macro metrics, the Imagga, Microsoft CV, and Google Vision APIs take first, second, and third place respectively. We can also see that the Microsoft CV and Imagga APIs are neck to neck for the first place with slight advantage to Microsoft’s API for the Macro Precision metric in all four prediction levels; as we have already seen, this indicates that most of their predicted labels are true positive. The Macro Recall metric, which measures how many of the ground truth labels were predicted correctly, introduces TensorFlow API as the new high performing player, where the Imagga and Google Vision APIs take second and third place respectively; for this metric the Microsoft CV API performs less well, coming in sixth place. This indicates that the Microsoft CV API might not have the capacity to correctly predict some specific labels. The harmonic average of them both, the , indicates that when considering all of the prediction levels, Imagga leads overall, while the Microsoft CV and IBM APIs come in second and third place respectively.
In the Micro family of metrics, we observe that the Microsoft CV API leads, with the IBM and Imagga APIs are tie for second place. The Micro Precision measures the overall ratio of predicting the correct labels from the total labels’ predictions. Here, the IBM API comes in first place, and the Microsoft CV and Imagga API come in second and third place. The Micro Recall measures the same ratio as the Micro Precision but out of the total ground truth labels, meaning how good the model predicts the labels from the predicted labels. In this case, the Microsoft CV takes the lead, with the Imagga and IBM APIs for second and third place. When taking into consideration their harmonic average , the ranking stays the same, with the Microsoft CV API at the top, and the IBM and Imagga APIs tied for second place.

Considering the insights obtained from both the example and label-based metrics, we can summarize that the Microsoft CV API wins in points over the Imagga and IBM APIs. In the end, the choice between them will depend on whether there is a need for many labels, some of which will be irrelevant, or fewer labels, of which most are predicted correctly.

4.4 Semantic Similarity Metric Results

One of the main contributions of this work is in our use of a semantic similarity metric to evaluate the similarity for image multi-label classification. We utilize the WMD metric, which applies Euclidean distances between GloVe word embedding vectors as building blocks, to calculate the semantic distance between two bag-of-words (BOW), therefore a perfect semantic similarity is represented by a value of zero, which mean that there is no semantic distance between the BOWs. The semantic similarity metric allows us to gain insights regarding the relevance of the predicted labels to the ground truth labels without restricting the predicted labels to the exact ground truth labels. Of course exact label prediction is preferable, however with this metric, the predicted labels, which are only close in terms of meaning to the ground truth labels, are also considered, whereas discarded in classical metrics.
With the semantic similarity metric, the TensorFlow API takes the overall lead, the Microsoft CV and DeepDetect APIs are tied for second place, and the Imagga and Wolfram APIs are tied for third place. This shows that the TensorFlow API, which received low scores for the example-based metrics and mediocre to high scores for the label-based metrics, can be used to predict labels which are semantically close to the ground truth labels. While the Microsoft CV API show superior performance when considering more than three labels, coming in first place for the all and top five label levels, the TensorFlow API leads when considering a low number of labels, coming in first place for the top three predicted labels and one predicted label levels. These results demonstrate the capabilities of the semantic metric: the TensorFlow API was not considered a top performer with the conventional metrics, however the semantic metric revealed deeper insights and showed the TensorFlow API’s superiority in predicting meaningful predictions.

Based on all of our findings we can conclude that the overall top performer is the Microsoft CV API which can predict many labels; in most cases these labels are the same as the ground truth, and the ones that are not are most likely semantically similar to the ground truth labels. In cases in which many predicted labels are required, and some of them can be irrelevant, the Imagga API is the preferred API. On the other hand, if exact labels are not required and labels that are similar in meaning are fine, the TensorFlow API does the best job.

5 Conclusions

With the appearance of new deep learning technologies, significant advances have been made in the field of multi-label classification. As these technologies increase in popularity, more and more implementations are developed and published, some of which are made available to the public as API services. In this study, we compared the performance of some of the most prominent deep learning multi-label classification APIs. These services were examined using a subset of the Visual Genome benchmark dataset and evaluated with various well-known example and label-based evaluation metrics. These metrics evaluate the prediction performance of the APIs on whether a predicted label exists in the ground truth label set. In addition, for the first time, a semantic similarity metric was used for image multi-label classification performance evaluation. This type of semantic metric allowed us to obtain deeper insights regarding the relevance of the predicted labels, even if they do not exist in the ground truth label set.
When evaluating the APIs label predictions with the well-known metrics, three were shown to outperform the others: the Microsoft CV, IBM, and Imagga APIs. Evaluating the performance for example-based metrics, the Microsoft CV API beat the IBM and Imagga’s APIs, while for label-based metrics, the Microsoft CV and Imagga API were neck and neck, with a slight advantage shown by the Microsoft CV API. When we evaluated the APIs’ performance in predicting semantically similar labels, a new top performer was revealed: the TensorFlow API showed superior performance and was followed by the Microsoft CV and DeepDetect APIs in predicting semantically similar labels. Here we can observe the importance and added value of the semantic similarity metric; this metric allowed us to obtain deeper insights regarding the semantic relevance of the predicted labels, insights which are unavailable with the conventional metrics.
As the field of multi-label classification advances, we believe that comparisons such as ours can be beneficial for the users of such services, as well as for researchers who may be encouraged to develop improved algorithms for multi-label classification that outperform the existing best-of-the-breed. Furthermore, the proposed semantic similarity metric provides an additional and insightful metric for these algorithms to be evaluated upon.

Figure 1: Results for each of the metrics for the evaluated APIs, with all of the predicted labels per image.
Figure 2: Results for each of the metrics for the evaluated APIs, with the top five predicted labels per image.
Figure 3: Results for each of the metrics for the evaluated APIs, with the top three predicted labels per image.
Figure 4: Results for each of the metrics for the evaluated APIs, with the top predicted label per image.


This work was supported by grants from the MAGNET program of the Israeli Innovation Authority and the MAFAT program of the Israeli Ministry of Defense.


  • [1] “Large Scale Visual Recognition Challenge (ILSVRC),” 2018. [Online]. Available:
  • [2] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” Dept.of Informatics, Aristotle University of Thessaloniki, Greece, 2006.
  • [3] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” Data mining and knowledge discovery handbook, pp. 667–685, 2009.
  • [4] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Džeroski, “An extensive experimental comparison of methods for multi-label learning,” Pattern Recognition, vol. 45, no. 9, pp. 3084–3104, 2012.
  • [5] G. Nasierding and A. Z. Kouzani, “Comparative evaluation of multi-label classification methods,” in Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on.    IEEE, 2012, pp. 679–683.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.
  • [8] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.    IEEE, 2012, pp. 3642–3649.
  • [9] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney, “Integrating language and vision to generate natural language descriptions of videos in the wild,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 1218–1227.
  • [10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
  • [11] K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and C. Sienkiewicz, “Rich image captioning in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 49–56.
  • [12] C. Szegedy, W. Liu, Y. Jia, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
  • [14] Y. Huang, W. Wang, L. Wang, and T. Tan, “Multi-task deep neural network for multi-label learning,” in Image Processing (ICIP), 2013 20th IEEE International Conference on.    IEEE, 2013, pp. 2897–2900.
  • [15] C.-K. Yeh, W.-C. Wu, W.-J. Ko, and Y.-C. F. Wang, “Learning Deep Latent Space for Multi-Label Classification.” in AAAI, 2017, pp. 2838–2844.
  • [16] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A unified framework for multi-label image classification,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on.    IEEE, 2016, pp. 2285–2294.
  • [17] Imagga, “Imagga website,” 2018. [Online]. Available:
  • [18] IBM, “Visual Recognition,” 2018. [Online]. Available:
  • [19] Clarifai, “Clarifai website,” 2018. [Online]. Available:
  • [20] Microsoft, “Computer-vision API website,” 2018. [Online]. Available:
  • [21] Wolfram, “Wolfram Alpha: Image Identification Project,” 2018. [Online]. Available:
  • [22] Google, “Google Cloud Vision,” 2018. [Online]. Available:
  • [23] Berkeley AI Research, “Caffe,” 2018. [Online]. Available:
  • [24] DeepDetect, “DeepDetect,” 2018. [Online]. Available:
  • [25] CILVR Lab @ NYU, “OverFeat: Object Recognizer, Feature Extractor,” 2018. [Online]. Available:
  • [26] Google, “TensorFlow: ImageNet classifier,” 2018. [Online]. Available:
  • [27] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 5 2017. [Online]. Available:
  • [28] M. S. Sorower, “A literature survey on algorithms for multi-label learning,” Oregon State University, Corvallis, 2010.
  • [29] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, “From Word Embeddings To Document Distances.” in ICML, vol. 15, 2015, pp. 957–966.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia.    ACM, 2014, pp. 675–678.
  • [33] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
  • [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” arXiv preprint arXiv:1512.00567, 2015.
  • [35] Torch, “Torch.” [Online]. Available:
  • [36] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999.
  • [37] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled classification,” Advances in Knowledge Discovery and Data Mining, pp. 22–30, 2004.
  • [38] Y. Yang, “An evaluation of statistical approaches to text categorization,” Information Retrieval, vol. 1, no. 1-2, pp. 69–90, 1999.
  • [39] O. Pele and M. Werman, “A linear time histogram metric for improved sift matching,” in European conference on computer vision.    Springer, 2008, pp. 495–508.
  • [40] ——, “Fast and robust earth mover’s distances,” in Computer vision, 2009 IEEE 12th international conference on.    IEEE, 2009, pp. 460–467.
  • [41] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [42] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation.” in

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , vol. 14, 2014, pp. 1532–1543. [Online]. Available:
  • [43] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
  • [44] Y. Shen, P.-S. Huang, J. Gao, and W. Chen, “Reasonet: Learning to stop reading in machine comprehension,” arXiv preprint arXiv:1609.05284, 2016.
  • [45]

    H. Amiri, P. Resnik, J. Boyd-Graber, and H. D. III, “Learning text pair similarity with context-sensitive autoencoders,” in

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, 2016, pp. 1882–1892.
  • [46] G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.