In a smart city[5, 28], multi-label scenes are much common, and accurately recognizing multiple label is quite important. For example, by recognizing every traffic routes and analyzing flows through monitors, a smart city is able to ease traffic jams. Recently, some study about multi-label image classification in smart cities are draw attention of researchers[25, 1]
. Multi-label image classification seeks to recognize all possible objects/labels in a given image. Because of the dramatic development of deep learning and the availability of large-scale datasets such as ImageNet, there exist many studies on single-label image classification [27, 13]. However, scenes around us are always with multiple objects/labels. Unfortunately, multi-label image classification are more difficult than single-label one since the complicated structure and the internal label dependencies.
Recently, methods based on Deep Neural Networks become popular. On the one hand, due to the success of Convolutional Neural Networks (CNNs) on single-label image classification, a large number of methods directly apply CNNs to multi-label tasks[8, 17, 26, 30]
. On the other hand, some researchers additionally leverage Recurrent Neural Networks (RNNs) to model the dependencies among labels[29, 16, 19]. However, all the aforementioned works consider to indiscriminately analyze the whole image when building a multi-label image classification model, so that useless and redundant information would be equally taken into account. For example, some blank or blur backgrounds may be behind key objects in an image are equally used in the model learning process.
,we propose a global/local attention method to for multi-label image classification that can classify images from coarse to fine. The model can imitate how human beings observe a scenery image—they first observes the image with a global attention to find the areas that may have objects, and then focuses on these areas to consider what object is inside each area. The process is simply shown in Fig.1. The global attention, which is generated from the final convolutional layer in CNN, denotes a general attentive area, i.e., an overview of an image. Then, we generate local attention in every step of RNN, which denotes each specific attentive area for each predicted label. Additionally, we propose a joint max-margin objective function to separate the positive and negative prediction in the time domain, which can effectively improve the performance. We evaluate our method on two popular multi-label image datasets, and the experimental results show that our method is better than the other state-of-the arts.
Ii Related Work
Ii-a Multi-label image classification
Multi-label classification is with wide applications in many areas, especially for image classification, and lots of efforts have been made for this task. Traditional methods can be decomposed into two categories , i.e.the problem transformation [3, 23, 24] and algorithm adaptation [35, 4]. Recently, methods based on CNNs become popular in single-label image classification for its strong capability in learning discriminative features. Some researchers attempted to directly apply CNNs on multi-label image classification. Gong et al.  built a CNN architecture similar to  to tackle this problem, and trained a CNN model with top-k ranking objectives. Wei et al. 
fine-tuned the network pre-trained on ImageNet with the squared loss for multi-label image classification (I-FT). Some works employed an object detection framework to strengthen the performance of CNNs. For example, Wei et al. provided a regional solution that allows predicting labels independently at the regional level (H-FT). Some approaches use RNNs to model label dependencies. Wang et al.  utilized CNNs to extract image features, and then utilized RNNs to model correlations among labels. In 
, the authors combined the image embedding with the output of Long Short-Term Memory (LSTM) every step, and then passed the combined vector to the final fully connected layer to predict the current label. Liuet al.  regularized CNN by ground truth semantic concepts, and then used the prediction to set the LSTM initial states. Although the performance of multi-label classification has been significantly improved by using CNNs and RNNs, these methods always consider to extract features from the whole image. This results in that much redundant information would be equally considered in the multi-label classification model training process. In fact, relevant objects may be only little parts of an image. Some researchers started to leverage the attention mechanism to guide multi-label classification.
Ii-B Attention mechanism
The attention mechanism forces a learning model to focus on relevant parts of an original data. Bahdanau et al. 
proposed a model to search a set of possible positions while generating the target word in Neural Machine Translation. This mechanism was then applied to the research field that combines vision and language. In, Xu et al. took hard and soft attention-based methods to generate image descriptions. You et al.  ran a set of attribute detectors to get a list of visual attributes and fuse them into the RNN hidden state. Lu et al. 
proposed a co-attention model that combines the language information and the image information in the task of Visual Question Answering. With the attention mechanism, the model can learn the attention by itself, which can intuitively guide the model to observe the data. However, few works applied this mechanism to the multi-label image classification. Zhuet al.  proposed to learn semantic and spatial relations jointly and generate attention maps for all labels. Although their work computed attentions for all labels, this may also result in a large number of additional parameters.
In this paper, we argue that the attention can also be learned from coarse to fine. Almost all existing attention-based methods analyzed the whole image directly and cursorily, and we think this should follow a progress. When coming across a complicated scene, we need to look around in general firstly and then search specific objects one by one. Therefore, we propose a global-local attention method for multi-label image classification. The details of our proposed method will be explained in Section III.
Iii-a Problem definition
Multi-label classification is to predict all possible labels for an image. Given a set of images and their corresponding labels , where is the number of images in the set, our work is to learn a hypothesis that maps an input image to output . For the i-th image , we denote the corresponding labels as , where means the image is labeled with label while is on the contrary. is the number of possible labels.
Iii-B The framework of our model
Our overall model follows the encoder-decoder design pattern , which transforms data from one representation to another. In the proposed model, the encoder is a VGG-16 model , which has been proved to extract features from image effectively. From the VGG-16 model, we extract two types of features from each image. The first type of features comes from the final convolutional layer, presenting the structural information of an image and denoting as , where is the number of regions in the feature map. The other type of features is from the last fully-connected layer, including more higher-level information of an image and denoting as . The decoder is an RNN model. In this paper, we used Long Short-Term Memory  (LSTM). LSTM adds three extra gates to the vanilla RNN, i.e., the input gate, the forget gate and the output gate. Following , multiple labels in multi-label classification can be regarded as a sequence, and the RNN decoder is used to recognize each specific object one by one.
Iii-C Visual attention mechanism
In this section, we describe our visual attention mechanism. In our model, we leverage two types of attentions, i.e., the global attention and the local attention. For the global attention, a more general attentive area is highlighted, while a more fine-grained one is highlighted for local attention.
Iii-C1 Global attention
Our global attention is computed from . For the -th region, it corresponds to a positive weight
where is a scalar presenting the degree of the -th region’s importance. With , we can compute the excepted aligned global context , and the process can be shown in Fig. 2(a).
Note that we used the sum of all weighted to compute the expected aligned contexts. This attentive context shows how the weight influences the feature maps.Thus, unlike traditional sequence learning that zero-initializing LSTM, in our architecture, we initialized it with the average of .
The initialization of parameters is quite important.  considered that the attention mechanism lacks global modeling abilities in common sequential learning. Initializing the memory cell and the hidden state in this way helps LSTM learn the whole non-attentive feature maps and a glance to the original image. Moreover, because is an attentive global context, our model can first give a general area that may contain some meaningful objects.
Iii-C2 Local attention
After the proposed model has observed an image in a general way, we expect our model like human beings to focus on every specific object. Therefore, as shown in Fig. 2(b), our model computes local attention at each step of LSTM.
For all regions at step , similar to the global attention, we used a positive weight to decide which location is the right attentive place for the next label. Its element is computed by
In Eq. (4), presents the prior hidden state ,
is a simple Multi-Layer Perceptron, which reflects the importance of the featureas well as the hidden state and decides the next state of LSTM. Therefore, LSTM is forced to pay more attention to these regions with larger weights. Then, we can compute the dynamic context as follows.
We treated their set as another special features and feeded them into LSTM as the next input. That means for every step of the LSTM’s recurrence, the model must take the possible area into account and overlook some unimportant information. As a consequence, following , the forward passing at step can be defined as follows.
where all -s and -s are trainable weights and biases. denotes the input label in step , and is the last hidden state.
Iii-D Objective function
Iii-D1 Horizontal max-margin objective
We obtained a prediction at every step for the -th image. Therefore, we will obtain a set of predictions at the end of the sequence. For the prediction at the -th step , it is a vector with length , where is the number of all classes. Actually, we obtained the final prediction
by a max-pooling for each class. For the-th class, we have
To separate the positive and negative prediction, we assumed that a max margin is between the minimum positive and maximum negative prediction. That is,
where and mean the minimum positive and the maximum negative prediction respectively. is the joint max-margin, and is pre-defined before training. As a result, we have a constrain for the prediction as follows.
Iii-D2 Vertical max-margin objective
With only the horizontal max-margin objective, the distance between positive and negative labels will be larger. However, for each step, we only expect to predict one label. Thus, even if the label is not predicted, the margin still exists. Therefore, we proposed another vertical max-margin objective. The prediction list can be regarded as a matrix , and , where the -th row presents the prediction in step and the -th column presents the -th class. Thus, for each class the minimum positive and maximum negative also have a max margin.
where means the minimum positive prediction on class for each step, and means the maximum negative prediction. The constrain in the vertical direction can be denoted as
Iii-D3 Final objective
Although we do prediction at every step of RNN, we defined the final prediction as the max-pooling of the prediction of each step. Formally, given a training sample , we expect the model to give the prediction .
We construct the final objective function as
where and are the regular parameters.
Iv-a Datasets and Experimental Settings
We used two popular multi-label image datasets, i.e., The PASCAL Visual Object Classes Challenge (Pascal VOC)  and Microsoft COCO (MS-COCO) , to evaluate our method. Pascal VOC 2007 has 5,011 training examples and 4,952 testing examples of 20 classes. MS-COCO dataset  has 123,287 images (82,783 training and 40,504 validating examples) of 80 different classes.
For the proposed method, we used VGG-16  as our back-bone model of the encoder CNN. The are extracted from the last convolutional layer conv5_3 and are extracted from the last fully-connected layer fc_7. The parameters of VGG-16 are pre-trained on ImageNet. We set and to determine the importance of the max-margin regular term. In our experimental results, we used “L” and “G” to represent the model with local attention and/or global attention respectively. And we used “MM” to represent the model with joint max-margin objective. We used the several common metrics (Precision[P], Recall[R], F1-Score[F], Hamming loss[H], Accuracy[A], One error[1-err], Coverage[C], Rank loss[rloss], Mean average precision[mAP]) to evaluate our method and comparison methods, and X@ means metric X on top . means the lower the metric, the better the performance is, while is on the contrary.
Iv-B Performance on Pascal VOC
We first evaluated our method on Pascal VOC 2007. The comparison to the state-of-the-art methods is shown in Table I. Comparison methods include the follows:
INRIA  combines object localization and image classification efficiently and makes both improved.
IFT trains an AlexNet with a softmax loss.
, based on IFT, fine-tunes the network with multiple hypotheses, and augments 1000/2000 additional classes.
CNN-RNN  uses a CNN as an encoder and a RNN as a decoder, and predicts labels sequentially.
From Table I, We can see that our method outperforms these state-of-the-arts. First, our VGG+LSTM+L that leverages the local attention has the same performance (85.2% mAP) with HCP-2000C, but the latter additionally trains the model with extra 2000 classes. Second, With global attention, our VGG+LSTM+L/G is better than VGG+LSTM+L, and reaches to 85.4% in terms of mAP. At last, when we use the joint max-margin objective, our VGG+LSTM+L/G+MM achieves the best performance (85.6%), which shows the constructed joint max-margin objective can effectively improve the classification.
Iv-C Performance on MS-COCO
We then evaluated our method on the dataset MS-COCO, and the experimental results are shown in Table II. First, our VGG+LSTM+L/G+MM is better than all other methods in most metrics; In terms of mAP, it reaches 64.64% , outperforming VGG+LSTM+L/G (64.07%). Second, from Table II, we can see that the performance of VGG+LSTM+L/G+MM on both top- and top- are better than that of other methods on most metrics. At last, the performance of VGG and VGG+LSTM+L
is close, and this is probably because MS-COCO is a large dataset and the correlations among labels in it is not obvious. For example, the label “person” has much higher frequency than other labels. When the current prediction is “person”, it is difficult to determine which label to predict in the next step. Another evidence is that the performance ofVGG+LSTM is worse than that of both VGG and VGG+LSTM+L.
Iv-D Visualization of attention
We visualized the attentive areas for the images on PASCAL VOC 2007 by up-sampling the attention weights with a factor of and applying a Gaussian filter. We showed the predictions and the relative attentive areas of images in Fig. 3 and 4. Fig. 3 represents some visualized results of global and local attention and Fig. 4 shows the trend for attention updating every 10 epochs. From Fig. 3 and 4, we can see when predicting the related labels of an image, the model first observed the image in general (the attentive areas are covered most of the region of the image). Then at each step of RNN, the model focused on smaller areas that may contain specific target objects. This is very similar to human thinking that people observe an image, they always glance the whole image, and then they consider the relationships inside the image, and focus on their attention on some specific objects.
In this paper, we proposed a novel model that uses a global/local attention mechanism for multi-label image classification. In our model, we first let the model focus on a more coarse area of an image, i.e., a global attention on the image. Then, with the guidance of the global attention, the model can predict each label one by one with the local attention, which can attentively help the model focus on some specific objects. Additionally, we proposed a joint max-margin objective that defines two max-margin in vertical and horizontal directions, respectively. Finally, we evaluated our method on two popular multi-label image datasets, i.e., Pascal VOC 2007 and MS-COCO. Our experimental results showed the superiority of the proposed method.
-  (2016) Activity recognition in multi-user environments using techniques of multi-label classification. In Proceedings of the 6th International Conference on the Internet of Things, pp. 15–23. Cited by: §I.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I, §II-B.
-  (2004) Learning multi-label scene classification. Pattern recognition 37 (9), pp. 1757–1771. Cited by: §II-A.
-  (2001) Knowledge discovery in multi-label phenotype data. In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 42–53. Cited by: §II-A.
-  (2014) Smart and digital city: a systematic literature review. In Smart city, pp. 13–43. Cited by: §I.
-  (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §I, §II-A.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §IV-A.
-  (2013) Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894. Cited by: §I, §II-A.
-  (2015) Incremental support vector learning for ordinal regression. IEEE Transactions on Neural Networks and Learning Systems 26 (7), pp. 1403–1416. Cited by: 3rd item.
-  (2017) A solution path algorithm for a general parametric quadratic programming problem. IEEE Transactions on Neural Networks and Learning Systems 28 (5), pp. 1241–1248. Cited by: 3rd item.
-  (2017) Structural minimax probability machine. IEEE Transactions on Neural Networks and Learning Systems 28 (7), pp. 646–1656. Cited by: 3rd item.
-  (2009) Combining efficient object localization and image classification. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 237–244. Cited by: 1st item.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-B.
-  (1999) Exploiting generative models in discriminative classifiers. In Advances in neural information processing systems, pp. 487–493. Cited by: 2nd item.
-  (2016) Annotation order matters: recurrent image annotator for arbitrary length image tagging. arXiv preprint arXiv:1604.05225. Cited by: §I.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §I, §II-A.
-  (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §IV-A.
-  (2016) Semantic regularisation for recurrent image annotation. arXiv preprint arXiv:1611.05490. Cited by: §I, §II-A.
-  (2016) Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pp. 289–297. Cited by: §II-B.
-  (2016) Fast reference frame selection based on content similarity for low complexity hevc encoder. Journal of Visual Communication and Image Representation 40 (part B), pp. 516–524. Cited by: §III-B.
-  (2010) Improving the fisher kernel for large-scale image classification. Computer Vision–ECCV 2010, pp. 143–156. Cited by: 2nd item.
-  (2009) Classifier chains for multi-label classification. Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Cited by: §II-A.
-  (2011) Classifier chains for multi-label classification. Machine learning 85 (3), pp. 333–359. Cited by: §II-A.
-  (2017) Automatic multi-label image annotation for smart cities. In IEEE Region 10 Symposium (TENSYMP), 2017, pp. 1–4. Cited by: §I.
-  (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813. Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I, §III-B, §IV-A.
-  (2011) Smart city and the applications. In Electronics, Communications and Control (ICECC), 2011 International Conference on, pp. 1028–1031. Cited by: §I.
-  (2016) Cnn-rnn: a unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294. Cited by: §I, §II-A, §III-B, TABLE II, 6th item.
-  (2014) CNN: single-label to multi-label. arXiv preprint arXiv:1406.5726. Cited by: §I, §II-A, 5th item.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning, pp. 2048–2057. Cited by: §I, §II-B.
-  (2016) Review networks for caption generation. In Advances in Neural Information Processing Systems, pp. 2361–2369. Cited by: §III-C1.
-  (2016) Image captioning with semantic attention. pp. 4651–4659. Cited by: §I, §II-B.
-  (2014) Learning to execute. arXiv preprint arXiv:1410.4615. Cited by: §III-C2.
ML-knn: a lazy learning approach to multi-label learning. Pattern recognition 40 (7), pp. 2038–2048. Cited by: §II-A.
-  (2014) A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26 (8), pp. 1819–1837. Cited by: §II-A.
-  (2017) Effective and efficient global context verification for image copy detection. IEEE Transactions on Information Forensics and Security 12 (1), pp. 48–63. Cited by: §I.
-  (2017) Learning spatial regularization with image-level supervisions for multi-label image classification. arXiv preprint arXiv:1702.05891. Cited by: §II-B.