As an important problem in computer vision community, multi-label image recognition has attracted considerable attention due to its wide applications such as music emotion categorization, fashion attribute recognition , human attribute recognition , etc. Unlike conventional multi-class classification problems, which only predict one class label for each image, multi-label image recognition needs to assign multiple labels to a single image. Its challenges come from the rich and diverse semantic information in images.
address the multi-label classification problem by either transform it into i) multiple binary classification tasks or ii) multivariate regression problem or iii) adapting single-label classification algorithms. With the great success of deep Convolutional Neural Networks (CNNs) on single-label multi-class image classification, recent multi-label image classification methods are mainly based CNNs with certain adaptions [37, 38, 33, 36, 46, 10, 4, 42, 6].
A popular way of modern CNN-based multi-label classification is to model label dependencies as the objects usually co-occur in the physical world. For instance, ’baseball’, ’bat’ and ’person’ always appear in the same image, but ’baseball’ and ’ocean’ rarely appear together. Wang et al. 
propose a CNN-RNN framework, which learns a joint image-label embedding to characterize the semantic label dependency. It shows that the recurrent neural networks (RNNs) can capture the higher-order label dependencies in a sequential fashion. However, this method ignores the explicit associations between semantic labels and image regions. Consequently, some works combine the attention mechanism with CNN-RNN framework to explore the associations between labels and image regions [36, 46, 10, 4]. For example, Zhu et al.  propose a Spatial Regularization Network which generates class-related attention maps and captures both spatial and semantic label dependencies via learnable convolutions. These methods essentially learn local correlations by attention regions of an image which introduce limited complementary information. Chen et al.  propose a multilabel-GCN (ML-GCN) framework, which leverages Graph Convolutional Networks to capture global correlations between labels with extra knowledge from label statistical information. One drawback of ML-GCN is that the label correlation graph is manually designed and needs carefully adaptions. This hand-crafted correlation graph makes the ML-GCN inflexible and may be sub-optimal for multi-label classification.
In this paper, we propose a unified multi-label GCN framework, termed as A-GCN to address the inflexible correlation graph problem in ML-GCN. The key of A-GCN is that it learns an Adaptive label correlation graph to model label dependencies in an end-to-end manner. Specifically, we introduce a plug-and-play adaptive Label Graph (LG) module to learn label correlations with word embeddings, and then utilize traditional GCN to map this graph into label-dependent object classifiers, and further applied these classifiers to image features. By default, we implement LG module by two convolutional layers and uses dot product to generate label graphs. As label co-occurance is sparse in current popular multi-label datasets, we also introduce a sparse correlation constraint
to enhance the LG module by using a L1-norm loss between the learned correlation graph and an identity matrix. Furthermore, we explore three alternative architectures to evaluate the LG module. We validate our method on two diverse multi-label datasets: MS-COCO and Fashion550K. Experimental results show that our A-GCN significantly improves baseline methods and achieves performance superior or comparable to the state of the art.
Our work is mainly related to multi-label image recognition and graph neural network. In this section, we first present related works on multi-label image recognition methods, and then graph neural network methods.
Multi-label Image Recognition
Remarkable developments in image recognition have been observed over the past few years due to the availability of large-scale hand-labeled datasets like ImageNet and MS-COCO . Recent progress on single-label image classification is made based on the deep convolutional neural networks (CNNs) [18, 26, 15] that learn powerful visual representation via stacking multiple nonlinear transformations. A simple way is to adapt these single-label classification networks to the multi-label image recognition with the deep CNNs, which has been witnessed good results [24, 33, 36, 5].
Early works on multi-label image recognition utilize hand-crafted image features and linear models to solve this problem [30, 27, 1, 3]. Intuitively, as a well-known example is to decompose the multi-label image recognition problems into multiple binary classification tasks . As in paper , to train a set of independent linear classifiers for each label. Zhang et al. 
propose a multi-label lazy learning approach named ML-KNN, using k-nearest neighbor to predict labels for unseen data from training data. Tai et al. design a novel Principle Label Space Transformation (PLST) algorithm, which seeks important correlations between labels before learning. Chen et al. 
introduce a hierarchical matching framework with so-called side information for image classification based on the bag-of-words model. Although these methods may perform well on the simple benchmarks, they can’t generalize as well as deep learning-based methods on input images with complex scenes and multiple objects.
Several studies based on CNNs still attract the attention of researchers in Multi-label image recognition tasks [3, 33, 36]. The earliest applications of deep learning to multi-label classification is done by Gong et al. 
, who propose to combine convolutional architectures with an approximate top-k ranking objective function for annotating multi-label images. Instead of extracting off-the-shelf deep features, Chatfield et al. fine-tune the network using the target multi-label dataset, which is used to learn more domain-specific features to boost the classification performance. Wu et al. 
propose an approach named weakly semi-supervised deep learning for multi-label image annotation, which uses a triplet loss function to draw images with similar label sets. To better consider the correlations between labels instead of treat each label independently, various approaches have been considered in recent works. One of the popular trends utilizes the graph models to build the label co-occurrence dependency, such as Conditional Random Field , Dependency Network , and Co-occurrence Matrix 
. In order to explore the label co-occurrence dependency combined with CNNs model, another group of researchers applies the low-dimensional recurrent neurons in RNN model to efficiently abstract high-order label correlation. For example, Wang et al. utilize the RNNs combined with CNN to learn a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance. Wang et al. 
introduce a spatial transformer layer and long short-term memory (LSTM) units to capture the label correlation. Lee et al.
propose a framework that incorporates knowledge graphs for describing the relationships between multiple labels and learned representations of this graph to enhance image feature representation to promote multi-label recognition.
Graph Convolutional Neural Networks
Generalization of GCNNs has drawn great attention in recent years. There are two typical types of GCNNs: spatial manner and spectral manner. The first type adopts feed-forward neural networks to every node. For example, Marino et al.  successfully apply GCNNs for multi-label image classification to exploit explicit semantic relations in the form of structured knowledge graphs. Wang et al.  propose to represent videos as space-time region graphs which capture similarity relationships and spatial-temporal relationships. Wang et al.  propose a spatial-based GCN to solve the link prediction problem. The second type provides well-defined localization operators on graphs via convolutions in the Fourier domain . In recent years, an important branch of the spectral GCNNs has been proposed to tackle graph-structured data. The outputs of spectral GCNNs are updated features for each object node, leading to an advanced performance on any tasks related to graph based information processing. More specifically, Kipf et al.  apply the GCNNs to semi-supervised classification. Hamilton et al.  leverage GCNs to learn feature representations.  propose a novel GCN based model (aka ML-GCN) to learn the label correlations for multi-label image recognition tasks. It utilizes the GCN to learn an object classifier via mining their co-occurrence patterns within the dataset. Motivated by ML-GCN , our work leverages the graph structure to capture and explore an adaptive label correlation graph. With the proposed A-GCN, we can overcome the limitation caused by the manually designed graph and automatically learn the label correlation by an LG module. We also demonstrate that our A-GCN is also an effective model for label dependency and can be trained in an end-to-end manner.
To efficiently exploit the label dependencies and make GCN flexible, we propose the A-GCN to learn label correlations for GCN based multi-label image classification. In this section, we first present some notations to define the problem, and then introduce the basic GCN based multi-label classification, finally we present our A-GCN and several alternative label graph architectures.
Notations. Let be the training data, where is the th image and is the corresponding multi-hot label vector. Zeros or ones in the label vector denote the absence or presence of the corresponding category in the image. Let denote the CNN feature of , and as a CNN model with parameters . Assume we have object classifiers
, then the predicted logit scores of featurecan be defined as,
The CNN model and classifiers can be optimized by the following multi-label classification loss,
the sigmoid function.
Multi-label classification with GCN. We revisit the ML-GCN  pipeline for multi-label classification in the following. It performs GCN on the word embeddings of labels, and learns inter-dependent object classifiers to improve performance. The purpose of GCN is to learn a function on a graph , which takes previous feature descriptions and the correlation matrix , and outputs learned node features as . One GCN layer can be formulated as,
and is a transformation matrix to be learned, is the normalized version of with and , is an identity matrix, and
is an activation function which is set as LeakyReLU following. The input of the first layer is and the output of the last layer is , i.e. the inter-dependent classifiers.
Following the pipeline of ML-GCN, we propose the A-GCN to address the generation of label correlation matrix . Figure 1 depicts the framework of A-GCN. It mainly consists of two branches. The upper branch is a traditional CNN for image feature learning, and the bottom branch is a GCN model to generate inter-dependent classifiers.
The key difference between our A-GCN and ML-GCN is the construction method of . We argue that building correlation matrix by counting the occurrence of label pairs and thresholding is inflexible and may be sub-optimal for multi-label classification. To address this problem, we propose an adaptive label graph (i.e correlation matrix) module to learn label correlations in an end-to-end manner.
Adaptive label graph (LG) module. As shown in Figure 1, the adaptive LG module is comprised of two convolutional layers and a dot product operation. The LG module takes as input the word embeddings of labels and output a learned label correlation matrix . Formally, the learned can be written as,
where and are the convolutional kernels to be learned, and denotes the convolutional operation.
Sparse correlation constraint. For each node of a certain graph, GCN gradually aggregates information from its own feature and the adjacent nodes’ features. We can imagine that the features can be indistinguishable by over-smoothing if the learned becomes uniform. A uniform denotes dense correlations among different labels. To avoid this issue, we enforce a sparse correlation constraint on by a L1-norm loss as follows,
This constraint encourages high self-correlation weights to avoid over-smoothed features in GCN. Our total loss is , where is a trade-off weight and is default as 1.0 in our experiments.
Alternative LG architectures. As illustrated in Figure 2
, we propose three alternative LG architectures, namely i) pair-wise cosine similarity (abbreviated asCos-A
), ii) linear transformation ofby a full-connected layer (FC-A), and iii) linear transformation of with a dot product (Dot-A).
Cos-A simply computes the cosine similarities between label embeddings which generates a symmetrical correlation matrix. Each element in is defined by,
FC-A directly utilizes a linear layer to generate the correlation matrix as,
Dot-A first uses a convolutional layer for and a dot product operation, and then compute the self-correlation matrix as ,
|Our baseline (ResNet101)||80.3||77.8||72.8||75.2||81.5||75.1||78.2||82.5||64.6||72.4||87.3||65.7||75.0|
|A-GCN (w/o )||82.78||83.04||72.87||77.63||84.45||75.75||79.87||87.48||64.73||74.4||89.55||66.54||76.35|
|Cos-A (w )||82.77||84.89||71.67||77.72||85.77||74.83||79.93||88.92||64.03||74.45||90.24||66.2||76.37|
|FC-A (w )||82.85||83.65||72.45||77.65||84.99||75.56||80.0||88.29||64.23||74.37||89.95||66.3||76.34|
|Dot-A (w )||82.22||84.64||70.93||77.18||85.86||74.65||79.87||88.74||63.19||73.82||90.37||65.93||76.24|
Training. We illustrate the training process of A-GCN in Algorithm 1. We train A-GCN in an end-to-end manner with two branches. Branch 1 extracts image features and updates image-level CNN parameters. Branch 2 learns adaptive label correlation graph and the GCN model to generate label-dependent classifiers. The total loss is the combination of sparse correlation constraint and multi-label classification loss .
In this section, we evaluate the proposed A-GCN and compare it to the state-of-the-art methods on two public multi-label benchmark datasets: MS-COCO  and Fashion550K . We first present the implementation details and metrics, and then extensively explore our A-GCN on MS-COCO, and finally apply A-GCN on Fashion550K.
Implementations and evaluation metrics
We implement our method with Pytorch. For data augmentation, we resize images to scale 512512 on MS-COCO (256256 on Fashion550K), and randomly crop regions of 448448 (224224 on Fashion550K) with random flipping. For test, we resize images to scale 448448 (224224). For fair comparison, we use ResNet-101 on MS-COCO , and ResNet-50 on Fashion550K , which are pre-trained on ImageNet. We use the SGD method for optimization with a momentum of 0.9 and a weight decay of . We set the minibatch size as 32, the initial learning rate () as . We divide
by 10 after every 30 epochs, and stop training after 65 epochs. For word embedding method and other hyper-parameters of GCN are kept consistent with.
. For MS-COCO dataset, we use the same evaluation metrics as, i.e. the mean of class-average precision (mAP), overall precision (OP), recall (OR), F1 (OF1), and average per-class precision (CP), recall (CR), F1 (CF1). For each image, the labels are predicted as positive if the confidences of them are greater than 0.5. Among all these metrics, mAP is known as the most important one. For fair comparisons, we also report the results of top-3 labels.On Fashion550K, we also use mAP and the class agnostic average precision () to evaluate the performance for consistency with .
Exploration on MS-COCO
MS-COCO is the most popular multi-label image dataset which consists of 80 categories with 82,081 training images and 40,137 test images. We compare our A-GCN to several state-of-the-art methods including CNN-RNN , RNN-Attention , Order-Free RNN , ML-ZSL , SRN , Multi-Evidence , ML-GCN . The results are presented in Table 1. Our A-GCN significantly improves the baseline (ResNet101) in most of the metrics. Specifically, the A-GCN improves the mAP of baseline from 80.3% to 83.1%. In addition, our A-GCN slightly outperforms the most related method ML-GCN in mAP. Compared to ML-GCN, our A-GCN, with a small extra LG module, is more flexible which does not need to elaborately design correlation matrix.
Evaluation of and LG architectures. We evaluate the effect of sparse correlation constraint and different label graph architectures in the last four rows of Table 1. Several observations can be concluded as following. First, without we obtain slightly worse results than the default A-GCN, which indicates the effectiveness of sparse constraint. Second, all alternative LG architectures improve the baseline obviously which suggests that all of them learn label correlation information effectively. Third, the FC-A, which only differs from the default A-GCN by replacing convolutional with one FC layer, shows the best results in all the alternative ones. Compared to the default A-GCN, the Dot-A has an obviously degradation.
Evaluation of . The trade-off weight indicates the contribution of in the whole loss value. Intuitively, this regularization should not have large weight. Figure 3 shows the evaluation of on MS-COCO. Increasing from 0 to 1 slightly boosts performance, while larger leads to degradation and even divergence ( in our test).
Visualization. To further investigate the effect of our A-GCN, we show the per-class improvement (degradation) from A-GCN on MS-COCO and Fashion550K in Figure 4. It shows that those objects (mainly daily needs) whose presences usually depend on their co-occurrence containers are likely to have large gains, e.g. spoon, backpack, book, toothbrush in image (a), (or glasses, sneakers, sweatshirts in image (c)), etc. It suggests that our A-GCN leverages the graph module to automatically learn the objects co-occurrence relation, which can effectively improve the multi-label recognition performance.
Performance on Fashion550K
Fashion550K  is a multi-label fashion dataset which contains 66 unique weakly-annotated tags with 407,772 images in total. Among all the images, 3,000 images are manually verified for training (i.e. clean set), 300 images for validation, and 2,000 images for test. The rest images are used as noisy-labeled data, i.e. noisy set. We report performance on the test set following common setting.
We compare our default A-GCN to several well-known state-of-the-art methods on Fashion550K, including StyleNet , Baseline and Inoue et al. proposed method , Viet et al. proposed method , and our re-implementation ML-GCN (Re-weighted). For fair comparison, we also use two training configurations, namely i) training on noisy set and ii) further fine-tuning on clean set (i.e. noisy+clean). The comparison is presented in Table 2. Our A-GCN improves our baseline by 2.76% and 3.4% in mAP with both training settings, respectively. It also demonstrates the label correlation information is helpful for multi-label fashion image classification.
|Viet et al.||noisy+clean||78.92||63.08|
|Inoue et al.||noisy+clean||79.87||64.62|
In this paper, we proposed a simple and flexible A-GCN framework for multi-label image recognition. The A-GCN leverages a plug-and-play label graph module to automatically construct the label correlation matrix for GCN on the label embeddings. We designed a sparse correlation constraint on the learned correlation matrix to avoid over-smoothing on the features. We also explored several alternative label graph modules to demonstrate the effectiveness of our A-GCN. Extensive experiments on MS-COCO and Fashion550K show that our A-GCN achieves superior performance to several state-of-the-art methods.
-  (2014) Matrix completion for weakly-supervised multi-label image classification. IEEE transactions on pattern analysis and machine intelligence 37 (1), pp. 121–135. Cited by: Multi-label Image Recognition.
-  (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: Multi-label Image Recognition.
-  (2012) Hierarchical matching with side information for image classification. In CVPR, pp. 3426–3433. Cited by: Multi-label Image Recognition, Multi-label Image Recognition.
Order-free rnn with visual attention for multi-label classification.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction, Introduction, Exploration on MS-COCO.
Recurrent attentional reinforcement learning for multi-label image recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Multi-label Image Recognition.
-  (2019) Multi-label image recognition with graph convolutional networks. In CVPR, pp. 5177–5186. Cited by: Introduction, Introduction, Graph Convolutional Neural Networks, Preliminaries, Preliminaries, Preliminaries, Implementations and evaluation metrics, Implementations and evaluation metrics, Exploration on MS-COCO.
Combining instance-based learning and logistic regression for multilabel classification. Machine Learning 76 (2-3), pp. 211–225. Cited by: Introduction.
-  (2001) Knowledge discovery in multi-label phenotype data. In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 42–53. Cited by: Introduction.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: Multi-label Image Recognition.
Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In CVPR, pp. 1277–1286. Cited by: Introduction, Introduction, Exploration on MS-COCO.
-  (2005) Collective multi-label classification. In Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200. Cited by: Multi-label Image Recognition.
-  (2013) Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894. Cited by: Multi-label Image Recognition.
-  (2011) Multi-label classification using conditional dependency networks. In Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: Multi-label Image Recognition.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: Graph Convolutional Neural Networks.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Multi-label Image Recognition.
-  (2017) Multi-label fashion image classification with minimal human supervision. In CVPR, pp. 2261–2267. Cited by: Introduction, Implementations and evaluation metrics, Implementations and evaluation metrics, Performance on Fashion550K, Performance on Fashion550K, Experiment.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Graph Convolutional Neural Networks, A-GCN.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Introduction, Multi-label Image Recognition.
-  (2018) Multi-label zero-shot learning with structured knowledge graphs. In CVPR, pp. 1576–1585. Cited by: Multi-label Image Recognition, Exploration on MS-COCO.
-  (2016) Human attribute recognition by deep hierarchical contexts. In ECCV, pp. 684–700. Cited by: Introduction.
-  (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: Multi-label Image Recognition, Experiment.
-  (2016) The more you know: using knowledge graphs for image classification. arXiv preprint arXiv:1612.04844. Cited by: Graph Convolutional Neural Networks.
-  (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: Graph Convolutional Neural Networks.
-  (2014) CNN features off-the-shelf: an astounding baseline for recognition. In CVPR workshops, pp. 806–813. Cited by: Multi-label Image Recognition.
Fashion style in 128 floats: joint ranking and classification using weak data for feature extraction. In CVPR, pp. 298–307. Cited by: Performance on Fashion550K.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Multi-label Image Recognition.
-  (2012) Multilabel classification with principal label space transformation. Neural Computation 24 (9), pp. 2508–2542. Cited by: Multi-label Image Recognition.
-  (2017) Modeling label dependence for multi-label classification using the choquistic regression. Pattern Recognition Letters 92, pp. 75–80. Cited by: Multi-label Image Recognition.
-  (2008) Multi-label classification of music into emotions.. In ISMIR, Vol. 8, pp. 325–330. Cited by: Introduction.
-  (2009) Mining multi-label data. In Data mining and knowledge discovery handbook, pp. 667–685. Cited by: Multi-label Image Recognition.
-  (2007) Multi-label classification: an overview. IJDWM 3 (3), pp. 1–13. Cited by: Introduction, Multi-label Image Recognition.
-  (2017) Learning from noisy large-scale datasets with minimal supervision. In CVPR, pp. 839–847. Cited by: Performance on Fashion550K.
-  (2016) Cnn-rnn: a unified framework for multi-label image classification. In CVPR, pp. 2285–2294. Cited by: Introduction, Introduction, Multi-label Image Recognition, Multi-label Image Recognition, Exploration on MS-COCO.
-  (2018) Videos as space-time region graphs. In ECCV, pp. 399–417. Cited by: Graph Convolutional Neural Networks.
-  (2019) Linkage based face clustering via graph convolution network. In CVPR, pp. 1117–1125. Cited by: Graph Convolutional Neural Networks.
-  (2017) Multi-label image recognition by recurrently discovering attentional regions. In CVPR, pp. 464–472. Cited by: Introduction, Introduction, Multi-label Image Recognition, Multi-label Image Recognition, Exploration on MS-COCO.
-  (2014) Cnn: single-label to multi-label. arXiv preprint arXiv:1406.5726. Cited by: Introduction.
-  (2015) HCP: a flexible cnn framework for multi-label image classification. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1901–1907. Cited by: Introduction.
-  (2015) Weakly semi-supervised deep learning for multi-label image annotation. IEEE Transactions on Big Data 1 (3), pp. 109–122. Cited by: Multi-label Image Recognition.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: Introduction.
-  (2011) Correlative multi-label multi-instance image annotation. In ICCV, pp. 651–658. Cited by: Multi-label Image Recognition.
-  (2019) DELTA: a deep dual-stream network for multi-label image classification. Pattern Recognition 91, pp. 322–331. Cited by: Introduction.
-  (2007) ML-knn: a lazy learning approach to multi-label learning. Pattern recognition 40 (7), pp. 2038–2048. Cited by: Multi-label Image Recognition.
-  (2013) A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26 (8), pp. 1819–1837. Cited by: Introduction.
-  (2012) Multi-instance multi-label learning. Artificial Intelligence 176 (1), pp. 2291–2320. Cited by: Introduction.
-  (2017) Learning spatial regularization with image-level supervisions for multi-label image classification. In CVPR, pp. 5513–5522. Cited by: Introduction, Introduction, Exploration on MS-COCO.