PyTorch implementation of Multi-Label Image Recognition with Graph Convolutional Networks, CVPR 2019.
The task of multi-label image recognition is to predict a set of object labels that present in an image. As objects normally co-occur in an image, it is desirable to model the label dependencies to improve the recognition performance. To capture and explore such important dependencies, we propose a multi-label classification model based on Graph Convolutional Network (GCN). The model builds a directed graph over the object labels, where each node (label) is represented by word embeddings of a label, and GCN is learned to map this label graph into a set of inter-dependent object classifiers. These classifiers are applied to the image descriptors extracted by another sub-net, enabling the whole network to be end-to-end trainable. Furthermore, we propose a novel re-weighted scheme to create an effective label correlation matrix to guide information propagation among the nodes in GCN. Experiments on two multi-label image recognition datasets show that our approach obviously outperforms other existing state-of-the-art methods. In addition, visualization analyses reveal that the classifiers learned by our model maintain meaningful semantic topology.READ FULL TEXT VIEW PDF
Multi-label image recognition is a task that predicts a set of object la...
Knowledge representation of graph-based systems is fundamental across ma...
In this paper, we propose a novel approach for learning multi-label
It is a usual practice to ignore any structural information underlying
Recent advancements in audio event classification often ignore the struc...
We introduce a dynamic multiscale tree (DMT) architecture that learns ho...
Recognizing multiple labels of an image is a practical yet challenging t...
PyTorch implementation of Multi-Label Image Recognition with Graph Convolutional Networks, CVPR 2019.
Multi-label image recognition is a fundamental and practical task in Computer Vision, where the aim is to predict a set of objects present in an image. It can be applied to many fields such as medical diagnosis recognition, human attribute recognition  and retail checkout recognition [8, 30]. Comparing to multi-class image classification , the multi-label task is more challenging due to the combinatorial nature of the output space. As the objects normally co-occur in the physical world, a key for multi-label image recognition is to model the label dependencies, as shown in Fig. 1.
A Naïve way to address the multi-label recognition problem is to treat the objects in isolation and convert the multi-label problem into a set of binary classification problems to predict whether each object of interest presents or not. Benefited from the great success of single-label image classification achieved by deep Convolutional Neural Networks (CNNs)[10, 26, 27, 12], the performance of the binary solutions has been greatly improved. However, these methods are essentially limited by ignoring the complex topology structure between objects. This stimulates research for approaches to capture and explore the label correlations in various ways. Some approaches, based on probabilistic graph model [18, 17]
or Recurrent Neural Networks (RNNs), are proposed to explicitly model label dependencies. While the former formulates the multi-label classification problem as a structural inference problem which may suffer from a scalability issue due to high computational complexity, the latter predicts the labels in a sequential fashion, based on some orders either pre-defined or learned. Another line of works implicitly model the label correlations via attention mechanisms [36, 29]. They consider the relations between attended regions of an image, which can be viewed as local correlations, but still ignore the global correlations between labels which require to be inferred from knowledge beyond a single image.
In this paper, we propose a novel GCN based model (aka
ML-GCN) to capture the label correlations for multi-label image recognition, which properties with scalability and flexibility impossible for competing approaches. Instead of treating object classifiers as a set of independent parameter vectors to be learned, we propose to learn inter-dependent object classifiers from prior label representations,e.g., word embeddings, via a GCN based mapping function. In the following, the generated classifiers are applied to image representations generated by another sub-net to enable end-to-end training. As the embedding-to-classifier mapping parameters are shared across all classes (i.e., image labels), the gradients from all classifiers impact the GCN based classifier generation function. This implicitly models the label correlations. Furthermore, to explicitly model the label dependencies for classifier learning, we design an effective label correlation matrix to guide the information propagation among nodes in GCN. Specifically, we propose a re-weighted scheme to balance the weights between a node and its neighborhood for node feature update, which effectively alleviates overfitting and over-smoothing. Experiments on two multi-label image recognition datasets show that our approach obviously outperforms existing state-of-the-art methods. In addition, visualization analyses reveal that the classifiers learned by our model maintain meaningful semantic structures.
The main contributions of this paper are as follows:
We propose a novel end-to-end trainable multi-label image recognition framework, which employs GCN to map label representations, e.g., word embeddings, to inter-dependent object classifiers.
We conduct in-depth studies on the design of correlation matrix for GCN and propose an effective re-weighted scheme to simultaneously alleviate the over-fitting and over-smoothing problems.
We evaluate our method on two benchmark multi-label image recognition datasets, and our proposed method consistently achieves superior performance over previous competing approaches.
The performance of image classification has recently witnessed a rapid progress due to the establishment of large-scale hand-labeled datasets such as ImageNet, MS-COCO  and PASCAL VOC , and the fast development of deep convolutional networks [10, 11, 35, 3, 32]. Many efforts have been dedicated to extending deep convolutional networks for multi-label image recognition.
A straightforward way for multi-label recognition is to train independent binary classifiers for each class/label. However, this method does not consider the relationship among labels, and the number of predicted labels will grow exponentially as the number of categories increase. For instance, if a dataset contains 20 labels, then the number of predicted label combination could be more than 1 million (i.e., ). Besides, this baseline method is essentially limited by ignoring the topology structure among objects, which can be an important regularizer for the co-occurrence patterns of objects. For example, some combinations of labels are almost impossible to appear in the physical world.
In order to regularize the prediction space, many researchers attempted to capture label dependencies. Gong et al.  used a ranking-based learning strategy to train deep convolutional neural networks for multi-label image recognition and found that the weighted approximated-ranking loss worked best. Additionally, Wang et al.  utilized recurrent neural networks (RNNs) to transform labels into embedded label vectors, so that the correlation between labels can be employed. Furthermore, attention mechanisms were also widely applied to discover the label correlation in the multi-label recognition task. In , Zhu et al. proposed a spatial regularization network to capture both semantic and spatial relations of these multiple labels based on weighted attention maps. Wang et al. 
introduced a spatial transformer layer and long short-term memory (LSTM) units to capture the label correlation.
Compared with the aforementioned structure learning methods, the graph was proven to be more effective in modeling label correlation. Li et al.  created a tree-structured graph in the label space by using the maximum spanning tree algorithm. Li et al.  produced image-dependent conditional label structures base on the graphical Lasso framework. Lee et al. 
incorporated knowledge graphs for describing the relationships between multiple labels. In this paper, we leverage the graph structure to capture and explore the label correlation dependency. Specifically, based on the graph, we utilize GCN to propagate information between multiple labels and consequently learn inter-dependent classifiers for each of image labels. These classifiers absorb information from the label graph, which are further applied to the global image representation for the final multi-label prediction. It is a more explicit way for evaluating label co-occurrence. Experimental results validate our proposed approach is effective and our model can be trained in an end-to-end manner.
In this part, we elaborate on our ML-GCN model for multi-label image recognition. Firstly, we introduce the motivation for our method. Then, we introduce some preliminary knowledge of GCN, which is followed by the detailed illustration of the proposed ML-GCN model and the re-weighted scheme for correlation matrix construction.
How to effectively capture the correlations between object labels and explore these label correlations to improve the classification performance are both important for multi-label image recognition. In this paper, we use a graph to model the inter dependencies between labels, which is a flexible way to capture the topological structure in the label space. Specifically, we represent each node (label) of the graph as word embeddings of the label, and propose to use GCN to directly map these label embeddings into a set of inter-dependent classifiers, which can be directly applied to an image feature for classification. Two factors motivated the design of our GCN based model. Firstly, as the embedding-to-classifier mapping parameters are shared across all classes, the learned classifiers can retain the weak semantic structures in the word embedding space, where semantic related concepts are close to each other. Meanwhile, the gradients of all classifiers can impact the classifier generation function, which implicitly models the label dependencies. Secondly, we design a novel label correlation matrix based on their co-occurrence patterns to explicitly model the label dependencies by GCN, with which the update of node features will absorb information from correlated nodes (labels).
Graph Convolutional Network (GCN) was introduced in  to perform semi-supervised classification. The essential idea is to update the node representations by propagating information between nodes.
Unlike standard convolutions that operate on local Euclidean structures in an image, the goal of GCN is to learn a function on a graph , which takes feature descriptions and the corresponding correlation matrix as inputs (where denotes the number of nodes and indicates the dimensionality of node features), and updates the node features as . Every GCN layer can be written as a non-linear function by
After employing the convolutional operation of , can be represented as
where is a transformation matrix to be learned and is the normalized version of correlation matrix , and denotes a non-linear operation, which is acted by LeakyReLU  in our experiments. Thus, we can learn and model the complex inter-relationships of the nodes by stacking multiple GCN layers. For more details, we refer interested readers to .
Our ML-GCN is built upon GCN. GCN was proposed for semi-supervised classification, where the node-level output is the prediction score of each node. Different from that, we design the final output of each GCN node to be the classifier of the corresponding label in our task. In addition, the graph structure (i.e., the correlation matrix) is normally pre-defined in other tasks, which, however, is not provided in the multi-label image recognition task. Thus, we need to construct the correlation matrix from scratch. The overall framework of our approach is shown in Fig. 2, which is composed of two main modules, i.e., the image representation learning and GCN based classifier learning modules.
We can use any CNN base models to learn the features of an image. In our experiments, following [36, 1, 15, 6], we use ResNet-101  as the base model in experiments. Thus, if an input image is with the resolution, we can obtain feature maps from the “conv5_x
” layer. Then, we employ global max-pooling to obtain the image-level feature:
where indicates model parameters and .
We learn inter-dependent object classifiers, i.e., , from label representations via a GCN based mapping function, where denotes the number of categories. We use stacked GCNs where each GCN layer l takes the node representations from previous layer () as inputs and outputs new node representations, i.e., . For the first layer, the input is the matrix, where is the dimensionality of the label-level word embedding. For the last layer, the output is with denoting the dimensionality of the image representation. By applying the learned classifiers to image representations, we can obtain the predicted scores as
We assume that the ground truth label of an image is , where denotes whether label appears in the image or not. The whole network is trained using the traditional multi-label classification loss as follows
is the sigmoid function.
GCN works by propagating information between nodes based on the correlation matrix. Thus, how to build the correlation matrix is a crucial problem for GCN. In most applications, the correlation matrix is pre-defined, which, however, is not provided in any standard multi-label image recognition datasets. In this paper, we build this correlation matrix through a data-driven way. That is, we define the correlation between labels via mining their co-occurrence patterns within the dataset.
We model the label correlation dependency in the form of conditional probability, i.e., which denotes the probability of occurrence of label when label appears. As shown in Fig. 3, is not equal to . Thus, the correlation matrix is asymmetrical.
To construct the correlation matrix, firstly, we count the occurrence of label pairs in the training set and get the matrix . Concretely, is the number of categories, and denotes the concurring times of and . Then, by using this label co-occurrence matrix, we can get the conditional probability matrix by
where denotes the occurrence times of in the training set, and means the probability of label when label appears.
However, the simple correlation above may suffer from two drawbacks. Firstly, the co-occurrence patterns between a label and the other labels may exhibit a long-tail distribution, where some rare co-occurrences may be noise. Secondly, the absolute number of co-occurrences from training and test may not be completely consistent. A correlation matrix overfitted to the training set can hurt the generalization capacity. Thus, we propose to binarize the correlation. Specifically, we use the threshold to filter noisy edges, and the operation can be written as
where is the binary correlation matrix.
From Eq. (2), we can conclude that after GCN, the feature of a node will be the weighted sum of its own feature and the adjacent nodes’ features. Then, a direct problem for the binary correlation matrix is that it can result in over-smoothing. That is, the node features may be over-smoothed such that nodes from different clusters (e.g., kitchen related vs. living room related) may become indistinguishable . To alleviate this problem, we propose the following re-weighted scheme,
where is the re-weighted correlation matrix, and determines the weights assigned to a node itself and other correlated nodes. By doing this, when updating the node feature, we will have a fixed weight for the node itself and the weights for correlated nodes will be determined by the neighborhood distribution. When , the feature of a node itself will not be considered. While, on the other hand, when , neighboring information tends to be ignored.
In this section, we first describe the evaluation metrics and implementation details. Then, we report the empirical results on two benchmark multi-label image recognition datasets,i.e., MS-COCO  and VOC 2007 . Finally, visualization analyses are presented.
Following conventional settings [28, 6, 36], we report the average per-class precision (CP), recall (CR), F1 (CF1) and the average overall precision (OP), recall (OR), F1 (OF1) for performance evaluation. For each image, the labels are predicted as positive if the confidences of them are greater than 0.5. For fair comparisons, we also report the results of top-3 labels, cf. [36, 6]. In addition, we also compute and report the mean average precision (mAP). Generally, average overall F1 (OF1), average per-class F1 (CF1) and mAP are relatively more important for performance evaluation.
Without otherwise stated, our ML-GCN consists of two GCN layers with output dimensionality of and , respectively. For label representations, we adopt -dim GloVe  trained on the Wikipedia dataset. For the categories whose names contain multiple words, we obtain the label representation as average of embeddings for all words. For the correlation matrix, without otherwise stated, we set in Eq. (7) to be 0.4 and in Eq. (8) to be 0.2. In the image representation learning branch, we adopt LeakyReLU 
with the negative slope of 0.2 as the non-linear activation function, which leads to faster convergence in experiments. We adopt ResNet-101
as the feature extraction backbone, which is pre-trained on ImageNet. During training, the input images are random cropped and resized into with random horizontal flips for data augmentation. For network optimization, SGD is used as the optimizer. The momentum is set to be 0.9. Weight decay is
In this part, we first present our comparisons with state-of-the-arts on MS-COCO and VOC 2007, respectively. Then, we conduct ablation studies to evaluate the key aspects of the proposed approach.
Microsoft COCO  is a widely used benchmark for multi-label image recognition. It contains 82,081 images as the training set and 40,504 images as the validation set. The objects are categorized into 80 classes with about 2.9 object labels per image. Since the ground-truth labels of the test set are not available, we evaluate the performance of all the methods on the validation set. The number of labels of different images also varies considerably, which makes MS-COCO more challenging.
Quantitative results are reported in Table 1. We compare with state-of-the-art methods, including CNN-RNN , RNN-Attention , Order-Free RNN , ML-ZSL , SRN , Multi-Evidence , etc. For the proposed ML-GCN, we report the results based on the binary correlation matrix (“ML-GCN (Binary)”) and the re-weighted correlation matrix (“ML-GCN (Re-weighted)”), respectively. It is obvious to see that our ML-GCN method based on the binary correlation matrix obtains worse classification performance, which may be largely due to the over-smoothing problem discussed in Sec. 3.4. The proposed re-weighted scheme can alleviate the over-smoothing issue and consequently obtains superior performance. Comparing with state-of-the-art methods, our approach with the proposed re-weighted scheme consistently performs better under almost all metrics, which shows the effectiveness of our proposed ML-GCN as well as its corresponding re-weighted scheme.
|Order-Free RNN ||–||–||–||–||–||–||–||71.6||54.8||62.1||74.2||62.2||67.7|
PASCAL Visual Object Classes Challenge (VOC 2007)  is another popular dataset for multi-label recognition. It contains 9,963 images from 20 object categories, which is divided into train, val and test sets. Following [2, 29], we use the trainval set to train our model, and evaluate the recognition performance on the test set. In order to compare with other state-of-the-art methods, we report the results of average precision (AP) and mean average precision (mAP).
The results of VOC 2007 are presented in Table 2. Because the results of many previous works on VOC 2007 are based on the VGG model . For fair comparisons, we also report the results using VGG models as the base model. It is apparent to see that, our proposed method observes improvements upon the previous methods. Concretely, the proposed ML-GCN with our re-weighted scheme obtains mAP, which outperforms state-of-the-art by . Even using VGG model as the base model, we can still achieve better results (). Also, consistent with the results on MS-COCO, the re-weighed scheme enjoys better performance than the binary correlation matrix on VOC as well.
In this section, we perform ablation studies from four different aspects, including the sensitivity of ML-GCN to different types of word embeddings, effects of in correlation matrix binarization, effects of for correlation matrix re-weighting, and the depths of GCN.
By default, we use Glove  as label representations, which serves as the inputs of the stacked GCNs for learning the object classifiers. In this part, we evaluate the performance of ML-GCN under other types popular word representations. Specifically, we investigate four different word embedding methods, including GloVe , GoogleNews , FastText  and the simple one-hot word embedding. Fig. 4 shows the results using different word embeddings on MS-COCO and VOC 2007. As shown, we can see that when using different word embeddings as GCN’s inputs, the multi-label recognition accuracy will not be affected significantly. In addition, the observations (especially the results of one-hot) justify that the accuracy improvements achieved by our method do not absolutely come from the semantic meanings derived from word embeddings. Furthermore, using powerful word embeddings could lead to better performance. One possible reason may be that the word embeddings [25, 24, 13] learned from large text corpus maintain some semantic topology. That is, for semantic related concepts, their embeddings are close in the embedding space. Our model can employ these implicit dependencies, and further benefit multi-label image recognition.
We vary the values of the threshold in Eq. (7) for correlation matrix binarization, and show the results in Fig. 5. Note that, if we do not filter any edges, the model will not converge. Thus, there is no result for in that figure. As shown, when filtering out the edges of small probabilities (i.e., noisy edges), the multi-label recognition accuracy is boosted. However, when too many edges are filtered out, the accuracy drops since correlated neighbors will be ignored as well. The optimal value of is for both MS-COCO and VOC 2007.
To explore the effects of different values of in Eq. (8) on multi-label classification accuracy, we change the values of in a set of , as depicted in Fig. 6. Generally, this figure shows the importance of balancing the weights between a node itself and the neighborhood when updating the node feature in GCN. In experiments, we choose the optimal value of by cross-validations. We can see that when , it can achieve the best performance on both MS-COCO and VOC 2007. If is too small, nodes (labels) of the graph can not get sufficient information from correlated nodes (labels). While, if is too large, it will lead to over-smoothing.
Another interesting observation is that, when , we can obtain mAPs of on MS-COCO and on VOC 2007, which still outperforms existing methods. Note that when , we essentially do not explicitly incorporate the label correlations. The improvement is benefited from that our ML-GCN model learns the object classifiers from the prior label representations through a shared GCN based mapping function, which implicitly models label dependencies as discussed in Sec. 3.1
We show the performance results with different numbers of GCN layers for our model in Table 3. For the three-layer model, the output dimensionalities are , and for the sequential layers, respectively. For the four-layer model, the dimensionalities are , , and . As shown, when the number of graph convolution layers increases, multi-label recognition performance drops on both datasets. The possible reason for the performance drop may be that when using more GCN layers, the propagation between nodes will be accumulated, which can result in over-smoothing.
The effectiveness of our approach has been quantitatively evaluated through comparisons to existing methods and detailed ablation studies. In this section, we visualize the learned inter-dependent classifiers to show if meaningful semantic topology can be maintained.
In Fig. 8, we adopt the t-SNE  to visualize the classifiers learned by our proposed ML-GCN, as well the classifiers learned through vanilla ResNet (i.e., parameters of the last fully-connected layer). It is clear to see that, the classifiers learned by our method maintain meaningful semantic topology. Specifically, the learned classifiers exhibit cluster patterns. Classifiers (of “car” and “truck”) within one super concept (“transportation
”), tend to be close in the classifier space. This is consistent with common sense, which indicates that the classifiers learned by our approach may not be limited to the dataset where the classifiers are learned, but may enjoy generalization capacities. On the contrary, the classifiers learned through vanilla ResNet uniformly distribute in the space and do not shown any meaningful topology. This visualization further shows the effectiveness of our approach in modeling label dependencies.
Apart from analyzing the learned classifiers, we further evaluate if our model can learn better image representations. We conduct an image retrieval experiment to verify this. Specifically, we use the-NN algorithm to perform content-based image retrieval to validate the discriminative ability of image representations learned by our model. Still, we choose the features from vanilla ResNet as the baseline. We show the top-5 images returned by -NN. The retrieval results are presented in Fig. 7. For each query image, the corresponding returned images are sorted in the ascending order according to the distance to the query image. We can clearly observe that our retrieval results are obviously better than the vanilla ResNet baseline. For example, in Fig. 7 (c), the labels of the images returned by our approach almost exactly match the labels of the query image. It can demonstrate that our ML-GCN can not only effectively capture label dependencies to learn better classifiers, but can benefit image representation learning as well in multi-label recognition.
Capturing label dependencies is one crucial issue for multi-label image recognition. In order to model and explore this important information, we proposed a GCN based model to learn inter-dependent object classifiers from prior label representations, e.g., word embeddings. To explicitly model the label dependencies, we designed a novel re-weighted scheme to construct the correlation matrix for GCN by balancing the weights between a node and its neighborhood for node feature update. This scheme can effectively alleviate over-fitting and over-smoothing, which are two key factors hampering the performance of GCN. Both quantitative and qualitative results validated the advantages of our ML-GCN.
Recurrent attentional reinforcement learning for multi-label image recognition.In AAAI, pages 6730–6737, 2018.
Xception: Deep learning with depthwise separable convolutions.In CVPR, pages 1251–1258, 2017.
Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning.In CVPR, pages 1277–1286, 2018.
Deeper insights into graph convolutional networks for semi-supervised learning.In AAAI, pages 3538–3545, 2018.
Efficient estimation of word representations in vector space.In ICLR, pages 1–12, 2013.