Although zero-shot learning for single-label classification has been well studied [xie2019attentive, mishra2017cvae-zsl, xian2018fganzsl, verma2018segzsl, GDAN_2019_CVPR, schonfeld2019generalized], multi-label zero-shot learning (ML-ZSL) [mensink2014costa, fu2015transductive, zhang2016fast, lee2018multi] is a less explored area. An image is associated with potentially multiple seen and unseen classes, but only labels of seen classes are provided during training. The multi-label zero-shot learning problem is more difficult than its single-label counterpart for two reasons. First, since there is no constraint on the number of labels assigned to an image and only labels of seen classes are provided during training, the model can easily bias towards the seen classes and ignore the unseen ones. Second, the datasets used for multi-label zero-shot learning are generally much more challenging than those used for traditional zero-shot learning. While the datasets used for zero-shot learning (ZSL) usually consist of closely related classes such as different kinds of birds (e.g., Baird Sparrow and Chipping Sparrow in CUB [wah2011cub_bird]), the datasets for multi-label classification contain high-level concepts that are less related to each other (e.g., truck and sheep in MS-COCO [lin2014mscoco]). The semantic gap between seen and unseen classes in multi-label datasets makes it difficult for the model to generalize knowledge learned from seen classes to predict unseen classes.
Some studies [lee2018multi, zhang2016fast, fu2015transductive, gaure2017probabilistic, marino2016more, ren2017multiple, mensink2014costa] try to solve the multi-label zero-shot learning problem by leveraging word embeddings [zhang2016fast] or constructing structured knowledge graphs [lee2018multi]. Although these methods empirically work well, none of them utilizes external knowledge to close the gap between semantically distinct seen and unseen classes. By contrast, humans learn new concepts by connecting them with those that they have already learned before, and thus can transfer knowledge from previous experience to solve new tasks. This ability is called transfer learninge.g., ResNet [he2016resnet]) pretrained on ImageNet [ILSVRC15ImageNet]. However, this naive approach does not make use of the semantic information of the 1K ImageNet classes, which can be helpful in recognizing unseen classes in zero-shot learning, especially when none of the seen classes is closely related to the unseen ones.
In this paper, we propose to build a knowledge graph with not only the classes from the target multi-label dataset (e.g., MS-COCO [lin2014mscoco]) but also classes from the external single-label dataset (e.g., ImageNet [ILSVRC15ImageNet]). Then classification can be performed via graph reasoning methods such as graph convolutional networks (GCNs) [kipf2016gcn]. Our intuition is that the knowledge of which ImageNet classes are more related to the image can help close the semantic gap between the seen and unseen classes. For example, as illustrated in Figure 1, the unseen class Sheep in the image is hard to predict since it is semantically very different from all seen classes in the dataset. But if we include the 1K ImageNet classes into the knowledge graph, there will be a class Bighorn Sheep which is semantically related to Sheep
. Given the predicted probability distribution on the ImageNet classes, if theBighorn Sheep class has a high probability, we can infer that the image contains the Sheep class.
After incorporating the ImageNet classes as nodes in the knowledge graph of the target dataset, it is still nontrivial to infer their states. A simple solution is to use the predicted probability distribution on ImageNet classes as pseudo labels, and then we can treat the ImageNet classes the same as those in the target multi-label dataset. However, this simple approach is problematic. Since the ImageNet classes are not labeled in the multi-label dataset, treating them the same as the real target labels may confuse the network and lead to inferior learning. To address this issue, we propose a novel PosVAE module based on a conditional variational auto-encoder [sohn2015cvae]
to model the posterior probabilities of the ImageNet classes. Then the states of ImageNet nodes are obtained via inference through the PosVAE network. The states of nodes from the target dataset are obtained by another semantically conditioned encoder which takes as input the concatenation of the image features and the class embedding of each node. The states of all nodes are passed into a relational graph convolutional network (RGCN), which propagates information among classes and generates predictions for each class in the target dataset.
Our contributions are summarized as follows:
To the best of our knowledge, this is the first attempt to transfer the semantic knowledge from the 1K ImageNet classes to help solve the multi-label zero-shot learning problem. This is achieved by adding the ImageNet classes as additional nodes to extend the knowledge graph constructed by labels in the target multi-label dataset. By contrast, previous methods only utilize pretrained ImageNet classifiers as feature extractors.
We design a PosVAE module, which is based on a conditional variational auto-encoder [sohn2015cvae], to infer the states of ImageNet nodes in the extended knowledge graph, and a relational graph convolutional network (GCN) to generate predictions for seen and unseen classes.
We conduct extensive experiments and demonstrate that the proposed approach can effectively transfer knowledge from the ImageNet classes and improve the multi-label zero-shot classification performance. We also show that it outperforms previous methods.
2 Related Work
Multi-label Classification. Recently, some work [wang2016cnnrnn, wang2017multi, zhu2017spatialreg, chen2019mlgcn, chen2019ssgrl, marino2016more] propose different ways to learn the correlations among labels. Marino et al[marino2016more] first detect one label and then sequentially predict more related labels of the input image. Wang et al[wang2016cnnrnn]
use a recurrent neural network (RNN) to capture the dependencies between labels, while Zhuet al[zhu2017spatialreg] utilize spatial attention to better capture spatial and semantic correlations of labels. Chen et al[chen2019mlgcn] construct a knowledge graph from labels and use the co-occurrence probabilities of labels as weights of the edges in the graph, and apply a graph convolutional network (GCN) [kipf2016gcn] to perform multi-label classification. Chen et al[chen2019ssgrl] use the pretrained word embeddings of labels as semantic queries on the image feature map and apply spatial attention to obtain semantically attended features for each label. Although these methods consider label dependencies and work well on traditional multi-label classification, they are not directly applicable in the multi-label zero-shot setting because the label co-occurrence statistics is not available for unseen classes, and only utilizing the statistics for seen labels can cause severe bias towards seen classes. Thus we need a different way to learn the relations between labels in the multi-label zero-shot setting.
Multi-label Zero-shot Learning is a less explored problem compared to single-label zero-shot learning [mishra2017cvae-zsl, xian2018fganzsl, verma2018segzsl, GDAN_2019_CVPR, schonfeld2019generalized]. Fu et al[fu2015transductive] enumerate all possible combinations of labels and thus transform the problem into single-label zero-shot learning. However, as the number of labels increases, the number of label combinations will increase exponentially, which makes this method inapplicable. Mensink et al[mensink2014costa] learn classifiers for unseen classes as weighted combinations of seen classifiers learned with co-occurrence statistics of seen labels. Zhang et al[zhang2016fast]
consider the projected image features as a projection vector which ranks relevant labels higher than irrelevant ones by calculating the inner product between the projected image features and label word embeddings. Gaureet al[gaure2017probabilistic] further assume that the co-occurrence statistics of both seen and unseen labels are available and design a probabilistic model to solve the problem. Ren et al[ren2017multiple] cast the problem into a multi-instance learning [zha2008joint] framework and learn a joint latent space for both image features and label embeddings. Lee et al[lee2018multi] construct a knowledge graph for both seen and unseen labels, and learn a gated graph neural network (GGNN) [li2015ggnn] to propagate predictions among labels. To our knowledge, none of the previous work considers the large semantic gap between seen and unseen labels, and we are the first to exploit the label space of a single-label image classification dataset as external knowledge and transfer it for zero-shot multi-label image classification.
3.1 Problem Definition and Notations
The multi-label zero-shot classification problem is defined as follows. Given a multi-label classification dataset, the label space is split into seen labels and unseen labels such that and . During training, a set of image-label pairs is available, where the labels of an image come from only the seen classes: . During testing, the images may have labels from both seen and unseen classes: . The goal is to learn a model , given only seen classes during training. In this work, we propose to use an external set of labels from ImageNet [ILSVRC15ImageNet] where . For an arbitrary class , its pretrained GloVe [pennington2014glove] embedding is denoted by , and all class embeddings are aggregated into a single matrix , where is the dimension of embeddings.
3.2 Overview of the Proposed Method
The overall pipeline of the proposed model is illustrated in Figure 2. An input image is first fed into a pretrained ImageNet classifier to extract its image feature and a probability distribution over the ImageNet classes . We first construct a graph whose nodes are classes from the target dataset. We calculate the pair-wise Wu-Palmer (WUP) [wu1994wup] similarities on WordNet [miller1995wordnet] and then use a threshold to determine whether an edge exists between a pair of nodes. Self-connections are also included. During training, only seen classes are included into the training graph, while during testing both seen and unseen classes () are included in the testing graph. The classes from ImageNet [ILSVRC15ImageNet] are also added to the training and testing graphs, and edges are added in the same way as described above. The overlapping classes between ImageNet and the target dataset have been removed from . All aggregated edges form an adjacency matrix for training, and an adjacency matrix for testing (with little abuse of notation).
Since the target dataset does not have ground-truth labels of the ImageNet classes , we need to treat and differently when inferring their states in the graph. For nodes in
, we infer their initial states by concatenating their corresponding word embeddings and the extracted image features. Then a multi-layer perceptron (MLP) network is applied to obtain the node featuresfor training and for testing, where is the dimension of node features. Meanwhile, for nodes corresponding to
, we design a PosVAE module to estimate the posteriorfor each , where is the node state of class and is predicted probability of the class . Then we sample from this posterior distribution to infer the initial node states for all classes in denoted as . We will discuss the detailed design of PosVAE in Section 3.3. Then we can obtain the initial node states for all the nodes in the graph by concatenation: for training and for testing.
With the initial node states , the adjacency matrix , and the embedding matrix , we design a Relational Graph Convolutional Network (RGCN) to propagate information among the nodes and refine their features, which will be discussed in Section 3.4
. The output features of RGCN are fed into another MLP network followed by Sigmoid functions to predict the existence of each label in the target dataset (for training and for testing).
3.3 Variational Inference for Node States of
For each ImageNet class , we aim to estimate its corresponding initial node state in the graph from its probability predicted by pretrained ImageNet classifier and its embedding . The objective is to estimate the posterior . Since we do not have the exact form of the posterior, we resort to variational inference and use another learned distribution to approximate the true posterior . In order to ensure a good approximation, we aim to minimize the KL-divergence between and :
Similar to the conditional variational auto-encoder [sohn2015cvae], the above objective can be further derived as:
where the term is neglected since it does not depend on during optimization.
follows an isotropic Gaussian distributionand . By using the reparameterization trick [kingma2013vae], the initial node states can be derived as:
where the Encoder network is a stack of MLP layers and is element-wise multiplication. The Encoder first projects into the same dimension space as . Then the two vectors are concatenated and fed into subsequent MLP layers to generate the mean
and log standard deviation. In order to calculate the log-likelihood term , we assume a Gaussian distribution on , and use the Decoder network to project the concatenation of and back to . Then maximizing the log-likelihood term is approximately equal to minimizing the mean-square error (MSE) between the reconstructed probability and the real one.
3.4 Relational Graph Convolutional Network
The Graph Convolutional Network (GCN) [kipf2016gcn] was first introduced as a method to perform semi-supervised classification. Its core idea is to design a message passing mechanism for graph data so that the node features can be updated recursively and become more suitable to the desired task. Different from the conventional convolution that operates in the Euclidean space like images, a GCN works on graph-structure data that contain nodes and edges but without fixed geometric layout. Given the states of the current layer and the adjacency matrix , the states of the -th layer can be calculated as:
where is the learnable weight matrix of the -th layer, is the normalized version of [kipf2016gcn], and
In traditional multi-label classification models, the matrix is often represented by the co-occurrence probabilities between labels [chen2019ssgrl, chen2019mlgcn]. However, in the zero-shot learning setting, we do not have this information. Thus we set as binary adjacency matrix and learn a new matrix to indicate the weight of each edge ( for testing). In order to learn the weight matrix, we propose the relational graph convolutional network (RGCN) as follows. Specifically, the weight of the edge between a pair of classes is calculated by first concatenating their word embeddings together and feeding them into an MLP network to generate a scalar which represents the weight. In this way, we are able to calculate the weight between seen and unseen classes. An entry of the weight matrix is derived by:
where , and . Elements on the diagonal of are set to 1 to include self-connections. After obtaining the weight matrix , we calculate , where is element-wise multiplication. After obtaining the new matrix , the rest of RGCN is the same as the traditional GCN [kipf2016gcn]. After obtaining the node states of each node corresponding to the seen classes , we apply a final MLP layer to the node states to calculate the probability of each label, i.e., , and is the last layer of RGCN. We train the relational graph convolutional network (RGCN) with class-wise binary cross-entropy loss:
where is an indicator function that outputs 1 if the input image contains label
, otherwise 0. The whole model is trained by minimizing the joint loss function:
where is a hyper-parameter to balance the two terms.
4.1 Datasets and Setting
Datasets. In our experiments, we use two multi-label datasets, i.e., MS-COCO [lin2014mscoco] and NUS-WIDE [chua2009nuswide]. MS-COCO [lin2014mscoco]
is a large-scale dataset commonly used for multi-label image classification, object detection, instance segmentation and image captioning. We adopt the 2014 challenge which contains 79,465 training images and 40,137 testing images respectively (after removing images without labels). Among its 80 classes, we randomly select 16 unseen classes and make sure that the unseen classes do not overlap with the ImageNet classes[ILSVRC15ImageNet]. The 16 unseen classes are (’bicycle’, ’boat’, ’stop sign’, ’bird’, ’backpack’, ’frisbee’, ’snowboard’, ’surfboard’, ’cup’, ’fork’, ’spoon’, ’broccoli’, ’chair’, ’keyboard’, ’microwave’, ’vase’). NUS-WIDE [chua2009nuswide] is a web-crawled dataset with 54,334 training images and 42,486 testing images (after removing images without labels). Among the 81 classes of NUS-WIDE [chua2009nuswide], we randomly select 16 unseen classes and make sure that the unseen classes do not overlap with the ImageNet classes [ILSVRC15ImageNet]. The 16 unseen classes are (’airport’, ’cars’, ’food’, ’fox’, ’frost’, ’garden’, ’mountain’, ’police’, ’protest’, ’rainbow’, ’sun’, ’tattoo’, ’train’, ’water’, ’waterfall’, ’window’).
Graph construction. We calculate the WUP [wu1994wup] similarity between each pair of classes, and an edge is added to the graph if the corresponding WUP similarity is greater than a specified threshold (0.5 in all of our experiments). When adding the ImageNet [ILSVRC15ImageNet] classes into either the MS-COCO [lin2014mscoco] or NUS-WIDE [chua2009nuswide] graph, we exclude any class existing in the target dataset from ImageNet to ensure the additional nodes are distinct from the nodes in the two target datasets.
. We adopt two evaluation metrics,i.e., mean average precision (mAP) which evaluates the model performance on the class level, and mean image average precision (miAP) which evaluates the model performance on the image level. We calculate the mAP and miAP for two sets of labels, i.e., seen classes and unseen classes .
Baselines. We use three models from previous work and some variants of our proposed method as baselines. Visual-semantic [akata2013ALE] was proposed for single-label zero-shot classification, which projects the visual features of images to the semantic space of classes and use the class embeddings as classifiers. Here we change the loss function from multi-class cross-entropy to binary cross-entropy applied to each class individually. Fast0Tag [zhang2016fast] uses class embeddings as projection vectors to rank the closeness between images and classes, and applies a triplet-based ranking loss to train the model. SKG [lee2018multi] constructs a structural knowledge graph from labels and applies GGNN [li2015ggnn] to infer the existence of each label. It also uses WUP simiarities on WordNet [miller1995wordnet] to construct the graph as we do. RGCN is a basic version of our proposed model which does not include external nodes from ImageNet [ILSVRC15ImageNet]. RGCN-XL is another variant of our proposed model, where the ImageNet [ILSVRC15ImageNet] nodes are also included in the graph but treated the same as the other nodes in the target dataset. For ImageNet nodes, we predict their probabilities in the same way as the other classes in the target dataset, but train them with MSE to reconstruct the probability distribution of the pretrained ImageNet classifier. RGCN-PosVAE is our full model which uses PosVAE to infer the hidden states of ImageNet classes and applies RGCN to propagate information among different classes.
Implementation details. For SKG [lee2018multi] we use the code provided by the authors and modify it to fit our pipeline, and we implement all other baselines. We use ResNet-101 [he2016resnet] pretrained on ImageNet [ILSVRC15ImageNet] as the feature extractor. We use the 300-dimension pretrained GloVe [pennington2014glove] vectors to represent class embeddings. We implement the encoder and decoder of PosVAE as MLP networks with a hidden layer of size 256, while the relation network that computes is implemented as a two-layer MLP network with a hidden size 256. We use two layers of RGCN with the feature dimension 256. We use an Adam [kingma2014adam] optimizer and set the initial learning rate as and decay it by 0.1 when the loss plateaus. is set to 1 for our model. The random seed for all experiments are fixed as 42.
The main results are shown in Table 1. As we can see from the last row of the table, our proposed method achieves the highest miAP on unseen classes in both datasets. We achieve the highest unseen mAP on MS-COCO and second highest unseen mAP on NUS-WIDE [chua2009nuswide]. Although our model has 1.64% lower unseen mAP than visual-semantic [akata2013ALE] on NUS-WIDE [chua2009nuswide], our unseen miAP is 7.77% higher than visual-semantic [akata2013ALE]. For the MS-COCO [lin2014mscoco] dataset, our model performs better than SKG [arjovsky2017wgan] with an obvious margin of about 3.5% unseen miAP and 2.5% unseen mAP. On NUS-WIDE [chua2009nuswide], our RGCN-PosVAE outperforms the second best Fast0Tag [zhang2016fast] by around 2% unseen miAP and 0.6% unseen mAP.
Fast0tag [zhang2016fast] achieves higher seen and unseen miAP on both datasets than visual-semantic [akata2013ALE], while visual-semantic [akata2013ALE] performs better on seen mAP than Fast0tag [zhang2016fast] on both datasets. SKG [lee2018multi] performs better than the previous two baselines on MS-COCO [lin2014mscoco], but is slightly worse than Fast0tag [zhang2016fast] on the NUS-WIDE [chua2009nuswide]. The RGCN baseline, which is the same as our full model but without incorporating ImageNet nodes into the graph, still outperforms Fast0Tag [zhang2016fast] on MS-COCO, and have comparable result with SKG [lee2018multi] on NUS-WIDE. By comparing the seen and unseen class performance of each model, we can see that other baselines generally achieve better performance on seen classes than our proposed model, but have lower performance on unseen classes, which indicates that our proposed model is less likely to overfit seen classes and better able to generalize to unseen classes than others.
The basic RGCN model achieves higher unseen mAP than SKG [lee2018multi] on MS-COCO and slightly lower unseen miAP on NUS-WIDE [chua2009nuswide]. But their results are still comparable, which shows that our proposed RGCN itself is already a very strong baseline. RGCN-XL, which is an extended RGCN with ImageNet nodes, achieves higher unseen miAP than RGCN, which shows the potential of utilizing the semantic information of ImageNet classes, although in a simple fashion. It can also be noted that RGCN-XL has much lower performance on seen classes on MS-COCO. The reason may be that the brute-force way of incorporating ImageNet classes will make the predictions on seen classes harder, since the input images do not actually have those ImageNet classes while the model is trying to treat them the same as the classes in the target dataset when initializing their node states. Meanwhile, the proposed RGCN-PosVAE model, which infers the initial states of ImageNet classes using variational inference, achieves higher scores on both seen and unseen classes than RGCN-XL, which demonstrates the advantages of using the proposed method to initialize the hidden states of ImageNet classes.
4.3 Ablation Study
Effect of the number of GCN layers. We first investigate how the number of RGCN layers affects our model’s performance on the MS-COCO dataset [lin2014mscoco]. We try different numbers of layers from 1 up to 6, and plot the seen and unseen classes miAP in Figure 3. As we can see, the miAP performance on seen classes does not vary a lot with different numbers of RGCN layers. On the other hand, two or three layers of RGCN have similar performance on unseen classes, while the version with only one layer has miAP about 2% lower than the highest. After increasing the number of layers to 4, 5 and 6, the model’s performance on unseen classes drops, which indicates that the model starts to become more overfitted to seen classes.
Effect of the WUP similarity threshold. We also investigate how the threshold of WUP similarity can affect our model’s performance. We tune the threshold from 0.1 to 0.8, and show the result as well as the number of edges generated by each threshold in Figure 3. As we can see, although setting a low threshold will include some noisy edges that may be misleading, our model still achieves a good performance on unseen miAP, with less than 1% decrease from the best one. As the threshold becomes larger than 0.5, we can see that there is an obvious drop of performance on unseen miAP. This is because the number of edges is much smaller when the threshold is too large, and thus the model cannot make good reference to seen classes when predicting unseen classes. Overall, our model’s performance on seen and unseen classes is still relatively robust with respect to the WUP [wu1994wup] similarity threshold when the threshold is less than 0.6.
In this paper, we tackle the multi-label zero-shot learning (ML-ZSL) problem by first pointing out the potential semantic gap between seen and unseen classes, and propose to incorporate external ImageNet classes to help predict the seen can unseen classes. Specifically, we construct a knowledge graph consisting both clases from ImageNet and the target dataset (e.g, MS-COCO [lin2014mscoco] and NUS-WIDE [chua2009nuswide]), and design a PosVAE network to infer the node states of ImageNet classes, and learn a relational graph convolutional network (RGCN) which calculates the propagation weights between each pair of classes given their GloVe [pennington2014glove] embeddings. Experiments show that our proposed method has a clear advantage of predicting unseen classes.