Transductive Zero-Shot Learning with Visual Structure Constraint

01/06/2019 ∙ by Ziyu Wan, et al. ∙ City University of Hong Kong Association for Computing Machinery 0

Zero-shot Learning (ZSL) aims to recognize objects of the unseen classes, whose instances may not have been seen during training. It associates seen and unseen classes with the common semantic space and provides the visual features for each data instance. Most existing methods first learn a compatible projection function between the semantic space and the visual space based on the data of source seen classes, then directly apply it to target unseen classes. However, in real scenarios, the data distribution between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (i.e. alleviate the above domain shift problem). Specifically, two different strategies (symmetric Chamfer-distance and bipartite matching) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. Experiments on three widely used datasets demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of deep learning and the emergence of large-scale training datasets like ImageNet

[7], significant progress has been made for visual recognition task in the past several years [34, 35, 16]. However, labeling these training datasets is very difficult and labor intensive. Moreover, it is unrealistic to cover all the object categories and find enough images for each category with modern search engines. Taking ImageNet for example, it consists of total 21814 classes with 14M images, within which 21K object classes only consist of a handful of images. For the classes with limited training samples, existing visual recognition models struggle to make correct predictions. And for the new classes which are unseen in the training dataset, these models cannot even work at all.

By contrast, zero-shot Learning (ZSL) [1, 3, 2, 39, 14, 48, 6, 40, 38, 4]

only requires labelled images for seen classes (source domain) but requires no image for unseen classes (target domain). These two domains often share a common semantic space, which defines how unseen classes are semantically related to seen classes. The most popular semantic space used by existing works is based on semantic attributes, where each seen or unseen class is represented by an attribute vector. Besides the semantic space, image contents of source and target domain are also related and represented in a visual feature space. Thanks to the powerful representation ability of deep neural networks, most state-of-the-art methods use pretrained CNN to extract high-level features as the visual representation.

To associate the semantic space and the visual space, existing methods often rely on the data from source domain to learn a compatible projection function to map one space to the other, or two compatible projection functions to map both spaces into one common intermediate space. During test time, to recognize an image in the target domain, semantic vectors of all unseen classes and the visual feature of this image would be projected into the embedding space using the learned function, then nearest neighbor (NN) search will be performed to find the best match class.

However, due to the existence of distribution difference between the source and target domain in most real scenarios, the learned projection function often suffers from the well-known domain shift problem [12]. To compensate for this domain gap, transductive zero-shot learning [13, 46] assumes that the semantic information (e.g. attributes) of unseen classes and visual features of all test images are known in advance. To better leverage this extra information, different ways are well-investigated in [13, 17, 31, 46], such as domain adaption [17] and label propagation [47].

In this paper, we propose a visual structure constraint for the transductive ZSL to improve the generality of the learned projection function. Our key insight is that, though the class annotations of test images are unknown for ZSL, pretrained CNN features can still separate these images well.Take the TSNE visualization in Figure 1 as an example, test (target domain) images of the AwA2 dataset can be well-separated into different clusters. The relationships between these cluster centers also called visual centers, define some structure in the visual space. We believe that if this visual structure can be preserved during the projection, the learned projection function should be more general to the target domain, thus alleviate the domain shift problem.

Considering the prior that all these images are from the unseen classes of the target domain, there should exist some correspondences between these visual centers and unseen semantic classes. Here, we adopt the visual space as the embedding space and project the semantic space into it, which is also demonstrated to be helpful in improving the hubness problem [29] in [43]. To learn the projection function, we not only use the projection constraint of source domain data as [43] but also impose the aforementioned visual structure constraint. Specifically, we first project all the unseen semantic classes into the visual space, then consider two different strategies to align the projected unseen semantic centers and aforementioned visual centers. However, due to the lack of labels of test instances in the ZSL setting, we approximate these real visual centers with some unsupervised clustering algorithms (e.g

. K-Means).

Inspired by previous 3D point clouds alignment works [9], we adopt the symmetric Chamfer-distance in the first strategy to measure the discrepancy of these two sets, where each center of one set will find the nearest center in the other set. However, we will show that many-to-one matching may exist in this strategy. Considering the prior that these two sets should conform to the strict one-to-one matching principle in ZSL, we further consider using the bipartite matching algorithm to obtain a global minimum distance between these two sets in the second strategy. These two types of visual structure constraint can be both modeled as an additional loss term so that they can be potentially incorporated into other end-to-end learning ZSL methods easily. Need to note that we also replace the instance-based projection distance (visual feature of each instance to projected semantic centers respectively) in [43] with center-based projection distance ( visual feature centers of instances to the projected semantic centers), which is more computationally efficient.

To demonstrate the effectiveness of the proposed visual structure constraint, we have tried three different widely-used datasets, including AwA2 [21], CUB [36] and SUN [28]. Experiments demonstrate that the visual structure constraint consistently brings substantial performance gain and achieves state-of-the-art results. We have also provided some deep analysis and visualization, which may inspire researchers in this field with more insightful understandings.

To summarize, our contributions are three-fold as below:

  • We have proposed two different types of visual structure constraint for ZSL learning, which can help to improve the domain shift problem.

  • Experiments demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.

  • Compared to previous instance-based optimization objective, our center-based optimization objective is much more computationally efficient.

2 Related Work

In the past several years, deep supervised learning has gained enormous success for the image recognition task

[34, 35, 16]. However, it relies on large-scale human annotations and cannot generalize to new classes. Zero-shot learning bridges the gap between training seen classes and test unseen classes via different kinds of semantic spaces. Among them, the most popular and effective one is the attribute-based semantic space [10, 1, 20]. The attributes are often designed by experts, which are reliable and effective. To incorporate more attributes for fine-grained recognition tasks, the text description-based semantic space is proposed [30, 43, 8, 48], which provides a more natural language interface. Compared to these two labor-intensive types of methods, word vector-based methods [11, 25, 27, 38] can learn the semantic space from large text corpus automatically and save much human labor. However, they often suffer from visual-semantic discrepancy problem and achieve inferior performance. Need to note that, though the effectiveness of the proposed structure constraint is only demonstrated with the attribute semantic space by default, it should be general to all these spaces.

To relate the visual feature of test images and semantic attribute of unseen classes, three different embedding spaces are used by existing zero-shot learning methods: the original semantic space, the original visual space, and the newly-learned common intermediate embedding space. Specifically, they often learn a projection function from the visual space to the semantic space [21, 30, 11, 4] or from the semantic space to the visual space [17, 33, 43] in the first two cases, or learn two projection functions from semantic and visual space to the common embedding space [5, 24, 45] respectively, which can be modeled as a regression or ranking problem solved by conventional methods or deep neural networks. Our method also uses the visual space as the embedding space, because it is demonstrated helpful in alleviating the hubness problem [29] in [43]. More importantly, our structure constraint is based on the separability of visual features of unseen classes.

Recently, to solve the projection domain shift problem [13], transductive approaches [13, 17, 31, 46, 22] are proposed to leverage test-time unseen class data structure in the learning procedure and obtain performance improvement. Unsupervised domain adaption is used in [17] by incorporating target domain class projections as regularization in a sparse coding framework. In [13], transductive multi-class and multi-label zero-shot learning are proposed. [46]

proposes a structured prediction approach by assuming that the unseen data can be visually clustered, whose underlying idea is very related to our structure constraint. But we use it as the loss function in the training stage to help to learn a better projection function, while

[46]

uses it as a post-processing step in the test stage by modeling it as a MAP (maximum a posteriori) estimation problem.

3 Method

Problem Definition

In ZSL setting, we have source labeled samples , where is an image and is the corresponding label within total source classes. We are also given unlabeled target samples that are from target classes . According to the definition of ZSL, there is no overlap between source seen classes and target unseen classes , i.e. . But they are associated in a common semantic space, which can be viewed as the knowledge bridge between the source and target domain. As explained before, we adopt semantic attribute space in this paper, where each class is represented with a pre-defined auxiliary attribute vector . The goal of ZSL is to build a recognition model that can predict the label given with no labeled training data for target classes.

Besides the semantic representations, the images of the source and target domain are also related in a common visual space, where each image content is represented by its corresponding visual feature. Because of the powerful representation ability of deep networks, most existing ZSL methods use a pretrained CNN to extract deep features. Whatever kind of algorithm is designed in these methods, their underlying working principle is the same,

i.e. similar images in the visual space should have similar classes in the semantic space. To relate these two spaces, a joint embedding space is often first defined; then two projection functions are learned to project these two spaces into this common space. Following [43], we use the original visual space as the embedding space in this paper, in which case only one projection function is needed. The key then becomes that how to learn a general and better projection function.

Figure 1: Visualization of CNN feature distribution of 10 target unseen classes in AwA2 dataset using t-SNE, which can be clearly clustered into several real centers (stars). Squares (VCL) are synthetic centers projected by the projection function learned only from source domain data, which suffers from domain shift problem. By incorporating our visual structure constraint, our method (BMVSc) can help to learn better projection function and the generated synthetic semantic centers would be much closer to the real visual centers. Better viewed in colour.

Motivation

The intuition of our method stems from the observation that unseen data instances of each class could form a tight cluster in visual space, with the features extracted from pre-trained CNN models, as shown in Figure 

1. Specifically, the intra-class distances are small, and the inter-class distances are large. In other words, the separated clusters of unseen classes obtained from pre-trained CNN models are already discriminative

, and it should be very easy to classify unseen classes with these discriminative clusters. By regarding the

mean of all feature vectors belonging to the same class as the class center, the classification result for one test instance can be easily determined by selecting the nearest class centers.

Although the visual features of unseen instances are discriminative, the performance of existing ZSL methods on unseen classes is still far from satisfactory [41, 39]. What makes such a contradiction? The reason is the domain shift phenomenon. In ZSL, the labels for unseen instances are inaccessible. Thus the class centers for unseen classes cannot be directly calculated. Previous ZSL methods utilize an alternative approach to address this problem. They first learn a projection function from the attribute vectors to the class centers on source seen classes. Then they assume that the same projection function can be applied to transfer the given attribute vectors of unseen classes to the synthetic class centers. However, this assumption may be invalid in ZSL due to the existence of distribution difference between the source and the target domain. An example is shown in Figure 1, the real centers are calculated by averaging the corresponding feature vectors in target domain; synthetic centers (VCL) are obtained by first learning the projection function only with source domain data and then applying it to target unseen classes. It can be seen that the synthetic centers will deviate from the real class centers. It finally leads to inferior classification performance when searching the nearest one from these deviated centers.

Based on the above analysis, besides source domain data, our proposed method attempts to take advantage of the existing discriminative structure of target unseen class clusters during the learning of the projection function, thus alleviate the domain shift problem. In this way, we want the learned projection function can align the structure of the synthetic centers to the real ones in the target domain.

3.1 Visual Center Learning (VCL)

In this section, we first introduce a baseline method which learns the projection function only with source domain data and directly apply it to target unseen classes to obtain synthetic class centers. Specifically, given an input image in the source domain, we first use a CNN feature extractor to convert each image into a -dimensional feature representation . According to the aforementioned analysis, each class of source domain should have a real visual center , which is defined as the mean of all feature vectors in the corresponding class. For the projection function, a two-layer embedding network is utilized to transfer source semantic attribute to generate corresponding synthetic center :

(1)

where is the indicator function, and

denote non-linear operation (Leaky ReLU with negative slope of 0.2 by default).

and are the weights of two fully connected layers to be learned.

Since the correspondence relationship is given in the source domain, we adopt the simple mean square error (following [43]) as the loss function to minimize the discrepancy between synthetic centers and real centers in the visual feature space for all the source seen classes:

(2)

where is the -norm parameter regularizer decreasing the model complexity, controls the tightness of the constraint and we empirically set . Need to note that different from [43] which trains with a large amount of individual instances of each class , we choose to utilize a single cluster center to represent each object class, and train the model with just several center points. It is based on the observation that the instances of the same category could form compact clusters, and will make our method much more computationally efficient than [43]

When performing ZSL prediction, we first project the semantic attributes of each unseen class to its corresponding synthetic visual center using the learned embedding network as in Equation (1). Then for a test image , its classification result can be achieved by selecting the nearest synthetic center in the visual space. Formally,

(3)

3.2 Chamfer-Distance-based Visual Structure Constraint(CDVSc)

As discussed earlier, because of the domain shift problem, there exists the discrepancy between synthetic centers and real centers for target unseen classes, which finally yields poor performance in the target domain. To alleviate this problem, we aim to leverage the underlying discriminative structure of real centers to adjust the undesired structure of the synthetic centers in the target domain. In ZSL, although the real centers for unseen classes cannot be directly calculated, the fact that unseen instances form compact clusters provides a possible approach to approximate the real centers using unsupervised clustering methods. In our method, we utilize the standard K-means clustering to divide the unseen instances to clusters. Then the cluster centers are regarded as approximated real centers.

After obtaining the cluster centers, aligning the structure of cluster centers to that of synthetic centers can be formulated as reducing the distance between the two unordered high-dimensional point sets. Inspired by the work in 3D point clouds [9], a symmetric Chamfer-distance constraint is proposed to solve the structure matching problem:

(4)

where indicates the cluster centers of unseen classes obtained by K-means algorithm. represents the synthetic target centers obtained with the learned projection. Combining the above constraint, the final loss function to train the embedding network is defined as:

(5)

3.3 Bipartite-Matching-based Visual Structure Constraint(BMVSc)

Figure 2: Illustration of possible many-to-one matching problem in Chamfer-distance based visual structure constraint, which can be avoided in the bipartite matching based visual structure constraint.

CDVSc helps to preserve the structure similarity of two sets, but sometimes many-to-one matching may happen with the Chamfer-distance constraint, as shown in Figure 2. This conflicts with the important prior in ZSL that the obtained matching relation between synthetic and real centers should conform to the strict one-to-one principle. When undesirable many-to-one matching arises, the synthetic centers will be pulled to incorrect corresponding real centers, and it finally results in inferior performance. To address this issue, we improve CDVSc to bipartite matching based visual structure constraint (BMVSc), which aims to find a global minimum distance between the two sets meanwhile to satisfy the strict one-to-one matching principle in ZSL.

We first consider a graph with two partitions and , where is the set of all synthetic centers and contains all cluster centers of target classes. Let denotes the Euclidean distance between and of matrix , element of the assignment matrix defines the matching relationship between and . To find a one-to-one minimum matching between real and synthetic centers, we could formulate it as a min-weight perfect matching problem, and optimize the problem as follows:

(6)

In this formulation, the assignment matrix

strictly conform to the one-to-one principle. To solve this linear programming problem, we employ Kuhn-Munkres algorithm

[19] whose time complexity is .

After getting the desired assignment matrix , we calculate the matching distance between two sets,

(7)

where denotes Hadamard product, is the auxiliary vector filled with to calculate the sum of a matrix.

Combining the MSE loss and this bipartite matching loss in the final loss function can not only help to learn a reliable projection function from semantic attributes to visual centers but also align the synthetic centers with the real centers satisfying the one-to-one match prior.

(8)

4 Experiments

4.1 Experimental Settings

Datasets

To demonstrate the effectiveness of the proposed visual structure constraint, we have conducted extensive experiments on three widely-used ZSL benchmark datasets, i.e., Animals with Attributes2 (AwA2) [21], Caltech-UCSD Birds 200-2011 (CUB) [36]

and Scene UNderstanding (SUN)

[28]. The statistics of these datasets are briefly introduced as below:

  • Animals with Attributes2 (AwA2) [21] contains 37,322 images from 50 animals categories, where 40 of 50 classes are used for training and the rest 10 are used for testing. We adopt the class-level continuous 85-dim attributes as the semantic representations.

  • Caltech-UCSD Birds 200-2011 (CUB) [36] is a fine-grained bird dataset with 200 species of birds and 11,788 images. Each image is associated with a 312-dim continuous attribute vector. Following [39], we use the class-level attribute vector and the 150/50 split.

  • SUN-Attribute (SUN) [28] includes 14,340 images coming from 717 fine-grained scenes. Each sample is paired with a binary 102-dim semantic vector. We compute class-level continuous attributes as our semantic representations by averaging the image-level attributes for each class. Following [39], we use 645 classes of SUN for training and 72 classes for testing.

Data Splits:

1) Standard Splits (SS): The standard seen/unseen class split is first proposed in [20] and then widely used in the following ZSL works [5, 23, 42, 46]. However, most recent ZSL methods extract the visual features using ImageNet 1K classes pretrained CNN models, and the unseen classes in standard splits overlap with these 1K classes, which actually violates the zero-shot setting that the test classes should be unseen during ZSL training. [41, 39] test and verify that the classification accuracies of existing ZSL methods will significantly decrease on test classes that do not overlap with ImageNet 1K classes. 2) Proposed Splits (PS): Based on this consideration, [41, 39] introduce the Proposed Splits (PS), in which the overlapped ImageNet classes are removed from the test set of unseen classes, so evaluation with this splits manner is more reasonable. In this paper, we report the results on both the standard splits and the proposed splits for fair comparisons.

Evaluation Metrics

Following previous work [41]

, the multi-way classification accuracy is adopted as our evaluation metric:

(9)

Implementation Details

By default, we adopt the most widely used ResNet-101 to extract visual features, which is pretrained with ImageNet 1K classes. The dimension of extracted features is 2048, and all images are resized to

without any other data augmentation. The embedding network consists of two stacked fully connected layers (2048-2048) to project semantic attributes to visual centers. Both visual features and semantic attributes are L2-normalized. Our method is trained for 1000 epochs with Adam optimizer. The learning rate is set to 0.0001, and weight decay is set to 0.0005 for all datasets. The constraint weights

in Equation (5) and Equation (6) are selected based on cross-validation on the training set.

4.2 Conventional ZSL Results

To demonstrate the effectiveness of the proposed visual structure constraint, we first compare our method with existing state-of-the-art methods in the convention setting where all the test instances are assumed to belong to unseen classes. For our method, we report the results with three different final loss functions: our visual center learning baseline (ours(VCL)), and our baseline plus two types of visual structure constraint (ours(CDVSc), ours(BMVSc)). Based on whether using test dataset or not, compared methods can be splitted into two different categories: 1) Inductive methods: DAP [20], IAP [20], CONSE [27], SSE[44], DEVISE[11], SJE [3], ESZSL [32], SYNC [5], SCoRe [26], LDF [23], PSR-ZSL [4], 2) Transductive methods: UDA [17], TMV [12],SMS [15], VZSL [37]. For convenience, all the baseline results are cited from previous methods.

As shown in Table 1, with the two different types of visual structure constraint, our method can obtain substantial performance gains consistently on all the three datasets no matter using the standard split (SS) or the proposed split (PS), and outperforms previous state-of-the-art methods by a large margin. To further demonstrate whether the proposed visual structure constraint can improve the generality of the learned projection function, we also plot the projected synthetic centers on Figure 1. Obviously, they align with the real visual centers better than VCL which learns the projection function only with source domain data, i.e. alleviate the domain shift problem. On average, BMVSc is better than CDVSc except on the CUB dataset.

AwA2 CUB SUN
Method Trans SS PS SS PS SS PS
DAP [20] N 58.7 46.1 37.5 40.0 38.9 39.9
IAP [20] N 46.9 35.9 27.1 24.0 17.4 19.4
CONSE [27] N 67.9 44.5 36.7 34.3 44.2 38.8
SSE [44] N 67.5 61.0 43.7 43.9 25.4 54.5
DEVISE [11] N 68.6 59.7 53.2 52.0 57.5 56.5
SJE [3] N 69.5 61.9 55.3 53.9 57.1 53.7
ESZSL [32] N 75.6 58.6 55.1 53.9 57.3 54.5
SYNC [5] N 71.2 46.6 54.1 55.6 59.1 56.3
SCoRe [26] N 69.5 59.5 61.0 51.7
LDF [23] N 70.3 69.2
PSR-ZSL [4] N 63.8 56.0 61.4
UDA [17] Y 39.5
TMV [12] Y 51.2 61.4
SMS [15] Y 59.2 60.5
VZSL [37] Y 66.5
VCL(ours) N 82.5 61.5 60.1 59.6 63.8 59.4
CDVSc(ours) Y 93.9 78.2 74.2 71.7 64.5 61.2
BMVSc(ours) Y 96.8 81.7 73.6 71.0 66.2 62.2
Table 1: Quantitative comparisons of multi-way classification accuracy (%) under standard splits (SS) and proposed splits (PS) in conventional ZSL setting, which have demonstrated the superior performance of our methods.

4.3 Generalized ZSL Results

AwA2 CUB SUN
Method H H H
DAP [20] 0.0 84.7 0.0 1.7 67.9 3.3 4.2 25.1 7.2
IAP [20] 0.9 87.6 1.8 0.2 72.8 0.4 1.0 37.8 1.8
CONSE [27] 0.5 90.6 1.0 1.6 72.2 3.1 6.8 39.9 11.6
SSE [44] 8.1 82.5 14.8 8.5 46.9 14.4 2.1 36.4 4.0
DEVISE [11] 17.1 74.7 27.8 23.8 53.0 32.8 16.9 27.4 20.9
SJE [3] 8.0 73.9 14.4 23.5 59.2 33.6 14.7 30.5 19.8
ESZSL [32] 5.9 77.8 11.0 12.6 63.8 21.0 11.0 27.9 15.8
SAE [18] 1.1 82.2 2.2 7.8 54.0 13.6 8.8 18.0 11.8
SYNC [5] 10.0 90.5 18.0 11.5 70.9 19.8 7.9 43.3 13.4
ALE [2] 14.0 81.8 23.9 23.7 62.8 34.4 21.8 33.1 26.3
VCL(ours) 21.4 89.6 34.6 15.6 86.3 26.5 10.4 63.4 17.9
CDVSc(ours) 66.9 88.1 76.0 37.0 84.6 51.4 27.8 63.2 38.6
BMVSc(ours) 71.9 88.2 79.2 33.1 86.1 47.9 29.9 62.9 40.6
Table 2:

Quantitative comparisons of multi-way classification accuracy (%) of unseen and seen classes, and their harmonic mean respectively in generalized ZSL setting. By imposing the visual structure constraint, our method obtains substantial performance gain over our

VCL baseline and outperforms previous state-of-the-art methods by a large margin.

Though most existing ZSL methods usually assume that the test instances only belong to the unseen target classes, real user scenarios often require to recognize instances from both the source and the target classes. To demonstrate the effectiveness of the proposed methods, we further apply our method to the more challenging setting, generalized zero-shot learning(gZSL) task, where the test set contains data samples from both the seen and unseen classes.

Following [41], we adopt the same data splits and denote and as the accuracy of images from the seen and unseen classes respectively. Moreover, the harmonic mean is computed to measure the ZSL performance in the generalized setting with the same weights of and :

(10)

In Table 2, we compare our method with ten different generalized ZSL methods. It can be noted that, though almost all the methods can not maintain the same level accuracy for both seen () and unseen classes (), our method with visual structure constraint still significantly outperforms other methods by a large margin on all three datasets (AwA2, CUB and SUN). More specifically, take CONSE[27] as an example, due to the domain shift problem, it can achieve the best results on the source seen classes but totally fails on the target unseen classes. By contrast, since the proposed two structure constraint can help to align the structure of synthetic centers to that of real unseen centers, our method can achieve acceptable ZSL performance on target unseen classes.

4.4 Qualitative Results

Figure 3: Qualitative results of BMVSc on 6 categories of AwA2 and CUB datasets. We list the top-6 images classified to each category. The misclassified images are marked with red bounding boxes and the right name of category is below the corresponding image.

In Figure 3, we have shown some qualitative results of the proposed BMVSc on the AwA2 and CUB datasets. Although the test images of each class have an overall different appearance, the projection function learned by our method can still capture important discriminative semantic information from their visual characteristics, which corresponds to their semantic attributes. For example, the predicted sheep images in AwA2 all share furry, bulbous and hooves attributes. However, we could also observe some misclassified images such as the walrus in row 6 of AwA2. After careful analysis, we find two possible reasons: 1) The discriminative ability of the pretrained CNN is not enough to separate the visual appearances between too similar categories. In fact, the visual appearance of seal and walrus are so close that even people could not distinguish them by rule and line without expert knowledge. This problem can be solved only by more powerful visual features. 2) Some attribute annotations are not accurate enough. For example, the seal category possesses spots of semantic descriptions, but walrus does not, but both these two categories own this attribute in the semantic annotation. Such incorrect supervision information will mislead the learning of the projection function.

4.5 More Analysis

AwA2 CUB SUN
K-Means 75.0 67.4 57.6
VCL 61.5 59.6 59.4
CDVSc 78.2 71.7 61.2
BMVSc 81.7 71.0 62.2

Table 3: Analysis to demonstrate the importance of unsupervised cluster centers and semantic attributes. By combining these two types of information during training, our CDVSc and BMVSc achieve better results than the upper bound of K-Means and VCL.

Importance of unsupervised cluster centers and semantic attributes.

In our method, to recognize the target domain images, two different types of knowledge are leveraged: unsupervised cluster centers and semantic attributes. To study the importance of these two components, we design a simple voting algorithm to calculate the upper bound of unsupervised clustering algorithms for ZSL recognition. Specifically, we assume the ground truth label for each unseen instance is accessible. Then for each cluster center obtained by K-means, we predict its category through a voting process, i.e. its category is the one which most images in this cluster belong to. Finally, the classification results for test instances are directly set to the label of the corresponding cluster. In this way, because we have already used the ground truth information, it can be viewed as the upper bound of K-means clustering algorithm. As shown in Table 3, its performance is even better than our baseline VCL. which demonstrates that the unsupervised clustering information is very useful. By combining the semantic attributes and this unsupervised cluster information during the learning process, our method CDVSc and BMVSc are both better than the upper bound of K-Means and VCL.

Figure 4: Matching matrixs between the projected semantic centers and visual cluster centers of CDVSc (left) and BMVSc (right) on the AwA2 dataset. BMVSc can guarantee strict one-one matching while CDVSc may have many-to-one matching.

Possible many-to-one problem in CDVSc.

To verify that there may exist many-to-one matching phenomenon during the training stage of CDVSc, we randomly select the output of embedding networks of one epoch and visualize the matching results of CDVSc on the AwA2 dataset in Figure 4. It can be seen that one projected semantic center can be matched by multiple visual cluster centers, and vice versa. By contrast, BMVSc can guarantee strict one-one matching, which satisfies the prior of ZSL recognition in the target domain. This may be the reason which contributes to better results shown in Table 1 and Table 2 on this dataset.

Progressive improving of center matching in BMVSc.

During the training of BMVSc, we will match the projected semantic centers and visual cluster centers from K-Means and minimize their matching distance at each epoch. However, the final ZSL recognition performance relies on the matching of the projected semantic centers and real visual centers and their distance. So one natural question would be ”whether we can achieve this final objective by training with cluster centers from K-means?”. To answer this question, we calculate the right match point number and distances between the projected semantic centers and real visual centers of different training epoch of BMVSc respectively. We plot these two metrics of the SUN test dataset in Figure 5. Obviously, BMVSc can definitely improve the matching of the projected semantic centers and real visual centers by only using the synthetic cluster centers from K-means in the proposed visual structure constraint.

Figure 5: The right matching number and distance between the projected semantic centers and real visual centers during the training of BMVSc on the SUN dataset.

5 Conclusion

Due to the lack of labelled images in the target domain, most existing ZSL methods often learn the projection function only based on the source domain data to associate the semantic space and visual space. However, It will often cause domain shift problem and hurt the recognition performance in the target domain. To alleviate this problem, we propose two different types of visual structure constraint, which constrain the learned projection function to match the projected semantic centers and generated visual cluster centers by some unsupervised clustering algorithms. Experiments demonstrate that the proposed visual constraint can bring substantial performance gain consistently on three different benchmark datasets and outperform previous state-of-the-art methods by a large margin.

References

  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In

    2013 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 819–826. IEEE, 2013.
  • [2] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(7):1425–1438, 2016.
  • [3] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, pages 2927–2936, 2015.
  • [4] Y. Annadani and S. Biswas. Preserving semantic relations for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [5] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, pages 5327–5336, 2016.
  • [6] L. Chen, H. Zhang, J. Xiao, W. Liu, and S.-F. Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [8] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 2584–2591, 2013.
  • [9] H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, July 2017.
  • [10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE, 2009.
  • [11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121–2129, 2013.
  • [12] Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong. Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV, pages 584–599, 2014.
  • [13] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(11):2332–2345, 2015.
  • [14] Y. Fu and L. Sigal. Semi-supervised vocabulary-informed learning. In CVPR, pages 5337–5346, 2016.
  • [15] Y. Guo, G. Ding, X. Jin, and J. Wang. Transductive zero-shot recognition via shared model space learning. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    , AAAI’16, pages 3494–3500. AAAI Press, 2016.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [17] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot learning. In ICCV, pages 2452–2460, 2015.
  • [18] E. Kodirov, T. Xiang, and S. Gong.

    Semantic autoencoder for zero-shot learning.

    In CVPR, 2017.
  • [19] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955.
  • [20] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958, 2009.
  • [21] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(3):453–465, 2014.
  • [22] Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zero-shot recognition using dual visual-semantic mapping paths. In CVPR, 2017.
  • [23] Y. Li, J. Zhang, J. Zhang, and K. Huang. Discriminative learning of latent features for zero-shot recognition. In CVPR, pages 7463–7471, 03 2018.
  • [24] Y. Lu. Unsupervised learning on neural network outputs: with application in zero-shot learning. arXiv preprint arXiv:1506.00990, 2015.
  • [25] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  • [26] P. Morgado and N. Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR, 2017.
  • [27] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
  • [28] G. Patterson, C. Xu, H. Su, and J. Hays. The sun attribute database: Beyond categories for deeper scene understanding. International Journal of Computer Vision (IJCV), 108(1-2):59–81, 2014.
  • [29] M. Radovanović, A. Nanopoulos, and M. Ivanović.

    Hubs in space: Popular nearest neighbors in high-dimensional data.

    Journal of Machine Learning Research

    , 11(Sep):2487–2531, 2010.
  • [30] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In CVPR, pages 49–58, 2016.
  • [31] M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in a transductive setting. In Advances in neural information processing systems, pages 46–54, 2013.
  • [32] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, pages 2152–2161, 2015.
  • [33] Y. Shigeto, I. Suzuki, K. Hara, M. Shimbo, and Y. Matsumoto. Ridge regression, hubness, and zero-shot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 135–151. Springer, 2015.
  • [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • [36] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [37] W. Wang, Y. Pu, V. K. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin. Zero-shot learning via class-conditioned deep generative models. CoRR, abs/1711.05820, 2018.
  • [38] X. Wang, Y. Ye, and A. Gupta.

    Zero-shot recognition via semantic embeddings and knowledge graphs.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [39] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • [40] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [41] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning-the good, the bad and the ugly. In CVPR, 2017.
  • [42] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR, pages 2021–2030, 2017.
  • [43] L. Zhang, T. Xiang, S. Gong, et al. Learning a deep embedding model for zero-shot learning. 2017.
  • [44] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4166–4174, Dec 2015.
  • [45] Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034–6042, 2016.
  • [46] Z. Zhang and V. Saligrama. Zero-shot recognition via structured prediction. In ECCV, volume 9911, pages 533–548, 2016.
  • [47] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. 2002.
  • [48] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.