Transductive Zero-Shot Learning with a Self-training dictionary approach

03/27/2017
by   Yunlong Yu, et al.
0

As an important and challenging problem in computer vision, zero-shot learning (ZSL) aims at automatically recognizing the instances from unseen object classes without training data. To address this problem, ZSL is usually carried out in the following two aspects: 1) capturing the domain distribution connections between seen classes data and unseen classes data; and 2) modeling the semantic interactions between the image feature space and the label embedding space. Motivated by these observations, we propose a bidirectional mapping based semantic relationship modeling scheme that seeks for crossmodal knowledge transfer by simultaneously projecting the image features and label embeddings into a common latent space. Namely, we have a bidirectional connection relationship that takes place from the image feature space to the latent space as well as from the label embedding space to the latent space. To deal with the domain shift problem, we further present a transductive learning approach that formulates the class prediction problem in an iterative refining process, where the object classification capacity is progressively reinforced through bootstrapping-based model updating over highly reliable instances. Experimental results on three benchmark datasets (AwA, CUB and SUN) demonstrate the effectiveness of the proposed approach against the state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 6

01/25/2018

Class label autoencoder for zero-shot learning

Existing zero-shot learning (ZSL) methods usually learn a projection fun...
07/07/2016

Zero-Shot Visual Recognition via Bidirectional Latent Embedding

Zero-shot learning for visual recognition, e.g., object and action recog...
12/26/2017

Zero-Shot Learning via Latent Space Encoding

Zero-Shot Learning (ZSL) is typically achieved by resorting to a class s...
05/26/2017

Zero-Shot Learning with Generative Latent Prototype Model

Zero-shot learning, which studies the problem of object classification f...
03/27/2017

Transductive Zero-Shot Learning with Adaptive Structural Embedding

Zero-shot learning (ZSL) endows the computer vision system with the infe...
11/20/2017

Zero-shot Learning via Shared-Reconstruction-Graph Pursuit

Zero-shot learning (ZSL) aims to recognize objects from novel unseen cla...
05/23/2020

Fine-Grain Few-Shot Vision via Domain Knowledge as Hyperspherical Priors

Prototypical networks have been shown to perform well at few-shot learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Zero-shot learning (ZSL) [1, 2, 3, 4, 5, 6, 7] endows the computer vision system with the capability to recognize instances of a new class that has never seen before. A common framework to address this problem is to transfer the knowledge from the seen classes to unseen ones by resorting to a label embedding space where the semantic relatedness between different classes are measured. Commonly used semantic label embeddings include visual attributes [1, 2, 5, 12, 17, 38]

and word vectors

[2, 3, 21].

In order to achieve the knowledge transfer, existing approaches fall into two main categories. The first one poses the seen classes as the mediators to connect the test instance and the unseen classes. It relies on learning a classification model for seen classes with the labeled instances, which is then used to compute the visual similarities between the test instance and seen classes. The prediction is implemented by matching the visual similarities and the semantic relatedness between the seen classes and the unseen classes, which is obtained with their label embeddings. In contrast, the approaches in the second category focus on modeling the semantic interactions between different modalities by directly learning a projection function either from the image feature space to the label embedding space [20, 34], or from a reverse direction [13, 18], and then predict the unseen instances in the label embedding space or image visual space.

A common characteristic of existing ZSL approaches from both categories is that they all critically rely on the pre-defined label embeddings to compute the semantic relatedness between the seen and unseen classes. However, the noisy and uncertainty of the label embedding make it hard to characterize the semantic information explicitly, which will be blindly forced to the unseen data during the knowledge transfer. Besides, we only have a single sparse label semantic vector for each unseen class, which is insufficient to fully represent the data distribution of the class. Thus, the distribution connections between the seen domain and unseen domain are difficult to capture. Motivated by these observations, we propose a bidirectional mapping based semantic relationship modeling scheme that seeks for cross-modal knowledge transfer by simultaneously projecting the image features and label embeddings into a common latent space. In specific, the bidirectional connection relationship is formulated into a general dictionary framework, in which a common latent space is learned for preserving the semantic relatedness between different modalities. By projecting the label embeddings to the latent space where the embedding semantics are more suitably aligned, the influence of semantic gap across different modalities alleviates.

As the seen classes and unseen classes are different and potentially unrelated, the projection function learned from the seen domain is usually biased on the unseen domain. To address this domain shift issue, many approaches focus on learning a more general projection function to bridge the semantic relationships between the image feature space and the label embedding space under a transductive setting [18, 29, 35, 45]. The transductive setting means that the unlabeled unseen instances are used to improve the generalization accuracy. However, existing transductive approaches treat all unlabeled data equally and achieve the prediction in one pass, which makes the learned models difficult to relate the seen domain to the unseen domain. Based on this motivation, we further present a transductive learning approach that treats the unlabeled unseen instances in different levels by assessing their reliability and discriminability. Specifically, it formulates the class prediction problem in an iterative refining process, in which each iteration alternates between two paradigms, learning-to-predict and predicting-to-learn. In the learning-to-predict paradigm, the prediction is conducted on the unseen data with the current learned model to select reliable instances for the subsequent learning process; In the predicting-to-learn paradigm, the model is retrained with the feedback reliable instances for the next prediction. In this way, the object classification capacity is progressively reinforced through bootstrapping-based model updating over highly reliable instances.

The flowchart of the proposed transductive ZSL approach is illustrated in Fig. 1. In conclusion, the main contributions of this paper are two folds:

  1. To achieve the knowledge transfer from the seen classes data to the unseen classes data, we propose a general dictionary model to simultaneously project the image features and label embeddings into a common latent space, where the class semantic relatedness between different modalities are effectively preserved.

  2. A novel transductive framework is developed for alleviating the domain shift problem in ZSL by formulating the class prediction step in an iterative refining process, in which the domain shift is gradually adapted by retraining a powerful classification model with highly reliable unseen instances. Experimental results show that the proposed transductive strategy can significantly improve the inductive classification model and outperform the state-of-the-art related approaches.

Fig. 1: The illustration of our proposed model with attributes. In the training stage, the visual image features and the class attribute features are jointly embedded in the latent space, where the compatibility scores between different domains are obtained. In the testing stage, the previous predicted results are obtained with the learned dictionary matrix from the seen domain and the compatibility matrix , and the predicted results are updated in an iterative refining process. At each iteration, the prototypes of the unseen classes are fixed in the latent space, and the unseen instances with high compatibility scores are selected for retraining a more powerful dictionary model for unseen classes. is the number of iteration.

The remaining sections are organized as follows. Section II describes the related work. Section III presents the proposed general dictionary model for achieving the cross-modal knowledge transfer and the transductive framework for addressing the domain shift problem in ZSL. Section IV provides extensive experiments and evaluations, followed by the conclusion in Section V.

Ii Related work

Ii-a Knowledge transfer for ZSL

The key idea of ZSL is transferring the knowledge from the seen domain to the unseen one. It relies on constructing a label semantic embedding space where each class can be represented as a vector and the semantic relationships among all classes can be precisely characterized. The most common label embeddings include visual attributes [1], [5], [41], word vectors [14], [21], [33], knowledge mined from the Web [19], [42]. Visual attributes are a list of manually specified properties for categories, such as color, shape and presence or absence of a certain body part, which are shared across both the seen and unseen classes. In contrast, semantic word vectors are obtained from a large text corpus in an unsupervised way. With a language model, such as word2vec [40] and Glove [43], each class name is embedded into the word vector space, where the class semantic information is defined. Given such label semantic embeddings, the existing approaches of ZSL focus on bridging the class semantic relationships between the instances and the categories with the help of label semantic embeddings. One of the pioneering studies is [1], in which two probabilistic paradigms are proposed, i.e., directed attribute prediction (DAP) and indirected attribute prediction (IAP). DAP takes advantage of the class attributes as the middle layer between the input images and the output class labels. However, in IAP model, the seen classes are taken as the middle layer to connect the visual instances and the unseen classes, where the semantic relationships between seen classes and unseen classes are defined by their corresponding attributes.

Considering that the visual instances and label embeddings are embedded in different spaces, recent work addresses ZSL by exploring the semantic relationships between the visual instances and the label embeddings, which has been widely explored in two ways: (1) learning a direct projection function by regressing from image feature space to the label embedding space with regressors [8] [38]

or neural networks

[20]; (2) projecting the visual features and label embeddings into a latent space, such as CCA [17]. Instead of learning two different mapping functions for image feature space and label embedding space, SJE [2] and DeViSE [21] combined the visual features and label embeddings into a bilinear model to represent the compatibility scores of different modalities and employ a ranking objective to enforce the correct class labels to be ranked higher than any of the other class labels. In order to improve the compatibility learning framework, [3] introduced a list of latent variables to learn a collection of mappings with the selection of the latent variable to match the current image-class pair. Taking the class labels into consideration, [7] proposed a simpler but more efficient method that associates the visual feature, label embedding and class label into an joint model. As an extension of [7], Qiao et al. [14] proposed an -norm based objective function which can simultaneously suppress the noisy signal in the textual representation and learn a function to match the textual semantic vectors and visual features.

Instead of projecting the visual features into the label embedding space, [15] showed that mapping label semantic vectors into the image feature space is desirable to suppress the emergence of hubs in the subsequent nearest neighbor search step. Analogously, [13] employed a dictionary learning scheme in which class attributes are considered to be coding coefficients which are used to reconstruct the visual instances. Based on the dictionary learning, Zhang et al. [16] proposed a latent probabilistic model to simultaneously project both the visual features and label embeddings into different latent spaces, and then learn a cross-domain similarity matrix for matching different modalities.

Ii-B Adaptation for domain shift problem

Domain shift problem is a common issue in the situations where there are a lot training data in one domain but little to none in another. Traditional domain adaptation approaches are derived for both with [30], [39] and without [31] requiring label information of the target domain. Since the label information of the unseen domain are not available in ZSL, thus the supervised domain adaptation approaches are not applicable for ZSL. Besides, different from the traditional domain shift problem [26], [27], the domain shift issue in ZSL is mainly due to the disjointness of the seen classes and unseen classes rather than the feature distribution shift. Recently, several work has proposed for mitigating domain shift problem in ZSL with methods ranging from subspace aligning [17], data augmentation [9], [37], self-training [32] to hubness correction [10]. Transductive zero-shot learning was first considered by Fu et al. [36], in which the unseen data attribute distribution is exploited by averaging the label prototype’s k-nearest neighbours. In [45], the domain shift problem was addressed by transductive multi-view hypergraph label propagation (TMV-HLP), in which the manifold structure of the unseen data is exploited to compensate for the the impoverished supervision available from the sparse semantic vector. By using graph-based label propagation to exploit the manifold structure of the unseen data, Rohrbach et al. [11] proposed a more elaborate transductive strategy for domain shift problem in ZSL. Different from these approaches, Xu et al. [34] proposed a data augmentation strategy by mitigating any available auxiliary dataset to the labeled seen data for training a general model for unseen data. Self-training adaptation [34, 35] was a post-processing technique, which is based on adjusting the latent embeddings of unseen classes according to the distribution of all the test instance projections in the latent subspace.

Iii The proposed model

In this section, we focus on learning a specific classification model for recognizing the unlabeled unseen data. It consists of two parts: i) a general dictionary model is learned with the labeled seen data for initially predicting the unseen data, in which the semantic relatedness between different modalities are preserved by projecting the image features and label embeddings into a common latent space; ii) a transductive framework is presented for mitigating domain shift problem in ZSL by formulating the prediction step in an iterative refining precess, where the classification capacity is progressively reinforced through bootstrapping-based model updating over highly reliable instances.

Iii-a Notations

Suppose that we collect labeled instances from seen classes for training, and each class is associated with a vector embedded in the label embedding space. Denote as the instances available at training stage, where is the dimensionality of the image feature. And we use and to denote the corresponding ground truth label matrix and label embedding matrix for seen data, respectively. Here, each column of represents a label embedding vector, and is the dimensionality of the class label embedding. At testing stage, we are given instances from unseen classes, which are disjoint from seen classes. Each unseen class is also associated with a label embedding vector. TABLE I shows the main notations used here in after.

Notation Description
number of the seen classes
number of the unseen classes
number of the seen instances
number of the unseen instances
dimensionality of the image feature space
dimensionality of the label embedding space
dimensionality of the latent space
seen instance matrix
label semantic matrix of seen classes
ground truth label matrix of seen classes
unseen instance matrix
label semantic matrix of unseen classes
shared compatibility matrix
dictionary model for seen data
dictionary model for unseen data
hyper-parameters
self-labeled rate
the predicted instance number of -th unseen class
TABLE I: The notations used in our approach.

Iii-B The Joint Embedding Dictionary Model (JEDM)

For the labeled seen data, conventional dictionary learning models [22, 23, 24, 25] aim at learning an effective data representation model from the input data for classification tasks by exploiting the class label discriminative information of labeled data. Most existing dictionary learning approaches can be formulated under the following framework:

(1)

where is a weight parameter, denotes the class label matrix of the instances from input data matrix , is the dictionary matrix to be learned, and is the representative coding matrix of over , and is the classification matrix for seen classes,

is the classifier for class

, ; is the -norm regularizer on , is the reconstruction error term ensuring the representative ability of , denotes the matrix Frobenious norm, and is a discriminative function, which ensures the discriminative ability of .

With Eq. (1), the shared dictionary matrix and classification matrix can be trained with labeled seen classes. However, no labeled data are available for unseen classes such that the classification parameters for unseen classes cannot be obtained directly. We thereby need to transfer the knowledge exploited from the labeled seen domain to the unseen domain. As previous work has indicated that the properties of a class can be well characterized by its corresponding label embedding, thus it is reasonable to assume that the classifier of a class can be derived from its label embedding. Thus, we replace the classification model with: , where is the label embedding and is the compatibility matrix shared both the seen and unseen classes. Intuitively, the compatibility matrix aligns the semantic consistency between the visual instances and the label embeddings. Once obtaining the compatibility matrix , the classification parameter for unseen class can be obtained by . To this end, the remaining problem is to learn the compatibility matrix with the labeled seen data. Based on this idea, we propose to learn such a compatibility matrix together with the seen dictionary matrix. Formally, we get the Joint Embedding Dictionary Model (JEDM) for ZSL,

(2)

where and are two parameters to trade-off different terms, which can be determined via the cross-validation.

The first term of Eq. (2) is the reconstruction error, which compresses the visual features in a more representative latent space, and the second term incorporates the latent features, label embeddings and class labels into a joint framework for preserving the semantic relatedness across different modalities. By enforcing the visual latent features being close to the corresponding label embeddings while be far away from that of the other classes, this term is subject to exploit the semantic discriminant information across different modalities. The last term is a regularizer term.

We next introduce the optimization process to solve the objective function in Eq. (2). Eq. (2) is not convex for , and simultaneously but is convex for each of them individually. Therefore, the optimization can be done alternatively between the following two steps.

1). Fix , and solve for .

(3)

This sub-problem is a standard least square problem; so we take the derivative of Eq. (3) with respect to and make it equal to zero, which has the following closed-form solution:

(4)

2). Fix and solve and . Since and are independent, thus they can be solved separately,

(5)

The closed-form solutions of can be obtained as:

(6)

The optimal can be obtained by introducing a variable :

(7)

And the solution of Eq. (7) can be obtained by the alternating direction method of multipliers (ADMM) algorithm.

In each iterative step, and are obtained with closed-form solutions and the optimization of is obtained with the ADMM algorithm, which converges rapidly. The iterative step stops when the difference between the variations in two adjacent iterations is less than a threshold.

Once and are obtained, the compatibility score of a test instance over the unseen class

is estimated in the common latent space:

(8)

where is the label embedding of the -th unseen class, is the approximate embedding of visual instance in the latent space, while is the prototype of unseen class , which is the latent embedding of . Thus, ZSL is achieved by resorting to the largest compatibility score with respect to the unseen label embeddings.

(9)

Iii-C Self-Labeled strategy

Like most inductive ZSL approaches, the classification model which is learned only with the labeled seen data will generalize poorly on the unseen data due to that the class distribution of the seen domain is different from that of the unseen domain. To address this domain shift problem, we formulate the prediction step of ZSL in an iterative refining process, in which each iteration alternates between two paradigms, learning-to-predict and predicting-to-learn. With the model learned with the labeled seen data, the labels of the unseen data are previously predicted. This is the first learning-to-predict paradigm. Considering that the instances that have higher compatibility scores are more reliable to be correctly-predicted, it is reasonable to annotate these reliable instances as labeled data for unseen classes. With these feedback reliable instances, the unseen-specific model is retrained for the subsequent prediction step. This is a predicting-to-learn paradigm. Repeat this precess, the domain shift is progressively adapted in a confident way. The remaining problem is how to select reliable instances from unseen data. In this paper, we introduce a simple strategy to select instances from unseen data as labeled data. Specifically, for each unseen class, the test instances can be ranked according to the compatibility scores over their corresponding predicted unseen class. We then set a self-labeled rate to annotate the reliable instances as labeled data. For example, suppose that instances are predicted into the unseen class , instances are selected according to their ranking scores to the corresponding class, is the rounding operation. Clearly, the self-labeled strategy is under a transductive setting.

It should be noted that the self-labeled strategy can be seamlessly integrated into the various existing ZSL approaches. As shown in Fig. 2, the seen data are used for learning a previously classification model for initially predicting the unseen data, and then an iterative strategy is used for refining the learned model. At each iteration, only reliable instances from the unseen data are selected for refining the classification model. As more instances are selected, a powerful specific model is learned for unseen classes.

Iii-D Transductive Self-training Dictionary (TSTD) model

By integrating the self-labeled strategy into the previously proposed JEDM, we obtain the final Transductive Self-Training Dictionary (TSTD) model. For the first learning-to-predict paradigm, the class labels of unseen data are previously predicted with the proposed JEDM. And then the classification model is retrained by the unseen data themselves. In each predicting-to-learn paradigm, two baselines are introduced to ensure that the refined model is more suitable for unseen classes. The first one is that the current learned dictionary model is close to the previously optimal one . Since the previously learned model is used to align different spaces, the currently learned model should refine the previous one by a fine step rather than adjusting with a large range. The other one is that the learned model ensures that the latent embeddings of the self-labeled instances are close to their predicted label prototypes in the latent space. Thus the objective function is defined as follows:

Fig. 2: The workflow of our proposed self-training strategy. In the training stage, a ZSL model is trained for , which is used to initialize the unseen model . In the testing stage, the labels of the unseen data are predicted by the unseen model . Then the instances whose labels are reliablely predicted are selected as self-labeled data for refining the unseen model with a ZSL method. The self-labeled process stops until all unseen data are selected.
(10)

where is the collected set which contains the selected self-labeled instances, is the previous learned compatibility matrix shared both the seen domain and unseen domain. is the currently learned dictionary matrix for unseen classes, is the latent embeddings of the self-labeled instances and is the predicted label embedding matrix that self-labeled instances correspond to. Since each unseen class is associated with a label semantic vector, is easily inferred by the predicted class labels. and are trade-off parameters. In our model, the latent embeddings of the input unseen data are enforced to be close to their corresponding predicted classes’ label latent embedding in the common latent space, i.e., .

In the following, we design an alternating optimization method to solve Eq. (10). When is fixed, the optimization problem becomes:

(11)

which leads to a closed-form solution:

(12)

With the fixed , the optimal can be easily solved by:

(13)

This is a standard least squares problem, and we have the optimization solution:

(14)

With the currently learned dictionary matrix , the unseen data are revisited with Eq. (9). With the latest predicted results, we enlarge the self-labeled rate to incorporate more reliable instances for training. Repeat this refining process until all the unseen data are selected. Specifically, the values of self-labeled rate are successively selected from in our experiments. The TSTD process is summarized in Algorithm 1.

Iii-E Further analysis

With the learned dictionary matrix and the compatibility matrix from the seen data, the unseen instances and the label embeddings of unseen classes can be embedded into a latent space together. We visualize them with t-SNE approach, as illustrated in Fig. 3. We can observe that the projections of most visual instances from the same class are distributed around the corresponding class prototypes in the latent space. It is easy to conclude that the instances that are close to the corresponding class semantic prototypes tend to be classified correctly. In contrast, the instances that are farther away from the corresponding class prototypes tend to be classified into the wrong classes. Thus it is natural to annotate the instances that are close to the corresponding class prototypes as labeled data, which eliminates the issue that no training samples are available for unseen classes. Fixing the prototypes of unseen classes, the embeddings of unseen instances are gradually adjusted by retraining the embedding function with the reliable instances, and thus the domain shift issue in ZSL alleviates. The mechanism of the proposed transductive strategy is borrowing the knowledge from the seen classes to teach unseen data, and then learning a specific model with the unseen data by themselves in a word.

Fig. 3: t-SNE visualization of AwA unseen data and the corresponding class attributes embedded in the latent space, where the blue circles denote the embedding prototypes of AwA test classes.
Input:
  1: The seen domain:
     Instance matrix ,
     Ground truth label matrix ,
     Seen label embedding matrix ,
     Hyper-parameters .
  2: The unseen domain:
     Instance matrix ,
     Unseen label embedding matrix ,
     Hyper-parameters ,
     Self-labeled rate
Output: The predicted class labels of Unseen data.
Training:
  3: repeat
  4:   Update according to Eq. (4);
  5:   Update according to Eq. (6);
  6:   Update according to Eq. (7);
  7: until There is no change to , and .
  8: return and .
  9:  Fix the compatibility matrix and initialize the
      unseen dictionary model with and ;
  10: repeat
  11:  Predict the unseen data with Eq. (9);
  12:  for do
  12:    Rank the instances that are predicted to the
          unseen class based on the compatibility scores;
  13:    Select the previous reliable instances from
          class to incorporate the self-labeled set ;
  14:  end
  15:    Refer to the label embedding matrix according
          to the predicted labels of the selected ;
  16:    Update according to Eq. (12);
  17:    Update according to Eq. (14);
  18:    Enlarge the self-labeled rate .
  19: until All the unseen instances are selected.
Algorithm 1 The process of TSTD

Iii-F Complexity Analysis

In this section, we analyze the computational complexity of TSTD and the convergence of the proposed JEDM separately.

Computational Complexity. In the training phase of JEDM, , and are updated alternatively. In each iteration, the time complexities of updating and in Eq. (4) and Eq. (6) are and , respectively. As for the optimization of updating , the time cost is about , where is the iteration number in ADMM algorithm. We have experimentally found that the ADMM algorithm converges with less than 20 iterations. In the domain adaptation phase of TSTD, and are also updated alternatively. In each iteration, the time complexities of updating and are and , respectively. Given that , , and , , , are in the same order of magnitude and our algorithm converges with a few iterations, the over time cost of our algorithm is . It is worth noting that the dominant operation of our algorithm is matrix multiplication, which can greatly accelerate the training process.

Convergence. We conduct empirical study on the convergence property using Animal with Attribute (AwA) with attributes as label semantic vectors. We set the hype-parameters and both as 0.1. The train/test split provided by the dataset is used accordingly. As Fig. 4 shows, the cost function of JEDM descends dramatically and converges with only 10 iterations, which clearly indicates the efficiency of the proposed JEDM.

Fig. 4: The convergence curve of JEDM on the AwA dataset with attributes as label embeddings.

Iv Experiments

In this section, we do a set of experiments to demonstrate the superiority of the proposed approaches. Firstly, we detail the datasets and settings for the experiments, and then compare the proposed JEDM with the state-of-the-art inductive ZSL approaches. Then, the effectiveness of the proposed self-training strategy is evaluated, followed by the comparison results about TSTD and the state-of-the-art transductive ZSL approaches.

Iv-a Datasets and Settings

Datasets. To evaluate the effectiveness of the proposed approaches, we conduct extensive experiments on three benchmark datasets. (a). Animal with Attribute (AwA) [1] consists of 30,475 animal images from 50 different classes, and each class is associated with a 85-dimensional attribute vector. (b). Caltech-UCSD Bird2011 (CUB) [47] is a fine-grained dataset which contains 11,788 images from 200 bird subspecies, and a 312-dimensional attribute vector is provided for each class. (c). SUN Attribute [48] contains 717 scene categories annotated by 102 attributes, and each class has 20 images. For the seen/unseen class split, we use the standard 40/10 split setting for AwA dataset [1]. For CUB dataset, we follow the same 150/50 split in [2]. And for SUN dataset, we use 707 classes as the seen domain and 10 classes as the unseen domain, the same as that in [16]. The statistics for the three datasets are shown in TABLE II.

Dataset instances attribute seen/unseen classes
AwA 30,475 85 40/10
CUB 11,788 312 150/50
SUN 14,340 102 707/10
TABLE II: The statistics of three benchmark datasets

Visual representation. In our experiments, we use the vgg-verydeep-19 (denoted as VGG for short) features provided by those datasets for representing the visual instances.

Label semantic embedding. In this paper, we explore the visual attributes and word vectors as label embedding space for AwA and CUB datasets. Meanwhile, only visual attributes are used for SUN dataset to be comparable with the existing practices in the literature.

Besides, there are four hyper-parameters , , and in our proposed TSTD, and are two parameters in the JEDM and and are in the refining model. We select their best values with a 5-fold cross-validation (CV) strategy, where 20% of the seen classes are held out for validation and the remaining for training. Once the parameters are fixed, all seen classes are then trained together for the final model. All the parameters are selected from . In all the experiments, the classification performances are evaluated with the average per-class top-1 accuracy. The average running time of our Matlab implementation is about 0.01ms per image on a desktop with an Intel Core i7-4790K processor and 32G RAM.

Iv-B Comparative results of JEDM

In order to evaluate the effectiveness of the proposed JEDM, we conduct two experiments according to the types of label embedding space.

We first take attributes as semantic vectors for classes. In this experiment, six state-of-the-art approaches are selected for comparison. For descriptive convenience, they are respectively referred to as DAP (Direct Attribute Prediction [38]), SJE (Structal Joint Embedding [2]), LatEm (Latent Embeddings [3]), ESZSL (Embarrassing Simple Zero-Shot Leaning [7]), SC (Synthesized Classifiers [12]) and JLSE (Joint Latent Similarity Embedding [16]). These selected competing methods are all inductive approaches.

The results of the comparative methods are all from the original papers except [7], which is obtained with the published codes under the same setting as ours. The results are summarized in TABLE III, where ‘-’ indicates that these methods were not tested on the datasets in their original work.

From TABLE III, we can observe that JEDM is comparable with the state-of-the-art approaches. More specifically, in the AwA dataset, JEDM achieves an improvement of 19.0% against the baseline method DAP [38] and beats the other competitors expect for JLSE [16], which projects both modalities into different latent spaces. It is a more complicated model. For CUB dataset, our approach works better than others except for [12] and [2]. [12] tackles ZSL with exploiting the manifold structure to align the semantic space, which behaves robust for the fine-grained dataset. While [2] takes a more powerful visual feature as the input, which attributes to the fact that [2] works better than JEDM. Since the SUN dataset is less popular than the above two, only three recent approaches are selected for comparison. From the results, we can find that the proposed JEDM outperforms the previously published approaches by a large margin. Specifically, it outperforms DAP [1], ESZSL [7] and [16] in 14%, 4% and 3.2% gains, respectively. Besides, it is found that classification performances of JEDM outperform that of ESZSL, which is similar to the proposed JEDM. The most difference between our method and ESZSL is that JEDM projects the visual features into a more discriminative latent space with a dictionary framework, while ESZSL uses the visual feature as input directly and designs an elaborated regularizer. The comparative results demonstrate the effectiveness of the dictionary representation.

Method F AwA CUB SUN
DAP [38] V 57.5 - 72.0
SJE [2] G 66.7 50.1 -
LatEm [3] G 71.9 45.5 -
ESZSL [7] V 75.3 46.8 82.0
Changpinyo et al.[12] G 72.9 54.7 -
JLSE [16] V 80.5 42.1 82.8
JEMD V 76.5 47.6 86.0
TABLE III: Comparison results of different approaches on different datasets with attributes (in %). Notations: ‘F’: visual features; ‘V’: VGG feature; ‘G’: GoogleNet feature, indicates the method of which the classification performances are obtained by ourselves. We report the best performance after tuning the parameters in their models.

In the second experiment, the word vector space is taken as the label embedding space. Thanks to the recent advances in unsupervised neural language modeling [40] [43], each word in a text corpus can be effectively embedded in a textual semantic space, where each word is represented as a semantic multi-dimensional vector. Specifically, we use word2vector model [40] to train a skip-gram language model on the latest Wikipedia corpus to extract 1000-dimensional word vector for each class from AwA and CUB datasets. Five wordvector-based approaches are selected for comparison, as illustrated in TABLE IV. From the results, we can find that JEDM has an impressive improvement in AwA dataset. Specially, JEDM outperforms CCA [44], SJE [2], LatEm [3] and ESZSL [7] in 5.9%, 20.3%, 10.4% and 4.1% gains, respectively. Meanwhile, it also achieves a competitive result in CUB dataset, which is only 0.9% lower than that of the previous best reported LatEm [3].

Method F AwA CUB
CCA [44] V 65.6 30.4
SJE [2] G 51.2 28.4
LatEm [3] G 61.1 31.8
ESZSL [7] V 67.4 30.4
JEMD V 71.5 30.9
TABLE IV: Comparison results on AwA and CUB datasets with word vectors (in %). Notations: ‘V’: VGG feature; ‘G’: GoogleNet feature. indicates the methods that the classification performances are obtained by ourselves.

Iv-C Evaluation of self-training strategy

In this section, we conduct a set of experiments on AwA and CUB datasets to demonstrate the generality and the effectiveness of the proposed self-training strategy. In specific, two typical ZSL approaches are selected for being integrated with the self-training strategy. These approaches are CCA and ESZSL, both of which have a closed-form solution. For descriptive convenience, we add a postfix -ST to the name of the approaches for representing the corresponding approaches with the self-training strategy. Specifically, the approach that JEDM integrates self-training strategy is called TSTD in this paper. In implementation, the baselines introduced in TSTD are also suitable for CCA-ST and ESZSL-ST. The comparative results are provided in TABLE V.

From the results, we can observe that the proposed transductive self-training strategy can not only improve the performance of the proposed JEDM with a large margin, but also boost other approaches substantially on different datasets with different semantic vectors. Specifically, on AwA dataset, the transductive self-training strategy helps JEDM improve 13.8% and 19.7% in gains with attribute and word vector as label embedding space, respectively. It should be noted that TSTD achieves 91.2% classification accuracy on AwA dataset with word vector as semantic space, which is even better than those attribute-based approaches. In contrast to AwA dataset, the improvement range of the transductive self-training strategy is smaller on CUB dataset. The reason is that the CUB dataset is a fine-grained dataset and its classification performance of JEDM is much lower than that of AwA dataset, such that the self-labeled set contains many fake instances that may spoil the classification model. Even so, the proposed transductive self-training strategy helps JEDM improve 10.6% and 3.0% absolute percentage points with visual attribute and word vector as semantic space, respectively.

Method AwA CUB
A W A W
CCA [44] 72.5 65.6 46.2 30.4
ESZSL [7] 75.3 67.4 46.8 30.4
JEDM 76.5 71.5 47.6 30.9
CCA-ST 85.6 79.9 53.8 32.6
ESZSL-ST 87.7 83.8 56.4 32.4
TSTD 90.3 91.2 58.2 33.9
TABLE V: Evaluations of self-training strategy on AwA and CUB datasets, ‘M-ST’ denotes the method ‘M’ with self-training strategy, A and W are short for visual attribute and word vector, respectively

Iv-D Comparison results of TSTD

We also compare our TSTD approach with the state-of-the-art transductive ZSL approaches. TABLE VI shows the comparison results. We can observe that the proposed TSTD has an overwhelming superiority to the competitors. Specifically, the proposed domain adaptation strategy on JEMD model achieves 90.3% classification accuracy on AwA dataset with visual attribute, which outperforms [45], [29], [13] and [35] in 9.8%, 11.8%, 14.7% and 2.4% gains, respectively. On CUB dataset, it achieves 58.2% classification accuracy, which improves 10.3%, 17.6% and 4.7% over [45], [13] and [35], respectively. Specifically, TMV-HLP and SMS are two transductive methods that integrate the seen data and unseen data together for training a general model for all classes. And [35] explores the label information of unseen data with an unsupervised cluster-based approach. However, [13] and our self-training strategy focus on re-training a suitable model for unseen classes. The main difference between these two strategies is that [13] uses an unsupervised model to exploit the structure information of the unseen domain while ours relies on a bootstrapping-based model updating over highly reliable instances to progressively reinforce the classification capacity.

Method AwA CUB
TMV-HLP [45] 80.5 47.9
SMS [29] 78.5 -
Kodirov et al. [13] 75.6 40.6
Wang et al. [35] 87.9 53.5
TSTD 90.3 58.2
TABLE VI: Comparison results on different transductive approaches (in %), where visual attribute is adopted as the semantic space.

Iv-E Evaluation of the self-labeled rate

We next conduct a set of experiments to evaluate the influences of self-labeled rate to the maturity of the learned model. As illustrated in Fig. 5, we can observe that the performances increase steadily with the increase of and achieve their peaks when on AwA dataset with different types of label semantic embeddings. This indicates that with the increase of , more correct self-labeled instances are selected for refining the classification model, thus the classification capacity is progressively reinforced. In contrary, on CUB dataset, the performances achieve their peaks when and with attributes and word vector, respectively. And the performances decrease with increase of . This is due to the classification performances on CUB unseen data with the learned model are poor (47.6% and 30.9% with attribute and word vector respectively), and thus with increase of , more false instances are selected as self-labeled data, which may spoil the learned model. The curves of AwA dataset in Fig. 5 (a) also verify this explanation.

(a) AwA
(b) CUB
Fig. 5: The classification performances with different rates for TSTD on AwA and CUB datasets, respectively.

V Conclusions

In this paper, we proposed a bidirectional mapping based scheme to address ZSL. It formulates the semantic interactions between image feature space and label embedding space in a general dictionary model by simultaneously projecting the image features and label embeddings into a common latent space. The experimental results demonstrated that the proposed approach achieves the state-of-the-art performance on three benchmark datasets. To alleviate the domain shift problem in ZSL, we further proposed a transductive learning framework that formulates ZSL in two paradigms, where the labeled seen data are used to transfer the knowledge to unseen data, and the unlabel unseen data are used to gradually learn a more powerful model by themselves. In this way, the classification capacity is progressively reinforced through bootstrapping-based model updating over highly reliable unseen instances. The experimental results demonstrated that the proposed transductive strategy improves the classification performance of the existing inductive methods with a large margin. Compared with the state-of-the-art methods, our transductive approach outperforms the runner-up method on AwA and CUB datasets with 2.4% and 4.7% improvements, respectively.

References

  • [1] C. H. Lampert, H. Nickisch and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in

    Proc. Comput. Vis. Pattern Recognit.

    , Miami, USA, June 2009, pp. 951-958.
  • [2] Z. Akata, S. Reed, D. Walter, et al., “Evaluation of output embeddings for fine-grained image classification,” in Proc. Comput. Vis. Pattern Recognit., Boston, USA, June 2015, pp. 2927-2936.
  • [3] Y. Q. Xian, Z. Akata, G. Sharma, et al., “Latent embeddings for zero-shot classification,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Las Vegas, USA, June 2016, pp. 69-77.
  • [4] Z. Fu, T. Xiang, E. Kodirov, et al., “Zero-shot object recognition by semantic manifold distance,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Boston, USA, June 2015, pp. 2635-2644.
  • [5] Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid, “Label embedding for attribute-based classification,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Portland, USA, June 2013, pp. 819-826.
  • [6] M. Norouzi, T. Mikolov, S. Bengio,et al., “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” Int. Conf. on Learn. Repr., Banff, Canada, Apr. 2014, pp. 1-9.
  • [7] B. Romera-Paredes and P. H. S Torr, “An embarrassingly simple approach to zero-shot learning,” in Proc. Int. Conf. Mach. Learn., Lille, France, July 2015, pp. 2152-2161..
  • [8] A. Farhadi, I. Endres, D. Hoiem and D. Forsyth, “Describing objects by their attributes,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Miami, USA, June 2009, pp.  1778-1785.
  • [9]

    S. J. Pan, Q. Yang, “A survey on transfer learning,”

    IEEE Trans. Know. Data Eng., vol. 22, no. 10, pp. 1345-1359, 2010.
  • [10] G. Dinu, A. Lazaridou, M. Baroni, “Improving zero-shot learning by mitigating the hubness problem,” Comput. Sci., pp.135-151, 2014.
  • [11] M. Rohrbach, S. Ebert, and B. Schiele,“Transfer learning in a transductive setting,” Advances in Neural Infor. Proc. Sys., Nevada, US, Dec. 2013, pp. 46-54.
  • [12] S. Changpinyo, W. L. Chao, B. Gong, et al., “Synthesized Classifiers for Zero-Shot Learning,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Las Vegas, USA, June 2016, pp. 5327-5336.
  • [13] E. Kodirov, T. Xiang, Z. Fu, et al., “Unsupervised domain adaptation for zero-shot learning,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Santiago, Chile, Dec. 2015, pp. 2452-2460.
  • [14] R. Qiao, L. Liu, C. Shen, et al., “Less is more: zero-shot learning from online textual documents with noise suppression,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Las Vegas, USA, June 2016, pp. 2249-2257.
  • [15] Y. Shigeto, I. Suzuki,K. Hara, et al.

    , “Ridge Regression, Hubness, and Zero-Shot Learning,” in

    Eur. Conf. Mach. Learn., Porto, Portugal, Sep. 2015, pp.135-151.
  • [16] Z. Zhang, V. Saligrama, Zero-shot learning via joint latent similarity embedding, in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Las Vegas, USA, June 2016, pp. 6034-6042
  • [17] Y. Fu, T. M. Hospedales, T. Xiang, et al., “Transductive multi-view embedding for zero-shot recognition and annotation,” in Proc. Eur. Conf. on Comput. Vis., Zurich, Sep., 2014, pp. 584-599.
  • [18] S. M. Shojaee, M. S. Baghshah, “Semi-supervised Zero-Shot Learning by a Clustering-based Approach,” arXiv:1605.09016, 2016.
  • [19] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, “What helps where Cand why? semantic relatedness for knowledge transfer,” in Proc. Eur. Conf. on Comput. Vis., Crete, Greece, Sep., 2010, pp. 910-917.
  • [20] R. Socher, M. Ganjoo, C. D. Manning, et al., “Zero-shot learning through cross-modal transfer,” Advances in Neural Inf. Process Syst., Nevada, US, Dec. 2013, pp. 935-943.
  • [21] A. Frome, G. S. Corrado, J. Shlens, et al., “DeViSE: A deep visual-semantic embedding model”, Advances in Neural Inf. Process Syst., Nevada, US, Dec. 2013, pp. 2121-2129.
  • [22] Z. Jiang, Z. Li, L. Davis, “Label consistent k-svd: learning a discriminative dictionary for recognition,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 2651-2664, 2013.
  • [23] Z. Wang, R. Hu, C. Liang, et al., “Zero-shot person re-identification via cross-view consistency,” IEEE Trans. on Multimedia, vol. 18, no. 2, pp. 260-272, 2016.
  • [24] S. Gu, L. Zhang, W. Zuo, et al., “Projective dictionary pair learning for pattern classification,” Advances in Neural Inf. Process Syst., Montral, Canada, Dec. 2014, pp. 793-801.
  • [25] X. Song, Z. H. Feng, G. Hu, X. J. Wu, “Half-Face Dictionary Integration for Representation-Based Classification,” IEEE Trans. Cybern., vol. 47, no. 1, pp.142-152, 2016.
  • [26] M. Uzair, A. Mian, “Blind Domain Adaptation With Augmented Extreme Learning Machine Features,” IEEE Trans. Cybern., Sep. 2016.
  • [27] L. Duan, D. Xu, I. W. Tsang, “Domain adaptation from multiple sources: A domain-dependent regularization approach.” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp.  504-518, 2012.
  • [28] M. Rohrbach, S. Ebert, and B. Schiele, “Transfer learning in a transductive setting,” Advances in Neural Inf. Process Syst., Nevada, US, Dec. 2013, pp. 46-54.
  • [29] Y. Guo, G. Ding, X. Jin, et al., “Transductive Zero-Shot Recognition via Shared Model Space Learning,” Thirtieth AAAI Conf. Art. Intell., Phoenix, USA, Feb. 2016.
  • [30] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank, “Domain transfer svm for video concept detection,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Miami, USA, June 2009, pp. 1375-1381.
  • [31] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proc. IEEE Inter. Conf. Comput. Vis., Sydney, Australia, Dec. 2013, pp. 2960-2967.
  • [32] X. Xu, T. Hospedales and S. Gong, “Zero-shot action recognition by word-vector embedding,” arXiv:1511.04458, 2015.
  • [33] Z. Al-Halah, M. Tapaswi, R. Stiefelhagen, “Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Las Vegas, USA, June 2016.
  • [34] X. Xu, T. Hospedales, S. Gong, “Semantic embedding space for zero-shot action recognition,” in Proc IEEE Int. Conf. Image Proc., Quebec city, Canada, Sep. 2015, pp.63-67.
  • [35] Q. Wang, K. Chen, “Zero-Shot Visual Recognition via Bidirectional Latent Embedding,” arXiv:1607.02104, 2016.
  • [36] Y. Fu, T. Hospedales, T. Xiang, and S. Gong, “Attribute learning for understanding unstructured social activity,” in Proc. Eur. Conf. Comput. Vis., Firenze, Italy, Oct., 2012, pp. 530-543.
  • [37] L. Shao, F. Zhu, X. Li, “Transfer learning for visual categorization: A survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 5, pp. 1019-1034, 2015.
  • [38] C. H. Lampert, H. Nickisch and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 453-465, 2014.
  • [39]

    X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for Large-Scale sentiment classification: A deep learning approach,” in

    Proc. Inter. Conf. Mach. Learn., Washington, USA, July, 2011, pp. 513-520.
  • [40]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in

    Neural Inf. Process Syst., Nevada, US, Dec. 2013, pp. 3111-3119.
  • [41] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S. F. Chang, “Designing category-level attributes for discriminative visual recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Portland, USA, June 2013, pp. 771-778.
  • [42] T. Mensink, E. Gavves, and C. G. Snoek, “Costa: Co-occurrence statistics for zero-shot classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Columbus, USA, June, 2014, pp. 2441-2448.
  • [43] J. Pennington, R. Socher, C. D. Manning, “Glove: Global vectors for word representation,” in Proc. Conf. Empi. Meth. Natural Lan. Proc., Doha, Qatar, Oct. 2014, pp. 1532-1543.
  • [44] A. Lazaridou, E. Bruni, M. Baroni, “Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world,” Proc. ACL., 2014, pp. 1403-1414.
  • [45] Y. Fu, T. M. Hospedales, T. Xiang et al., “Transductive multi-view zero-shot learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 11, pp. 2332-2345, 2015.
  • [46] A. Margolis, “A literature review of domain adaptation with unlabeled data,” Tec. Report, pp. 1-42, 2011.
  • [47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset”, Technical report, 2011.
  • [48]

    G. Patterson, C. Xu, H. Su, and J. Hays, “The sun attribute database: Beyond categories for deeper scene understanding,”

    Int. Journal of Comput. Vis., vol. 108, no. 1, pp. 59-81, 2014.