Introduction
Most classification algorithms require a large pool of manually labeled data to learn the optimal parameters of a classifier. The recent exponential growth of visual data, the growing need for finegrained multilabel annotations, and consistent emergence of new classes (e.g. new products), however, has rendered manual labeling of data practically infeasible. Transfer learning has been proposed as a remedy to deal with this issue
[Lampert, Nickisch, and Harmeling2009]. The idea is to learn on a limited number of classes and then through knowledge transfer, learn how to classify images from the new classes either using only few labeled data points, i.e. few and oneshot learning [FeiFei, Fergus, and Perona2006], or in the extreme case without any labeled data, i.e. zeroshot learning (ZSL) [Lampert, Nickisch, and Harmeling2009]. These transfer learning approaches address the challenge of annotated data unavailability and open the door towards lifelong learning machines.To learn target classes with no labeled data, one needs to be able to generalize the relationship between the source data and its labels to the target classes. To address this challenge in ZSL, an intermediate shared space (i.e. the space of semantic attributes) is exploited, which allows for knowledge transfer from labeled classes to the unlabeled classes. The overarching idea in ZSL is that the source and the target classes share common attributes. The semantic attributes (e.g., can fly, is green) are often provided as accessible side information (e.g. verbal description of a class), which uniquely describe classes of data. To achieve ZSL the relationship between seen data and its corresponding attributes are first learned in the training phase. In testing stage, this allows for parsing a target image from an unseen class into its semantic attributes to predict corresponding label.
To clarify the ZSL core idea and the required steps to perform ZSL, consider the following sentence: ‘Tardigrades (also known as water bears or moss piglets) are waterdwelling, eightlegged, segmented micro animals’^{1}^{1}1Source: Wikipedia. Given this textual description, one can easily identify the creature in Figure 1, Left as a Tardigrade even though she may have never seen one before. Performing this task requires three capabilities: 1) parsing the textual information into semantic features, so we can describe the class Tardigrade as ‘bearlike’, ‘pigletlike’, ‘waterdwelling’, ‘eightlegged’, ‘segmented’, and ‘microscopic animal’, 2) parsing the image into its visual attributes (See Figure 1
), and 3) matching the parsed visual features to the parsed textual information which often requires extensive prior knowledge. Recent textual features extracted from large unlabeled text corpora; including
word2vec [Mikolov et al.2013] and glove [Pennington, Socher, and Manning2014]enable a learner to efficiently parse textual information. Deep convolutional neural networks (CNNs)
[Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2014, He et al.2016, Huang et al.2017]have revolutionized the field of computer vision and they enable a learner to extract rich visual features from images. An extensive body of work in the field of ZSL is concentrated on modeling the relationship between visual features and semantic attributes
[Palatucci et al.2009, Akata et al.2013, Socher et al.2013, Norouzi et al.2014, Lampert, Nickisch, and Harmeling2009, Zhang and Saligrama2015, Ding, Shao, and Fu2017].In this paper, we provide a novel approach to model the relationship between the visual features and the textual information. Our specific contributions are:

New formulation of ZSL via joint dictionary learning

Extending the classic joint dictionary learning formulation to an attribute aware formulation that addresses the domain shift/adaptation problem [Kodirov et al.2015]

Demonstrating the benefit of a transductive learning scheme to reduce the hubness phenomenon [Dinu, Lazaridou, and Baroni2014, Shigeto et al.2015]
Related Work
ZSL methods often focus on learning the relationship between the visual space and the semantic attribute space. Palatucci et al. [Palatucci et al.2009] proposed to learn a linear compatibility between the visual space and the semantic attribute space. Lampert et al. [Lampert, Nickisch, and Harmeling2009] posed the problem as an attribute classification problem and learned individual binary attribute classifiers in the training stage and used the ensemble of classifiers to map visual features to their semantic attributes. Yu and Aloimonos [Yu and Aloimonos2010] approached the problem from a probabilistic point of view and proposed to use generative models to learn prior distributions for image features with respect to each attribute. More recently, various authors have proposed to embed image features and semantic attributes in a shared metric space (i.e. a latent embedding) [Akata et al.2013, RomeraParedes and Torr2015, Zhang and Saligrama2015] while forcing the embedded representations for image features and their corresponding semantic attributes to be similar. Akata et al. [Akata et al.2013], for instance, proposed a model that embeds the image features and the semantic attributes in a common space (i.e. a latent embedding) where the compatibility between them is measured via a bilinear function. Similarly, RomeraParedes and Torr [RomeraParedes and Torr2015] utilized a principled choice of regularizers that enable the authors to derive a simple closed form solution to learn a linear mapping that embeds the image features and the semantic attributes in a low dimensional shared linear subspace. Others have identified the major problems and challenges in ZSL to be the domain shift problem [Kodirov et al.2015] and the hubness phenomena [Dinu, Lazaridou, and Baroni2014, Shigeto et al.2015]. In short, the domain shift problem raises from the fact that the distribution of features corresponding to the same attribute for seen and unseen images could be very different (e.g. stripes of tigers versus zebras). The hubness problem, on the other hand, states that there will often be attributes that are similar (have small distance) to vastly different visual features in the embedding space. Various transductive approaches are presented to overcome the hubness problem [Fu et al.2015, Yu et al.2017].
The use of sparse dictionaries to model the space of visual features and semantic attributes as union of linear subspaces has been shown to be an effective modeling scheme in recent ZSL papers [Yu et al.2017, Isele, Rostami, and Eaton2016, Kodirov et al.2015, Zhang and Saligrama2015]. Zhang et al. [Zhang and Saligrama2015] showed that modeling the test image features as sparse linear combination of train image features is beneficial and formulated a ZSL method based on this principal. Using similar ideas, Isele et. al. [Isele, Rostami, and Eaton2016]
used joint dictionary learning to learn a dynamical control system using high level task descriptors in an online lifelong zeroshot reinforcement learning setting. Our JDZSL build on similar ideas as in
[Yu et al.2017, Isele, Rostami, and Eaton2016, Kodirov et al.2015] and introduce a novel ZSL method based on learning joint sparse dictionaries for the image features and the semantic attributes. At its core, JDZSL is equipped with a novel entropy minimization regularizer [Grandvalet and Bengio2004], which facilitates the solution to the ZSL problem by reducing the domain shift effect. We further show that a transductive approach applied to our attribute aware JDZSL formulation provide stateoftheart or close to stateoftheart performance on various benchmark datasets. Finally it should be noted that the idea of using joint dictionaries to map data from a given metric space to a second related space was pioneered by Yang et al. [Yang et al.2010]in superresolution applications.
Figure 1
captures the gist of our idea. Visual features are extracted via CNNs, left subfigure, and the semantic attributes are provided via textual feature extractors like word2vec or via human annotations, right subfigure. Both the visual features and the semantic attributes are assumed to be representable sparsely in a shared union of linear subspaces, left and right subfigures. The idea here is then to enforce the sparse representation vectors for both domains be equal and thus effectively couple the learned dictionaries for the the visual and the attribute spaces. The intuition from a coview perspective
[Yu et al.2014] is that both the visual and the attribute features provide information about the same class, and so each can augment the learning of the other. Each underlying class is common to both views, and one can find task embeddings that are consistent for both the visual features and their corresponding attributes. Having learned the coupled dictionaries, zeroshot classification can be performed by mapping images of unseen classes into the attribute space, where classification can be simply done via nearest neighbor or via a more elaborate scheme like label propagation. Given the coupled nature of the learned dictionaries, an image could be mapped to its semantic attributes by first finding the sparse representation with respect to the visual dictionary, and next the semantic attribute dictionary can be used to recover the attribute vector from the joint sparse representation which could then be used for classification.Problem Statement and Technical Rational
Consider a visual feature metric space of dimension , an attribute metric space with dimension , and a class label set with dimension which ranges over a finite alphabet of size (images can potentially have multiple memberships to the classes). As an example for the visual features extracted from a deep CNN and when a binary code of length is used to identify the presence/absence of various characteristics in an object [Lampert, Nickisch, and Harmeling2009]. We are given a labeled dataset of features of seen images and their corresponding semantic attributes, where , and . We are also given the unlabeled attributes of unseen classes (i.e. we have access to textual information for a wide variety of objects but do not have access to the corresponding visual information). In ZSL the set of seen and unseen classes are disjoint and it is assumed that the semantic attributes are class specific. The goal is then to use and to learn the relationship between and so when an unseen image (image from an unseen class) is fed to the system, its corresponding attributes and consequently its label could be predicted. Finally, we assume that is the mapping between the attribute space and the label space and is a known linear mapping, .
To further clarify the problem, consider an instance of ZSL in which features extracted from images of horses and tigers are included in seen visual features , where , but does not contain features from images containing zebras. On the other hand, the semantic attributes contain information of all seen for and unseen for classes including the zebras. Intuitively, by learning the relationship between the image features and the attributes “has hooves”, “has mane”, and “has stripes” from the seen images, we must be able to assign an image of a zebra to its corresponding attribute, while we have never seen a zebra before. More formally, we want to learn the mapping which relates the visual space and the attribute space. Having learned this mapping, for an unseen image one can recover the corresponding attribute vector using the image features and then classify the image using the mapping , where ‘’ represents function composition.
Technical Rational
For the rest of our discussion we assume that , , and . The simplest ZSL approach is to assume that the mapping is linear, where , and then minimize the regression error to learn . Despite existence of a closed form solution for , the solution contains the inverse of the covariance matrix of ,
, which requires a large number of data points for accurate estimation. To overcome this problem, various regularizations are considered for
. Decomposition of as , where , , , and can also be helpful. Intuitively, is a right linear operator that projects ’s into a shared low dimensional subspace, is a left linear operator that projects into the same shared subspace, and provides a bilinear similarity measure in the shared subspace. The regression problem then can be transformed into maximizing , which is a weighted correlation between the embedded ’s and ’s. This is the essence of many ZSL techniques including Akata et al. [Akata et al.2013] and RomeraParedes et al.[RomeraParedes and Torr2015]. This technique can be extended to nonlinear mappings using kernel methods. However, the choice of kernels remains a challenge.On the other side of the spectrum, the mapping can be chosen to be highly nonlinear, as in deep neural networks (DNN). Let a DNN be denoted as , where represents the parameters of the network (i.e. synaptic weights and biases). ZSL can then be addressed by minimizing with respect to . Alternatively, one can nonlinearly embed ’s and ’s in a shared metric space via deep nets, and , and maximize their similarity measure in the embedded space, , as in [Lei Ba et al.2015]. Nonlinear methods are computationally expensive, require a large training dataset, and can easily overfit to the training data. On the other hand, linear ZSL algorithms are efficient, easy to train, and generalizable but they are often outperformed by nonlinear methods. As a compromise, we model nonlinearities in data distributions as union of linear subspaces with coupled dictionaries. By jointly learning the visual and attribute dictionaries, we effectively model the relationship between the metric spaces. This allows a nonlinear scheme with a computational complexity comparable to linear techniques.
Zero Shot Learning using Joint Dictionaries
Joint dictionary learning has been proposed to couple related features from two metric spaces [Yang et al.2010, Shekhar et al.2014]. Yang et al. [Yang et al.2010] proposed the approach to tackle the problem of image superresolution and Shekhar et al. [Shekhar et al.2014] used joint dictionary learning for multimodal biometrics recognition. Following a similar framework, the gist of our approach is to learn the mapping through two dictionaries, and that model and , respectively, where . The goal is to find a shared sparse representation (i.e. sparse code) for and , such that and . Below we describe the training and testing phases of our proposed method.
Training phase
The standard dictionary learning is based on minimizing the empirical average estimation error on a given training set , where regularization on enforces sparsity:
(1) 
Here is the regularization parameter, which controls the sparsity of , and is the i’th column of . Alternatively, following the Lagrange multiplier technique, the Frobenius norm of could be used as a regularizer in place of the costraint.
In our joint dictionary learning framework, we aim to learn and such that they share the sparse coefficients to represent the seen visual features and their corresponding attributes , respectively. An important twist here is that the attribute dictionary, , is also required to sparsify the semantic attributes of other (unseen) classes, . To obtain such coupled dictionaries we propose the following optimization,
(2) 
The above formulation combines the dictionary learning problem for and by coupling them via , and also enforces to be a sparsifying dictionary (i.e. a good model) for . The optimization in Eq (2), while convex in each individual term, is highly nonconvex in all variables. Following the approach proposed in [Yang et al.2012]
we use an Expectation Maximization (EM) like alternation to update dictionaries
and . To do so, we rewrite the optimization problem into the following two steps:
For a fixed update via the following optimization:
(3) is found using a Lasso optimization problem, and FISTA [Beck and Teboulle2009] is used to update and .

For a fixed update via:
(4) which involves a Lasso optimization together with a simple regression with a close form solution.
ZeroShot Prediction of Unseen Attributes
In the testing phase we are only given the extracted features from unseen images, and the goal is to predict their corresponding semantic attributes. Here we introduce a progression of methods, which clarifies the logic behind our method, and enables us to efficiently predict the semantic attributes of the unseen images based on the learned dictionaries in the training phase.
Attribute Agnostic Prediction
The attribute agnostic (AAg) formulation, is the naive way of predicting semantic attributes from an unseen image . In the AAg formulation, we first find the sparse representation of the unseen image with respect to the learned dictionary by solving the following Lasso problem,
(5) 
Here, one can simply use and compare it to the sparse codes of the unseen attributes, . In our experiments, however, we found that this approach is not suitable in our JDZSL setting as the dictionaries could have redundant atoms that cause two similar image features or attributes to have different sparse codes. Instead, we do the comparison in the attribute space and predict the corresponding attribute via . In the attributeagnostic formulation, the sparse coefficients are calculated without any information from the attribute space. Not using the information from the attribute space would lead to the domain shift problem, in the sense that there is no guarantee that would reconstruct a meaningful attribute in . In other words, could be far from the unseen attributes, , and therefore could not be assigned to any known attribute with high confidence. To alleviate this problem we progress to an extended solution, which we denote as the Attribute Aware (AAw) prediction.
Attribute Aware Prediction
In the attributeaware (AAw) formulation we would like to find the sparse representation to not only approximate the input visual feature, , but also provide an attribute prediction, , that is well resolved in the attribute space and does not suffer from the domain shift problem. Note that, ideally for some . To achieve this we define the soft assignment of to , denoted by , using the Student’s tdistribution as a kernel to measure similarity between and ,
(6) 
where is the kernel parameter. The choice of tdistribution is due to its long tail and low sensitivity to the choice of kernel parameter, . Ideally, for some and for . The ideal softassignment then would be onesparse and therefore would have minimum entropy. This motivates our attributeaware formulation, which regularizes the AAg formulation in Equation 5 with the entropy of .
(7) 
where
is the regularization parameter for entropy of the softassignment probability vector
. Such entropy minimization scheme has been successfully used in several work [Grandvalet and Bengio2004, Huang, Tran, and Tran2016] whether as a sparsifying regularization or to boost the confidence of classifiers. We note that the entropy regularization enforces the prediction to be close to one of the unseen attributes, but it can potentially backfire in that a lowentropy solution (aligned to a prototype) doesn’t necessarily have to be the correct solution. In our experiments, we consistently observed higher performance for the AAw formulation.The entropy regularization turns the optimization in Eq. (7) into a nonconvex problem. In [Huang, Tran, and Tran2016], the authors use a generalized gradient descent approach similar to FISTA to optimize this nonconvex problem. We use a similar scheme to optimize the objective function in Eq. (7). In short, we relax using its quadratic approximation around the previous estimation of , , and update as the solution of the following problem
(8) 
Equation (8) is a LASSO problem and can be solved efficiently using FISTA. It only remains to compute :
(9) 
where:
Due to the nonconvex nature of the objective function, a good initialization is needed to achieve a sensible solution. Therefore we initialize from the solution of the AAg formulation. Finally the corresponding attributes are estimated by , for .
From Predicted Attributes to Labels
In order to predict the image labels, one needs to assign the predicted attributes, , to the attributes of the unseen classes (i.e. prototypes). This task can be performed in two ways, namely the inductive approach and the transductive approach. In the inductive scheme the inference could be performed using a nearest neighbor (NN) approach in which label of each individual is assigned to be the label of its nearest neighbor . In such approach the structure of ’s is not taken into account and the hubness problem could easily degrade the performance of the ZSL algorithm. Looking at the tSNE embedding visualization [Maaten and Hinton2008] of ’s and ’s in Figure 2, details are explained later, it can be seen that NN does not provide an optimal label assignment.
In the transductive setting, on the other hand, the attributes for all test images (i.e. unseen) are first predicted to form . Next, a graph is formed on where the labels for are known and the task is to infer the labels of . This problem can be formulated as a graphbased semisupervised label propagation [Belkin, Matveeva, and Niyogi2004, Zhou et al.2003]. We follow the work of Zhou et al. [Zhou et al.2003] and spread the labels of to . More precisely, we first reduce the dimension of via tSNE [Maaten and Hinton2008] and then form a graph in the lower dimension and perform label propagation on this graph. Figure 2 reconfirms that label propagation in a transductive setting could significantly improve the performance of ZSL and resolve the hubness and domain shift issues as also demonstrated in [Fu et al.2015, Yu et al.2017].
Theoretical Discussion
The core step for ZSL in our scheme is to compute the joint sparse representation for an unseen image. Note that in the testing phase, the sparse representation is estimated using (5), while the dictionaries are learned for optimal sparse representations as in (2). More specifically, we need to demonstrate that the following two problems lead to close approximations:
(10) 
in order to conclude that we can solve for in ZSL regime (i.e. prediction attributes for unseen images) to estimate with good accuracy. Note that the major challenge in the testing phase is that we are using the dictionary to find the shared sparse parameters, , instead of . To study the effect of this change, we first point out that Eq. 1 can be interpreted as result of a maximum a posteriori (MAP) inference from a Bayesian perspective. This means that from a probabilistic perspective, ’s are drawn from a Laplacian distribution and the dictionary is a Gaussian matrix with elements drawn i.i.d: . This means that given a drawn dataset, we learn MAP estimate of a Gaussian matrix. In order to analyze the effect, we rely on the following theorem about LASSO with Gausian matrices [Negahban et al.2009]:
Theorem 1 [Negahban et al.2009]: Let be the unique sparse solution of the linear system with and . If is the LASSO solution for the system from noisy observations, then with high probability: , where
is a constant which depends on the loss function which measures the data fidelity, here the Euclidean distance.
Lemma 1: Attribute prediction error in ZSL setting is upperbounded proportional to .
Proof: note that if is a solution of , trivially it is also a solution for as well. Now using Theorem 1:
(11) 
Note we have used the triangular inequality first and then the theorem in the above deduction and denotes spectral norm for a matrix. This result accords with intuition. First, it advises sparseness of , i.e. smaller , decreases the error which means that a good sparsifying dictionary would lead to less ZSL error. Second, the error is proportional to inverse of both and , meaning that rich visual and attribute descriptions lead to minimal ZSL error. This suggests that for our approach to work, existence of a good sparsifying dictionary as well as rich visual and attribute data is essential. Finally, although increasing the number of dictionary columns intuitively can improve sparsity, i.e. decrease , this result shows that it can potentially increase the ZSL error, and should be tuned for an optimal performance.
Method  SUN  CUB  AwA  

[RomeraParedes and Torr2015]  82.10    75.32  
[Zhang and Saligrama2015]  82.5  30.41  76.33  
[Zhang and Saligrama2016]  82.83  42.11  80.46  
[Bucher, Herbin, and Jurie2016]  84.41  43.29  77.32  
[Xu et al.2017]  83.5  53.6  84.5  
[Li et al.2017]    61.79  87.22  
[Ye and Guo2017]  85.40  57.14  85.66  
[Ding, Shao, and Fu2017]  86.0  45.2  82.8  
[Wang and Chen2017]    42.7  79.8  
[Kodirov, Xiang, and Gong2017]  91.0  61.4  84.7  
Ours  AAg (5)  82.05  35.81  77.73 
Ours  AAw (6)  83.22  38.36  83.33 
Ours  Transductive AAw (TAAw)  85.90  47.12  88.23 
Ours  TAAw hit@3  94.52  58.19  91.73 
Ours  TAAw hit@5  98.15  69.67  97.13 
Zeroshot classification results for three benchmark datasets. All methods use VGG19 features trained on the ImageNet dataset and the original continuous (or binned) attributes provided by the datasets. Here,
indicates that the results are extracted directly from the corresponding paper, indicates that the results are reimplemented with VGG19 features, and indicates that the results are not reported.Experiments
We carried out experiments on three benchmark ZSL datasets and empirically evaluated the resulting performance against nascent ZSL algorithms.
Datasets: We conducted our experiments on three benchmark datasets namely: the Animals with Attributes (AwA1) [Lampert, Nickisch, and Harmeling2014], the SUN attribute [Patterson and Hays2012], and the CaltechUCSDBirds 2002011 (CUB) bird [Wah et al.2011] datasets. The AwA1 dataset is a coarsegrained dataset containing 30475 images of 50 types of animals with 85 corresponding attributes for these classes. Semantic attributes for this dataset are obtained via human annotations. The images for the AWA1 dataset are not publicly available; therefore we use the publicly available features of dimension extracted from a VGG19 convolutional neural network, which was pretrained on the ImageNet dataset. Following the conventional usage of this dataset, 40 classes are used as source classes to learn the model and the remaining 10 classes are used as target (unseen) classes to test the performance of zeroshot classification. The SUN dataset is a finegrained dataset and contains 717 classes of different scene categories with 20 images per category (14340 images total). Each image is annotated with 102 attributes that describe the corresponding scene. Following [Lampert, Nickisch, and Harmeling2014], 707 classes are used to learn the dictionaries and the remaining 10 classes are used for testing. The CUB200 dataset is a finegrained dataset containing 200 classes of different types of birds with 11788 images with 312 attributes and boundary segmentation for each image. The attributes are obtained via human annotation. The dataset is divided into four almost equal folds, where three folds are used to learn the model and the fourth fold is used for testing. For both SUN and CUB2002011 datasets we used features from VGG19 trained on the ImageNet dataset, which have dimensions. We note that our results using ResNet50 and DenseNet [Huang et al.2017] features will be published in an extended version of this paper.
Tuning parameters: The optimization regularization parameters , , as well as the number of dictionary atoms need to be tuned for maximal performance. We used standard
fold cross validation to search for the optimal parameters for each dataset. After splitting the datasets accordingly into training, validation, and testing sets, we used performance on the validation set for tuning the parameters in a bruteforce search. we used the common evaluation metrics in ZSL, flat hit@K classification accuracy, to measure the performance. This means that a test image is said to be classified correctly if it is classified among the top
predicted labels. We report hit@1 rate to measure ZSL image classification performance and hit@3 and hit@5 for image retrieval performance. Each experiment is performed ten times and the mean is reported in Tabel
1.Results: Figure 2 demonstrates the 2D tSNE embedding for predicted attributes and actual class attributes of the AWA dataset. The actual attributes are depicted by colored circles with black edges. The first column of Figure 2 demonstrates the attribute prediction for AAg and AAw formulations. It can be clearly seen that the entropy regularization in AAw formulation improves the clustering quality, decreases data overlap, and reduces the domain shift problem. The nearest neighbor label assignment is shown in the second column, which demonstrates the domain shift and hubness problems with NN label assignment in the attribute space. The third column of Figure 2 shows the transductive approach in which a label propagation is performed on the graph of the predicted attributes. Note that the label propagation addresses the domain shift and hubness problem and when used with the AAw formulation provides significantly better zeroshot classification accuracy.
Performance comparison results are summarized in Table 1. As pointed out by Xian et al. [Xian et al.2017] the variety of used image features (e.g. various DNNs and various combinations of these features) as well as the variation of used attributes (e.g. word2vec, human annotation), and different data splits make direct comparison with the ZSL methods in the literature very challenging. In Table 1 we provide a fair comparison of our JDZSL performance to the recent methods in the literature. All compared methods use the same visual features (i.e. VGG19) and the same attributes (i.e. the continuous or binned) provided in the dataset. Table 1 provides a comprehensive explanation of the shown results. Note that our method achieves stateoftheart or close to stateoftheart performance.
We report the hit@1 accuracy on unseen classes in the first nine rows of the table to measure image classification performance. For the sake of transparency and to provide the complete picture to the reader, we included results for the AAg formulation using nearest neighbor, the AAw using nearest neighbor, and AAw using the transductive approach, denoted as transductive attribute aware (TAA) formulation. As it can be seen, while the AAw formulation significantly improves the AAg formulation and adding the transductive approach (i.e. label propagation on predicted attributes) to the AAw formulation further boosts the classification accuracy, as also shown in Figure 2. In addition, our approach leads to better and comparable performance in all three datasets which include zeroshot scene and object recgonition tasks. More importantly, while the other methods can perform well on a specific dataset, our algorithm leads to competitive performance on all the three datasets.
Conclusions
A ZSL formulation is developed that models the relationship between visual features and semantic attributes via joint sparse dictionaries. We showed that while a classic joint dictionary learning approach suffers from the domain shift problem, an entropy regularization scheme can help with this phenomenon and provide superior performance. In addition, we demonstrated that a transductive approach towards assigning labels to the predicted attributes can boost the performance considerably and lead to stateoftheart zeroshot classification. Finally, we compared our method to the nascent approaches in the literature and demonstrated its competitiveness on benchmark datasets.
References

[Akata et al.2013]
Akata, Z.; Perronnin, F.; Harchaoui, Z.; and Schmid, C.
2013.
Labelembedding for attributebased classification.
In
Proc. IEEE Conf. Computer Vision and Pattern Recognition
, 819–826.  [Beck and Teboulle2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1):183–202.

[Belkin, Matveeva, and
Niyogi2004]
Belkin, M.; Matveeva, I.; and Niyogi, P.
2004.
Regularization and semisupervised learning on large graphs.
In Conference on Learning Theory, 624–638. Springer.  [Bucher, Herbin, and Jurie2016] Bucher, M.; Herbin, S.; and Jurie, F. 2016. Improving semantic embedding consistency by metric learning for zeroshot classiffication. In European Conference on Computer Vision, 730–746. Springer.
 [Ding, Shao, and Fu2017] Ding, Z.; Shao, M.; and Fu, Y. 2017. Lowrank embedded ensemble semantic dictionary for zeroshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2050–2058.
 [Dinu, Lazaridou, and Baroni2014] Dinu, G.; Lazaridou, A.; and Baroni, M. 2014. Improving zeroshot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568.
 [FeiFei, Fergus, and Perona2006] FeiFei, L.; Fergus, R.; and Perona, P. 2006. Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4):594–611.
 [Fu et al.2015] Fu, Y.; Hospedales, T. M.; Xiang, T.; and Gong, S. 2015. Transductive multiview zeroshot learning. IEEE transactions on pattern analysis and machine intelligence 37(11):2332–2345.
 [Grandvalet and Bengio2004] Grandvalet, Y., and Bengio, Y. 2004. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, volume 17, 529–536.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In International Conference on Computer Vision, 770–778.
 [Huang et al.2017] Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4700–4708.
 [Huang, Tran, and Tran2016] Huang, S.; Tran, D. N.; and Tran, T. D. 2016. Sparse signal recovery based on nonconvex entropy minimization. In IEEE International Conference on Image Processing, 3867–3871. IEEE.

[Isele, Rostami, and
Eaton2016]
Isele, D.; Rostami, M.; and Eaton, E.
2016.
Using task features for zeroshot knowledge transfer in lifelong
learning.
In
Proc. of International Joint Conference on Artificial Intelligence
, 1620–1626.  [Kodirov et al.2015] Kodirov, E.; Xiang, T.; Fu, Z.; and Gong, S. 2015. Unsupervised domain adaptation for zeroshot learning. In Proceedings of the IEEE International Conference on Computer Vision, 2452–2460.

[Kodirov, Xiang, and
Gong2017]
Kodirov, E.; Xiang, T.; and Gong, S.
2017.
Semantic autoencoder for zeroshot learning.
3174–3183.  [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
 [Lampert, Nickisch, and Harmeling2009] Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009. Learning to detect unseen object classes by betweenclass attribute transfer. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 951–958.
 [Lampert, Nickisch, and Harmeling2014] Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2014. Attributebased classification for zeroshot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3):453–465.
 [Lei Ba et al.2015] Lei Ba, J.; Swersky, K.; Fidler, S.; et al. 2015. Predicting deep zeroshot convolutional neural networks using textual descriptions. In International Conference on Computer Vision, 4247–4255.
 [Li et al.2017] Li, Y.; Wang, D.; Hu, H.; Lin, Y.; and Zhuang, Y. 2017. Zeroshot recognition using dual visualsemantic mapping paths. 3279–3287.

[Maaten and
Hinton2008]
Maaten, L., and Hinton, G.
2008.
Visualizing data using tsne.
Journal of Machine Learning Research
9(Nov):2579–2605.  [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
 [Negahban et al.2009] Negahban, S.; Yu, B.; Wainwright, M.; and Ravikumar, P. 2009. A unified framework for highdimensional analysis of estimators with decomposable regularizers. In Advances in neural information processing systems, 1348–1356.
 [Norouzi et al.2014] Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G. S.; and Dean, J. 2014. Zeroshot learning by convex combination of semantic embeddings. International Conference on Learning Representations.
 [Palatucci et al.2009] Palatucci, M.; Pomerleau, D.; Hinton, G. E.; and Mitchell, T. M. 2009. Zeroshot learning with semantic output codes. In Advances in neural information processing systems, 1410–1418.
 [Patterson and Hays2012] Patterson, G., and Hays, J. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2751–2758. IEEE.

[Pennington, Socher, and
Manning2014]
Pennington, J.; Socher, R.; and Manning, C. D.
2014.
Glove: Global vectors for word representation.
In
Empirical Methods on Natural Language Processing
, volume 14, 1532–43.  [RomeraParedes and Torr2015] RomeraParedes, B., and Torr, P. 2015. An embarrassingly simple approach to zeroshot learning. In International Conference on Machine Learning, 2152–2161.
 [Shekhar et al.2014] Shekhar, S.; Patel, V. M.; Nasrabadi, N. M.; and Chellappa, R. 2014. Joint sparse representation for robust multimodal biometrics recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(1):113–126.
 [Shigeto et al.2015] Shigeto, Y.; Suzuki, I.; Hara, K.; Shimbo, M.; and Matsumoto, Y. 2015. Ridge regression, hubness, and zeroshot learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 135–151. Springer.
 [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 [Socher et al.2013] Socher, R.; Ganjoo, M.; Manning, C. D.; and Ng, A. 2013. Zeroshot learning through crossmodal transfer. In Advances in neural information processing systems, 935–943.
 [Wah et al.2011] Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltechucsd birds2002011 dataset.
 [Wang and Chen2017] Wang, Q., and Chen, K. 2017. Zeroshot visual recognition via bidirectional latent embedding. International Journal of Computer Vision 124(3):356–383.
 [Xian et al.2017] Xian, Y.; Lampert, C. H.; Schiele, B.; and Akata, Z. 2017. Zeroshot learninga comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600.
 [Xu et al.2017] Xu, X.; Shen, F.; Yang, Y.; Zhang, D.; Shen, H. T.; and Song, J. 2017. Matrix trifactorization with manifold regularizations for zeroshot learning. In Proceedings of the IEEE International Conference on Computer Vision, 3798–3807.
 [Yang et al.2010] Yang, J.; Wright, J.; Huang, T. S.; and Ma, Y. 2010. Image superresolution via sparse representation. IEEE transactions on image processing 19(11):2861–2873.
 [Yang et al.2012] Yang, J.; Wang, Z.; Lin, Z.; Cohen, S.; and Huang, T. 2012. Coupled dictionary training for image superresolution. IEEE Transactions on Image Processing 21(8):3467–3478.
 [Ye and Guo2017] Ye, M., and Guo, Y. 2017. Zeroshot classification with discriminative semantic representation learning. 17140–17148.
 [Yu and Aloimonos2010] Yu, X., and Aloimonos, Y. 2010. Attributebased transfer learning for object categorization with zero/one training example. European Conference on Computer Vision 127–140.
 [Yu et al.2014] Yu, Z.; Wu, F.; Yang, Y.; Tian, Q.; Luo, J.; and Zhuang, Y. 2014. Discriminative coupled dictionary hashing for fast crossmedia retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 395–404. ACM.
 [Yu et al.2017] Yu, Y.; Ji, Z.; Li, X.; Guo, J.; Zhang, Z.; Ling, H.; and Wu, F. 2017. Transductive zeroshot learning with a selftraining dictionary approach. arXiv preprint arXiv:1703.08893.
 [Zhang and Saligrama2015] Zhang, Z., and Saligrama, V. 2015. Zeroshot learning via semantic similarity embedding. In International Conference on Computer Vision, 4166–4174.
 [Zhang and Saligrama2016] Zhang, Z., and Saligrama, V. 2016. Zeroshot learning via joint latent similarity embedding. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 6034–6042.
 [Zhou et al.2003] Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Schölkopf, B. 2003. Learning with local and global consistency. In Advances in neural information processing systems, volume 16, 321–328.