. This has been studied before in different tasks involving the data of diverse modalities. One of the most important applications is content-based search and semantic indexing for text documents and images. The text documents are much easier to label as compared to associated images on a webpage. Also, since classifiers naturally work better with features that have semantic interpretability, text features are inherently friendly to the classification process in a way that is often a challenge for visual representations of images. It makes it easier to interpret and solve the classification problem in the text modality, while there is a tremendous semantic gap between visual features and the concepts for images. In addition, the challenges of image classification are particularly evident, when the amount of training data available is limited. In such cases, the image classification is further hampered by the paucity of labels.
In the case of images, it is desirable to obtain a feature representation which relates more directly to semantic concepts; a process which will improve the quality of classification. Furthermore, this often has to be achieved with the use of only a limited amount of labeled image data. This naturally motivates an approach for utilizing the labeled data in the text modality in order to improve image classification. Hence, we implemented an intermodal label transfer process in which a transfer function is built to reveal the alignment between modalities so the labels can be transferred across different modalities . We showed that the transfer of the rich label information from texts to images provides much more effective learning algorithms.
Although intermodal label transfer has shown promising result , however, we have observed that it might fail when the labels cannot be well aligned between modalities. For example, the text labels of “building” may refer to a large variety of building architectures in different documents, while a test image of “building” often has a certain style of appearance. It is risky to blindly transfer text labels no matter when the visual appearance does not match with any text descriptions from a source corpus. This causes the “negative tranfer” problem that refers to transferring of irrelevant information between different modalities . To prevent the negative transfer, we will present an intramodal label transfer process to complement the intermodal label transfer, which will take over the annotation of a test image by transferring image labels in absence of labeled relevant text documents. As a result, this yields a joint Intermodal and Intramodal Label Transfer (I2LT) algorithm, which combines the advantages of both image labels and text labels in the context of a label transfer task.
Formally, we seek to develop a label transfer algorithm for jointly sharing labels across and within different modalities   . Specifically, it is applied to the image classification problem in order to leverage the labels in text corpora to annotate image corpora with scarce labels. Such algorithms typically transfer labeling information between heterogeneous feature spaces  instead of homogeneous feature spaces 
. Heterogeneous transfer learning is usually much more challenging due to the unknown alignment across the distinct feature spaces. In order to bridge across two distinct feature spaces, the key ingredient is a “transfer function” which can explain the alignment between text and image feature spaces through the use of a feature transformation. This transformation is used for the purpose of effective image classification and semantic indexing. As discussed earlier, it is achieved with the use of co-occurrence data that is often available in many practical settings. For example, in many real web and social media applications, it is possible to obtain manyco-occurrence pairs between text and images ; in web pages, the images are surrounded by text descriptions on the same web page. Similarly, there is a tremendous amount of linkage between text and images on the web, through comments in image sharing sites, posts in a social networks, and other linked text and image corpora. It is reasonable to assume that the content of the text and the images are highly correlated in both scenarios. This information provides a semantic bridge, which can be exploited in order to learn the alignment and label transfer between the different modalities.
In contrast to previous work , the label transfer proposed in this paper can establish the alignment between texts and images even if the new test images do not have any surrounding text description, or if the co-occurrence data is independent of the labeled source texts. This increases the flexibility of the algorithm and makes it more widely applicable in many practical applications. Specifically, in order to perform the label transfer process, we create a new topic space into which both the text and images are mapped. Both the occurrence set and training set are used to learn the transfer function, which aligns heterogeneous text and image spaces. We also follow the principle of parsimony, and encode as few topics as possible in order to align between text and images for regularization. This principle has a preference for the least complex model, as long as the text and image alignment can be well explained by the learned transfer function. After the transfer function is learned, the labels can be propagated from any labeled text corpus to any new image by intermodal label propagation. While labels from the images are also used for improving accuracy, one characteristic of our transfer function is that it is particularly robust in the presence of a very small number of scarce training examples.
The remainder of this paper is organized as follows. In Section 2, we briefly review the related work. Then we propose an intermodal label transfer process in Section 3 and show how the labels of text corpus can be propagated to image corpus. In section 4, a joint Intermodal and Intramodal Label Transfer (I2LT) process is proposed, along with a transfer function in Section 5 that instantiates the joint model. In Section 6, we present the objective problem along with a proximal gradient based algorithm for solving the optimization problem. We also present a zero-shot learning extension of the proposed algorithm to classify images of unseen classes in Section 7. The experiment results are presented in section 8. The conclusion and summary is presented in Section 9.
2 Related Work
A variety of transfer learning methods have been proposed in prior pioneering works, e.g., domain adaption [25, 15, 26, 47, 40, 39], cross-category information sharing , and heterogeneous transfer learning [35, 50, 14, 18]. In this paper, we concentrate on learning cross-modal correspondence and sharing the semantic information across different modalities.
Learning semantic correspondence from text to images can be seen as a transfer learning problem that involves heterogeneous data points across different feature spaces. For example,  proposes heterogeneous transfer learning, which uses both user tags and related document text as auxiliary information to extract a new latent feature representation for each image. However, it does not utilize the text labels to enrich the semantic labels of images, which may restrict its performance when the image labels are very scarce. On the other hand, translated learning  attempts to label the target instances through a Markovian chain. A translator is assumed to be available between source and target data for correspondence. However, given an arbitrary new image, such a correspondence is not always directly available between any text and image instances. In this case, a generative model is used in the Markovian chain to construct feature-feature co-occurrence. This model is not reliable when co-occurrence data is noisy and sparse. On the contrary, we explicitly learn a semantic transfer function, which directly propagates semantic labels from text to images even if the semantic correspondence is not available beforehand for a new image. It avoids overfitting into the noisy and sparse co-occurrence data by imposing the prior of fewest topics on semantic translation.
It is also worth noting that learning label transfer across heterogenous modalities is different from the conventional heterogeneous learning, such as multi-kernel learning  and co-training . In heterogeneous learning, each instance must contain different views. On the contrary, when translating text to images , it is not required that an image has an associated text view. This makes the problem much more challenging. The correspondence between text and images is established by the learned transfer function, and a single image view of an input instance is enough to predict its label by a label transfer process.
We also distinguish the proposed label transfer model from the other latent models. Previous latent methods, such as Latent Semantic Analysis , Probabilistic Latent Semantic Analysis , Latent Dirichlet Allocation  and Multimodal Latent Attributes , are restricted to latent factor discovery from the co-occurrence observations. On the contrary, in this paper, the goal is to establish semantic bridge so that the discriminative labeling information can be propagated between the source and target spaces. To the best of our knowledge, it is one of the first algorithms to address such heterogeneous label transfer problem via a parsimonious latent topic space. It is worth noting that even with unknown correspondence to source instances, it can still label the new instance by predicting its correspondence based on the learned transfer function.
. Usually Recurrent Neural Networks (RNNs)45]. The translation problem has been recognized as a very challenging task, since it requires the machine not only capable of reading the content of images and videos accurately, but also be able to translate the visual elements into sentences in a correct order with a satisfactory level of grammatical correctness. In this paper, we do not aim to solve this challenging problem. Instead we consider the label transfer from texts to images, where we do not need to compose the sentences. Also, our goal differs from composing the description of the visual content in sentences. Instead, we wish to utilize the abundant labeled text documents to improve the classification accuracy for the image classification tasks.
Although we focus on label transfer from texts to images, the model developed in this paper is equally applicable to the other label transfer tasks between different modalities. For example, the previous work has demonstrated an application where the labels of English documents are transferred to annotate the Chinese documents . Similarly, the speech segments can be aligned by learning a transfer function by which the labels can be transferred across different languages to annotate the speeches. The label transfer model can also be applied for audio-video recognition tasks [29, 11]. Similar to the scenario set in this paper, a test audio will have no paralleled video and it must be aligned to the existing corpus of videos to enable intermodal label transfer. But  explores a slightly different idea – instead of aligning the test sample with the video corpus, they attempt to reconstruct the paralleled video through multimodal deep networks . This approach is indirect for label transfer and an independent classifier must be trained for audio-video recognition tasks.
In an earlier work , Flickr images with tags have been used to learn several CCA variants for cross-modal retrieval task. They incorporated a third view of supervised semantic information or unsupervised word clusters to bridge the cross-modal gap, along with the visual and text views. On the contrary, an important byproduct of the proposed algorithm is a intermodal transfer function, which can measure the cross-modal relevance directly. It is also learned with the supervised labeled image/text pairs. In this spirit, our approach also involves a “third view” of the labeled concepts. However, our approach is motivated to annotate the labels of semantic concepts, rather than learning the cross-modal relevance directly. This makes the proposed approach in a complimentary technical line to those CCA variants presented in .
A more recent work 
proposed to use privileged information to augment Support Vector Machines (SVMs). Additional training bags were collected from textural descriptions of images, where positive bags contain the returned images containing relevant tags, while negative bags do not contain any images with relevant tags. Then the problem with the training bags was formulated as multi-instance learning problem, and positive bags provide privileged information to augment the training of the classifiers with more positive instances of images. Our approach differs from this method in directly transferring the text labels to reconstruct image labels, rather than training an image classifier. However, both methods do not assume the availability of text information for testing images, making them applicable to label new images without text descriptions.
3 Intermodal Label Transfer
In this section, we will introduce the notations and problem definitions for the label transfer process. Let and be the source and target feature spaces, which have a dimensionality of and respectively. For the purpose of this paper, the source space corresponds to the text modality, and the target space corresponds to the image modality. In the source (text) space, we have a set of text documents in . Each text document is represented by a feature vector . This text corpus, , has already been annotated with class labels, where is the binary label for each document . The binary assumption is made to avoid notational clutter, and it can be straightforwardly extended to encode multiple classes.
The images are represented by feature vectors in the target space . The task is to relate the feature structure of the source (text) space to the target space (image) space, so that the labeling information can be shared between two spaces. The goal of the transformation process is to provide a classifier for the target (image) domain in the presence of scarce labeled data for the latter domain.
In order to perform the label propagation from the text to the image domain, we need a bridge, which relates the text and image information. A key component which provides such bridging information about the relationship between the text space and image feature space is a set of co-occurrence pairs . Such co-occurrence information is abundant in the context of web and social network data. In fact, it may often the case that the co-occurrence information between text and images can be more readily obtained than the class labels in the target (image) domain. For example, in many web collections, the images may co-occur with the surrounding text on the same web page. Similarly, in web and social networks, it is common to have implicit and explicit links between text and images. Such links can be viewed more generally as co-occurrence data. This co-occurrence set provides the semantic bridge needed for transfer learning.
Besides the co-occurrence set, we also have a small set of labeled images. This is an auxiliary set of training examples, and its size is usually much smaller than that of the set of labeled source examples. In other words, we have . As we will see, the auxiliary set is used in order to enhance the accuracy of the transfer learning process.
One of the key intermediate steps during this process is the design of a transfer function between text and images. This transfer function serves as a conduit to measure the alignment between text and image features. We will show that such a conduit can be used directly in order to propagate the class labels from text to images. The transfer function is defined jointly on text space and image space as . It assigns a real value to one pair of texts and image instances to weigh their alignment. This value can be either positive or negative, representing either positive or negative match. Given a new image , its label is determined by an intermodal discriminant function as a linear combination of the class labels in weighted by the corresponding transfer functions
Then, the sign of decides the class label of .
4 Joint Intermodal and Intramodal Label Transfers
In addition to the above inter-modal label transfer model, we can transfer the image labels from the training set directly to annotate a test image . Formally, we can define the following discriminant function for the intra-modal label transfer:
where are the real-valued coefficients for intra-modal label transfer, and is a kernel function between two images satisfying Mercer’s condition (e.g., Gaussian kernel) . This label transfer has the similar form as the discriminant function of kernelized support vector machine , with each nonzero corresponding to a support vector.
It is worth noting that usually no surrounding text document comes with the test image . But we can always apply the transfer function to align the test image with the text documents from the source corpus . This solves the out-of-sample problem so the text labels can be transferred to annotate any new images.
This extends the inter-modal label transfer paradigm. We expect the intermodal and intramodal label transfers can collaboratively annotate the test images, aggregating both the label information from both texts and images. This can mitigate the negative transfer problem   when the text documents in the source corpus cannot properly specify the visual aspect of a test image. In this case, we expect the image labels would take over to annotate the image based on its visual appearance. In this spirit, the intramodal label transfer component plays a role of “watchdog” to overlook and complement the intermodal label transfer. As to be shown in the experiment, it successfully improves the intermodal label transfer model and outperforms the compared algorithms over all the categories for a image classification task.
The learning problem of establishing joint label transfers boils down to learn the coefficients , along with the transfer function that properly explains the alignment between text and image spaces. This overall process is illustrated intuitively in Figure 1. Since the key to an effective transfer learning process is to learn the function , we need to formulate an optimization problem which maximizes the classification accuracy obtained from this transfer process. First, we will first set up the optimization problem more generally without assuming any canonical form for . Later, we will set up a canonical form for the transfer function in the form of matrices which represent topic spaces. The parameters of this canonical form will be optimized in order to learn the transfer function. We propose to optimize the following problem to jointly learn the parameters of intermodal and intramodal functions:
is the loss function of the training errors on the labeled image set; (2) is the loss function that measures the misalignment between the co-occurrence text-image pairs, and minimizing this loss would maximize the value of transfer function over the co-occurrence pairs; and (3) The last term regularizes the learning of the transfer function in order to improve the generalization performance. In the following section, we will present the detailed forms of these loss functions and the regularizer.
In addition, and are positive balancing parameters, which define the relative importance of training data and co-occurrence pairs in the objective function; and the bound constraint follows the conventional regularization constraint on the coefficients in support vector machines , which is expected to yield better generalization performance.
Remark on the three data sets : It is worth noting that the labelled text set is used to propagate their labels to annotate the target images. We do not need to set the text part of the co-occurrence set to be the same as , since the modeling of co-occurrence and the label propagation are different. The labeled image set can also differ from . These labeled images are used to minimize the classification errors involved in the first term of objective function (4), which is different from maximization of co-occurrence consistency in the second term.
5 Intermodal Transfer Function
In this section, we will design the canonical form of the transfer function in terms of underlying topic spaces. This provides a closed form to our transfer function, which can be effectively optimized. Topic spaces provide a natural intermediate representation which can semantically link the information between the text and images. One of the challenges to this is that text and images have inherently different structure to describe their content. For example, text is described in the form of a vector space of sparse words, whereas images are typically defined in the form of feature vectors that encode the visual appearances such as color, texture and their spatial layout. To establish their connection, one must discover a common structure which can be used in order to link them. A text document usually contains several topics which describe different aspects of the underlying concepts at a higher level. For example, in a web page depicting a bird, some topics such as the head, body and tail may be described in its textual part. At the same time, there is an accompanying bird image illustrating them. By mapping the original text and image feature vectors into a space with several unspecified topics, they can be semantically linked together by investigating their co-occurrence data. By using this idea, we construct two transformation matrices to map text and images into a common (hypothetical) latent topic space with dimension , as in the previous work , which makes them directly comparable. The dimensionality is essentially equal to the number of topics. We note that it is not necessary to know the exact semantics of latent topics. We only attempt to model the semantic correspondence between the unknown topics of text and images. The learning of effective transformation matrices (or, as we will see later, an appropriate function of them) is the key to the success of the semantic translation process. These matrices are defined as follows.
The transfer function is defined as a function of the source and target instances by computing the inner product in our hypothetical topic space, with a nonlinear hyperbolic tangent activation
Here and denote the inner product and transpose operations respectively. Clearly, the choice of the transformation matrices (or rather the product matrix ) impacts the transfer function directly. Therefore, we will use the notation in order to briefly denote the matrix . Clearly, it suffices to learn this product matrix rather than the two transformation matrices separately. The above definition of the matrix can be used to rewrite the inter-modal label transfer function as follows:
6 Objective Problem
Putting together with the intermodal and intramodal label transfer formula in (8) and (2), we define the discriminant function which can be substituted in the objective function of the optimization problem (4) for learning the transfer function. In addition, we use the conventional squared norm to regularize the transfer function on two transformations respectively:
Here, the expression represents the Frobenius norm. Then, we can use the aforementioned substitutions in order to rewrite the objective function of Eq. (4) as follows:
The goal is to determine the value of , which optimizes the objective function in Eq. (9). We note that this objective function is not jointly convex in and . This implies that the optimum value of may be hard to find with the use of straightforward gradient descent techniques, which can easily get stuck in local minima. Fortunately, it is possible to learn directly from Eq. (9) by the trace norm as in  . It is defined as follows:
The trace norm is a convex function of
, and can be computed as the sum of its singular values. The trace norm is different from the conventional squared norm for regularization purposes, and is actually a surrogate of matrix rank, and minimizing it can limit the dimension of the topic space. In other words, minimizing the trace norm results in the fewest topics to explain the correspondence between text and images. This implies that concise semantic transfer with fewer topics is more effective than tedious translation on cross-domain correspondence between text and images, as long as the learned transfer function complies with the observations (i.e., the co-occurrence and auxiliary data). This is consistent with the parsimony principle, which states preference for the least complex translation model. A parsimonious choice is also helpful in avoiding overfitting problems which may arise in scenarios where the number of auxiliary training examples are small.
The objective function in Eq. (9) can be rewritten as follows with the use of the trace norm:
We note that this objective function has has a number of properties, which can be leveraged for optimization purposes. In the next section, we discuss the methodology for optimization of this objective function.
6.1 Joint Optimization Algorithm
In order to optimize the objective function above, we first need to decide which functions are used for and in Eq. (11).
Recall that these functions are used to measure compliance with the observed co-occurrence and the margin of discriminant functions on the auxiliary data set, respectively. In this case, we use the hinge loss for the loss function over the training set, where
denotes the positive component. We choose the hinge loss here because it has been shown to be more robust to the noisy outliers of training examples. Clearly, minimizing the hinge loss tends to maximize the margin.
On the other hand, in compliance with the use of hyperbolic tangent activation in Eq. (7), we choose in the objective function (11) with denoting . This choice of the loss function essentially uses the logistic loss to measure the misalignment made by the transfer function between a co-occurrence pair of and . Minimizing this logistic loss tends to maximize the values of the transfer function over the co-occurrence pairs.
The aforementioned substitutions instantiate the objective function (11) which is nonlinear in and . One possibility for optimizing an objective function of the form represented in Eq. (11) is to use the method of Srebro et al. . The work showed that the dual problem can be optimized by the use of semi-definite programming (SDP) techniques. Although many off-the-self SDP solvers use interior point methods and return a pair of primal and dual optimal solutions , they do not scale well with the size of the problem. The work in  proposes a gradient based method which replaces the non-differentiable trace norm with a smooth proxy. But the smoothed approximation to may not guarantee that the obtained minima still correspond to fewest topics for label transfer.
Alternatively, a proximal gradient method is proposed in  to minimize such non-linear objective functions with the use of a trace norm regularizer. We will use such an approach to optimize over and in an alternating fashion in this paper. In order to represent the objective function of Eq. (11) more succinctly, first we introduce the optimization over , and we define the function as follows.
Then, the objective function of Eq. (11) can be rewritten as . In order to optimize this objective function, the proximal gradient method quadratically approximates it by Taylor expansion at current value of with a proper coefficient as follows:
where denotes the subgradient of at . Here we use the subgradient because of the non-differentiability of loss function at . We can further introduce the notation in order to organize the above expression:
The subgradient can be computed as follows:
where is the subdifferential of , is the gradient of to , and is the derivative of logistic loss as derived before. Then, the matrix can be updated by minimizing with fixed iteratively. This can be solved by singular value thresholding  with a closed-form solution (see Line 4 in Algorithm 1).
On the other hand, the optimization over can be performed by using the gradient projection method . With fixed at each iteration, each can be updated as
where is a positive step size, is the projection onto , and
is the subdifferential of at .
Algorithm 1 summarizes the proximal gradient based method to optimize the expression in Eq. (11). Note that the intermodal discriminant function is not convex as a function of , and hence the objective function is not convex either. But as long as the step size (i.e., ) is properly set, the objective function (11) tends to decrease in each iteration, usually converging to a stationary point (may not be a global optimum) . This is different from our previous work , where we adopted a linear transfer function yielding a convex objective problem. The nonlinear function has shown better performance on learning alignment between multiple modalities in literature .
7 Zero-Shot Label Transfer for Unseen Classes
The goal of zero-shot learning [16, 34, 23] is to build classifiers to label the unseen classes without any training image examples. However, there can be some positive examples available in text modality. In this section, we show that our cross-modal label transfer model can also be used in this setting.
Specifically, suppose that we have seen classes with labeled training images, and our goal is to annotate the images for unseen classes. In addition to the images, we have the labeled text examples for both seen and unseen classes. Then, zero-short label transfer aims to transfer the text labels to annotate the images of unseen classes. In principle, the same inter-modal label transfer function in Eq. (1) is applicable in labeling the images of unseen classes, since the text labels, of no matter seen or unseen classes, can be transferred to label those images. However, in this case, the intra-modal label transfer term will not be used any more since we cannot get access to the image labels of those unseen classes.
The learning of the inter-modal transfer function does not need to be changed to adapt to the zero-shot learning problem. However, for the sake of fair zero-shot learning scenario, we should exclude the co-occurrence text-image pairs belonging to the unseen classes from the training set. Only co-occurrence pairs of seen classes would be used to model the correlation between the text and image modalities via the inter-modal transfer function . This idea of involving pairs of seen classes has been adopted in literature [22, 16] to learn the inter-modal correlations, which plays the critical role in bridging the gap across multi-modalities.
On the other hand, we note that the labeled image examples of seen classes can still be used in training the model, except that they should be treated as negative examples for the unseen classes. These seen classes provide useful auxiliary information to exclude the regions from the feature space where the unseen classes are unlikely to be present 111We assume that different classes are exclusive to each other, i.e., we consider a multi-class problem rather than a multi-label problem. This assumption holds for many image classification problems, such as object and face recognitions.. This prior has been explored in  to improve the classification accuracy for the unseen classes.
We will demonstrate the experiment result in zero-shot learning scenario in Section 8.4.
In this section, we compare the proposed label transfer paradigm with a pure image classification algorithm with a SVM classifier based on pure image features, along with the other existing transfer learning methods proposed in . We will show the superior results of our approach to the other methods, with limited amount of training data.
|Category||Occurrence pairs||Category||Occurrence pairs|
|Category||positive examples||negative examples|
We compare the accuracy and sensitivity of our label transfer approach with a number of algorithms below:
As the baseline, we directly train the SVM classifiers based on the visual features extracted from images. This method does not use any of the additional information available in corresponding text in order to improve the effectiveness of target domain classification. The method is also susceptible to the case when we have a small number of test instances.
TLRisk (Translated Learning by minimizing Risk). This is another transfer learning algorithm, which performs the translation by minimizing risk (TLRisk) . The algorithm transfers the text labels to image labels via a Markovian chain. It learns a probabilistic model to translate the text labels to image labels by exploring the occurrence relation between text documents and images. We note however, that such an approach does not use the topic-space methodology which is more useful in connecting heterogeneous feature spaces.
HTL (Heterogeneous Transfer Learning): This algorithm is the best fit to our scenario with heterogenous spaces compared to other transfer learning algorithms such as  on a homogeneous space. This method has also been reported to achieve superior effectiveness results. It maps each image into a latent vector space where an implicit distance function is formulated. In order to do so, it also makes use of the occurrence information between images and text documents as well as images and visual words. To facilitate this method into our scenario, user tags in Flickr are extracted to construct the relational matrix between images and tags as well as that between tags and documents. Images are represented in a new feature space on which the images can be classified by applying the -nearest neighbor classifier (here is set to be ) based on the distances in the new space. We refer to this method as HTL.
Translator from Text to Images (TTI): This is our previous label transfer algorithm which only uses intermodal label transfer without considering the intramodal label transfer. This model fails to outperform the other compared algorithms on some categories . As aforementioned, this might be caused by the misalignment between text documents and test images.
Joint Intermodal and Intramodal Label Transfer (I2LT): this is the proposed approach in this paper.
In the experiments, a small number of training images are randomly selected from each category as labeled instances in
for the classifiers. The remaining images in each category are used for testing the performance of the classification task. Only a small number of training examples are used, making the problem very challenging from the training perspective. This process is repeated five times. The error rate and the standard deviation for each category is reported in order to evaluate the effectiveness of the compared classifiers. We also use varying number of co-occurred text-image pairs to construct the classifier, and compare the corresponding results with related algorithms.
In the experiments, the parameters , (used to decide the importance of auxiliary data and co-occurrence data from the objective function in (11)) and (used to regularize the intramodal label transfer) are selected from , and , respectively. All the parameters are tuned based on a twofold cross-validation procedure on the selected training set, and the parameters with the best performance are selected to train the models.
8.2 Result on Flickr-Wiki Dataset
|Category||Two examples||Ten examples|
The first data set is Flickr-Wiki dataset, consisting of a collection of Flickr and Wikipedia web pages which contains rich media content with images and their text descriptions. We use ten categories to evaluate the effectiveness on the image classification task. To collect text and image collections for experiments, the names of these categories are used as query keywords to retrieve the relevant web pages from Flickr and Wikipedia. Both web sites return many web pages in response to the submitted queries. Figure 2 illustrates some examples of retrieved images, Table I shows the number of occurrence pairs crawled from Flickr by using different query words, and Table III shows the number of Wiki articles retrieved from the subcategories of each topmost category. For example, these subcategories contain the breeds of animals (e.g., bird, horse, dog, and cat), and the list of buildings, mountains and waterfalls.
Flickr is an image sharing web site, storing many user-shared images and their textual descriptions in the forms of textual tags and comments. For Wikipedia, we have also retrieved the relevant web pages in the subcategories. In each crawled web page, the images and the surrounding text documents are used to learn the alignment between text and images. It is worth noting that these co-occurrence pairs used to align the image and text modalities do not contain any labeled images in the training set. In other words, no images in the co-occurrence pairs are labeled, and hence, these pairs are unlabeled. In fact, in our algorithm, we do not need the labels of these pairs to learn label transfer. These unlabeled pairs are only used to model the correlation between the two modalities.
For images, visual features are extracted to describe these images. For the sake of fair comparison, we use the same vocabulary of visual words to represent images as those used by the compared algorithms in previous work . These include the dimensional bag of visual-words (BOVW) based on SIFT descriptors . For the text documents, we normalize the textual words by removal of stop words and stemming, and use their frequencies as textual features. For each category, the images are manually annotated to collect the ground truth labels for training and evaluation as shown in Table II. Nearly the same number of background images are collected as the negative examples. These background images do not contain the objects of the categories. It is worth noting that these image categories are not exclusive which means that one image can be annotated by more than one category.
First, in Figure 3 and 4, we report the performances of different algorithms with varying numbers of training images. For each category, the same number of images from the background images are used as the negative examples. Then average error rate is shown to evaluate the performance for image classification tasks. To learn the transfer function, co-occurrence pairs are collected to learn the alignment between texts and images for label transfer. Since each image can be assigned more than one label, the error rate is computed in binary-wise fashion.
We note that as shown in Figure 3, a small number of training images is the most interesting case for our algorithm, because it handles the challenging cases when an image category do not have much past labeling information for the classification process. In order to validate this point, in Figure 3, we compare the average error rates over all categories with varying number of auxiliary training examples. It demonstrates the advantages of our methods when there are an extremely small number of training images. This confirms our earlier assertion that our approach can work even in the paucity of auxiliary training examples, by exploring the correspondence between text and images. In Tables V(a) and V(b), we compare the error rate of different algorithms for each category with two and ten auxiliary training images respectively. We note that Table V(a) (a) shows the results with a extremely smaller number of training images, and the proposed scheme outperforms the compared algorithms on all the categories. If we continue to increase the number of training examples to a large enough level, as shown in Figure 4, the advantage achieved by the label transfer algorithm gradually diminishes. This is expected since with sufficiently training examples, there is no need to leverage the cross-modal labels to enhance the classification accuracy.
Also, Table V lists the number of topics (i.e., the rank of matrix ) used for learning the transfer function in topic space from co-occurrence pairs with two and ten training examples. It shows that for most of categories with only a small number of topics, the learned label transfer model works very well. This also provides evidence of the advantages of the parsimony principle in semantic translation. However, this criterion is not absolute or unconditioned, but with the premise that the observed training examples and co-occurrence pairs can be fit by the learned model. For complex categories with many aspects, it often uses more topics to establish the correspondence between the heterogeneous domains. For example, as the appearances of “buildings” vary largely with lots of variants, more topics are needed to explain the correspondence between these variants than the categories with relatively uniform appearances. But as long as the training data can be explained, the models with fewer topics are preferred for the improved generalization performance.
8.3 Result on NUS-WIDE Dataset
The second dataset we use to evaluate the algorithm is NUS-WIDE , which is a real-world image dataset that contains images downloaded from Flicker. Each image has a number of textual tags and is labeled with one or more image concepts out of concepts. The image-text pairs belonging to the largest concepts are selected as co-occurrence pairs. Similar to the above Flickr-Wiki dataset, the images are represented by
-D Convolutional Neural Network (CNN) features by AlexNet and the image tags are represented by -D word occurrence feature vectors.
Figure 5 plots the comparison of Average Precision (AP) for concepts on NUS-WIDE dataset. Average Precision measures how well an algorithm ranks the positive examples higher than the negative examples . It is a widely-used metric in comparing between different classification algorithms especially with an imbalanced sets of the positive and negative examples. From the result, we can find that on 67 out of 81 concepts, the proposed I2TL outperforms the other compared algorithms. Table VI compares the Mean Average Precision (MAPs) over 81 concepts on NUS-WIDE dataset.
It is worth noting that the learned intermodal label transfer function measures the cross-modal relevance. It can be used to retrieve the relevant the images given a query of text description, and vice versa. Thus, we test the cross-modal retrieval with the learned transfer function. Specifically, following the experimental setup in , we consider two scenarios. (1) The Image to Image (I2I) search, i.e., an image is used as a query to search the relevant images with the same label; (2) The Text to Image (T2I) search, i.e., the input query is a text description and the output is a list of relevant images. We can also perform an Image to Text (I2T) search, where an image is used as input query to search for the relevant text descriptions. However, the I2T result on NUS-WIDE was not reported in . For the sake of a straight comparison, we skip the I2T search in this paper too.
We follow the same evaluation protocol as : from the test set, samples are randomly sampled and used as the queries, as the validation set, and the remaining ones are retrieved. A retrieved output is considered as relevant to an input query if they have the same label. The experiments are repeated five times, and we report the top-20 precision averaged over the five random database/query splits.
Table VII compares the retrieval performances by I2LT and the other three CCA variants. Among them, CCA(V+T) refers to the two-view baseline model based on both visual and text features; CCA(V+T+K) refers to the three-view CCA model with visual, text and supervised semantic information; and CCA(V+T+C) refers to the three-view model with unsupervised third view on automatically generated word clusters. More details about these three models can be found in . In testing I2I search, image features are projected into the CCA space and the learned I2LT space (cf. Eq. (6)) respectively, and then we use them to retrieve the most relevant images from the dataset. For the fair comparison with these CCA variants trained with the original BOVW features, we report the retrieval accuracies by I2LT with BOWV features and CNN features in the table.
8.4 Zero-Shot Label Transfer on CUB200 and Oxford Flower-102 Datasets
We used two datasets to test the algorithm for the zero-shot label transfer. The first one is CUB200 Birds dataset  which consists of 200 species of birds in images. The corresponding wikipedia articles are collected by using the name of these birds as query keywords, ending up with articles as the text descriptions . The second dataset is Flower102 with 102 classes of flowers in images . Different from CUB200, the text articles, one for each flower class, are collected not only from Wikipedia, but also from Plant Database, Plant Encyclopedia, as well as BBC articles .
Both datasets extracted dimensional Classme features as an intermediate semantic representation of the input images. For the text modality, TF-IDF (Term-Frequency and Inverse Document Frequency) features are extracted from each article, followed by reducing -dimensional TF-IDF features to dimension with Cluster Latent Semantic Indexing (CLSI) algorithm. The resultant dataset with text descriptions is publicly available  222https://sites.google.com/site/mhelhoseiny/computer-vision-projects/Write_a_Classifier.
Five-fold cross validation over the was adopted to test the algorithm, where 4/5 classes were used as seen classes and the other 1/5 of classes as unseen ones. Then the datasets are split into training and test sets according to the seen and unseen classes, where the images and the corresponding articles of thoese seen classes constitute the co-occurrence pairs. The five-fold cross-validation over the seen classes is used to decide the hyper-parameters. Following , we report the average AUC (Area Under ROC Curve) over five-fold cross-validation to evaluate the performance.
We considered four state-of-the-art zero-shot learning algorithms as baselines, namely (1) Gaussian Process Regressor (GPR) , (2) Twin Gaussian Process (TGP) , (3) Nonlinear Asymmetric Domain Adaptation (DA) , as well as (4) WAC (Write A Classifier) .
We report the comparative results on the two datasets in Table VIII. We can see that the proposed algorithm outperforms the others in terms of average AUC. The performance improvement is partly attributed to the fact that the proposed approach prefers the concise label transfer model by imposing the trace norm regularizer. This preference plays an important role considering that only very rare positive examples are available for unseen classes in text and image modalities (There exists no image examples for zero-shot learning!). With extremely rare examples, adopting a concise cross-modal transfer model can minimize the over fitting risk effectively as both modalities have much high dimensionality of feature representations. Actually, the resultant label transfer matrices are only of rank on CUB200 dataset and of rank on Flower102 dataset over five-fold cross-validation.
We also compare the classification accuracies with various types of output embedding models  on the extended CUB dataset in Table IX. This dataset extends CUB200 dataset to have images from bird species. For a fair comparison, the same zero-shot split as in  is used, where 150 classes are used for the training and validation, and the remaining disjoint classes are used for testing. The average per-class accuracy is reported on the test set for each compared algorithm. From the comparison, we can find the proposed algorithm outperforms the compared types of embedding algorithms. This can be attributed to the proposed algorithm which does not only use input and output embeddings to learn the transfer function, but also applies the learned transfer function to combine multiple labels of source texts to annotate the target images. On the contrary, these existing embedding models only output the compatibility between an input-output pair, without exploring the joint use of multiple source labels to predict on a target image.
Here, we wish to make an additional note on the training of the proposed label transfer model in the zero-shot scenario. In the experiment, we enforce that the label transfer matrix is shared across seen and unseen classes. This is possible because the transfer matrix is class-independent, since it aims to capture the inter-modal correlation between images and their corresponding articles no matter which classes they belong to. In this sense, in the training phase, each seen class of images in the training set can also be labeled by transferring the corresponding text labels with the shared transfer matrix learned by minimizing such label transfer errors as in Eq. (4). In this way, we fully explore the image labels of seen classes in the training set to learn the shared transfer matrix. This does not violate the zero-shot assumption that no image labels of unseen classes should be involved in the training algorithm. The experiment results also show that, without this training strategy, the proposed approach only achieved an average AUC of and on CUB200 and Flower102 by learning a separate transfer matrix for each unseen class, justifying this transfer matrix sharing strategy.
8.5 Impact of the Size of Co-occurrence Pairs
These above results are obtained by using pairs of co-occurred text and images. We know the number of co-occurrence text-image pairs play an important role to align the heterogeneous modalities. Therefore, it is instructive to examine the effect of increasing the pair numbers. In Figure 6, we compare the error rates of different algorithms with varying numbers of text-image pairs. The number of pairs is illustrated on the horizontal axis, whereas the error rate is illustrated on the vertical axis. As we can see, the error rate of the proposed I2LT algorithm decreases with an increasing number of pairs because more information is exploited to align text and image domains. We also note that its improvement is more significant than other algorithms when more text-image pairs are involved. This shows that I2LT is more resistant against the noisy co-occurrence pairs of texts and images by jointly modeling the relevance of training labels between the texts and the images used for label transfer. It also demonstrates the advantage of I2TL over the other algorithms.
8.6 Computational Cost
Finally we compare the computational costs made by different algorithms. All the algorithms are conducted on the same cluster server, equipped with Intel Xeon 2.5 GHz 12-Core CPU, and GB physical memory. Table X shows the computing time to train and test with the different models. It is shown that SVM is the fastest model to train since it does not involve any labeled text corpus. TLRisk is the second fastest model to train, and the other three models are trained in the comparable time since all of them spend most of time on constructing intermediate representation to transfer the labels. For test, I2LT uses the longest time because it has to transfer the labels from the intermodality as well as intramodality. But the longer time is compensated by the more accurate test results as shown above.
In this paper, we presented a method to jointly transfer labels within and across modalities for an effective image classification model. This method is designed in order to alleviate the dual issues of scare labels and high semantic gaps which are inherent for the images. The label transfer process is designed with the development of a transfer function, which can convert the labels from text to images effectively. We show that the transfer function can be learned from the co-occurrence pairs of texts and images as well as a small size of training images. We follow the parsimonious principle to develop a common representation to align texts and images with as few topics as possible in the label transfer process. For prediction, we do not assume that a test image comes with any text description, and the labels of the text corpus can be propagated to annotate the test image by the learned transfer function. We show superior results of the proposed algorithm for the image classification task as compared with state-of-the-art heterogeneous transfer learning algorithms.
The first author was partly supported by NSF grant 16406218. We also would like to thank the anonymous reviewers for bringing the zero-shot learning problem into our attention, which inspires us to study the applicability of the proposed approach to this problem.
-  Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 819–826. IEEE, 2013.
-  Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
Y. Amit, M. Fink, N. Srebro, and S. Ullman.
Uncovering shared structures in multiclass classification.
Proceedings of Internatinal Conference on Machine Learning, 2007.
-  F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of Internatinal Conference on Machine Learning, 2004.
-  D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, (3):993–1022, January 2003.
A. Blum and T. Mitchell.
Combining labeled and unlabeled data with co-training.
Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998.
-  L. Bo and C. Sminchisescu. Twin gaussian processes for structured prediction. International Journal of Computer Vision, 87(1-2):28–52, 2010.
-  S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
-  J.-F. Cai, E. Cands, and Z. Shen. A singular value thresholding algorithm for matrix completion, September 2008.
-  Y. Chen, T. V. Nguyen, M. Kankanhalli, J. Yuan, S. Yan, and M. Wang. Audio matters in visual attention. IEEE Transactions on Circuits and Systems for Video Technology, 24(11):1992–2003, 2014.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), Santorini, Greece., July 8-10, 2009.
-  N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.
-  W. Dai, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu. Translated learning: Transfer learning across different feature spaces. In Proceedings of Advances in Neural Information Processing Systems, 2008.
-  L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1375–1381. IEEE, 2009.
-  M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2584–2591. IEEE, 2013.
-  Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Learning multimodal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):303–316, 2014.
-  Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision, 106(2):210–233, 2013.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Probabilistic latent semantic analysis.
Uncertainty in Artificial Intelligence, 1999.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, pages 1097–1105.
-  B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1785–1792. IEEE, 2011.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009.
-  T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998.
-  W. Li, L. Duan, D. Xu, and I. W. Tsang. Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1134–1148, June 2014.
-  W. Li, L. Niu, and D. Xu. Exploiting privileged information from web data for image categorization. In European Conference on Computer Vision, September 2014.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
-  J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. Multimodal similarity-preserving hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):824–830, 2014.
-  S. Moon, S. Kim, and H. Wang. Multimodal transfer deep learning for audio visual recognition. arXiv preprint arXiv:1412.3121, 2014.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng.
Multimodal deep learning.In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 689–696, 2011.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pages 722–729. IEEE, 2008.
-  L. Niu, W. Li, and D. Xu. exploiting privileged information from web data for action and event recognition. International Journal of Computer Vision, pages 1–21, November 2015.
-  V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems, pages 1143–1151, 2011.
-  M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In Advances in neural information processing systems, pages 1410–1418, 2009.
-  G.-J. Qi, C. Aggarwal, and T. Huang. Towards semantic knowledge propagation from text corpus to web images. In Proc. of International World Wide Web conference, 2011.
-  G.-J. Qi, C. Aggarwal, Q. Tian, H. Ji, and T. S. Huang. Exploring context and content links in social media: A latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):850–862, 2012.
-  G.-J. Qi, X.-S. Hua, and H.-J. Zhang. Learning semantic distance from community-tagged media collection. In Proc. of International ACM Conference on Multimedia, 2009.
-  G.-J. Qi, Q. Tian, C. Aggarwal, and T. Huang. Towards cross-category knowledge propagation for learning visual concepts. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
-  R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of Internatinal Conference on Machine Learning, 2007.
-  R. Raina, A. Ng, and D. Koller. Constructing informative priors using transfer learning. In Proceedings of Internatinal Conference on Machine Learning, 2006.
-  C. E. Rasmussen. Gaussian processes for machine learning. 2006.
-  M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich. To transfer or not to transfer. In NIPS 2005 Workshop on Inductive Transfer: 10 Years Later, volume 2, page 7, 2005.
-  N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Proceedings of Advances in Neural Information Processing Systems, 2005.
-  K. C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Preprint on Optimization Online, April 2009.
-  S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
-  P. Wu and T. Dietterich. Improving svm accuracy by training on auxiliary data sources. In Proceedings of Internatinal Conference on Machine Learning, 2004.
-  Q. Yang, Y. Chen, G. R. Xue, W. Dai, and Y. Yu. Heterogeneous transfer learning for image clustering via the social web. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1–9, Singapore, August 2009.
-  M. Zhu. Recall, Precision and Average Precision. 2004.
-  Y. Zhu, Y. Chen, Z. Lu, S. J. Pan, G.-R. Xue, Y. Yu, and Q. Yang. Heterogeneous transfer learning for image classification. In Proceedings of The Twenty-Fourth AAAI Conference on Artificial Intelligence, 2011.