Object classification from natural images is useful in content-based image retrieval, video surveillance, robot localization and image understanding. According to Lampert et al., humans are able to distinguish between at least 30,000 relevant classes. However, training conventional object detectors for all these classes would require millions of well-labeled training images and is likely out of reach for years to come.
As such, the zero-shot learning paradigm [1, 2, 3, 4, 5, 6, 7] is motivated from the human ability to learn and abstract from examples, and the capability to describe completely unseen classes (i.e. training classes are not available during training of the object detector) from existing (known) classes. For instance, [1, 2, 3, 4] recognize a set of unseen objects using a list of high-level attributes that serve as an intermediate layer in the classifier cascade. The attributes enable those systems to recognize the object classes, even without a single training example. Others like [5, 6] use semantic relationships from different reference classes to predict the unseen classes. Though promising results were obtained, all these aforementioned approaches require either extensive human supervision to build the attributes, or a tight semantic relationship between the unseen classes and the training classes.
In this paper, we propose 1) topic model to replace the attributes [1, 2, 3, 4] so that extensive human supervision is no longer required, and 2) the Hierarchical Class (HiC) concept to relate the unseen classes to the existing seen classes. The HiC concept has a loose relationship image hierarchy compared to [5, 6]. Our framework starts with building a Bag-of-Words (BoW) model using the image features from the (small amount of) seen (available) classes. Herein, the HiC concept is utilized to build the codebook. A topic model (here we employ the probabilistic Latent Semantic Analysis (pLSA)) is learned using the generated BoW model. Based on the learned pLSA model and HiC concept, signature topics for both the seen and unseen classes are deduced (i.e. we cluster similar object classes that share visual similarity). Finally, object classification is performed using the deduced signature topics representation. Experimental results using four publicly available datasets, namely the PubFig, Cifar-100, Caltech-256 and AwA datasets have shown the effectiveness of the proposed method.
Ii Related work
Palatucci et al.  showed that the attribute description of an instance or category is useful as a semantically meaningful intermediate representation to bridge the gap between low level features and high-level classes. Thus, the attributes facilitate transfer and zero-shot learning to alleviate issues of the lack of labeled training data, by expressing classes in terms of well-known attributes. This is followed by Lampert et al. [1, 13] that extended the work to animal categorization by introducing Direct Attributes Prediction (DAP) and Indirect Attributes Prediction (IAP).
Unlike [17, 1, 13], Parikh and Grauman  introduced relative attributes to perform zero-shot learning. This approach captures the relationships between images and objects in terms of human-nameable visual properties. For example, the models capture that animal is ‘taller’ than animal , or subject is ‘happier’ than subject . This allows a richer language of supervision and description than the commonly used categorical (binary) attributes. Though relative attributes seem efficient for zero-shot learning, the dataset needs to be intra-class (i.e. the images in the dataset must belong to a set of object classes that are visually similar). Also, a binary or relative relationship between all classes needs to be defined beforehand. Such a process will require extensive human supervision efforts and the decision is always subjective.
In our proposed strategy, we replace the attributes with a topic model in order to reduce the human supervision needed. Others who use topic models in zero-shot learning are [15, 14]. They propose a hybrid attribute-topic model to deal with group social activities. Specifically, they define three unique attributes: user-defined, latent class-conditional, and latent generalized free attributes. These attributes are learned jointly in a semi-latent attribute space, and as the multi-modal latent attribute topic model (M2LATM). The motivation is to reduce the annotation effort through the introduction of the latent attributes in their proposed framework. In contrast, our focus in this paper is on object recognition that learns the topic model directly from the BoW representations, and infers the unseen classes using the proposed HiC concept. We eliminate the time consuming human annotation process by replacing the attributes with topic models. Instead of learning the topic models on top of user-defined and latent attributes [15, 14], we choose the pLSA as our topic model because it does not require prior comparison to the Latent Dirichlet Allocation (LDA) model. We further extend the topic model representation as a mapping algorithm to object classes, so that zero-shot learning would be possible.
Figure 1 shows conventional solutions that associate each image with a class label [8, 9, 10, 11], or further describe the image content with the association of attributes [2, 3, 12, 1, 13, 14, 15] or image tags . These are insufficient in zero-shot learning because these attributes and tags can be redundant and not useful when too many of them are introduced. Yet, there is no specific evaluation method on ”what is an effective attribute or tag”. Therefore, we introduce a new codebook learning method, i.e. the HiC concept that utilizes the hierarchical class characteristics during the codebook learning stage. This concept is inspired by [18, 5, 6] where a set of common objects are clustered into different classes in order to deduce the relationship among them. Specifically, we integrate two different levels of image class labels, namely the Coarse Class, and Fine Class, . Then, this class hierarchy is learned in the topic model to identify the significant differences among the classes and improve the model prediction capability. Such an approach is better than attributes-based classification [1, 13] which are commonly applicable in inter-class problems only. The HiC concept manages to deal with both inter-class, as well as intra-class problems.
Similar work that employed the hierarchical class strategy in zero-shot learning paradigm includes Rohrbach et al.  and Frome et al. . In both approaches, a set of frameworks on how to incorporate the semantic information from a language model/set to assist in the zero-shot learning is studied.  employed WordNet and Wikipedia as the language model, and learned a similarity measure to represent the hierarchy/attributes/objectness measure between the object classes.  extended the idea to learn the class relationship directly from the unannotated data (i.e. visual-semantic relationship between object classes from millions of documents in Wikipedia) using the Deep Visual-Semantic Embedding Model (DeViSE).
In another approach, Mensink et al.  used a different concept where a distance metric from a set of seen classes (e.g. 800 seen classes) and errors for both seen and unseen classes (e.g. 800 seen classes and 200 unseen classes, result in 1000-way classification) are learned. In order to classify the object classes, a Nearest Class Mean (NCM) classifier is employed. This approach does not require the semantic relationship, and manages to generalize the unseen classes in near to zero computational cost. For our proposed framework, although it is similar to the hierarchy-based knowledge transfer in , we do not need a language model to build the hierarchy. Instead, the HiC concept relates the unseen classes to the seen classes. Also, we use learnt topic model to perform the zero-shot learning, which is different from the attributes-based or direct similarity-based knowledge transfer in  that uses attributes or objectness measure, and  that uses metric learning.
In this section, we first discuss the prerequisites of the proposed framework: BoW model and topic model. Secondly, we explain the HiC concept and detail how to perform zero-shot learning in pLSA with the HiC concept. Finally, we show the inference method for image classification purposes.
Iii-a Codebook Representation
To build the BoW model, we engaged the Random Forest (RF) algorithm[11, 19]
where a random decision tree is constructed using a random subset of the training data with replacement. Thelabeled training images at a particular node are recursively split into left node and right node subsets, according to a threshold and a split function (Eq. 1).
are the feature vectors from the training images andare the associated class labels. At each split node, random subsets of features are generated and compare to . In this process, that maximizes the expected information gain is selected:
where , and
is the Shannon entropy of the probability class histogram. As such, the leafnodes of all trees in the RF form a codebook. Then, the codebook are used to quantize into BoW representation, by passing to each tree and count the occurrence of each leafnode.
Iii-B Topic Model
Our model is based on a latent topic model, in particular, the pLSA model. We briefly introduce it using the terminology in our context. Suppose we are given a collection of images . Each image is represented by a collection of features , where it shows how frequent a particular is used in . A word is the basic item from a codebook indexed by . A joint probability model over can be defined as:
where is a latent variable. We can further derive the document-specific word distribution as:
However, at the current setting, Eq. 3-4 could not infer the unseen classes as the algorithm needs prior knowledge about which belongs to which , or a set of labeled training image in learning the model. In the zero-shot paradigm, such information is simply not available. In order to handle this issue, we proposed the HiC concept (discussed next), so that we can infer the unseen classes to perform zero-shot learning using the pLSA model.
Iii-C Hierarchical Class (HiC) Concept
We introduced the HiC concept - a nested class concept as illustrated in Figure 1c where one image consists of two class labels (semantically related), . One has a broader visual concept, namely the Coarse Class; while the other class labels have a narrow visual concept, namely the Fine Class. Table I shows some examples of the HiC concept.
Coarse Class, is a large concept class (parent) that shares a conceptual similarity, either physical or biological, within its own Fine Class;
Fine Class, is a specific object class and is a subset to one of the Coarse Class (child).
|Electrical Devices||Building||Water Spot|
|- Television||- House||- Coast|
|- Refrigerator||- Apartment||- Beach|
|- Washing Machine||- Tall Building||- Underwater|
Iii-C1 Codebook Representation in HiC concept
Using the HiC concept, we have three new codebook representations, that are 1) Coarse () or Fine (), 2) Joint Coarse-Fine (J-CoFi) and 3) CoarseFine (CoFi). We next explain their properties.
(Coarse () or Fine ()). The and codebooks are similar to the initial RF learning described in Section III-A, except that we substitute in the Shannon entropy with or , respectively. We illustrate in Figure 2 that utilizing only the codebook is not an optimum setting as each of the codebook representations varies drastically although they belong to the same . Therefore, we built a variant, namely the J-CoFi.
(Joint Coarse-Fine (J-CoFi)). The J-CoFi codebook strategy adapts both and information during the RF learning. Specifically, we denote the total number of trees as . If one uses of trees that govern the similarity between with the same , and of trees that distinguish those within its associated , this will result in a BoW model that has a similar histogram shape for codebook bins that are created by trees. Hence, it eliminates the limitations in Property III.1.
|Codebook Type||J-CoFi||Shannon Entropy ()|
Table II summarizes the difference between Property III.1 - III.2. There still exist limitations in the Property III.1 - III.2 when Eq. 2 is employed to compute . That is, at one time, one could only optimize either or during the RF tree node splitting, and so we introduce the CoFi codebook (Property III.3) to handle this limitation.
(CoarseFine (CoFi)). The CoFi is proposed to learn the trees in such a way that utilizes both the and , simultaneously in the RF tree node splitting. Specifically, we modified Eq. 2 so for each CoFi tree, we consider the total maximum from and simultaneously for each split node as :
and the splits that maximize the will be selected.
Iii-D Zero-shot learning in pLSA with HiC concept
In order to perform the zero-shot learning using the HiC concept, we denote a seen class as and an unseen class as , where . As such, we collect a set of seen classes pair for each that associate to a pair of seen classes which belongs to the same :
where and indicates conceptual similarity between (i.e. as described in Definition III.1 and in ). In the pLSA model, we introduce a novel mapping algorithm namely topic sets, that indicate index of . Each will associate with specific , which creates a relationship between and . Our idea is that the unseen class that could be related to a pair of unseen classes (i.e. in this case are and ) will have high similarity for their respective . Therefore, we could relate by defining that satisfies the conditions of and . We denote as the signature topic set for the seen class as:
where the size of is , and is a class-specific topic distribution that is used to determine for every :
where is inferred as the union of the pairs ( and ) to achieve zero-shot learning. Taking as an example, the size of is ([0 0 1], [0 1 0], [1 0 0], [0 1 1], [1 0 0], [1 0 1], [1 1 0], [1 1 1]), where indicates the signature topic(s) and vice versa. Ideally, if is [0 0 1] and is [1 0 0], then is [1 0 1].
Finally, given a test class , it can be predicted by evaluating:
Algorithm 1 summarizes the proposed framework.
In the experiments, we employed four public datasets - PubFig , Cifar-100 , Caltech-256  and Animals with Attributes (AwA) . These datasets are designed to pose different visual challenges in terms of illumination effects, scales, and viewpoints as well as support more than 120,000 objects.
Implementation details: In order to evaluate , 1-vs-all classification is performed. Unless specified, the PubFig, Cifar-100 and Caltech-256 dataset features are extracted using the Pyramid Histogram of Gradient (PHOG) with pyramid levels, 180 angle and bins. Specifically, we use the PHOG from [10, 22]. However, we did not concatenate all the PHOG descriptors found. Instead, we put all these features in a codebook learning mechanism using the RF algorithm [11, 19]. Therefore, we can obtain a set of HOG descriptors that quantize shape information locally and globally, by the nature of the PHOG. The RF codebook can learn image shapes as a whole, as well as the local patch characteristic. For the RF codebook, it is learned using trees and leafnodes.
The PubFig or Public Figures Face Database has a total of 58797 images of 200 celebrities faces. We used identical subsets as in  where random identities are extracted with each class of images. The pLSA model is built using , similar to the number of attributes in . In addition to the PHOG features, we also re-implement our framework using features identical to , which is a combination of GIST features and color histograms. We employ the class relationship as in  to find the . However, the optimum nearest seen classes pair between the unseen classes are chosen, and we assume the () relationship in  is similar to our () relationship.
Table III shows that our proposed method has better accuracy (PHOG: ; GIST + color histogram: ), compared to Lampert et al.  that uses the binary attributes, and Parikh and Grauman  that uses the relative attributes. Our results are achieved without the annotation required in [1, 2]. When the number of unseen classes is increased, there is a consistent drop in the system accuracy from to for PHOG features, and from to for GIST + color histogram features. This is expected as when the number of unseen classes increases, the system accuracy decreases due to the tradeoffs between computational complexity and system accuracy.
|Features||Our Proposed Method||Binary Attributes||Relative attributes|
|Number of Unseen Class,|||||
We performed a consistency test where we tested the accuracy of our proposed method and [2, 1] across different . Figure 3 shows that the proposed method has a better consistency (PHOG: 7; GIST + color histogram: 10) in comparison with (17) and (23). Also,  performed the worst in terms of accuracy while  performed the worst in terms of consistency. Such results have shown the effectiveness and consistency of our proposed algorithm to handle the intra-class variation problem as opposed to the extensive attributes annotation in [2, 1].
|Our Proposed Method||Sparse Coding||Beyond|
|Number of Unseen Class,||||Spatial Pyramid|
|without||with HiC concept||1||2|
The Cifar-100  dataset has 100 classes and each class contains 600 images with resolutions. The 100 classes are further grouped into 20 Coarse Class. Each has 5 , where of them is(are) unseen. Thus we have a total . We picked training images randomly, and the rest are used for testing. In this dataset we use , as major semantic topics exist in the , i.e. mammals, size, trees, vehicles, food, household, insects, reptiles, people, and flowers. The dataset is challenging due to its limited resolution and so we only use pyramid levels for PHOG features, and codewords per tree in codebook learning.
Table IV shows that our proposed method with or without the HiC concept performed much better as compared to [23, 24]. Our approach also outperformed [23, 24] when . When , there is a total of 40 unseen when training the classifier. However, our approach was still able to achieve accuracy in comparison to  and  where in both approaches, (no unseen classes). In addition, the computational cost of our proposed method is lower, as we only employed a small number of training images.
Similar to the PubFig dataset, we also observed that when using fewer seen classes in the learning process, the accuracy drops. But, the accuracy differences between and only differ by a fraction of even when the difference number of is large (the total unseen class here is ). This indicates that our proposed method is robust as it is capable to handle the Cifar-100 dataset with very tiny () images that causes the collected features vector to be very similar. Besides, in comparison with the three different codebook learning strategies, the CoFi codebook method performs the best as it utilized both and , simultaneously in the RF tree node splitting.
|Number of Unseen Class,|
|without||with HiC concept||1||2||3||4||5|
|Class,||(Fine Class, )|
|household electrical devices||binoculars, boom-box, bread maker, calculator, cd, computer keyboard, computer monitor, computer mouse, floppy-disk, head-phones, iPod, joystick, laptop, light bulb, megaphone, microwave, palm-pilot, paper-shredder, PCI-card, photocopier, refrigerator, rotary-phone, toasters, treadmill, tripod, VCR, video-projector, washing machine|
|household furniture||bathtub, chandelier, chess-board, desk-globe, doorknob, ewer, flashlight, hammock, hot-tub, hourglass, mailbox, mattress, menorah, picnic table|
|large man-made outdoor things||Buddha, Eiffel-tower, golden-gate-bridge, light-house, minaret, pyramid, skyscraper, smokestack, teepee, tower-Pisa, windmill|
|medium mammals||dog, duck, elk, goat, goose, llama, minotaur, penguin, porcupine, raccoon, skunk, swan, unicorn, zebra, greyhound|
|vehicles||blimp, bulldozer, cannon, canoe, car-tire, covered-wagon, fighting-jet, fire-truck, helicopter, hot-air-ballon, kayak, ketch, license-plate, motorbikes, mountain-bike, pram, school-bus, segway, self-propelled-lawn-mower, snowmobile, speedboat, steering-wheel, touring-bike, tricycles, wheelbarrow, airplanes, car-side|
|household daily items||beer-mug, chopsticks, coffee-mug, knife, spoon, stained-glass, paperclip, paper-shredder, coins, dice, drinking-straw, dumb-bell, fire-extinguisher, frying-pan, ladder, pez-dispenser, playing-card, roulette-wheel, screwdriver, Swiss-army-knife, tweezer, umbrella|
|sports||baseball-bat, baseball-glove, baseball-hoop, billiards, bowling-ball, bowling-pin, boxing-glove, football-helmet, Frisbee, golf-ball, skateboard, soccer-ball, tennis-ball, tennis-court, tennis-racket, yo-yo|
|wears||cowboy-hat, diamond-ring, eyeglasses, football-helmet, necktie, sneaker, socks, top-hat, t-shirt, human-wear, wielding-mask, yarmulke, tennis-shoes, saddle, stirrups|
|musical instruments||electric-guitar, French-horn, grand-piano, guitar-pick, harmonica, harp, harpsichord, mandolin, sheet-music, tambourine, tuning-fork, xylophone|
The Caltech-256 dataset  consists of images grouped into object classes and a background class. Unfortunately, it does not provide any concepts in the dataset. Therefore, we group the classes manually to similar to Cifar-100, except for some specific classes where we introduce new . In Table VI, we show the distribution of the selected Caltech-256 classes with 5 existing as in Cifar-100 and 4 newly introduced . Only 158 of the total Caltech-256 classes are grouped because some object categories belong to a that had very few members. For this dataset, the total are .
Table V shows minor fluctuations compared to the PubFig and Cifar-100 results when different values are employed. For classification settings that have , interestingly, the proposed method performs better without applying the HiC concept. We found that this may be due to 1) in some are semantically related but have low visual similarity (i.e. in this context, the visual similarity is referring to the visual appearance of the object class), e.g. ‘computer keyboard’, ‘computer monitor’ and ‘computer mouse’, which belong to the = ‘household electrical devices’; 2) introducing tree to the codebook did not help in boosting the codebook discriminating power, which might be due to the low visual similarity among in some as well; and 3) the complexity of the objects in Caltech-256. However, the zero-shot learning still provides reasonable results. For this dataset we did not perform any comparisons as there are only 158 classes extracted.
Iv-D Animal with Attributes (AwA)
AwA is an object dataset of animal classes with corresponding attributes attached to each class. There are a total of 50 animal classes and 85 attributes in the dataset. We use similar experimental settings as in [1, 13], where same features and partitions of seen (i.e. 40) and unseen (i.e. 10) classes were employed. To build the and relationships, we adopt the attributes relationships and pick the attributes that have , and have the lowest number of possible. As a result, we grouped 8 and each has 6 to 12 , as shown in Table VII.
|Class,||(Fine Class, )|
|hooves||antelope, horse, moose, ox, sheep, rhinoceros, giraffe, buffalo, zebra, deer, pig, cow|
|weak||Siamese cat, Persian cat, skunk, mole, sheep, hamster, rabbit, bat, chihuahua, mouse|
|grazer||antelope, horse, moose, spider monkey, elephant, ox, sheep, hamster, rhinoceros, rabbit, giraffe, buffalo, zebra, giant panda, deer, mouse, cow|
|stalker||grizzly bear, German shepherd, Siamese cat, tiger, leopard, fox, wolf, bobcat, lion, polar bear|
|flippers||killer whale, blue whale, humpback whale, seal, otter, walrus, dolphin|
|strainteeth||killer whale, beaver, blue whale, hippopotamus, humpback whale, walrus|
|hibernate||grizzly bear, beaver, skunk, mole, fox, hamster, squirrel, bat, rat, bobcat, mouse, polar bear, raccoon|
|bipedal||grizzly bear, spider monkey, gorilla, chimpanzee, squirrel, bat, giant panda, polar bear|
Herein, our proposed method achieved an accuracy of . This is better as compared to the DAP and IAP , which achieved and respectively; to M2LATM  that obtain ; and attribute/hierarchical label embedding (AHLE)  that achieved . We also show the confusion matrix of the 10 test classes in Figure 4. We observe that our proposed method has better average classification results compared to DAP and IAP . Though our proposed method does not predict the ‘humpback whale’ class as well as DAP and IAP, but we achieve better accuracy in the ‘giant panda’ class, which leads to better overall accuracy. These results benefit from the HiC concept defined for AwA, where ‘giant panda’ class is the only in . Note that the ‘humpback whale’ class resides in , and both contains more than one . Therefore we observe the accuracy drops in the ‘humpback whale’ class. The same situation applies to the ‘rat’ class and ‘raccoon’ class as they share the same .
In this paper, we compared our proposed method with 4 public datasets and achieves better performance compared to state-of-the-art methods for zero-shot learning. Even in the conventional classification problem where training images for all object classes are available, we still manage to get state-of-the-art accuracy in PubFig and Cifar-100 datasets.
In the conducted experiments, there are some cases where the predicted is redundant. That is, if a lower number of is chosen, the numbers of possible will be reduced as well, and hence there is a possibility to obtain similar for different , which is a redundant representation. In order to handle this issue, we employ a large number of in the experiments to minimize the probability of to be redundant.
Based on the experiments in Caltech-256 dataset, we are aware that the classification accuracy is fluctuating due to the quality of collection under each . Though, the within is grouped based on the semantic relationship; these might be visually dissimilar. This limitation is likely to be solved by introducing a middle-level class group to extend the within the to some high-visual similarity group, e.g. we can group and extend : ‘head-phones’, ‘rotary-phones’ and ‘megaphone’ in : ‘household electrical devices’ to : ‘phones’. When we pick the random to model , the ‘head-phones’ and ‘rotary-phones’ will have priority as the related . Nonetheless, our future work includes introducing tighter relationship between the Fine Class in same the Coarse Class so that better performance can be achieved.
-  C.H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in , 2009, pp. 951–958.
-  D. Parikh and K. Grauman, “Relative attributes,” in IEEE International Conference on Computer Vision, 2011, pp. 503 –510.
-  N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar, “Attribute and simile classifiers for face verification,” in IEEE International Conference on Computer Vision, 2009, pp. 365–372.
-  M. Rohrbach, M. Stark, and B. Schiele, “Evaluating knowledge transfer and zero-shot learning in a large-scale setting,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1641–1648.
-  A. Frome, G. S Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, M.A. Ranzato, and T. Mikolov, “Devise: A deep visual-semantic embedding model,” in Advances in Neural Information Processing Systems, 2013, pp. 2121–2129.
-  T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Metric learning for large scale image classification: Generalizing to new classes at near-zero cost,” in European Conference on Computer Vision, pp. 488–501. Springer, 2012.
-  W.L. Hoo and C.S. Chan, “Plsa-based zero-shot learning,” in 20th IEEE International Conference on Image Processing, 2013, pp. 4297–4301.
-  L. Fei Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in IEEE Conference on Computer Vision and Pattern Recognition, 2005, vol. 2, pp. 524–531.
-  G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” Caltech Technical Report, No. CNS-TR-2007-001., 2007.
-  A. Bosch, A. Zisserman, and X. Muoz, “Image classification using random forests and ferns,” in IEEE International Conference on Computer Vision, 2007, pp. 1–8.
-  F. Moosmann, E. Nowak, and F. Jurie, “Randomized clustering forests for image classification.,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 1632 –1646, 2008.
-  V. Ferrari and A. Zisserman, “Learning visual attributes,” in Advances in Neural Information Processing Systems, 2007, pp. 433–440.
-  C.H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2014.
-  Y. Fu, T.M. Hospedales, T. Xiang, and S. Gong, “Attribute learning for understanding unstructured social activity,” in European Conference on Computer Vision, pp. 530–543. Springer, 2012.
-  Y. Fu, T.M. Hospedales, T. Xiang, and S. Gong, “Learning multimodal latent attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 303–316, 2014.
M. Guillaumin, J. Verbeek, and C. Schmid,
“Multimodal semi-supervised learning for image classification,”in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 902–909.
-  M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell, “Zero-shot learning with semantic output codes,” in Advances in Neural Information Processing Systems, 2009, pp. 1410–1418.
-  C. Silberer, V. Ferrari, and M. Lapata, “Models of semantic representation with visual attributes,” in Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics, 2013, pp. 572–582.
-  W.L. Hoo, T-K Kim, Y. Pei, and C.S. Chan, “Enhanced random forest with image/patch-level learning for image understanding,” in Proceedings of the 22nd International Conference on Pattern Recognition, 2014, pp. 3434–3439.
-  J. Sivic, B. Russell, A Efros, A. Zisserman, and W. Freeman, “Discovering objects and their location in images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 370–377.
-  A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
-  A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a spatial pyramid kernel,” in ACM International Conference on Image and Video Retrieval, 2007, pp. 401–408.
I. Goodfellow, A. Courville, and Y. Bengio,
“Large-scale feature learning with spike-and-slab sparse coding,”
International Conference on Machine Learning, 2012, pp. 1439–1446.
-  Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptive field learning for pooled image features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3370–3377.
-  Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, “Label-embedding for attribute-based classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 819–826.