1 Introduction
Sketchbased image retrieval (SBIR) has drawn increasingly more attention in the past decade, especially with the prevalence of touchscreens. There exist many annotated datasets Yu et al. (????); Eitz et al. (2010); Hu and Collomosse (2013); Ouyang et al. (2014); Li et al. (????); Sangkloy et al. (2016) and methods tackling all aspects of the problem Li et al. (????); Yu et al. (????); Li et al. (????); Yu et al. (2016); Xu et al. (????); Song et al. (????). The vibrancy of the SBIR area also promoted the development of other related research problems, such as sketch recognition Zhang et al. (2015); Li et al. (2015), sketch synthesis Gao et al. (2008); Xiao et al. (2010); Li et al. (2016), sketchbased D retrieval Su et al. (????), and sketch segmentation Qi et al. (????). From a technical perspective, SBIR is traditionally cast into a classification task, with most prior work evaluating the retrieval performance at categorylevel Li et al. (2015); Cao et al. (????a, ????b); Qi et al. (2015); Wang et al. (????). More recently, finegrained variants of SBIR Li et al. (????); Yu et al. (????) requires retrieval to be conducted within single object categories. With this more constrained ranking setting of the problem, researchers no longer carry out similarity matching based only on lowlevel and handdesigned visual features Eitz et al. (2010); Hu and Collomosse (2013); Shrivastava et al. (2011), but begin to devolve into highlevel and partial information for sketch and photo matching, e.g., local stroke ordering Yu et al. (2016, ????), and partlevel attributes Ouyang et al. (2014); Li et al. (????); Xu et al. (????).
Despite great strides made, most prior work ignores the crossmodal gap that inherently exists between sketch and photo, treating images as edgemaps (semisketches)
Li et al. (????); Qi et al. (????); Yu et al. (????); Song et al. (????). This assumption works well when the retrieval system is presented with good quality sketches that are close to contour tracings of intended objects, but would not work well with freehand sketches where sketches are much more abstract and do not offer close resemblance natural objects. However, effectively solving the sketchphoto crossmodal gap is nontrivial: (i) Sketch can only capture limited shape and contour information. It utilizes coarse lines to describe key features of an object at an abstract and semantic level. As shown in Fig. 1, a pyramid can be denoted as a triangle in sketch form. (ii) Different people have different observations, past experiences, drawing styles, and drawing skill Li et al. (????); Sangkloy et al. (2016). Fig. 1 shows us that sketches drawn by different persons for the same cat or shoe may be highly diverse. This naturally motivates us to apply crossmodal matching methods to tackle the SBIR problem. However, to the best of our knowledge, all previous crossmodal work Rasiwasia et al. (????); Sharma et al. (????); Zhen and Yeung (????); Masci et al. (2014); Xu et al. (2016); Yao et al. (2016) are designed to address the imagetext modal gap (e.g., Wikipedia imagetext dataset Rasiwasia et al. (????), Pascal VOC dataset Hwang and Grauman (2012), NUSWIDE Chua et al. (????), LabelMe Oliva and Torralba (2001)). Therefore, making their general applicableness to SBIR remains unclear.The main approaches behind existing imagetext crossmodal research can be roughly categorized into pairwise modeling Zhen and Yeung (????); Quadrianto and Lampert (????); Zhai et al. (????); Wang et al. (????), ranking Grangier and Bengio (2008); Weston et al. (????); Huang et al. (2016), mapping Hardoon et al. (2004); Wang et al. (????a, ????b), and graph embedding Sharma et al. (????); Wang et al. (????b); Li et al. (2016); Wang et al. (2016). In particular, probabilistic models Putthividhy et al. (????); Jia et al. (????), metric learning approaches Wu et al. (2010); Mignon and Jurie (????); Zhu et al. (????); Liu et al. (2013), and subspace learning methods Hardoon et al. (2004); Kim et al. (2007) have been proven to be effective across a number of datasets. Probabilistic approaches learn the multimodal correlation by modeling the joint multimodal data distribution Putthividhy et al. (????). Metric learning methods learn and compute appropriate distance metrics between different modalities Wu et al. (2010). Subspace learning constructs the common subspace and map multimodal data into it to conduct crossmodal matching Wang et al. (????a). Among these crossmodal techniques, crossmodal subspace learning methods have achieved stateoftheart results in recent years Rasiwasia et al. (????); Wang et al. (????a, 2016); Costa Pereira et al. (2014); Sharma and Jacobs (????), which have borrowed much inspiration from the conventional subspace approaches Liu et al. (????); Liu and Yan (????); Liu et al. (2013); Zhu et al. (????); Chang et al. (????a, ????b, 2016). For a comprehensive survey, please refer to Wang et al. (2016); Liu et al. (2010).
In this paper, we focus on analyzing the interaction and relationship between sketch and photo in the crossmodal setting. The main contributions of this paper are twofold:

We conduct detailed comparative analysis towards the general applicability of crossmodal techniques on matching sketches and photos.
The remaining parts of this paper are organized as follows. Section briefly presents some stateoftheart crossmodal subspace learning methods and the corresponding characteristics. In Section , we report and analyze their experimental performances for the SBIR task. Potential future research insights for SBIR are discussed in Section . Finally, conclusions are drawn in Section .
A preliminary version of this work has been presented in Xu et al. (????). The main extensions are:

Extensive experiments are performed on one extra recently released finegrained SBIR dataset (i.e., the chair SBIR dataset).

We simultaneously evaluate the performances of these methods for SBIR tasks on both subcategorylevel and instancelevel.
2 Crossmodal Subspace Learning
In this section, we will briefly survey some stateoftheart crossmodal subspace learning methods designed for image and text. All these methods will share the same notation. Suppose that sample matrices and are extracted from modality and modality , respectively. These multimodal samples can be categorized into classes. Samples and the corresponding class labels are denoted as and , where each pair represents the same object or content belonging to the same class. denotes the class label matrix for the multimodal data. The transform for the th sample of modality is denoted as . Similarly, the transform for the th sample of modality is denoted as
. Throughout this paper, vectors and matrices are denoted as straight bold lowercase
and uppercase , respectively.These crossmodal subspace learning methods have the common workflow of learning a projection matrix for each modality to project the data from different modalities into a common comparable subspace in the training phase. In the test phase, the test data samples from one modality will be taken as the query set to retrieve matched samples from the other modality. In this paper, and denote the projection matrices for modality and modality , respectively.
2.1 Canonical Correlation Analysis (CCA)
CCA Rasiwasia et al. (????); Hardoon et al. (2004); Kim et al. (2007)
is an effective multivariate statistical analysis approach, which is analogous to principal component analysis (PCA)
Jolliffe (2002). It was originally designed for data correlation modeling and dimension reduction. Recently, CCA has been applied widely in multimodal data fusion and crossmedia retrieval Rasiwasia et al. (????); Costa Pereira et al. (2014); Gong et al. (2014); Ranjan et al. (????). CCA has become one of the most popular unsupervised crossmodal subspace learning methods due to its generalization capability.CCA learns a set of canonical component pairs for and , i.e., directions and along which the multimodal data is maximally correlated Rasiwasia et al. (????) as
(1) 
where and denote the empirical covariance matrices for modality and modality, respectively. represents the crosscovariance matrix between different modalities. By repeatedly solving (1), we can obtain a series of canonical component pairs. We can choose the first canonical component pairs for projecting and into two dimensional subspaces. Here, is a hyperparameter. This optimization objective of (1
) can be solved as a generalized eigenvalue problem (GEV)
Ramsay (2006).2.2 Partial Least Squares (PLS)
PLS Rosipal and Krämer (2006)
can linearly map multimodal data into a linear subspace that preserves the data correlation. It can be adopted to solve the crossmodal matching in many multimodal scenarios. PLS has been effectively applied in face recognition and multimedia retrieval with different motivations
Sharma and Jacobs (????); Baek and Kim (2004); Dhanjal et al. (2009); Li et al. (????); Štruc and Pavešić (2009); Schwartz et al. (????); Chen et al. (????).PLS models and such that Sharma and Jacobs (????)
(2) 
and contain the extracted PLS scores or latent projections. and are the matrices of loadings and , , and are the residual matrices. is a diagonal matrix describing the latent scores of and .
PLS learns the basis vectors and such that the covariance between the score vectors and (rows of and ) is maximized as
(3)  
2.3 Generalized Multiview Analysis (GMA)
GMA Sharma et al. (????)
is a special multiview framework, which can be solved efficiently as a generalized eigenvalue problem. As we will show in this section, many popular supervised and unsupervised feature extraction techniques can be derived based on GMA.
The constrained objective is
(4) 
where, and denote the projection directions. The positive terms , , and are hyperparameters controlling the balance among the objectives.
is the betweenclass variance matrix while
is the withinclass covariance matrix. Sharma et al. Sharma et al. (????) have illustrated that if we substitute in (4) with particular expressions, we obtain the corresponding objective functions of different methods.2.3.1 Bilinear Model (BLM)
In (4), setting , , and we obtain BLM under the proposed GMA framework.
2.3.2 Generalized Multiview Linear Discriminant Analysis (GMLDA)
We can set , , where are the within/betweenclass scatter matrices. Here, is substituted by its class mean matrix.
2.3.3 Generalized Multiview Margin Fisher Analysis (GMMFA)
Based on GMA framework, the expression for the multiview version of MFA is complex. It utilizes the graph construction to restrict the projected data. More details can be found in Sharma et al. (????).
2.4 Common Discriminant Feature Extraction (CDFE)
Lin and Tang Lin and Tang (????) used the empirical separability and the local consistency to propose the CDFE method for subspace learning. The empirical separability ensures the intraclass compactness and the interclass dispersion, which are measured respectively as follows Lin and Tang (????)
(5) 
(6) 
where and are the quantities of sample pairs from the same class and the different classes, respectively.
As shown in Fig. 2, the empirical separability can be defined as:
(7) 
where is a hyperparameter for tradeoff. To prevent the overfitting, local consistency can be used to regularize the empirical separability. The objective function of CDFE can be formulated as follows:
(8) 
where is a hyperparameter to adjust the tradeoff between these two objectives. Here, represents the local consistency objective. More details can be found in Lin and Tang (????).
2.5 Threeview Canonical Correlation Analysis (CCA3V)
The objective function of CCA3V has three terms Gong et al. (2014):
(9) 
Obviously, the latent correlation among three views or three modalities can be captured by optimising this function. Moreover, for the crossmodal matching, some highlevel semantic information can be utilized as the third view Gong et al. (2014). If we put the groundtruth labels into its third view, it becomes a supervised method. As shown in Fig. 3, comparing with the conventional CCA, CCA3V constructs a semantic embedding subspace to improve the performance. CCA3V aligns the corresponding multimodal sample pairs by not only referring to the data distribution but also following the guidance of the highlevel semantics. Multimodal samples belonging to the same semantic cluster are forced to be close to each other.
2.6 Learning Coupled Feature Spaces for Crossmodal Matching (LCFS)
Many earlier studies have demonstrated two properties:

norm has good performances in feature selection
Gu et al. (????); He et al. (????); Nie et al. (????).
Integrating the properties of the norm and the trace norm, Wang et al. Wang et al. (????a) proposed a model of the following form
(10) 
where and are the projection matrices for the coupled modality and
modality, respectively. The first term is a coupled linear regression, which is used to learn two projection matrices for mapping multimodal data into a common subspace defined by label information. The second term containing
norms conducts feature selection on two feature spaces and simultaneously. The trace norm can enhance the relevance of projected data with connections inside the subspace.2.7 Joint Feature Selection and Subspace Learning for Crossmodal Retrieval (JFSSL)
JFSSL Wang et al. (2016) is an extension based on LCFS Wang et al. (????a). The objective function is a generic minimization problem among different modalities of data objects in the following form:
(11) 
where denotes the projection matrix for the th modality. The roles of its first term and the second term are the same as those in LCFS. The third term is a multimodal graph regularization reinforcing the intramodality and intermodality similarity. Similar to the empirical separability term of CDFE objective, this multimodal graph regularization preserves the intramodality compactness and the intermodality dispersion.
3 Experimental Results and Discussions
3.1 Datasets
In this section, we will apply the aforementioned crossmodal subspace learning methods on two recently released finegrained sketchbased image retrieval datasets Yu et al. (????); Li et al. (????). Each photo has a corresponding freehand sketch. That is, each sketch sample has a groundtruth photo counterpart as shown in Fig. 4.
The shoe dataset has photosketch pairs, which can be categorized into three subclasses. All the sample pairs are singlelabeled. The chair dataset contains photosketch pairs. These chairs can be divided into six subclasses. The sketches are drawn by nonexperts using their fingers on the touch screens, therefore, these sketches are abstract enough to escape the photo modal space.
3.2 Experimental Settings
These crossmodal subspace learning methods (i.e., CCA, PLS, BLM, GMLDA, GMMFA, CDFE, CCA3V, LCFS, JFSSL) were applied to be performed on shoe dataset and chair dataset, for two SBIR retrieval tasks ((1) photos query sketches and (2) sketches query photos). The experimental results contain randomness due to the limitation by the numbers of the samples in the shoe dataset and chair dataset. To remove the effect of randomness, we repeated each model on each setting times. On the shoe dataset, each evaluation we randomly chose sample pairs as training set, and treated the remaining sample pairs as test set. On the chair dataset, the ratio of training and test data sets was kept as to .
In the training phase, we input photo and sketch features into these crossmodal subspace learning methods to learn a projection matrix for each modality. After model training, we used the projection matrices to map the photo and sketch testing samples into a common subspace. The cosine distance was adopt to measure the similarity between the projected photos and sketches. Given a photo (or sketch) query, the goal of each SBIR task is to find the nearest neighbors (NN) from the sketch (or photo) database.
In all the following experiments, we used Histogram of Oriented Gradient (HOG) features to describe the photos and sketches. In order to evaluate the performance of these methods with different scales, two kinds of metrics were adopted. The mean average precision (MAP) Rasiwasia et al. (????) was applied to evaluate the performances on semifinegrained level. A retrieval was judged as correct by MAP as long as the retrieved sample and the query sample have the same subclass label. Another metric “” Yu et al. (????); Li et al. (????) was used to carry out finegrained evaluation on the instancelevel, which is the percentage of the corresponding photos or sketches ranked in the top results.
3.3 Results on Shoe Dataset
3.3.1 Evaluation by MAP
The MAP scores of different crossmodal subspace learning methods on shoe dataset are reported in Table 1
. The minimum (min), maximum (max), mean value (mean), variance (var), and standard deviation (std) for each method are also presented.
Wang et al. Wang et al. (????a, 2016) have illustrated that CCA, PLS, BLM, GMLDA, GMMFA, CDFE, and CCA3V are incapable of feature selection. Hence we utilize Principal Component Analysis (PCA) to remove the redundancy in the input features for these seven kinds of methods. In order to validate the feature selection abilities of LCFS and JFSSL, their results without performing PCA on the input features are also reported in Table 1. It can be observed that LCFS and JFSSL outperform the remaining methods for photo querying sketch and sketch querying photo on shoe dataset. This is because LCFS and JFSSL can simultaneously select discriminative and effective features from different modalities while learning the common subspace.
In terms of performance, GMLDA and GMMFA are close to LCFS and JFSSL. The performance gaps between GMLDA, GMMFA, and LCFS, JFSSL are not as obvious as those for image and text matching. This is due to the inherent data difference between sketch and text. GMLDA and GMMFA are superior to CDFE and CCA3V. Among these crossmodal subspace learning methods, CCA performs the worst while its supervised enhanced version CCA3V achieves good performance. PLM and BLM are a little better than CCA for photo querying sketch and sketch querying photo.
The overall trend of Table 1 can be summarized as supervised methods outperforming the unsupervised methods. This trend can also be explained by Fig. 9, which shows the differences between these methods. CCA, PLS, and BLM used only pairwise information to build the common subspace, as shown in Fig. 9(c). Fig. 9(d) illustrates that GMLDA, GMMFA, and CCA3V can take the advantage of class label information and pairwise relationship to construct preferable interclass separation in the common subspace. CDFE mainly attempts to keep the intraclass and interclass structures in a subspace. LCFS and JFSSL devote to minimize the subcategorybased residuals. However, their graph embedding technologies can only improve intraclass compactness and interclass dispersion. CDFE, LCFS, and JFSSL cannot thoroughly capture the pairwise relationship. In contrast to Fig. 9(e), the sample pairs of Fig. 9(d) have dashed lines to connect each other to visualize the pairwise connections.
These phenomena are consistent with the results presented in Wang et al. (2016). In Wang et al. (2016), it was discussed and verified that JFSSL can utilize the multimodal graph embedding constraint to obtain performance improvements basing on LCFS. However, for experiments on shoe dataset, their performances are almost the same. This is because the graph embedding constraint of JFSSL cannot fully play its role on this sketch dataset.
All the experimental results in Table 1 are also visualized as boxplots in Fig. 5. The box range of certain method shows the performance stability of corresponding method for the SBIR tasks. We can conclude that these crossmodal subspace learning methods have similar stabilities for SBIR on shoe dataset.
All the samples extracted from the same dataset follow the same underlying data distribution. Each method has its own unique principle and can be regarded as a system. Theoretically, the experimental results of a method will also follow a certain latent data distribution when it is repeated on the same dataset. Thus we can judge that the performances of these aforementioned methods for the SBIR tasks on shoe dataset are fundamentally different, only when their experimental result distributions do not belong to the same distribution.
According to Table 1 and Fig. 5
, we get a preliminary conclusion that LCFS is the best among these methods on shoe dataset for photo querying sketch and sketch querying photo. To verify whether LCFS is fundamentally superior to other methods, we conducted student s ttest between LCFS and other methods, as shown in Table
3. The null hypothesis is that the two results have similar means with unknown variance. We can observe that LCFS and JFSSL have the same output MAP distribution for photo querying sketch task no matter whether their input features are preprocessed by PCA. However, their output MAP distributions for sketch querying photo task are different. In all other cases, LCFS is statistically different from the others. Based on the above observations, we conclude that the performance of LCFS for the subcategorylevel SBIR tasks on shoe dataset is essentially different with the performances of CCA, PLS, BLM, GMLDA, GMMFA, CDFE, and CCA3V.
Method  Photo queries sketch  Sketch queries photo  

min  max  mean  var  std  min  max  mean  var  std  
PCA+CCA  0.5442  0.6442  0.5836  0.0007  0.0264  0.5474  0.6415  0.5868  0.0006  0.0255 
PCA+PLS  0.5712  0.6687  0.6169  0.0005  0.0218  0.5795  0.6649  0.6187  0.0004  0.0205 
PCA+BLM  0.5805  0.6762  0.6272  0.0005  0.0217  0.5900  0.6755  0.6294  0.0004  0.0206 
PCA+GMLDA  0.6943  0.8103  0.7542  0.0007  0.0267  0.7244  0.8213  0.7712  0.0006  0.0253 
PCA+GMMFA  0.7000  0.8111  0.7577  0.0006  0.0248  0.7317  0.8199  0.7733  0.0005  0.0227 
PCA+CDFE  0.6696  0.8024  0.7302  0.0008  0.0277  0.6755  0.8268  0.7559  0.0007  0.0271 
PCA+CCA3V  0.6339  0.7284  0.6837  0.0005  0.0219  0.6494  0.7261  0.6930  0.0004  0.0191 
PCA+LCFS  0.7229  0.8365  0.7705  0.0007  0.0255  0.7079  0.8473  0.7745  0.0006  0.0244 
LCFS  0.7236  0.8480  0.7798  0.0007  0.0258  0.7475  0.8518  0.8014  0.0006  0.0237 
PCA+JFSSL  0.7211  0.8355  0.7700  0.0006  0.0254  0.7067  0.8457  0.7748  0.0006  0.0242 
JFSSL  0.7080  0.8222  0.7619  0.0006  0.0249  0.7016  0.8253  0.7632  0.0006  0.0244 
Method  Photo queries sketch  Sketch queries photo  

min  max  mean  var  std  min  max  mean  var  std  
PCA+CCA  0.4973  0.6407  0.5558  0.0012  0.0347  0.4998  0.6400  0.5588  0.0011  0.0334 
PCA+PLS  0.5477  0.6557  0.5948  0.0008  0.0279  0.5557  0.6585  0.5998  0.0007  0.0273 
PCA+BLM  0.5435  0.6549  0.5942  0.0007  0.0260  0.5541  0.6507  0.5987  0.0006  0.0241 
PCA+GMLDA  0.6469  0.7798  0.7110  0.0010  0.0311  0.6426  0.7826  0.7077  0.0010  0.0321 
PCA+GMMFA  0.6473  0.7852  0.7094  0.0011  0.0331  0.6341  0.7892  0.7053  0.0011  0.0336 
PCA+CDFE  0.5638  0.7393  0.6585  0.0011  0.0328  0.5603  0.7400  0.6637  0.0009  0.0293 
PCA+CCA3V  0.5457  0.6692  0.6040  0.0007  0.0265  0.5585  0.6765  0.6127  0.0006  0.0254 
PCA+LCFS  0.6491  0.7993  0.7139  0.0013  0.0357  0.6292  0.8043  0.7046  0.0013  0.0360 
LCFS  0.6302  0.7804  0.7120  0.0012  0.0351  0.6227  0.7819  0.7049  0.0013  0.0363 
PCA+JFSSL  0.6504  0.8004  0.7137  0.0013  0.0356  0.6284  0.8036  0.7043  0.0013  0.0359 
JFSSL  0.6339  0.7949  0.7119  0.0013  0.0357  0.6283  0.7833  0.7045  0.0013  0.0365 
PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  JFSSL  
CCA  PLS  BLM  GMLDA  GMMFA  CDFE  CCA3V  LCFS  JFSSL  
Photo Query  5.1e60  3.9e56  1.2e53  4.0e06  3.2e05  4.8e15  1.7e36  0.0728  0.0578  0.0006 
Sketch Query  5.2e66  9.4e64  3.0e61  1.5e08  2.8e08  2.4e14  1.3e44  2.0e07  2.4e07  3.5e12 
Average Value  2.5e60  1.9e56  6.1e54  2.0e06  1.6e05  1.4e14  8.9e37  0.0364  0.0289  0.0003 
PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  JFSSL  
CCA  PLS  BLM  GMLDA  GMMFA  CDFE  CCA3V  LCFS  JFSSL  
Photo Query  2.7e40  1.0e33  9.9e35  0.8698  0.6971  4.4e12  1.0e31  0.7965  0.8117  0.9825 
Sketch Query  5.9e38  8.8e30  2.1e31  0.6830  0.9536  1.0e08  1.4e26  0.9633  0.9309  0.9516 
Average Value  2.9e38  4.4e30  1.0e31  0.7764  0.8254  5.3e09  7.2e27  0.8799  0.8713  0.9670 
Boxplots of MAP scores achieved by different crossmodal subspace learning methods on shoe dataset. The inputs of all the methods are preprocessed by PCA excepting methods marked with an asterisk. The top and bottom edges of the box are the 75th and 25th percentiles, respectively. The outliers are marked as red cross patterns individually.
3.3.2 Evaluation by “”
The shoe dataset and the chair dataset are finegrained SBIR datasets in which each sketch sample has a photo sample as its instancelevel counterpart. Hence we can also evaluate the performances of these crossmodal subspace learning methods by counting the percentage of the corresponding photos or sketches ranked in the top K results. Please note that the parameters of all the methods are readjusted when we evaluated them by “”.
Similar as the previous chapter, PCA is conducted for all the methods. In addition, to verify the feature selection capabilities of LCFS and JFSSL, we also input features without PCA reprocessing for these two methods.
The Cumulative Match Characteristic (CMC) curves are plotted in Fig. 6. We can observe that in terms of relative distribution relationships and trends of the curves, Fig. 6(a) is consistent with Fig. 6(b). And the performances of these crossmodal subspace learning methods are different from theirs on subcategorylevel evaluation (MAP). CCA3V achieves the highest instancelevel accuracy for photosketch query and sketchphoto query on shoe dataset. The curves of CCA, GMMFA, PLS, and BLM are slightly lower than CCA3V’s. GMLDA obtains more satisfying experimental results than LCFS and JFSSL. LCFS and JFSSL are little better than CDFE. And LCFS and JFSSL still can obviously show their feature selection ability for the instancelevel SBIR retrieval on shoe dataset. The experimental result of CDFE is the worst.
These supervised crossmodal subspace learning methods do not show a distinct advantage over unsupervised methods for the instancelevel SBIR tasks on shoe dataset. We can conclude that learning pairwise information is more effective than learning subcategorylevel relationship for instancelevel SBIR. CCA, PLS, and BLM can achieve good results because they can learn the pairwise relationships of multimodal samples. CCA3V, GMLDA, and GMMFA can utilize sample labels to learn some subcategory separation in the common subspace in the same time capturing the sample pairbased correlation crossing modalities. CCA3V is more focused on modeling the association between the pairs of samples while GMMFA and GMLDA also learn some structured information in the common subspace. LCFS and JFSSL cannot obtain the desired results. For LCFS, its objective function engages in optimizing the subcategorybased residuals and feature selection for each modality. The trace norm constraint in Eq. (10) can enforce the relevance of projected sample data with connections, but its weighting coefficient is often too small to learn enough sample pairwise information. For JFSSL, its optimization is also mainly minimizing the subcategorybased residuals. Its graph embedding term in Eq. (11) only preserves the intermodality and intramodality similarity. Hence, LCFS and JFSSL are not good at learning the instancelevel or pairwise relationship of sample data pairs.
3.4 Results on Chair Dataset
3.4.1 Evaluation by MAP
The MAP score comparison of different crossmodal subspace learning methods on chair dataset are reported in Table 2. Each experiment on each setting is also repeated for 50 times. And as in the previous chapter, PCA is also utilized to remove the redundancy in the input features for CCA, PLS, BLM, GMLDA, GMMFA, CDFE, and CCA3V. We can observe that the experimental results are analogous to those on shoe dataset. Comprehensively considering photosketch query and sketchphoto query tasks, LCFS and JFSSL performs best. The performances of GMLDA and GMMFA are very close to the performances of LCFS and JFSSL.
The corresponding boxplots for Table 2 are visualized in Fig. 7. We observe that the stabilities of these methods for subcategorylevel SBIR on chair dataset do not have much difference. We also conducted the students ttest for the repeated 50 times experimental results between LCFS and other methods, as shown in Table 4. We observe that GMLDA, GMMFA, LCFS, and JFSSL have the same output MAP distribution for photo querying sketch and sketch querying photo tasks.
As described above, shoe dataset contains three subcategories and chair dataset has six subcategories. In common sense, the threeclass problem should be easier than the sixclass one when we evaluate the experimental results by MAP. Thus the MAP score of the same method on shoe dataset should be significantly higher than it on chair dataset. However, comparing Table 1 and Table 2, the MAP scores in Table 1 are not obviously higher than their counterpart values in Table 2. Moreover, Fig. 5 has many outliers (marked red) while Fig. 7 shows almost no outliers. These phenomena can be interpreted as that these shoe sketches are not drew very well. Shoe dataset is mixed with too much noise due to that those shoe sketch samples were painted too rough.
3.4.2 Evaluation by “”
We readjust the parameters for all the methods and compare their performances by counting the percentage of the corresponding photos or sketches ranked in the top K results. The CMC curves are plotted in Fig. 8. For photo querying sketch task and sketch querying photo task, CCA3V outperforms other methods. And the curves of GMMFA, GMLDA, BLM, and PLS in Fig. 8(a) and Fig. 8(b) almost overlap together respectively. We can observe that in Fig. 8(a), ‘LCFS’ curve is slightly lower than ‘PCA+LCFS’ curve and ‘JFSSL’ curve locates at a high distance above ‘PCA+JFSSL’ curve. In Fig. 8(b), ‘LCFS’ curve is significantly lower than ‘PCA+LCFS’ curve and ‘JFSSL’ curve and ‘PCA+JFSSL’ curve are overlapped. This illustrates that the feature selection abilities of LCFS and JFSSL cannot work well for instancelevel SBIR on chair dataset. In the objective functions of LCFS and JFSSL, the constraint terms for feature selection are optimized with the subcategorybased regression residuals simultaneously. Thus the effect of their feature selection is to reduce the subcategorybased errors rather than instancelevel matching errors.
3.5 Feature Selection and Graph Embedding
In the experiments of this paper, the performances of LCFS and JFSSL are almost the same on shoe dataset and chair dataset. However, JFSSL is the improved version of LCFS and owns theoretical advantages. JFSSL has feature selection constraint and graph embedding constraint which are classical operational processes or technologies for crossmodal matching. Hence, it is worth exploring the synergy between the feature selection and graph embedding terms for SBIR tasks. In its objective function Eq. (11), and are the weighting parameters for feature selection and graph embedding terms, respectively. We tune and in the range of fixing the remaining parameters. This adjusting process is illustrated in Fig. 10. We can observe a smooth and symmetric correlation variation between and . This shows us that these two techniques can cowork harmoniously for SBIR. When is fixed, MAP value slightly changes with the variations of . MAP varies with while is set to a certain value. This proves that the performance of JFSSL is largely determined by its feature selection technology. The importances of the feature selection technology and the graph embedding technology are not equal in the optimization process of JFSSL for subcategorylevel SBIR. This inspires us to further explore these two techniques in our future research for sketch.
3.6 Complexity Analysis
In this section, the computational complexity of each compared crossmodal subspace learning method is discussed briefly. The asymptotic time complexity of CCA is Rasiwasia et al. (????) where
. PLS is a fitting model embedding regression technique, for which its complexity is defined in terms of its Degrees of Freedom
Nicole and Sugiyama (2011). GMA can be formulated as a standard generalized eigenvalue problem and solved by any eigenvalue solving technique Sharma et al. (????). CDFE can be solved using an alternate optimization strategy including a main step that is a convex quadratic optimization program with linear constraint Lin and Tang (????). The approximate kernel maps can be adopted to solving CCA3V Gong et al. (2014) reducing the size of this problem to , where are the dimensionalities of the respective explicit mappings. The complexity of LCFS is Wang et al. (????a) where . The complexity of JFSSL can be denoted as Wang et al. (2016) where .For rigorous comparison, the running time for learning the projection matrices is compared among these crossmodal subspace learning methods. Each methods on each setting are repeated 50 times. The average running time are reported in Table 5 which reveals that the feature selection processing is timeconsuming. All the MATLAB codes are run on a 2.40GHz server with 64G RAM.
PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  PCA+  LCFS  PCA+  JFSSL  
CCA  PLS  BLM  GMLDA  GMMFA  CDFE  CCA3V  LCFS  JFSSL  
shoe  0.793  4.217  5.609  5.497  10.287  5.558  4.035  5.124  139.665  5.357  160.894 
chair  0.691  7.189  5.979  7.454  7.416  5.200  2.494  2.785  122.104  2.938  137.910 
3.7 Discussion
Our experimental results demonstrate that the crossmodal subspace learning methods designed for image and text can be applied in subcategorylevel and instancelevel SBIR tasks. The main advantage of crossmodal subspace learning for SBIR is its clear physical significance. Their performance rankings for subcategorylevel SBIR tasks are almost consistent with those in crossmodal retrieval for image and text. For subcategorylevel SBIR, the class label information is useful and supervised methods are usually superior to unsupervised methods. Feature selection and graph embedding technologies are also efficient to subcategorylevel SBIR and they can work together well. Their performance rankings for instancelevel SBIR tasks are not the same as those for subcategorylevel retrieval. Learning pairwise information is more effective than learning subcategorylevel relationship for instancelevel SBIR. Supervised learning has no significant advantages over unsupervised methods for instancelevel SBIR task. On the shoe dataset and the chair dataset, LCFS outperforms other methods for subcategorylevel SBIR and CCA3V achieves the highest accuracy for instancelevel SBIR. This leads us to conclude that subcategorylevel information can also be beneficial to instancelevel SBIR.
4 Discussion and Future work
We have demonstrated the feasibility of utilizing crossmodal subspace learning methods to tackle the domaingap between sketches and photos. In the future, we may gain access to better solutions for SBIR by including the advantages of the crossmodal subspace learning techniques, e.g., pairwise modeling, subcategorybased residual, joint feature selection, graph embedding. In particular, many researchers use deep Convolutional Neural Network (CNN) to conduct crossmodal matching
Wang et al. (????); Xiong et al. (????); Yan et al. (????) or SBIR which is essentially to learn some feature subspaces to match multimodal data. Moreover, the convolutional sparse coding technology can also learn subspace satisfying certain qualities Shafiee et al. (????); Gwon et al. (2016); Zhu and Lucey (2015), which illustrates the convolutional idea and subspace learning can be reasonably combined. Therefore, it is natural to also utilize crossmodal subspace learning concepts to improve CNN for SBIR, and potentially incorporating saliency information Wang et al. (2016); Lei et al. (2016) to improve partlevel examination in the same network.If we assume that sketch sits between photo and text in terms of their expressive power, i.e., photo is the most expressive for it can capture a likeforlike depiction of the visual world, sketches are unlikely to do so since they are highly abstract yet still visual, text on the other hand can be vague and more importantly not in the visual domain anymore. This bears the question that if modeling sketch together with text and photo could be worthwhile to better bridge the gap between text and photo, e.g., for textbased image retrieval. The fact that CCA3V achieved the best performance for the finegrained case is a good indicator of the promise that such threeway modelling offers. However, currently available SBIR datasets cannot provide detailed and adequate semantic textual information. Hence new datasets that capture all three domains are required.
5 Conclusion
In this paper, we discussed and evaluated a series of stateoftheart crossmodal subspace learning methods. We described each method and applied these approaches to two recently released finegrained SBIR datasets. This paper provided detailed comparisons and analysis on experimental results and discussed future research opportunities for SBIR.
6 Acknowledgement
This work was partly supported by National Natural Science Foundation of China (NSFC) grant No. , NSFCRS joint funding grant Nos. and IE, Beijing Natural Science Foundation (BNSF) grant No. , Beijing Nova Program Grant 2017045, the Open Projects Program of National Laboratory of Pattern Recognition grant No. , and Chinese program of Advanced Intelligence and Network Service under grant No. B. This work is partly supported by BUPTSICE Excellent Graduate Students Innovation Fund .
References
 Yu et al. (????) Q. Yu, F. Liu, Y. Song, T. Xiang, T. Hospedales, C. C. Loy, Sketch me that shoe, in: CVPR, 2016.
 Eitz et al. (2010) M. Eitz, K. Hildebrand, T. Boubekeur, M. Alexa, An evaluation of descriptors for largescale image retrieval from sketched feature lines, Computers & Graphics 34 (2010) 482–498.
 Hu and Collomosse (2013) R. Hu, J. Collomosse, A performance evaluation of gradient field hog descriptor for sketch based image retrieval, CVIU 117 (2013) 790–806.
 Ouyang et al. (2014) S. Ouyang, T. Hospedales, Y.Z. Song, X. Li, Crossmodal face matching: beyond viewed sketches, in: ACCV, 2014.
 Li et al. (????) K. Li, K. Pang, Y. Song, T. Hospedales, H. Zhang, Y. Hu, Finegrained sketchbased image retrieval: The role of partaware attributes, in: WACV, 2016.
 Sangkloy et al. (2016) P. Sangkloy, N. Burnell, C. Ham, J. Hays, The sketchy database: learning to retrieve badly drawn bunnies, ACM Trans. Graph. 35 (2016) 119.
 Li et al. (????) Y. Li, T. Hospedales, Y.Z. Song, S. Gong, Intracategory sketchbased image retrieval by matching deformable part models, in: BMVC, 2014.
 Yu et al. (2016) Q. Yu, Y. Yang, F. Liu, Y.Z. Song, T. Xiang, T. M. Hospedales, Sketchanet: A deep neural network that beats humans, IJCV (2016).
 Xu et al. (????) P. Xu, Q. Yin, Y. Qi, Y.Z. Song, Z. Ma, L. Wang, J. Guo, Instancelevel coupled subspace learning for finegrained sketchbased image retrieval, in: ECCV workshop, 2016.
 Song et al. (????) J. Song, Y.Z. Song, T. Xiang, T. Hospedales, X. Ruan, Deep multitask attributebased ranking for finegrained sketchbased image retrieval, in: BMVC, 2016.
 Zhang et al. (2015) M. Zhang, J. Li, N. Wang, X. Gao, Recognition of facial sketch styles, Neurocomputing 149 (2015) 1188–1197.
 Li et al. (2015) Y. Li, T. M. Hospedales, Y.Z. Song, S. Gong, Freehand sketch recognition by multikernel feature learning, CVIU 137 (2015) 1–11.
 Gao et al. (2008) X. Gao, J. Zhong, D. Tao, X. Li, Local face sketch synthesis learning, Neurocomputing 71 (2008) 1921 – 1930.
 Xiao et al. (2010) B. Xiao, X. Gao, D. Tao, Y. Yuan, J. Li, Photosketch synthesis and recognition based on subspace learning, Neurocomputing 73 (2010) 840 – 852.
 Li et al. (2016) Y. Li, Y.Z. Song, T. M. Hospedales, S. Gong, Freehand sketch synthesis with deformable stroke models, IJCV (2016).
 Su et al. (????) H. Su, S. Maji, E. Kalogerakis, E. LearnedMiller, Multiview convolutional neural networks for 3d shape recognition, in: ICCV, 2015.
 Qi et al. (????) Y. Qi, Y.Z. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, J. Guo, Making better use of edges via perceptual grouping, in: CVPR, 2015.
 Cao et al. (????a) Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, L. Zhang, Mindfinder: interactive sketchbased image search on millions of images, in: ACM MM, 2010.
 Cao et al. (????b) Y. Cao, C. Wang, L. Zhang, L. Zhang, Edgel index for largescale sketchbased image search, in: CVPR, 2011.
 Qi et al. (2015) Y. Qi, J. Guo, Y.Z. Song, T. Xiang, H. Zhang, Z.H. Tan, Im2sketch: Sketch generation by unconflicted perceptual grouping, Neurocomputing 165 (2015) 338–349.
 Wang et al. (????) F. Wang, L. Kang, Y. Li, Sketchbased 3d shape retrieval using convolutional neural networks, in: CVPR, 2015.
 Shrivastava et al. (2011) A. Shrivastava, T. Malisiewicz, A. Gupta, A. A. Efros, Datadriven visual similarity for crossdomain image matching 30 (2011) 154.
 Yu et al. (????) Q. Yu, Y. Yang, Y.Z. Song, T. Xiang, T. M. Hospedales, Sketchanet that beats humans, in: BMVC, 2015.
 Rasiwasia et al. (????) N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, N. Vasconcelos, A new approach to crossmodal multimedia retrieval, in: ACM MM, 2010.
 Sharma et al. (????) A. Sharma, A. Kumar, H. Daume III, D. W. Jacobs, Generalized multiview analysis: A discriminative latent space, in: CVPR, 2012.
 Zhen and Yeung (????) Y. Zhen, D.Y. Yeung, Coregularized hashing for multimodal data, in: NIPS, 2010.
 Masci et al. (2014) J. Masci, M. M. Bronstein, A. M. Bronstein, J. Schmidhuber, Multimodal similaritypreserving hashing, TPAMI 36 (2014) 824–830.
 Xu et al. (2016) X. Xu, L. He, A. Shimada, R. ichiro Taniguchi, H. Lu, Learning unified binary codes for crossmodal retrieval via latent semantic hashing, Neurocomputing 213 (2016) 191 – 203.
 Yao et al. (2016) T. Yao, X. Kong, H. Fu, Q. Tian, Semantic consistency hashing for crossmodal retrieval, Neurocomputing 193 (2016) 250 – 259.
 Hwang and Grauman (2012) S. J. Hwang, K. Grauman, Reading between the lines: Object localization using implicit cues from image tags, TPAMI 34 (2012) 1145–1158.
 Chua et al. (????) T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.T. Zheng, Nuswide: A realworld web image database from national university of singapore, in: CIVR, 2009.
 Oliva and Torralba (2001) A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, IJCV 42 (2001) 145–175.
 Quadrianto and Lampert (????) N. Quadrianto, C. H. Lampert, Learning multiview neighborhood preserving projections, in: ICML, 2011.
 Zhai et al. (????) X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint graph regularization for crossmedia retrieval, in: AAAI, 2013.
 Wang et al. (????) J. Wang, Y. He, C. Kang, S. Xiang, C. Pan, Imagetext crossmodal retrieval via modalityspecific feature learning, in: ICMR, 2015.
 Grangier and Bengio (2008) D. Grangier, S. Bengio, A discriminative kernelbased approach to retrieval images from text queries, TPAMI 30 (2008) 1371–1384.
 Weston et al. (????) J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, in: IJCAI, 2011.
 Huang et al. (2016) W. Huang, S. Zeng, M. Wan, G. Chen, Medical media analytics via ranking and big learning: A multimodality imagebased disease severity prediction study, Neurocomputing 204 (2016) 125 – 134.
 Hardoon et al. (2004) D. R. Hardoon, S. Szedmak, J. ShaweTaylor, Canonical correlation analysis: An overview with application to learning methods, Neural computation 16 (2004) 2639–2664.
 Wang et al. (????a) K. Wang, R. He, W. Wang, L. Wang, T. Tan, Learning coupled feature spaces for crossmodal matching, in: ICCV, 2013.
 Wang et al. (????b) K. Wang, W. Wang, L. Wang, R. He, A twostep approach to crossmodal hashing, in: ICMR, 2015.
 Li et al. (2016) J. Li, Y. Wu, J. Zhao, K. Lu, Multimanifold sparse graph embedding for multimodal image classification, Neurocomputing 173, Part 3 (2016) 501 – 510.
 Wang et al. (2016) K. Wang, R. He, L. Wang, W. Wang, T. Tan, Joint feature selection and subspace learning for crossmodal retrieval, TPAMI 38 (2016) 2010–2023.
 Putthividhy et al. (????) D. Putthividhy, H. T. Attias, S. S. Nagarajan, Topic regression multimodal latent dirichlet allocation for image annotation, in: CVPR, 2010.
 Jia et al. (????) Y. Jia, M. Salzmann, T. Darrell, Learning crossmodality similarity for multinomial data, in: ICCV, 2011.
 Wu et al. (2010) W. Wu, J. Xu, H. Li, Learning similarity function between objects in heterogeneous spaces, Microsoft Research Technique Report (2010).
 Mignon and Jurie (????) A. Mignon, F. Jurie, Cmml: a new metric learning approach for cross modal matching, in: ACCV, 2012.
 Zhu et al. (????) X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear crossmodal hashing for efficient multimedia search, in: ACM MM, 2013.
 Liu et al. (2013) Y. Liu, J. Shao, J. Xiao, F. Wu, Y. Zhuang, Hypergraph spectral hashing for image retrieval with heterogeneous social contexts, Neurocomputing 119 (2013) 49–58.
 Kim et al. (2007) T.K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, TPAMI 29 (2007) 1005–1018.
 Costa Pereira et al. (2014) J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, N. Vasconcelos, On the role of correlation and abstraction in crossmodal multimedia retrieval, TPAMI 36 (2014) 521–535.
 Sharma and Jacobs (????) A. Sharma, D. W. Jacobs, Bypassing synthesis: PLS for face recognition with pose, lowresolution and sketch, in: CVPR, 2011.
 Liu et al. (????) G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by lowrank representation, in: ICML, 2010.
 Liu and Yan (????) G. Liu, S. Yan, Latent lowrank representation for subspace segmentation and feature extraction, in: ICCV, 2011.
 Liu et al. (2013) G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by lowrank representation, TPAMI 35 (2013) 171–184.
 Zhu et al. (????) Y. Zhu, D. Huang, F. D. L. Torre, S. Lucey, Complex nonrigid motion 3d reconstruction by union of subspaces, in: CVPR, 2014.
 Chang et al. (????a) X. Chang, F. Nie, Y. Yang, H. Huang, A convex formulation for semisupervised multilabel feature selection, in: AAAI, 2014.
 Chang et al. (????b) X. Chang, F. Nie, Z. Ma, Y. Yang, X. Zhou, A convex formulation for spectral shrunk clustering, in: AAAI, 2015.
 Chang et al. (2016) X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, C. Zhang, Compound rank projections for bilinear analysis, TNNLS 27 (2016) 1502–1513.
 Wang et al. (2016) K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on crossmodal retrieval, arXiv preprint arXiv:1607.06215 (2016).
 Liu et al. (2010) J. Liu, C. Xu, H. Lu, Crossmedia retrieval: stateoftheart and open issues, International Journal of Multimedia Intelligence and Security 1 (2010) 33–52.
 Xu et al. (????) P. Xu, K. Li, Z. Ma, Y.Z. Song, L. Wang, J. Guo, Crossmodal subspace learning for sketchbased image retrieval: A comparative study, in: ICNIDC, 2016.
 Lin and Tang (????) D. Lin, X. Tang, Intermodality face recognition, in: ECCV, 2006.
 Gong et al. (2014) Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multiview embedding space for modeling internet images, tags, and their semantics, IJCV 106 (2014) 210–233.
 Jolliffe (2002) I. Jolliffe, Principal component analysis, Springer verlag, 2002.
 Ranjan et al. (????) V. Ranjan, N. Rasiwasia, C. V. Jawahar, Multilabel crossmodal retrieval, in: ICCV, 2015.
 Ramsay (2006) J. O. Ramsay, Functional data analysis, Wiley Online Library, 2006.
 Rosipal and Krämer (2006) R. Rosipal, N. Krämer, Overview and recent advances in partial least squares, in: Subspace, latent structure and feature selection, 2006, pp. 34–51.
 Baek and Kim (2004) J. Baek, M. Kim, Face recognition using partial least squares components, PR 37 (2004) 1303–1306.
 Dhanjal et al. (2009) C. Dhanjal, S. R. Gunn, J. ShaweTaylor, Efficient sparse kernel feature extraction based on partial least squares, TPAMI 31 (2009) 1347–1361.
 Li et al. (????) A. Li, S. Shan, X. Chen, W. Gao, Maximizing intraindividual correlations for face recognition across pose differences, in: CVPR, 2009.
 Štruc and Pavešić (2009) V. Štruc, N. Pavešić, Gaborbased kernel partialleastsquares discrimination features for face recognition, Informatica 20 (2009) 115–138.
 Schwartz et al. (????) W. R. Schwartz, H. Guo, L. S. Davis, A robust and scalable approach to face identification, in: ECCV, 2010.
 Chen et al. (????) Y. Chen, L. Wang, W. Wang, Z. Zhang, Continuum regression for crossmodal multimedia retrieval, in: ICIP, 2012.
 Gu et al. (????) Q. Gu, Z. Li, J. Han, Joint feature selection and subspace learning, in: IJCAI, 2011.
 He et al. (????) R. He, T. Tan, L. Wang, W.S. Zheng, regularized correntropy for robust feature selection, in: CVPR, 2012.
 Nie et al. (????) F. Nie, H. Huang, X. Cai, C. H. Ding, Efficient and robust feature selection via joint norms minimization, in: NIPS, 2010.
 Angst et al. (????) R. Angst, C. Zach, M. Pollefeys, The generalized tracenorm and its application to structurefrommotion problems, in: ICCV, 2011.
 Fornasier et al. (2011) M. Fornasier, H. Rauhut, R. Ward, Lowrank matrix recovery via iteratively reweighted least squares minimization, SIAM Journal on Optimization 21 (2011) 1614–1640.
 Grave et al. (????) E. Grave, G. R. Obozinski, F. R. Bach, Trace lasso: a trace norm regularization for correlated designs, in: NIPS, 2011.
 Harchaoui et al. (????) Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, J. Malick, Largescale image classification with tracenorm regularization, in: CVPR, 2012.
 Rasiwasia et al. (????) N. Rasiwasia, D. Mahajan, V. Mahadevan, G. Aggarwal, Cluster canonical correlation analysis, in: AISTATS, 2014.
 Nicole and Sugiyama (2011) K. Nicole, M. Sugiyama, The degrees of freedom of partial least squares regression, Journal of the American Statistical Association 106 (2011) 697–705.
 Wang et al. (????) A. Wang, J. Cai, J. Lu, T.J. Cham, Mmss: Multimodal sharable and specific feature learning for rgbd object recognition, in: ICCV, 2015.
 Xiong et al. (????) C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, T.K. Kim, Conditional convolutional neural network for modalityaware face recognition, in: ICCV, 2015.
 Yan et al. (????) K. Yan, Y. Wang, D. Liang, T. Huang, Y. Tian, Cnn vs. sift for image retrieval: Alternative or complementary?, in: ACM MM, 2016.

Shafiee et al. (????)
S. Shafiee, F. Kamangar,
V. Athitsos,
A multimodal sparse coding classifier using dictionaries with different number of atoms,
in: WACV, 2015.  Gwon et al. (2016) Y. Gwon, W. Campbell, K. Brady, D. Sturim, M. Cha, H. Kung, Multimodal sparse coding for event detection, arXiv preprint arXiv:1605.05212 (2016).
 Zhu and Lucey (2015) Y. Zhu, S. Lucey, Convolutional sparse coding for trajectory reconstruction, TPAMI 37 (2015) 529–540.
 Wang et al. (2016) Q. Wang, J. Lin, Y. Yuan, Salient band selection for hyperspectral image classification via manifold ranking, IEEE Transactions on Neural Networks and Learning Systems 27 (2016) 1279–1289.
 Lei et al. (2016) J. Lei, B. Wang, Y. Fang, W. Lin, P. L. Callet, N. Ling, C. Hou, A universal framework for salient object detection, TMM 18 (2016) 1783–1795.
Comments
There are no comments yet.