Sketch-based image retrieval (SBIR) has drawn increasingly more attention in the past decade, especially with the prevalence of touchscreens. There exist many annotated datasets Yu et al. (????); Eitz et al. (2010); Hu and Collomosse (2013); Ouyang et al. (2014); Li et al. (????); Sangkloy et al. (2016) and methods tackling all aspects of the problem Li et al. (????); Yu et al. (????); Li et al. (????); Yu et al. (2016); Xu et al. (????); Song et al. (????). The vibrancy of the SBIR area also promoted the development of other related research problems, such as sketch recognition Zhang et al. (2015); Li et al. (2015), sketch synthesis Gao et al. (2008); Xiao et al. (2010); Li et al. (2016), sketch-based D retrieval Su et al. (????), and sketch segmentation Qi et al. (????). From a technical perspective, SBIR is traditionally cast into a classification task, with most prior work evaluating the retrieval performance at category-level Li et al. (2015); Cao et al. (????a, ????b); Qi et al. (2015); Wang et al. (????). More recently, fine-grained variants of SBIR Li et al. (????); Yu et al. (????) requires retrieval to be conducted within single object categories. With this more constrained ranking setting of the problem, researchers no longer carry out similarity matching based only on low-level and hand-designed visual features Eitz et al. (2010); Hu and Collomosse (2013); Shrivastava et al. (2011), but begin to devolve into high-level and partial information for sketch and photo matching, e.g., local stroke ordering Yu et al. (2016, ????), and part-level attributes Ouyang et al. (2014); Li et al. (????); Xu et al. (????).
Despite great strides made, most prior work ignores the cross-modal gap that inherently exists between sketch and photo, treating images as edgemaps (semi-sketches)Li et al. (????); Qi et al. (????); Yu et al. (????); Song et al. (????). This assumption works well when the retrieval system is presented with good quality sketches that are close to contour tracings of intended objects, but would not work well with free-hand sketches where sketches are much more abstract and do not offer close resemblance natural objects. However, effectively solving the sketch-photo cross-modal gap is non-trivial: (i) Sketch can only capture limited shape and contour information. It utilizes coarse lines to describe key features of an object at an abstract and semantic level. As shown in Fig. 1, a pyramid can be denoted as a triangle in sketch form. (ii) Different people have different observations, past experiences, drawing styles, and drawing skill Li et al. (????); Sangkloy et al. (2016). Fig. 1 shows us that sketches drawn by different persons for the same cat or shoe may be highly diverse. This naturally motivates us to apply cross-modal matching methods to tackle the SBIR problem. However, to the best of our knowledge, all previous cross-modal work Rasiwasia et al. (????); Sharma et al. (????); Zhen and Yeung (????); Masci et al. (2014); Xu et al. (2016); Yao et al. (2016) are designed to address the image-text modal gap (e.g., Wikipedia image-text dataset Rasiwasia et al. (????), Pascal VOC dataset Hwang and Grauman (2012), NUS-WIDE Chua et al. (????), LabelMe Oliva and Torralba (2001)). Therefore, making their general applicableness to SBIR remains unclear.
The main approaches behind existing image-text cross-modal research can be roughly categorized into pair-wise modeling Zhen and Yeung (????); Quadrianto and Lampert (????); Zhai et al. (????); Wang et al. (????), ranking Grangier and Bengio (2008); Weston et al. (????); Huang et al. (2016), mapping Hardoon et al. (2004); Wang et al. (????a, ????b), and graph embedding Sharma et al. (????); Wang et al. (????b); Li et al. (2016); Wang et al. (2016). In particular, probabilistic models Putthividhy et al. (????); Jia et al. (????), metric learning approaches Wu et al. (2010); Mignon and Jurie (????); Zhu et al. (????); Liu et al. (2013), and subspace learning methods Hardoon et al. (2004); Kim et al. (2007) have been proven to be effective across a number of datasets. Probabilistic approaches learn the multi-modal correlation by modeling the joint multi-modal data distribution Putthividhy et al. (????). Metric learning methods learn and compute appropriate distance metrics between different modalities Wu et al. (2010). Subspace learning constructs the common subspace and map multi-modal data into it to conduct cross-modal matching Wang et al. (????a). Among these cross-modal techniques, cross-modal subspace learning methods have achieved state-of-the-art results in recent years Rasiwasia et al. (????); Wang et al. (????a, 2016); Costa Pereira et al. (2014); Sharma and Jacobs (????), which have borrowed much inspiration from the conventional subspace approaches Liu et al. (????); Liu and Yan (????); Liu et al. (2013); Zhu et al. (????); Chang et al. (????a, ????b, 2016). For a comprehensive survey, please refer to Wang et al. (2016); Liu et al. (2010).
In this paper, we focus on analyzing the interaction and relationship between sketch and photo in the cross-modal setting. The main contributions of this paper are two-fold:
We conduct detailed comparative analysis towards the general applicability of cross-modal techniques on matching sketches and photos.
The remaining parts of this paper are organized as follows. Section briefly presents some state-of-the-art cross-modal subspace learning methods and the corresponding characteristics. In Section , we report and analyze their experimental performances for the SBIR task. Potential future research insights for SBIR are discussed in Section . Finally, conclusions are drawn in Section .
A preliminary version of this work has been presented in Xu et al. (????). The main extensions are:
Extensive experiments are performed on one extra recently released fine-grained SBIR dataset (i.e., the chair SBIR dataset).
We simultaneously evaluate the performances of these methods for SBIR tasks on both subcategory-level and instance-level.
2 Cross-modal Subspace Learning
In this section, we will briefly survey some state-of-the-art cross-modal subspace learning methods designed for image and text. All these methods will share the same notation. Suppose that sample matrices and are extracted from modality and modality , respectively. These multi-modal samples can be categorized into classes. Samples and the corresponding class labels are denoted as and , where each pair represents the same object or content belonging to the same class. denotes the class label matrix for the multi-modal data. The transform for the -th sample of modality is denoted as . Similarly, the transform for the -th sample of modality is denoted as
. Throughout this paper, vectors and matrices are denoted as straight bold lower-caseand upper-case , respectively.
These cross-modal subspace learning methods have the common workflow of learning a projection matrix for each modality to project the data from different modalities into a common comparable subspace in the training phase. In the test phase, the test data samples from one modality will be taken as the query set to retrieve matched samples from the other modality. In this paper, and denote the projection matrices for modality and modality , respectively.
2.1 Canonical Correlation Analysis (CCA)
is an effective multivariate statistical analysis approach, which is analogous to principal component analysis (PCA)Jolliffe (2002). It was originally designed for data correlation modeling and dimension reduction. Recently, CCA has been applied widely in multi-modal data fusion and cross-media retrieval Rasiwasia et al. (????); Costa Pereira et al. (2014); Gong et al. (2014); Ranjan et al. (????). CCA has become one of the most popular unsupervised cross-modal subspace learning methods due to its generalization capability.
CCA learns a set of canonical component pairs for and , i.e., directions and along which the multi-modal data is maximally correlated Rasiwasia et al. (????) as
where and denote the empirical covariance matrices for modality and modality, respectively. represents the cross-covariance matrix between different modalities. By repeatedly solving (1), we can obtain a series of canonical component pairs. We can choose the first canonical component pairs for projecting and into two dimensional subspaces. Here, is a hyper-parameter. This optimization objective of (1
) can be solved as a generalized eigenvalue problem (GEV)Ramsay (2006).
2.2 Partial Least Squares (PLS)
PLS Rosipal and Krämer (2006)
can linearly map multi-modal data into a linear subspace that preserves the data correlation. It can be adopted to solve the cross-modal matching in many multi-modal scenarios. PLS has been effectively applied in face recognition and multi-media retrieval with different motivationsSharma and Jacobs (????); Baek and Kim (2004); Dhanjal et al. (2009); Li et al. (????); Štruc and Pavešić (2009); Schwartz et al. (????); Chen et al. (????).
PLS models and such that Sharma and Jacobs (????)
and contain the extracted PLS scores or latent projections. and are the matrices of loadings and , , and are the residual matrices. is a diagonal matrix describing the latent scores of and .
PLS learns the basis vectors and such that the covariance between the score vectors and (rows of and ) is maximized as
2.3 Generalized Multi-view Analysis (GMA)
GMA Sharma et al. (????)
is a special multi-view framework, which can be solved efficiently as a generalized eigenvalue problem. As we will show in this section, many popular supervised and unsupervised feature extraction techniques can be derived based on GMA.
The constrained objective is
where, and denote the projection directions. The positive terms , , and are hyper-parameters controlling the balance among the objectives.
is the between-class variance matrix whileis the within-class covariance matrix. Sharma et al. Sharma et al. (????) have illustrated that if we substitute in (4) with particular expressions, we obtain the corresponding objective functions of different methods.
2.3.1 Bilinear Model (BLM)
In (4), setting , , and we obtain BLM under the proposed GMA framework.
2.3.2 Generalized Multi-view Linear Discriminant Analysis (GMLDA)
We can set , , where are the within/between-class scatter matrices. Here, is substituted by its class mean matrix.
2.3.3 Generalized Multi-view Margin Fisher Analysis (GMMFA)
Based on GMA framework, the expression for the multi-view version of MFA is complex. It utilizes the graph construction to restrict the projected data. More details can be found in Sharma et al. (????).
2.4 Common Discriminant Feature Extraction (CDFE)
Lin and Tang Lin and Tang (????) used the empirical separability and the local consistency to propose the CDFE method for subspace learning. The empirical separability ensures the intra-class compactness and the inter-class dispersion, which are measured respectively as follows Lin and Tang (????)
where and are the quantities of sample pairs from the same class and the different classes, respectively.
As shown in Fig. 2, the empirical separability can be defined as:
where is a hyper-parameter for trade-off. To prevent the overfitting, local consistency can be used to regularize the empirical separability. The objective function of CDFE can be formulated as follows:
where is a hyper-parameter to adjust the trade-off between these two objectives. Here, represents the local consistency objective. More details can be found in Lin and Tang (????).
2.5 Three-view Canonical Correlation Analysis (CCA-3V)
The objective function of CCA-3V has three terms Gong et al. (2014):
Obviously, the latent correlation among three views or three modalities can be captured by optimising this function. Moreover, for the cross-modal matching, some high-level semantic information can be utilized as the third view Gong et al. (2014). If we put the ground-truth labels into its third view, it becomes a supervised method. As shown in Fig. 3, comparing with the conventional CCA, CCA-3V constructs a semantic embedding subspace to improve the performance. CCA-3V aligns the corresponding multi-modal sample pairs by not only referring to the data distribution but also following the guidance of the high-level semantics. Multi-modal samples belonging to the same semantic cluster are forced to be close to each other.
2.6 Learning Coupled Feature Spaces for Cross-modal Matching (LCFS)
Many earlier studies have demonstrated two properties:
Integrating the properties of the -norm and the trace norm, Wang et al. Wang et al. (????a) proposed a model of the following form
where and are the projection matrices for the coupled modality and
modality, respectively. The first term is a coupled linear regression, which is used to learn two projection matrices for mapping multi-modal data into a common subspace defined by label information. The second term containing-norms conducts feature selection on two feature spaces and simultaneously. The trace norm can enhance the relevance of projected data with connections inside the subspace.
2.7 Joint Feature Selection and Subspace Learning for Cross-modal Retrieval (JFSSL)
where denotes the projection matrix for the -th modality. The roles of its first term and the second term are the same as those in LCFS. The third term is a multi-modal graph regularization reinforcing the intra-modality and inter-modality similarity. Similar to the empirical separability term of CDFE objective, this multi-modal graph regularization preserves the intra-modality compactness and the inter-modality dispersion.
3 Experimental Results and Discussions
In this section, we will apply the aforementioned cross-modal subspace learning methods on two recently released fine-grained sketch-based image retrieval datasets Yu et al. (????); Li et al. (????). Each photo has a corresponding freehand sketch. That is, each sketch sample has a ground-truth photo counterpart as shown in Fig. 4.
The shoe dataset has photo-sketch pairs, which can be categorized into three subclasses. All the sample pairs are single-labeled. The chair dataset contains photo-sketch pairs. These chairs can be divided into six subclasses. The sketches are drawn by nonexperts using their fingers on the touch screens, therefore, these sketches are abstract enough to escape the photo modal space.
3.2 Experimental Settings
These cross-modal subspace learning methods (i.e., CCA, PLS, BLM, GMLDA, GMMFA, CDFE, CCA-3V, LCFS, JFSSL) were applied to be performed on shoe dataset and chair dataset, for two SBIR retrieval tasks ((1) photos query sketches and (2) sketches query photos). The experimental results contain randomness due to the limitation by the numbers of the samples in the shoe dataset and chair dataset. To remove the effect of randomness, we repeated each model on each setting times. On the shoe dataset, each evaluation we randomly chose sample pairs as training set, and treated the remaining sample pairs as test set. On the chair dataset, the ratio of training and test data sets was kept as to .
In the training phase, we input photo and sketch features into these cross-modal subspace learning methods to learn a projection matrix for each modality. After model training, we used the projection matrices to map the photo and sketch testing samples into a common subspace. The cosine distance was adopt to measure the similarity between the projected photos and sketches. Given a photo (or sketch) query, the goal of each SBIR task is to find the nearest neighbors (NN) from the sketch (or photo) database.
In all the following experiments, we used Histogram of Oriented Gradient (HOG) features to describe the photos and sketches. In order to evaluate the performance of these methods with different scales, two kinds of metrics were adopted. The mean average precision (MAP) Rasiwasia et al. (????) was applied to evaluate the performances on semi-fine-grained level. A retrieval was judged as correct by MAP as long as the retrieved sample and the query sample have the same subclass label. Another metric “” Yu et al. (????); Li et al. (????) was used to carry out fine-grained evaluation on the instance-level, which is the percentage of the corresponding photos or sketches ranked in the top results.
3.3 Results on Shoe Dataset
3.3.1 Evaluation by MAP
The MAP scores of different cross-modal subspace learning methods on shoe dataset are reported in Table 1
. The minimum (min), maximum (max), mean value (mean), variance (var), and standard deviation (std) for each method are also presented.
Wang et al. Wang et al. (????a, 2016) have illustrated that CCA, PLS, BLM, GMLDA, GMMFA, CDFE, and CCA-3V are incapable of feature selection. Hence we utilize Principal Component Analysis (PCA) to remove the redundancy in the input features for these seven kinds of methods. In order to validate the feature selection abilities of LCFS and JFSSL, their results without performing PCA on the input features are also reported in Table 1. It can be observed that LCFS and JFSSL outperform the remaining methods for photo querying sketch and sketch querying photo on shoe dataset. This is because LCFS and JFSSL can simultaneously select discriminative and effective features from different modalities while learning the common subspace.
In terms of performance, GMLDA and GMMFA are close to LCFS and JFSSL. The performance gaps between GMLDA, GMMFA, and LCFS, JFSSL are not as obvious as those for image and text matching. This is due to the inherent data difference between sketch and text. GMLDA and GMMFA are superior to CDFE and CCA-3V. Among these cross-modal subspace learning methods, CCA performs the worst while its supervised enhanced version CCA-3V achieves good performance. PLM and BLM are a little better than CCA for photo querying sketch and sketch querying photo.
The overall trend of Table 1 can be summarized as supervised methods outperforming the unsupervised methods. This trend can also be explained by Fig. 9, which shows the differences between these methods. CCA, PLS, and BLM used only pair-wise information to build the common subspace, as shown in Fig. 9(c). Fig. 9(d) illustrates that GMLDA, GMMFA, and CCA-3V can take the advantage of class label information and pair-wise relationship to construct preferable inter-class separation in the common subspace. CDFE mainly attempts to keep the intra-class and inter-class structures in a subspace. LCFS and JFSSL devote to minimize the subcategory-based residuals. However, their graph embedding technologies can only improve intra-class compactness and inter-class dispersion. CDFE, LCFS, and JFSSL cannot thoroughly capture the pair-wise relationship. In contrast to Fig. 9(e), the sample pairs of Fig. 9(d) have dashed lines to connect each other to visualize the pair-wise connections.
These phenomena are consistent with the results presented in Wang et al. (2016). In Wang et al. (2016), it was discussed and verified that JFSSL can utilize the multi-modal graph embedding constraint to obtain performance improvements basing on LCFS. However, for experiments on shoe dataset, their performances are almost the same. This is because the graph embedding constraint of JFSSL cannot fully play its role on this sketch dataset.
All the experimental results in Table 1 are also visualized as box-plots in Fig. 5. The box range of certain method shows the performance stability of corresponding method for the SBIR tasks. We can conclude that these cross-modal subspace learning methods have similar stabilities for SBIR on shoe dataset.
All the samples extracted from the same dataset follow the same underlying data distribution. Each method has its own unique principle and can be regarded as a system. Theoretically, the experimental results of a method will also follow a certain latent data distribution when it is repeated on the same dataset. Thus we can judge that the performances of these aforementioned methods for the SBIR tasks on shoe dataset are fundamentally different, only when their experimental result distributions do not belong to the same distribution.
, we get a preliminary conclusion that LCFS is the best among these methods on shoe dataset for photo querying sketch and sketch querying photo. To verify whether LCFS is fundamentally superior to other methods, we conducted student s t-test between LCFS and other methods, as shown in Table3
. The null hypothesis is that the two results have similar means with unknown variance. We can observe that LCFS and JFSSL have the same output MAP distribution for photo querying sketch task no matter whether their input features are preprocessed by PCA. However, their output MAP distributions for sketch querying photo task are different. In all other cases, LCFS is statistically different from the others. Based on the above observations, we conclude that the performance of LCFS for the subcategory-level SBIR tasks on shoe dataset is essentially different with the performances of CCA, PLS, BLM, GMLDA, GMMFA, CDFE, and CCA-3V.
|Method||Photo queries sketch||Sketch queries photo|
|Method||Photo queries sketch||Sketch queries photo|
Box-plots of MAP scores achieved by different cross-modal subspace learning methods on shoe dataset. The inputs of all the methods are preprocessed by PCA excepting methods marked with an asterisk. The top and bottom edges of the box are the 75th and 25th percentiles, respectively. The outliers are marked as red cross patterns individually.
3.3.2 Evaluation by “”
The shoe dataset and the chair dataset are fine-grained SBIR datasets in which each sketch sample has a photo sample as its instance-level counterpart. Hence we can also evaluate the performances of these cross-modal subspace learning methods by counting the percentage of the corresponding photos or sketches ranked in the top K results. Please note that the parameters of all the methods are readjusted when we evaluated them by “”.
Similar as the previous chapter, PCA is conducted for all the methods. In addition, to verify the feature selection capabilities of LCFS and JFSSL, we also input features without PCA reprocessing for these two methods.
The Cumulative Match Characteristic (CMC) curves are plotted in Fig. 6. We can observe that in terms of relative distribution relationships and trends of the curves, Fig. 6(a) is consistent with Fig. 6(b). And the performances of these cross-modal subspace learning methods are different from theirs on subcategory-level evaluation (MAP). CCA-3V achieves the highest instance-level accuracy for photo-sketch query and sketch-photo query on shoe dataset. The curves of CCA, GMMFA, PLS, and BLM are slightly lower than CCA-3V’s. GMLDA obtains more satisfying experimental results than LCFS and JFSSL. LCFS and JFSSL are little better than CDFE. And LCFS and JFSSL still can obviously show their feature selection ability for the instance-level SBIR retrieval on shoe dataset. The experimental result of CDFE is the worst.
These supervised cross-modal subspace learning methods do not show a distinct advantage over unsupervised methods for the instance-level SBIR tasks on shoe dataset. We can conclude that learning pair-wise information is more effective than learning subcategory-level relationship for instance-level SBIR. CCA, PLS, and BLM can achieve good results because they can learn the pair-wise relationships of multi-modal samples. CCA-3V, GMLDA, and GMMFA can utilize sample labels to learn some subcategory separation in the common subspace in the same time capturing the sample pair-based correlation crossing modalities. CCA-3V is more focused on modeling the association between the pairs of samples while GMMFA and GMLDA also learn some structured information in the common subspace. LCFS and JFSSL cannot obtain the desired results. For LCFS, its objective function engages in optimizing the subcategory-based residuals and feature selection for each modality. The trace norm constraint in Eq. (10) can enforce the relevance of projected sample data with connections, but its weighting coefficient is often too small to learn enough sample pair-wise information. For JFSSL, its optimization is also mainly minimizing the subcategory-based residuals. Its graph embedding term in Eq. (11) only preserves the inter-modality and intra-modality similarity. Hence, LCFS and JFSSL are not good at learning the instance-level or pair-wise relationship of sample data pairs.
3.4 Results on Chair Dataset
3.4.1 Evaluation by MAP
The MAP score comparison of different cross-modal subspace learning methods on chair dataset are reported in Table 2. Each experiment on each setting is also repeated for 50 times. And as in the previous chapter, PCA is also utilized to remove the redundancy in the input features for CCA, PLS, BLM, GMLDA, GMMFA, CDFE, and CCA-3V. We can observe that the experimental results are analogous to those on shoe dataset. Comprehensively considering photo-sketch query and sketch-photo query tasks, LCFS and JFSSL performs best. The performances of GMLDA and GMMFA are very close to the performances of LCFS and JFSSL.
The corresponding box-plots for Table 2 are visualized in Fig. 7. We observe that the stabilities of these methods for subcategory-level SBIR on chair dataset do not have much difference. We also conducted the students t-test for the repeated 50 times experimental results between LCFS and other methods, as shown in Table 4. We observe that GMLDA, GMMFA, LCFS, and JFSSL have the same output MAP distribution for photo querying sketch and sketch querying photo tasks.
As described above, shoe dataset contains three subcategories and chair dataset has six subcategories. In common sense, the three-class problem should be easier than the six-class one when we evaluate the experimental results by MAP. Thus the MAP score of the same method on shoe dataset should be significantly higher than it on chair dataset. However, comparing Table 1 and Table 2, the MAP scores in Table 1 are not obviously higher than their counterpart values in Table 2. Moreover, Fig. 5 has many outliers (marked red) while Fig. 7 shows almost no outliers. These phenomena can be interpreted as that these shoe sketches are not drew very well. Shoe dataset is mixed with too much noise due to that those shoe sketch samples were painted too rough.
3.4.2 Evaluation by “”
We readjust the parameters for all the methods and compare their performances by counting the percentage of the corresponding photos or sketches ranked in the top K results. The CMC curves are plotted in Fig. 8. For photo querying sketch task and sketch querying photo task, CCA-3V outperforms other methods. And the curves of GMMFA, GMLDA, BLM, and PLS in Fig. 8(a) and Fig. 8(b) almost overlap together respectively. We can observe that in Fig. 8(a), ‘LCFS’ curve is slightly lower than ‘PCA+LCFS’ curve and ‘JFSSL’ curve locates at a high distance above ‘PCA+JFSSL’ curve. In Fig. 8(b), ‘LCFS’ curve is significantly lower than ‘PCA+LCFS’ curve and ‘JFSSL’ curve and ‘PCA+JFSSL’ curve are overlapped. This illustrates that the feature selection abilities of LCFS and JFSSL cannot work well for instance-level SBIR on chair dataset. In the objective functions of LCFS and JFSSL, the constraint terms for feature selection are optimized with the subcategory-based regression residuals simultaneously. Thus the effect of their feature selection is to reduce the subcategory-based errors rather than instance-level matching errors.
3.5 Feature Selection and Graph Embedding
In the experiments of this paper, the performances of LCFS and JFSSL are almost the same on shoe dataset and chair dataset. However, JFSSL is the improved version of LCFS and owns theoretical advantages. JFSSL has feature selection constraint and graph embedding constraint which are classical operational processes or technologies for cross-modal matching. Hence, it is worth exploring the synergy between the feature selection and graph embedding terms for SBIR tasks. In its objective function Eq. (11), and are the weighting parameters for feature selection and graph embedding terms, respectively. We tune and in the range of fixing the remaining parameters. This adjusting process is illustrated in Fig. 10. We can observe a smooth and symmetric correlation variation between and . This shows us that these two techniques can co-work harmoniously for SBIR. When is fixed, MAP value slightly changes with the variations of . MAP varies with while is set to a certain value. This proves that the performance of JFSSL is largely determined by its feature selection technology. The importances of the feature selection technology and the graph embedding technology are not equal in the optimization process of JFSSL for subcategory-level SBIR. This inspires us to further explore these two techniques in our future research for sketch.
3.6 Complexity Analysis
In this section, the computational complexity of each compared cross-modal subspace learning method is discussed briefly. The asymptotic time complexity of CCA is Rasiwasia et al. (????) where
. PLS is a fitting model embedding regression technique, for which its complexity is defined in terms of its Degrees of FreedomNicole and Sugiyama (2011). GMA can be formulated as a standard generalized eigenvalue problem and solved by any eigenvalue solving technique Sharma et al. (????). CDFE can be solved using an alternate optimization strategy including a main step that is a convex quadratic optimization program with linear constraint Lin and Tang (????). The approximate kernel maps can be adopted to solving CCA-3V Gong et al. (2014) reducing the size of this problem to , where are the dimensionalities of the respective explicit mappings. The complexity of LCFS is Wang et al. (????a) where . The complexity of JFSSL can be denoted as Wang et al. (2016) where .
For rigorous comparison, the running time for learning the projection matrices is compared among these cross-modal subspace learning methods. Each methods on each setting are repeated 50 times. The average running time are reported in Table 5 which reveals that the feature selection processing is time-consuming. All the MATLAB codes are run on a 2.40GHz server with 64G RAM.
Our experimental results demonstrate that the cross-modal subspace learning methods designed for image and text can be applied in subcategory-level and instance-level SBIR tasks. The main advantage of cross-modal subspace learning for SBIR is its clear physical significance. Their performance rankings for subcategory-level SBIR tasks are almost consistent with those in cross-modal retrieval for image and text. For subcategory-level SBIR, the class label information is useful and supervised methods are usually superior to unsupervised methods. Feature selection and graph embedding technologies are also efficient to subcategory-level SBIR and they can work together well. Their performance rankings for instance-level SBIR tasks are not the same as those for subcategory-level retrieval. Learning pair-wise information is more effective than learning subcategory-level relationship for instance-level SBIR. Supervised learning has no significant advantages over unsupervised methods for instance-level SBIR task. On the shoe dataset and the chair dataset, LCFS outperforms other methods for subcategory-level SBIR and CCA-3V achieves the highest accuracy for instance-level SBIR. This leads us to conclude that subcategory-level information can also be beneficial to instance-level SBIR.
4 Discussion and Future work
We have demonstrated the feasibility of utilizing cross-modal subspace learning methods to tackle the domain-gap between sketches and photos. In the future, we may gain access to better solutions for SBIR by including the advantages of the cross-modal subspace learning techniques, e.g., pair-wise modeling, subcategory-based residual, joint feature selection, graph embedding. In particular, many researchers use deep Convolutional Neural Network (CNN) to conduct cross-modal matchingWang et al. (????); Xiong et al. (????); Yan et al. (????) or SBIR which is essentially to learn some feature subspaces to match multi-modal data. Moreover, the convolutional sparse coding technology can also learn subspace satisfying certain qualities Shafiee et al. (????); Gwon et al. (2016); Zhu and Lucey (2015), which illustrates the convolutional idea and subspace learning can be reasonably combined. Therefore, it is natural to also utilize cross-modal subspace learning concepts to improve CNN for SBIR, and potentially incorporating saliency information Wang et al. (2016); Lei et al. (2016) to improve part-level examination in the same network.
If we assume that sketch sits between photo and text in terms of their expressive power, i.e., photo is the most expressive for it can capture a like-for-like depiction of the visual world, sketches are unlikely to do so since they are highly abstract yet still visual, text on the other hand can be vague and more importantly not in the visual domain anymore. This bears the question that if modeling sketch together with text and photo could be worthwhile to better bridge the gap between text and photo, e.g., for text-based image retrieval. The fact that CCA-3V achieved the best performance for the fine-grained case is a good indicator of the promise that such three-way modelling offers. However, currently available SBIR datasets cannot provide detailed and adequate semantic textual information. Hence new datasets that capture all three domains are required.
In this paper, we discussed and evaluated a series of state-of-the-art cross-modal subspace learning methods. We described each method and applied these approaches to two recently released fine-grained SBIR datasets. This paper provided detailed comparisons and analysis on experimental results and discussed future research opportunities for SBIR.
This work was partly supported by National Natural Science Foundation of China (NSFC) grant No. , NSFC-RS joint funding grant Nos. and IE, Beijing Natural Science Foundation (BNSF) grant No. , Beijing Nova Program Grant 2017045, the Open Projects Program of National Laboratory of Pattern Recognition grant No. , and Chinese program of Advanced Intelligence and Network Service under grant No. B. This work is partly supported by BUPT-SICE Excellent Graduate Students Innovation Fund .
- Yu et al. (????) Q. Yu, F. Liu, Y. Song, T. Xiang, T. Hospedales, C. C. Loy, Sketch me that shoe, in: CVPR, 2016.
- Eitz et al. (2010) M. Eitz, K. Hildebrand, T. Boubekeur, M. Alexa, An evaluation of descriptors for large-scale image retrieval from sketched feature lines, Computers & Graphics 34 (2010) 482–498.
- Hu and Collomosse (2013) R. Hu, J. Collomosse, A performance evaluation of gradient field hog descriptor for sketch based image retrieval, CVIU 117 (2013) 790–806.
- Ouyang et al. (2014) S. Ouyang, T. Hospedales, Y.-Z. Song, X. Li, Cross-modal face matching: beyond viewed sketches, in: ACCV, 2014.
- Li et al. (????) K. Li, K. Pang, Y. Song, T. Hospedales, H. Zhang, Y. Hu, Fine-grained sketch-based image retrieval: The role of part-aware attributes, in: WACV, 2016.
- Sangkloy et al. (2016) P. Sangkloy, N. Burnell, C. Ham, J. Hays, The sketchy database: learning to retrieve badly drawn bunnies, ACM Trans. Graph. 35 (2016) 119.
- Li et al. (????) Y. Li, T. Hospedales, Y.-Z. Song, S. Gong, Intra-category sketch-based image retrieval by matching deformable part models, in: BMVC, 2014.
- Yu et al. (2016) Q. Yu, Y. Yang, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, Sketch-a-net: A deep neural network that beats humans, IJCV (2016).
- Xu et al. (????) P. Xu, Q. Yin, Y. Qi, Y.-Z. Song, Z. Ma, L. Wang, J. Guo, Instance-level coupled subspace learning for fine-grained sketch-based image retrieval, in: ECCV workshop, 2016.
- Song et al. (????) J. Song, Y.-Z. Song, T. Xiang, T. Hospedales, X. Ruan, Deep multi-task attribute-based ranking for fine-grained sketch-based image retrieval, in: BMVC, 2016.
- Zhang et al. (2015) M. Zhang, J. Li, N. Wang, X. Gao, Recognition of facial sketch styles, Neurocomputing 149 (2015) 1188–1197.
- Li et al. (2015) Y. Li, T. M. Hospedales, Y.-Z. Song, S. Gong, Free-hand sketch recognition by multi-kernel feature learning, CVIU 137 (2015) 1–11.
- Gao et al. (2008) X. Gao, J. Zhong, D. Tao, X. Li, Local face sketch synthesis learning, Neurocomputing 71 (2008) 1921 – 1930.
- Xiao et al. (2010) B. Xiao, X. Gao, D. Tao, Y. Yuan, J. Li, Photo-sketch synthesis and recognition based on subspace learning, Neurocomputing 73 (2010) 840 – 852.
- Li et al. (2016) Y. Li, Y.-Z. Song, T. M. Hospedales, S. Gong, Free-hand sketch synthesis with deformable stroke models, IJCV (2016).
- Su et al. (????) H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller, Multi-view convolutional neural networks for 3d shape recognition, in: ICCV, 2015.
- Qi et al. (????) Y. Qi, Y.-Z. Song, T. Xiang, H. Zhang, T. Hospedales, Y. Li, J. Guo, Making better use of edges via perceptual grouping, in: CVPR, 2015.
- Cao et al. (????a) Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, L. Zhang, Mindfinder: interactive sketch-based image search on millions of images, in: ACM MM, 2010.
- Cao et al. (????b) Y. Cao, C. Wang, L. Zhang, L. Zhang, Edgel index for large-scale sketch-based image search, in: CVPR, 2011.
- Qi et al. (2015) Y. Qi, J. Guo, Y.-Z. Song, T. Xiang, H. Zhang, Z.-H. Tan, Im2sketch: Sketch generation by unconflicted perceptual grouping, Neurocomputing 165 (2015) 338–349.
- Wang et al. (????) F. Wang, L. Kang, Y. Li, Sketch-based 3d shape retrieval using convolutional neural networks, in: CVPR, 2015.
- Shrivastava et al. (2011) A. Shrivastava, T. Malisiewicz, A. Gupta, A. A. Efros, Data-driven visual similarity for cross-domain image matching 30 (2011) 154.
- Yu et al. (????) Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, T. M. Hospedales, Sketch-a-net that beats humans, in: BMVC, 2015.
- Rasiwasia et al. (????) N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in: ACM MM, 2010.
- Sharma et al. (????) A. Sharma, A. Kumar, H. Daume III, D. W. Jacobs, Generalized multiview analysis: A discriminative latent space, in: CVPR, 2012.
- Zhen and Yeung (????) Y. Zhen, D.-Y. Yeung, Co-regularized hashing for multimodal data, in: NIPS, 2010.
- Masci et al. (2014) J. Masci, M. M. Bronstein, A. M. Bronstein, J. Schmidhuber, Multimodal similarity-preserving hashing, TPAMI 36 (2014) 824–830.
- Xu et al. (2016) X. Xu, L. He, A. Shimada, R. ichiro Taniguchi, H. Lu, Learning unified binary codes for cross-modal retrieval via latent semantic hashing, Neurocomputing 213 (2016) 191 – 203.
- Yao et al. (2016) T. Yao, X. Kong, H. Fu, Q. Tian, Semantic consistency hashing for cross-modal retrieval, Neurocomputing 193 (2016) 250 – 259.
- Hwang and Grauman (2012) S. J. Hwang, K. Grauman, Reading between the lines: Object localization using implicit cues from image tags, TPAMI 34 (2012) 1145–1158.
- Chua et al. (????) T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng, Nus-wide: A real-world web image database from national university of singapore, in: CIVR, 2009.
- Oliva and Torralba (2001) A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, IJCV 42 (2001) 145–175.
- Quadrianto and Lampert (????) N. Quadrianto, C. H. Lampert, Learning multi-view neighborhood preserving projections, in: ICML, 2011.
- Zhai et al. (????) X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint graph regularization for cross-media retrieval, in: AAAI, 2013.
- Wang et al. (????) J. Wang, Y. He, C. Kang, S. Xiang, C. Pan, Image-text cross-modal retrieval via modality-specific feature learning, in: ICMR, 2015.
- Grangier and Bengio (2008) D. Grangier, S. Bengio, A discriminative kernel-based approach to retrieval images from text queries, TPAMI 30 (2008) 1371–1384.
- Weston et al. (????) J. Weston, S. Bengio, N. Usunier, Wsabie: Scaling up to large vocabulary image annotation, in: IJCAI, 2011.
- Huang et al. (2016) W. Huang, S. Zeng, M. Wan, G. Chen, Medical media analytics via ranking and big learning: A multi-modality image-based disease severity prediction study, Neurocomputing 204 (2016) 125 – 134.
- Hardoon et al. (2004) D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: An overview with application to learning methods, Neural computation 16 (2004) 2639–2664.
- Wang et al. (????a) K. Wang, R. He, W. Wang, L. Wang, T. Tan, Learning coupled feature spaces for cross-modal matching, in: ICCV, 2013.
- Wang et al. (????b) K. Wang, W. Wang, L. Wang, R. He, A two-step approach to cross-modal hashing, in: ICMR, 2015.
- Li et al. (2016) J. Li, Y. Wu, J. Zhao, K. Lu, Multi-manifold sparse graph embedding for multi-modal image classification, Neurocomputing 173, Part 3 (2016) 501 – 510.
- Wang et al. (2016) K. Wang, R. He, L. Wang, W. Wang, T. Tan, Joint feature selection and subspace learning for cross-modal retrieval, TPAMI 38 (2016) 2010–2023.
- Putthividhy et al. (????) D. Putthividhy, H. T. Attias, S. S. Nagarajan, Topic regression multi-modal latent dirichlet allocation for image annotation, in: CVPR, 2010.
- Jia et al. (????) Y. Jia, M. Salzmann, T. Darrell, Learning cross-modality similarity for multinomial data, in: ICCV, 2011.
- Wu et al. (2010) W. Wu, J. Xu, H. Li, Learning similarity function between objects in heterogeneous spaces, Microsoft Research Technique Report (2010).
- Mignon and Jurie (????) A. Mignon, F. Jurie, Cmml: a new metric learning approach for cross modal matching, in: ACCV, 2012.
- Zhu et al. (????) X. Zhu, Z. Huang, H. T. Shen, X. Zhao, Linear cross-modal hashing for efficient multimedia search, in: ACM MM, 2013.
- Liu et al. (2013) Y. Liu, J. Shao, J. Xiao, F. Wu, Y. Zhuang, Hypergraph spectral hashing for image retrieval with heterogeneous social contexts, Neurocomputing 119 (2013) 49–58.
- Kim et al. (2007) T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, TPAMI 29 (2007) 1005–1018.
- Costa Pereira et al. (2014) J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, N. Vasconcelos, On the role of correlation and abstraction in cross-modal multimedia retrieval, TPAMI 36 (2014) 521–535.
- Sharma and Jacobs (????) A. Sharma, D. W. Jacobs, Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch, in: CVPR, 2011.
- Liu et al. (????) G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in: ICML, 2010.
- Liu and Yan (????) G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and feature extraction, in: ICCV, 2011.
- Liu et al. (2013) G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, TPAMI 35 (2013) 171–184.
- Zhu et al. (????) Y. Zhu, D. Huang, F. D. L. Torre, S. Lucey, Complex non-rigid motion 3d reconstruction by union of subspaces, in: CVPR, 2014.
- Chang et al. (????a) X. Chang, F. Nie, Y. Yang, H. Huang, A convex formulation for semi-supervised multi-label feature selection, in: AAAI, 2014.
- Chang et al. (????b) X. Chang, F. Nie, Z. Ma, Y. Yang, X. Zhou, A convex formulation for spectral shrunk clustering, in: AAAI, 2015.
- Chang et al. (2016) X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, C. Zhang, Compound rank- projections for bilinear analysis, TNNLS 27 (2016) 1502–1513.
- Wang et al. (2016) K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on cross-modal retrieval, arXiv preprint arXiv:1607.06215 (2016).
- Liu et al. (2010) J. Liu, C. Xu, H. Lu, Cross-media retrieval: state-of-the-art and open issues, International Journal of Multimedia Intelligence and Security 1 (2010) 33–52.
- Xu et al. (????) P. Xu, K. Li, Z. Ma, Y.-Z. Song, L. Wang, J. Guo, Cross-modal subspace learning for sketch-based image retrieval: A comparative study, in: IC-NIDC, 2016.
- Lin and Tang (????) D. Lin, X. Tang, Inter-modality face recognition, in: ECCV, 2006.
- Gong et al. (2014) Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, IJCV 106 (2014) 210–233.
- Jolliffe (2002) I. Jolliffe, Principal component analysis, Springer verlag, 2002.
- Ranjan et al. (????) V. Ranjan, N. Rasiwasia, C. V. Jawahar, Multi-label cross-modal retrieval, in: ICCV, 2015.
- Ramsay (2006) J. O. Ramsay, Functional data analysis, Wiley Online Library, 2006.
- Rosipal and Krämer (2006) R. Rosipal, N. Krämer, Overview and recent advances in partial least squares, in: Subspace, latent structure and feature selection, 2006, pp. 34–51.
- Baek and Kim (2004) J. Baek, M. Kim, Face recognition using partial least squares components, PR 37 (2004) 1303–1306.
- Dhanjal et al. (2009) C. Dhanjal, S. R. Gunn, J. Shawe-Taylor, Efficient sparse kernel feature extraction based on partial least squares, TPAMI 31 (2009) 1347–1361.
- Li et al. (????) A. Li, S. Shan, X. Chen, W. Gao, Maximizing intra-individual correlations for face recognition across pose differences, in: CVPR, 2009.
- Štruc and Pavešić (2009) V. Štruc, N. Pavešić, Gabor-based kernel partial-least-squares discrimination features for face recognition, Informatica 20 (2009) 115–138.
- Schwartz et al. (????) W. R. Schwartz, H. Guo, L. S. Davis, A robust and scalable approach to face identification, in: ECCV, 2010.
- Chen et al. (????) Y. Chen, L. Wang, W. Wang, Z. Zhang, Continuum regression for cross-modal multimedia retrieval, in: ICIP, 2012.
- Gu et al. (????) Q. Gu, Z. Li, J. Han, Joint feature selection and subspace learning, in: IJCAI, 2011.
- He et al. (????) R. He, T. Tan, L. Wang, W.-S. Zheng, regularized correntropy for robust feature selection, in: CVPR, 2012.
- Nie et al. (????) F. Nie, H. Huang, X. Cai, C. H. Ding, Efficient and robust feature selection via joint norms minimization, in: NIPS, 2010.
- Angst et al. (????) R. Angst, C. Zach, M. Pollefeys, The generalized trace-norm and its application to structure-from-motion problems, in: ICCV, 2011.
- Fornasier et al. (2011) M. Fornasier, H. Rauhut, R. Ward, Low-rank matrix recovery via iteratively reweighted least squares minimization, SIAM Journal on Optimization 21 (2011) 1614–1640.
- Grave et al. (????) E. Grave, G. R. Obozinski, F. R. Bach, Trace lasso: a trace norm regularization for correlated designs, in: NIPS, 2011.
- Harchaoui et al. (????) Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, J. Malick, Large-scale image classification with trace-norm regularization, in: CVPR, 2012.
- Rasiwasia et al. (????) N. Rasiwasia, D. Mahajan, V. Mahadevan, G. Aggarwal, Cluster canonical correlation analysis, in: AISTATS, 2014.
- Nicole and Sugiyama (2011) K. Nicole, M. Sugiyama, The degrees of freedom of partial least squares regression, Journal of the American Statistical Association 106 (2011) 697–705.
- Wang et al. (????) A. Wang, J. Cai, J. Lu, T.-J. Cham, Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition, in: ICCV, 2015.
- Xiong et al. (????) C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, T.-K. Kim, Conditional convolutional neural network for modality-aware face recognition, in: ICCV, 2015.
- Yan et al. (????) K. Yan, Y. Wang, D. Liang, T. Huang, Y. Tian, Cnn vs. sift for image retrieval: Alternative or complementary?, in: ACM MM, 2016.
Shafiee et al. (????)
S. Shafiee, F. Kamangar,
A multi-modal sparse coding classifier using dictionaries with different number of atoms,in: WACV, 2015.
- Gwon et al. (2016) Y. Gwon, W. Campbell, K. Brady, D. Sturim, M. Cha, H. Kung, Multimodal sparse coding for event detection, arXiv preprint arXiv:1605.05212 (2016).
- Zhu and Lucey (2015) Y. Zhu, S. Lucey, Convolutional sparse coding for trajectory reconstruction, TPAMI 37 (2015) 529–540.
- Wang et al. (2016) Q. Wang, J. Lin, Y. Yuan, Salient band selection for hyperspectral image classification via manifold ranking, IEEE Transactions on Neural Networks and Learning Systems 27 (2016) 1279–1289.
- Lei et al. (2016) J. Lei, B. Wang, Y. Fang, W. Lin, P. L. Callet, N. Ling, C. Hou, A universal framework for salient object detection, TMM 18 (2016) 1783–1795.