I Introduction
In the last few years, there has been an explosive growth of 3D shape data, due to increasing demands from real industrial applications, such as virtual reality, LiDAR based autonomous vehicles. 3D shape related techniques have emerged as extremely hot research topics recently. Retrieving a certain category of 3D shapes from a given database is one of the fundamental problems for 3D shape based applications. A lot of efforts have been devoted to 3D shape retrieval by 3D models [23, 25]
, which are intuitively straightforward, but difficult to acquire. Alternatively, freehand sketch is a more convenient way for human to interact with data collection and processing systems, especially with the sharply increased use of touchpad devices such as smart phones and tablet computers. As a consequence, sketchbased 3D shape retrieval,
i.e., searching 3D shapes queried by sketches, has attracted more and more attentions [2, 22, 13, 24].Despite of its succinctness and convenience to acquire, freehand sketches remain two disadvantages in the application of 3D shape retrieval, making the sketchbased 3D shape retrieval an extremely challenging task. Firstly, sketches are usually drawn subjectively in uncontrolled environments, resulting in severe intraclass variations as shown in Fig. 3. Secondly, sketches and 3D shapes have heterogenous data structures, which leads to large crossmodality divergences.
A variety of models have been proposed to address the aforementioned two issues, which can be roughly divided into two categories, i.e., representation based methods and matching based methods. The first category aims to extract robust features for both sketches and 3D shapes [10, 13, 2, 3, 22, 30, 29, 24]. However, due to the heterogeneity of sketches and 3D shapes, it is quite difficult to achieve modalityinvariant discriminative representations. On the other hand, matching based methods focus on developing effective models for calculating similarities or distances between sketches and 3D shapes, among which deep metric learning based models [22, 1, 24] have achieved the stateoftheart performance. Nevertheless, these methods fail to explore the varying importance of different training samples. Besides, they can merely enhance local crossmodality correlations, by selecting data pairs or triplets across modalities, while not taking into account the holistic data distributions. As a consequence, the learned deep metrics might be less discriminative, and lack of generalization for unseen test data.
To overcome the drawbacks of existing works, we propose a novel model, namely Deep Crossmodality Adaptation (DCA), for sketchbased 3D shape retrieval. Fig. 1 shows the framework of our proposed model. We first construct two separate deep convolutional neural networks (CNNs) and metric networks, one for sketches and the other for 3D shapes, to learn discriminative modalityspecific features for each modality via importanceaware metric learning (IAML). Through mining the hardest samples in each minibatch for training, IAML could explore the importance of training data, and therefore learn discriminative representations more efficiently. Furthermore, in order to reduce the large crossmodality divergence between learned features of sketches and 3D shapes, we explicitly introduce a crossmodality transformation network, to transfer features of sketches into the feature space of 3D shapes. An adversarial learning method with classaware crossmodality mean discrepancy minimization (CMDMAL) is developed to train the transformation network, which acts as a generator. Since CMDMAL is able to enhance correlations between distributions of transferred data of sketches and data of 3D shapes, our model can compensate for the crossmodality discrepancy in a holistic way. IAML is also applied to the transformed data, in order to further preserve semantic structures of sketch data after adaptation. The main contributions of this paper are threefold:
1) We propose a novel deep crossmodality adaptation model via semantics preserving adversarial learning. To our best knowledge, this work is the first one that incorporates adversarial learning into sketchbased 3D shape retrieval.
2) We develop a new adversarial learning based method for training the deep crossmodality adaptation network, which simultaneously reduces the holistic crossmodality discrepancy of data distributions, and enhances semantic correlations of local data batches across modalities.
3) We significantly boost the performance of existing stateoftheart sketchbased 3D shape retrieval methods on two large benchmark datasets.
Ii Related Work
In the literature, most of existing works on sketchbased 3D shape retrieval mainly concentrate on building modalityinvariant representations for sketches and 3D shapes, and developing discriminative matching models. Various handcraft features are employed, such as Zernike moments, coutourbased Fourier descriptor, eccentricity feature and circularity feature
[14], the chordal axis transform based shape descriptor [26], HoGSIFT features [27], the local improved Pyramid of Histograms of Orientation Gradients (iPHOG) [12], the sparse coding spatial pyramid matching feature (ScSPM), local depth scaleinvariant feature transform (LDSIFT) [30]. Besides, many learningbased features are developed, including bagoffeatures (BoF) with Gabor local line based features (GALIF) [11], dense SIFT with BOF [3]. Meanwhile, tremendous matching approaches have also been developed, such as manifold ranking [3], dynamic time warping [26], sparse coding based matching [27] and adaptive view clustering [10, 12].Recently, various deep models have been developed for both feature extraction and matching, which are closely related to our proposed method. In
[22], two Siamese CNNs were employed to learn discriminative features of sketches and 3D shapes by minimizing withinmodality and crossmodality losses. In [30], the pyramid crossdomain neural networks were utilized to compensate for crossdomain divergences. In [1] and [25], Siamese metric networks are employed to minimize both withinmodality and crossmodality intraclass distances whilst maximizing interclass distances. In [25], the Wasserstein barycenters were additionally employed to aggregate multiview deep features of rendered images from 3D models. However, these methods only reduced the local crossmodality divergence, and haven’t considered removing data distribution shift across modalities. In contrast, our proposed model employs an adversarial learning based method to mitigate the discrepancy between distributions of two modalities in a holistic way, whilst addressing the local divergence issues by introducing a classaware mean discrepancy term. Moreover, we apply IAML to mine importance of training samples, which has also been ignored by current works.
Another branch of works related to our work is the supervised discriminative adversarial learning for domain adaptation. In [4, 21, 15], a variety of adversarial discriminative models were developed for domain adaptation. The basic idea of these methods is to remove the domain shift between the source and target domains, by employing a domain discriminator and an adversarial loss. However, these works concentrate on scenarios where few labeled data are available in the target domain (despite abundant labeled data in the source domain), and are unable to jointly explore local discriminative semantic structures for both domains, making them unsuitable for our task. In [28], the authors also explicitly adopted a transformation network to transfer data from source domain to the target domain, where the crossdomain divergence is mitigated by an adversarial loss. However, they used hand crafted features, while our model employs deep CNNs to learn discriminative modalityspecific features, and integrates them with the transformation network as a whole. Moreover, we introduce a classaware crossmodality mean discrepancy term to the original adversarial loss. This term can enhance semantic correlations of data distributions across modalities as well as remove domain shift, which is largely neglected by existing works.
Iii Deep Crossmodality Adaptation
As illustrated in Fig. 1, our proposed framework mainly consists of five components, including the CNN networks for 2D sketches (denoted by ) and for 3D shapes (denoted by ), fully connected metric networks for 2D sketches (denoted by ) and for 3D shapes (denoted by ), together with the crossmodality transformation network , of which the parameters are , , , and , respectively.
Similar to most existing deep learning methods, we train our model by minibatches. In order to depict our own method more conveniently, we build image batches from the whole training data in a slightly different way from random sampling. Specifically, for 2D sketches, we first select
classes randomly, and then collect images for each class. The selected images finally comprise a minibatch of size , of which the corresponding class labels are denoted by . Following the same way, a batch of 3D shapes is constructed, together with labels . To characterize a 3D shape, we utilize the widely used multiview representation as in [18, 1, 24], i.e., projecting a 3D shape to grayscale images from rendered views that are evenly divided around the 3D shape. Thereafter, we can represent as a batch of images , where consists of (=12 is used in our paper) 2D rendered images of the 3D shape .As demonstrated in Fig. 1, we train the CNN and metric networks for sketches, i.e., and , jointly by adopting an importanceaware metric learning method, which could explore hardest training samples within a minibatch. The CNN and metric networks for 3D shapes, i.e., and , are also trained in the same way. The crossmodality transformation network is learnt by preserving semantic structures of transformed features, and employing an adversarial learning based training strategy with classaware crossmodality mean discrepancy minimization.
In the rest of this paper, we will elaborate the trailing details about the proposed method, including the importanceaware metric learning, the semantic adversarial learning, and the optimization algorithm. Without loss of generality, all loss functions are formulated based on image batches
and throughout this paper, which can be easily extended to the whole training data.Iiia ImportanceAware Feature Learning
Given a minibatch , after successively passing through the CNN network and the metric network
, we can obtain a set of feature vectors:
where , and for ,
Ideally, in order to learn discriminative features for each modality (i.e., the 2D sketches or the 3D shapes), the interclass distances within the batch need to be larger than the intraclass distances. To achieve this, inspired by [7], we adopt the following loss function for importanceaware metric learning:
(1)  
where
(2) 
(3) 
and is a constant.
As can be seen from Eq. (2), for a certain anchor point , is the sample that has the minimal Euclidean distance to among those samples from different classes. And from Eq. (3), we can see that is the sample that has the maximal Euclidean distance to , among samples belonging to the same class as . In other words, and indicate the largest interclass Euclidean distance and the minimal intraclass Euclidean distance with respect to within the batch , respectively. Therefore, and are the batchwise “hardest positive” and the “hardest negative” samples w.r.t. , and should be given higher importance during training. Existing deep metric learning based models [1, 25] equally treat all training samples. In contrast, our IAML firstly explores the hardest positive and negative training samples within a minibatch, and enforces them to be consistent with semantics, making it more efficient to learn discriminative features.
By minimizing in Eq. (1), are forced to be greater than , i.e., . That is to say, by minimizing , the minimal interclass distance is compelled to be larger than the maximal intraclass distance in the feature space, whilst keeping a certain margin . Consequently, we can learn CNN and metric networks to extract discriminative features for each modality (i.e., 2D sketches or 3D shapes).
IiiB Crossmodality Transformation based on Adversarial Learning
By applying the importanceaware metric learning via minimizing the losses and , we can learn discriminative features for sketches and shapes, i.e., and , respectively. However, due to the large discrepancy between data distributions of different modalities, directly using and for crossmodality retrieval will result in extremely poor performance.
To address this problem, we propose a crossmodality transformation network , in order to adapt the learnt features of 2D sketches to the feature space of 3D shapes with crossmodality discrepancies removal.
Suppose is the transformed features of sketches with class labels
where for , and . Ideally, the transformed features are expected to have the following properties, in order to guarantee good performance for the crossmodality retrieval task:
1) should be semantics preserving, i.e., maintaining small intraclass distances and large interclass distances.
2) should have correlated data distribution with , i.e., the learnt features of 3D shapes.
The first property aims to compel the transformed features to preserve semantics, whilst the second attempts to remove the crossmodality discrepancy through strengthening correlations between data distributions of two modalities.
As shown in Fig. 2, we introduce a semantics preserving term by repeatedly utilizing the importanceaware metric learning to accomplish 1). And in order to achieve 2), we employ a crossmodality correlation enhancement term based on adversarial learning with classaware crossmodality mean discrepancy minimization. We will provide details about the aforementioned two terms in the rest of this section.
Semantics Preserving Term In order to preserve semantic structures, i.e., keeping small (large) intraclass (interclass) distances, we apply the loss of Importanceaware Metric Learning previously introduced to transformed data:
(4) 
where
(5) 
(6) 
and is a constant.
Crossmodality Correlation Enhancement Term Generative adversarial networks (GANs) have recently emerged as an effective method to generate synthetic data [5]. The basic idea is to train two competing networks, a generator and a discriminator
, based on game theory. The generator
is trained to sample from the data distribution from the vector of noise v. The discriminator is trained to distinguish synthetic data generated by and real data sampled from . The problem of training GANs is formulated as follows:(7) 
where is a prior distribution over v. It has been pointed out in [5] that the global equilibrium of the twoplayer game in Eq. (7) achieves if and only if , where is the distribution of generated data.
In our model, we treat the transformation network as the generator . Suppose , and are distributions of learnt features of sketches, 3D shapes and transformed data (denoted by , and ), respectively. By solving the following problem
(8) 
we can expect that , i.e., the transformed data has the same data distribution as of 3D shapes, if problem (8) reaches the global equilibrium. Consequently, the crossmodality discrepancy can be reduced.
Conventionally, problem (8) is solved by alternatively optimizing and through minimizing the following two loss functions:
(9) 
(10) 
So far, we have trained a transformation network such that by minimizing and . Albeit the divergence between the distributions for transformed features of sketches and for features of 3D models can be diminished by adversarial learning, the crossmodality semantic structures are not taken into account. To address this problem, we further introduce the following term, namely the classaware crossmodality mean discrepancy
(11) 
to adversarial learning, where is the class label. By minimizing , the mean feature vector of class from the sketch modality is compelled to be close to the mean feature vector of the same class from the 3D shape modality.
In practice, provided a minibatch (), the term can be approximated by the batchwise mean feature vector, i.e., .
Through minimizing the loss , we can obtain the adversarial learning method with crossmodality mean discrepancy minimization (CMDMAL), which could enhance the semantic correlations across modalities.
By combing the semantics preserving loss and the crossmodality correlation enhancing loss , we finally get the loss function for training :
(12) 
IiiC Optimization
In Eq. (1), we defined the loss function for jointly training , , and the loss function for training , of 3D shapes. We also developed a loss function for training the crossmodality transformation network in Eq. (12).
To learn parameters of the proposed deep crossmodality adaptation model, we optimize different networks in an alternating iterative way. Algorithm 1 summarizes the outline of the training process. Specifically, we first pretrain the CNN and metric networks of sketches and 3D shapes based on the loss in Eq. (1), and pretrain the crossmodality transformation network by minimizing and . After initialization, we then alternatively update , , , and the adversarial discriminator , by minimizing , , , and , respectively. Throughout the whole training process, we use the Adam stochastic gradient method [8] as the optimizer.
Iv Experimental Results and Analysis
To evaluate the performance of our method, we conduct experiments on two widely used benchmark datasets for sketchbased 3D shape retrieval: i.e., SHREC 2013 and SHREC 2014.
SHREC 2013 [10, 11] is a largescale dataset for sketchbased 3D shape retrieval. This dataset consists of 7,200 sketches and 1,258 shapes from 90 classes, by collecting humandrawn sketches [2] and 3D shapes from the Princeton Shape Benchmark (PSB) [16] that share common categories. For each class, there are totally 80 sketches, where 50 images are used for training and 30 images for test. The numbers of 3D shapes are different for distinct classes, about 14 on average.
SHREC 2014 [14, 13] is a sketch track benchmark larger than SHREC 2013. It totally contains 13,680 sketches and 8,987 3D shapes, grouped into 171 classes. The 3D shapes are collected from various datasets, including SHREC 2012 [9] and the Toyohashi Shape Benchmark (TSB) [20]. Similar to SHREC 2013, there are 80 images for sketches, and about 53 3D shapes on average for each class. The sketches are further split into 8,550 training data and 5,130 test data, where for each class, 50 images are used for training and the rest 30 images for test.
Fig. 3 shows some samples from the two datasets. As illustrated, retrieving 3D shapes by sketches is quite challenging, due to large intraclass variations and crossmodality discrepancies between sketches and 3D shapes.
Iva Implementation Details
In this subsection, we will provide implementation details about our proposed method, including network structures and parameter settings.
Network Structures. For CNN networks of both sketches and shapes, i.e., and , we utilize the ResNet50 network [6]. Specifically, we use the layers of ResNet50 before the “pooling5” layer (inclusive). As for metric networks of sketches and 3D shapes, i.e., and , both of them consist of four fully connected layers set as 20481024512256128. We utilize the “relu
” activation functions and batch normalization for all layers in the metric networks, except that the last layer uses the “
tanh” activation function. As to the crossmodality transformation model , we adopt a network with four fully connected layers set as 128643264128, where the first three layers uses the “relu” activation functions, and the last layer uses the “tanh” activation function. The discriminator is a fully connected network set as 128641.Parameter Settings. We set the number of the maximal iterative step as 30,000. The initial learning rate is set to , and decays exponentially after 10,000 steps. To generate data batches and , the number of classes per batch and the number of images per class are set as 16 and 4, respectively.
IvB Evaluation Metrics
We adopt the most widely used metrics for sketchbased 3D shape retrieval as follows: nearest neighbor (NN), first tier (FT), second tier (ST), Emeasure (E), discounted cumulated gain (DCG) and mean average precision (mAP) [11, 1, 24]. We also report the precisionrecall curve, a common metric for visually evaluating the retrieval performance.
IvC Evaluation of the Proposed Method
In this section, we will evaluate the effect of the proposed adversarial learning with classaware crossmodality mean discrepancy minimization (CMDMAL), together with the semantics preserving (SeP) term.
As a baseline, we apply the importanceaware metric learning (IAML) to separately train for 2D sketches, and for 3D shapes, where the learnt feature vectors are directly used for retrieval. This baseline method, denoted by sepIAML, merely learns discriminative features, without considering the crossmodality issues. Based on sepIAML, we employ the crossmodality transformation network , which is trained by minimizing the loss . We denote this method by DCA (CMDMAL). By further adding the semantics preserving term , i.e., training by , we can obtain the complete model of our proposed method denoted by DCA (CMDMAL+SeP). By comparing the performance of sepIAML, DCT (CMDMAL) and DCT (CMDMAL+SeP), we can evaluate the effects of the proposed adversarial learning method and semantics preserving term.
The results are summarized in Tables I and II. As can be seen, the baseline method sepSMML yields a rather poor performance, due to its weakness in dealing with crossmodality discrepancies. By introducing the adversarial learning method, DCA (CMDMAL) significantly boosts the performance of the baseline, implying that the adversarial learning can largely enhance the correlation between data distributions of different modalities. Moreover, we can see a consistent improvements of DCA (CMDMAL+SeP) on two benchmarks, compared to DCA (CMDMAL). This indicates that the semantics preserving term can help learn more discriminative crossmodality transformation network.
IvD Comparison with the Stateoftheart Methods
Retrieval Performance on SHREC 2013. Here we report experimental results of the proposed method on SHREC 2013, by comparing with the stateoftheart methods, including the cross domain manifold ranking method (CDMR) [3], sketchbased retrieval method with view clustering (SBRVC) [10], spatial proximity method (SP) [17], Fourier descriptors on 3D model silhouettes (FDC) [10], edgebased Fourier spectra descriptor (EFSD) [10], Siamese network (Siamese) [22], chordal axis transform with dynamic time warping (CATDTW), deep correlated metric learning (DCML) [1], and the learned Wasserstein barycentric representation method (LWBR) [24].
Methods  NN  FT  ST  E  DCG  mAP 
CDMR [3]  0.279  0.203  0.296  0.166  0.458  0.250 
SBRVC [10]  0.164  0.097  0.149  0.085  0.348  0.114 
SP [17]  0.017  0.016  0.031  0.018  0.240  0.026 
FDC [10]  0.110  0.069  0.107  0.061  0.307  0.086 
Siamese [22]  0.405  0.403  0.548  0.287  0.607  0.469 
CATDTW [26]  0.235  0.135  0.198  0.109  0.392  0.141 
KECNN [19]  0.320  0.319  0.397  0.236  0.489  NA 
DCML [1]  0.650  0.634  0.719  0.348  0.766  0.674 
LWBR [24]  0.712  0.725  0.785  0.369  0.814  0.752 
Baseline (sepIAML)  0.011  0.015  0.028  0.016  0.234  0.037 
DCA (CMDMAL)  0.762  0.776  0.812  0.370  0.842  0.795 
DCA (CMDMAL+SeP)  0.783  0.796  0.829  0.376  0.856  0.813 
Fig. 4 demonstrates the precisionrecall curves of the proposed method and compared approaches. As illustrated, the precision rate of our method is significantly higher than those of compared models, when the recall rate is smaller than 0.8. Considering that the top retrieved results are preferable, our method therefore performs significantly better than the stateoftheart approaches.
We also report NN, FT, ST, E, DCG and mAP of various methods, including CDMR, SBRVC, SP, FDC, Siamese, DCML, LWBR and the proposed method. As summarized in Table I
, our approach yields the best retrieval performance w.r.t. all evaluation metrics. Among all compared approaches, Siamese, DCML and LWBR are deep metric learning based models. They directly map data from different modalities into a common embedding subspace, where both the singlemodality and crossmodality intraclass Euclidean distances are decreased, and the interclass distances are simultaneously enlarged. However, they equally treat each training data, and fail to explore varying importance of distinct samples. Besides, they only reduce the local crossmodality divergences between data pairs or triplets, without considering the correlation between data distributions in a holistic way. In contrast, our method learns features by mining the batchwise hardest positive and hardest negative samples. Through automatically selecting the most important training samples, we can learn discriminative features more efficiently. Moreover, we explicitly introduce a crossmodality transformation network, in order to transfer the feature from the sketch modality to the feature space of 3D shapes. By leveraging the semantics preserving adversarial learning, we simultaneously reduce holistic divergences between data distributions from two modalities, and enhance the semantic correlations. As a consequence, our method achieves better retrieval performance. For instance, the mAP of our method reaches 0.813, which is
, and higher than Siamese, DCML and LWBR, respectively.Retrieval Performance on SHREC 2014. On this dataset, we compared our proposed model to the following stateoftheart methods: the BoF with Gabor local line based feature (BFfGALIF)[2], CDMR [3], SBRVC [10]
, depthbuffered vector of locally aggregated tensors (
DBVLAT) [20], SCMROPHOG [14], BOF junctionbased extended shape context (BOFJESC) [14], Siamese [22], DCML [1], and LWBR [24] .Methods  NN  FT  ST  E  DCG  mAP 
CDMR [3]  0.109  0.057  0.089  0.041  0.328  0.054 
SBRVC [10]  0.095  0.050  0.081  0.037  0.319  0.050 
DBVLAT [20]  0.160  0.115  0.170  0.079  0.376  0.131 
CATDTW [26]  0.137  0.068  0.102  0.050  0.338  0.060 
Siamese [22]  0.239  0.212  0.316  0.140  0.496  0.228 
DCML [1]  0.272  0.275  0.345  0.171  0.498  0.286 
LWBR [24]  0.403  0.378  0.455  0.236  0.581  0.401 
Baseline (sepIAML)  0.016  0.016  0.023  0.005  0.263  0.028 
DCA (CMDMAL)  0.745  0.766  0.808  0.392  0.845  0.782 
DCA (CMDMAL+SeP)  0.770  0.789  0.823  0.398  0.859  0.803 
Fig. 5 provides precisionrecall curves for BFfGALIF, CDMR, SBRVC, SCMROPHOG, OPHOG, DCML, LWBR and the proposed model. As shown, the precision rate of our proposed method is remarkably higher than compared approaches, when the recall rate is less than 0.8.
Besides the precisionrecall curves, we additionally report NN, FT, ST, E, DCG and mAP of CDMR, SBRVC, DBVLAT, Siamese, DCML, LWBR in Table II. As can be seen, the performance of existing deep metric learning based methods including Siamese, DCML and LWBR drops sharply on SHREC 2014. For example, the mAP of LWBR on SHREC 2014 is 0.401, around lower than the mAP that it has achieved on SHREC 2013. The reason might lie in that SHREC 2014 has much more class categories (90 classes on SHREC 2013 versus 171 classes on SHREC 2014) and larger scale 3D shapes (1,258 3D shapes on SHREC 2013 versus 8,987 3D shapes on SHREC 2014) with more severe intraclass and crossmodality variations, making SHREC 2014 more challenging than SHREC 2013. As a comparison, the mAP of our proposed model merely drops about , and reaches 0.803 on SHREC 2014. This result is 40.2%, 51.7% and 57.5% higher than that of LWBR, DCML and Siamese, indicating that our method are much more scalable than existing deep models.
V Conclusions
In this paper, we proposed a novel crossmodality adaptation model for sketchbased 3D shape retrieval. We firstly learnt modalityspecific discriminative features for 2D sketches and 3D shapes, by employing the importanceaware metric learning through mining the batchwise hardest samples. To remove the crossmodality discrepancy, we proposed a transformation network, aiming to transfer the features of sketches into the feature space of 3D shapes. We developed an adversarial learning based method for training the network, by enhancing correlations between holistic data distributions and preserving local semantic structures across modalities. As a consequence, we obtained discriminative transformed features of sketches that were also highly correlated with data of 3D shapes. Extensive experimental results on two benchmark datasets demonstrated the superiority of the propose method, compared to the stateoftheart approaches.
References
 [1] G. Dai, J. Xie, F. Zhu, and Y. Fang. Deep correlated metric learning for sketchbased 3d shape retrieval. In AAAI, pages 4002–4008, 2017.
 [2] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa. Sketchbased shape retrieval. ACM Trans. Graph., 31(4):31–1, 2012.
 [3] T. Furuya and R. Ohbuchi. Ranking on crossdomain manifold for sketchbased 3d model retrieval. In Cyberworlds (CW), 2013 International Conference on, pages 274–281. IEEE, 2013.

[4]
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky.
Domainadversarial training of neural networks.
The Journal of Machine Learning Research
, 17(1):2096–2030, 2016.  [5] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 [7] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person reidentification. arXiv preprint arXiv:1703.07737, 2017.
 [8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 [9] B. Li, A. Godil, M. Aono, X. Bai, T. Furuya, L. Li, R. J. LópezSastre, H. Johan, R. Ohbuchi, C. RedondoCabrera, et al. Shrec’12 track: Generic 3d shape retrieval. 3DOR, 6, 2012.
 [10] B. Li, Y. Lu, A. Godil, T. Schreck, M. Aono, H. Johan, J. M. Saavedra, and S. Tashiro. SHREC 13 track: large scale sketchbased 3D shape retrieval. 2013.
 [11] B. Li, Y. Lu, A. Godil, T. Schreck, B. Bustos, A. Ferreira, T. Furuya, M. J. Fonseca, H. Johan, T. Matsuda, et al. A comparison of methods for sketchbased 3d shape retrieval. Computer Vision and Image Understanding, 119:57–80, 2014.
 [12] B. Li, Y. Lu, H. Johan, and R. Fares. Sketchbased 3d model retrieval utilizing adaptive view clustering and semantic information. Multimedia Tools and Applications, 76(24):26603–26631, 2017.
 [13] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, Q. Chen, N. K. Chowdhury, B. Fang, et al. A comparison of 3d shape retrieval methods based on a largescale benchmark supporting multimodal queries. Computer Vision and Image Understanding, 131:1–27, 2015.
 [14] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, H. Fu, T. Furuya, H. Johan, et al. Shrec 14 track: Extended large scale sketchbased 3d shape retrieval. In Eurographics workshop on 3D object retrieval, volume 2014, 2014.
 [15] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Fewshot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 6673–6683, 2017.
 [16] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape benchmark. In Shape modeling applications, 2004. Proceedings, pages 167–178. IEEE, 2004.
 [17] P. Sousa and M. J. Fonseca. Sketchbased retrieval of drawings using spatial proximity. Journal of Visual Languages & Computing, 21(2):69–80, 2010.
 [18] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
 [19] H. Tabia and H. Laga. Learning shape retrieval from different modalities. Neurocomputing, 253:24–33, 2017.
 [20] A. Tatsuma, H. Koyanagi, and M. Aono. A largescale shape benchmark for 3d object retrieval: Toyohashi shape benchmark. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 AsiaPacific, pages 1–10. IEEE, 2012.

[21]
E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell.
Adversarial discriminative domain adaptation.
In
Computer Vision and Pattern Recognition (CVPR)
, volume 1, page 4, 2017.  [22] F. Wang, L. Kang, and Y. Li. Sketchbased 3d shape retrieval using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1875–1883. IEEE, 2015.
 [23] P.S. Wang, Y. Liu, Y.X. Guo, C.Y. Sun, and X. Tong. Ocnn: Octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG), 36(4):72, 2017.
 [24] J. Xie, G. Dai, F. Zhu, and Y. Fang. Learning barycentric representations of 3d shapes for sketchbased 3d shape retrieval. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, pages 3615–3623, July 2017.
 [25] J. Xie, G. Dai, F. Zhu, E. K. Wong, and Y. Fang. Deepshape: deeplearned shape descriptor for 3d shape retrieval. IEEE transactions on pattern analysis and machine intelligence, 39(7):1335–1345, 2017.
 [26] Z. Yasseen, A. VerroustBlondet, and A. Nasri. View selection for sketchbased 3d model retrieval using visual part shape description. The Visual Computer, 33(5):565–583, 2017.
 [27] G.J. Yoon and S. M. Yoon. Sketchbased 3d object recognition from locally optimized sparse features. Neurocomputing, 267:556–563, 2017.
 [28] Y. Zhang, R. Barzilay, and T. Jaakkola. Aspectaugmented adversarial networks for domain adaptation. arXiv preprint arXiv:1701.00188, 2017.

[29]
F. Zhu, J. Xie, and Y. Fang.
Heat diffusion longshort term memory learning for 3d shape analysis.
In European Conference on Computer Vision, pages 305–321. Springer, 2016.  [30] F. Zhu, J. Xie, and Y. Fang. Learning crossdomain neural networks for sketchbased 3d shape retrieval. In AAAI, pages 3683–3689, 2016.
Comments
There are no comments yet.