SketchDesc: Learning Local Sketch Descriptors for Multi-view Correspondence

In this paper, we study the problem of multi-view sketch correspondence, where we take as input multiple freehand sketches with different views of the same object and predict semantic correspondence among the sketches. This problem is challenging, since visual features of corresponding points at different views can be very different. To this end, we take a deep learning approach and learn a novel local sketch descriptor from data. We contribute a training dataset by generating the pixel-level correspondence for the multi-view line drawings synthesized from 3D shapes. To handle the sparsity and ambiguity of sketches, we design a novel multi-branch neural network that integrates a patch-based representation and a multi-scale strategy to learn the correspondence among multi-view sketches. We demonstrate the effectiveness of our proposed approach with extensive experiments on hand-drawn sketches, and multi-view line drawings rendered from multiple 3D shape datasets.


page 6

page 7


End-to-End Learning Local Multi-view Descriptors for 3D Point Clouds

In this work, we propose an end-to-end framework to learn local multi-vi...

DifferSketching: How Differently Do People Sketch 3D Objects?

Multiple sketch datasets have been proposed to understand how people dra...

SketchZooms: Deep multi-view descriptors for matching line drawings

Finding point-wise correspondences between images is a long-standing pro...

3D Sketching using Multi-View Deep Volumetric Prediction

Sketch-based modeling strives to bring the ease and immediacy of drawing...

The Animation Transformer: Visual Correspondence via Segment Matching

Visual correspondence is a fundamental building block on the way to buil...

Semi-supervised Classification using Attention-based Regularization on Coarse-resolution Data

Many real-world phenomena are observed at multiple resolutions. Predicti...

Recurrent Aggregation Learning for Multi-View Echocardiographic Sequences Segmentation

Multi-view echocardiographic sequences segmentation is crucial for clini...

1 Introduction

Sketching as a universal form of communication provides arguably the most natural and direct way for humans to render and interpret the visual world. While it is not challenging for human viewers to interpret missing 3D information from single-view sketches, multi-view inputs are often needed for computer algorithms to recover the underlying 3D geometry due to the inherent ambiguity in single-view sketches. A key problem for interpreting multi-view sketches of the same object is to establish semantic correspondence among them. This problem has been mainly studied for 3D geometry reconstruction from careful engineering drawings in orthographic views [15]. The problem of establishing semantic correspondence from rough sketches in arbitrary views (e.g., those in Figure 1 (Bottom)) is challenging due to shape abstraction and distortions in both shape and view, and is largely unexplored. Addressing this problem can benefit various applications, e.g., designing a novel interface for users with little training in drawing to create 3D shapes using multi-view sketches.

In this work we take a very first step to learn to establish semantic correspondence between freehand sketches depicting the same object from different views. This demands a proper shape descriptor. However, traditional descriptors like Shape Context [5] and recent learning-based patch descriptors [47, 34, 48] are often designed to be invariant to 2D transformation (rotation, translation, and limited distortion) and cannot handle large view changes (e.g., with view disparity greater than 30 degrees.) Such descriptors, especially used for the applications of stereo matching [19, 49, 7, 32, 55] and image-based 3D modeling [50, 29], heavily exploit the features containing textures and shadings of images for inferring similarities among different points or image patches, and thus are not directly applicable to our problem. This is because sketches only contain binary lines and points, exhibiting inherent sparsity and ambiguity.

Figure 1: Multi-view sketch correspondence results by our SketchDesc on line drawings synthesized from 3D shapes (Top) and freehand sketches (Bottom).

We observe that human viewers can easily identify corresponding points from sketches with very different views. This is largely because human viewers have the knowledge of sketched objects in different views. Our key idea is thus to adapt deep neural networks previously used for learning patch-based descriptors but get the networks exposed to corresponding patches from multi-view sketches.

Training deep neural networks requires a large-scale dataset of sketch images with ground truth semantic correspondence. Unfortunately such training datasets are not available. On the other hand, manually collecting multi-view hand-drawn sketches and labeling the ground-truth correspondence would be a demanding task. Following previous deep learning solutions for 3D interpretation of sketches  [37, 12, 26], we synthesize the multi-view line drawings from 3D shapes (e.g., from ShapeNet [56]) by using non-photorealistic rendering. Given the synthesized dataset of multi-view line drawings as sketches, we project the 3D vertices of 3D models to multi-view sketches to get ground-truth correspondences (Figure 2). This data generation pipeline is to emphasize the correspondence from 3D shapes and force deep networks to learn the valuable 3D correspondences among 2D multi-view sketches.

We formulate the correspondence learning problem into a metric function learning procedure and build upon the latest techniques for metric learning and descriptor learning [47]. To find an effective feature descriptor for multi-view sketches, we combine a patch-based representation and a multi-scale strategy (Figure 3) to address the abstraction problem of sketches (akin to Sketch-A-Net [57]). We further design a multi-branch network (Figure 4) with shared-weights to consume the patches in different scales. The patch-based representation helps the network specify the local features of a point on a sketch image by embedding the information of its neighboring pixels. Our multi-scale strategy feeds the network with the local and global perspectives to learn the distinctive information in different scales.

The multi-scale patch representation allows the use of ground-truth correspondences that are away from sketch lines as the additional training data. This not only improves the correspondence accuracy but also enables correspondences inside the regions of sketched objects (e.g., the cup in Figure 1), potentially benefiting applications like sketch-based 3D shape synthesis. We evaluate our method by performing multi-view sketch correspondence pixel-wise retrieval tasks on a large-scale dataset of synthesized multi-view sketches based on three shape repositories: ShapeNet, Princeton Segmentation Benchmark (PSB) [6] and Structure Recovery [42] to show the effectiveness of our proposed framework. We also test our trained network on hand-drawn sketches (Figure 1) and show its robustness against shape and view distortions.

Our contributions are summarized as follows:

  • To the best of our knowledge, we are among the first to study the problem of multi-view sketch correspondence.

  • We introduce a multi-branch network to learn a patch-based descriptor that can be used to measure semantic similarity between patches from multi-view sketches.

  • We present a large-scale dataset consists of 6,852 multi-view sketches with ground-truth correspondences for 18 categories, and will release the dataset to the research community.

2 Related Work

We review the literature that is closely related to our work, namely, the approaches for correspondence establishment in images, and the approaches for sketch analysis using deep learning technologies.

2.1 Correspondence Establishment in Image Domain

Image-based Modeling and Stereo Matching.

Image-based modeling often takes as input multiple images of an object [29, 1] or scene [50, 45] from different views, and aims to reconstruct the underlying 3D geometry. A typical approach to this problem is to first detect a sparse set of key points, then adopt a feature descriptor (e.g., SIFT [30]) to describe the patches centered at the key points, and finally conduct feature matching to build the correspondence among multi-view images. Stereo matching [19, 49] takes two images from different but often close viewpoints, and aims to establish dense pixel-level correspondence across images. Our problem is different from these tasks in the following ways. First, the view disparity in our input sketches is often much larger. Second, unlike natural photos, which have rich textures, our sketches have more limited information due to their line-based representation.

Local Image Descriptors.

Local image descriptors are typically derived from image patches centered at points of interest and designed to be invariant to certain factors, such as rotation, scale, or intensity, for robustness. Existing local image descriptors can be broadly categorized as hand-crafted descriptors and learning-based descriptors. A full review is beyond the scope of this work. We refer the interested readers to [9] for an insightful survey.

Classical descriptors include, to name a few, SIFT [30], SURF [4], Shape Context [5], and HOG [10]

. The conventional local descriptors are mostly built upon low-level image properties and constructed using hand-crafted rules. Recently, learning-based local descriptors produced by deep convolutional neural networks (CNNs)

[47, 34, 48] have shown their superior performance over the hand-crafted descriptors, owing importantly to the availability of large-scale image correspondence datasets [52, 2] obtained from 3D reconstructions. To learn robust 2D local descriptors, extensive research has been dedicated to the development of CNN designs [18, 58, 47, 59, 51]

, loss functions 

[23, 38, 3, 34, 21, 35] and training strategies [44, 8, 33]. The above methods, however, are not specially designed for learning multi-view sketch correspondence. The work of GeoDesec [33] shares the closest spirit with ours and employs geometry constraints from 3D reconstruction by Structure-from-Motion (SfM) to refine the training data. However, SfM heavily depends on the textures and shadings in the image domain, and is not suitable for sketches.

Multi-scale Strategy for Descriptors

Shilane and Funkhouser [43] computed a 128-D descriptor at four scales in spherical regions to describe distinctive regions on 3D shape surfaces. Recently, Huang et al. [20] proposed to learn 3D point descriptors from multi-view projections with progressively zoomed out viewpoints. Inspired by these methods, we design a multi-scale strategy to gather local and global context to locate corresponding points in sketches across views (Figure 3). Different from [20], which focuses on rendered patches of 3D shapes, our work considers patches of sparse line drawings with limited textures as input. To reduce the network size and computation, we adopt a smaller input scale  [47] rather than in [20]. In addition, Huang et al. [20] used three viewpoints for 3D point descriptors, while our work learns descriptors of points in sketches drawn under a specific viewpoint for correspondence establishment across a larger range of views.

Figure 2: Illustration of how to synthesize multi-view sketches (with size 480 480), with ground-truth semantic correspondences (in green) mapped to the same points (in red) of a 3D shape.
Figure 3: An illustration of our multi-scale patch-based representation. Given a 480480 sketch input and the location of a pixel of interest, we utilize 3232, 6464, 128128, and 256256 four-scale patches to perceive the pixels on sketches. A triplet example involving positive, anchor and negative patches is also given.

2.2 Deep Learning in Sketch Analysis

With the recent advances in deep learning techniques, a variety of deep learning based methods have been proposed for sketch analysis tasks such as fine-grained sketch-based image retrieval

[40], sketch synthesis [17], sketch segmentation [28], sketch retrieval [54], and sketch recognition [57]. However, none of these existing solutions can be directly applied to our task. To make deep learning possible in sketch analysis, there exist multiple large-scale datasets of sketches including the TU-Berlin [13], QuickDraw [17], and Sketchy [40] datasets. However, they do not contain multi-view sketches and thus cannot be used to train our network. A sketch is generally represented as either a rasterized binary-pixel image [40]

or vector sequences

[17, 28] or both [54]. Since it is difficult to render line drawings from 3D shapes as a sequence of well-defined strokes, we use rasterized binary-pixel images to represent our training data and input sketches.

Multi-view Sketch Analysis

Multi-view sketches are often used in sketch-based 3D modeling [39, 37, 31, 12, 26]. Early sketch-based modeling methods (e.g., [39]) require precise engineering drawings as input or are limited by their demanding mental efforts, requiring users to first decompose a desired 3D shape into parts and then constructing each part through careful engineering drawings [15]. To alleviate this issue, several recent methods [37, 31, 12, 26] leverage learning-based frameworks (e.g., GAN [14]) to obtain the priors from training data and then infer 3D shapes from novel input multi-view sketches (usually in orthographic views, namely, the front, back, and side views). However, these methods often process individual multi-view sketches in separate branches and do not explicitly consider semantic correspondence between input sketches.

SketchZooms by Navarro et al. [36] is a concurrent work and studies the sketch correspondence problem with a similar deep learning based solution. Our work is different from SketchZooms as follows. To generate the training data for cross-object correspondence, the used 3D shapes need to be semantically registered together in advance in SketchZooms, while our work explores training data generation from individual 3D shapes for cross-view correspondence. SketchZooms follows [20, 46] and adapts AlexNet [22] (40M parameters) with the final layer replaced with a view pooling layer. In contrast, our designed framework has a smaller size (1.4M parameters), and its high performance and efficiency shown in our experiments pave a path for our method to be more easily integrated into mobile/touch devices.

3 Methodology

Figure 4: The architecture of SketchDesc-Net. Our input is a four-scale patch pyramid (3232, 6464, 128128, 256256) centered at a pixel of interest on a sketch, with each scale rescaled to 32

32. Given the multi-scale patches, we design a multi-branch framework with shared weights to accept these rescaled patch inputs. The dashed lines represent the data flow from an input patch to an output descriptor.. For the kernel size and stride in our network, we adopt the same settings as

[47]. Finally, the output as a 128-D descriptor embeds all the features from four scales by the concatenation and full connection operations.

Terminology. To standardize different expressions for points in the following sections, we refer to the points on the surface of a 3D shape as vertices; the points on a 2D sketch image as pixels; and the points exactly on the sketch lines as points.

For input sketches represented as rasterized images, a key problem to semantic correspondence learning is to measure the difference between a pair of pixels in a semantic way. This is essentially a metric function learning problem where a pair of corresponding pixels get a shorter distance than a pair of non-corresponding pixels in the metric space. Formally, it can be expressed as follows:


where is a metric function, and and represent corresponding pixels from different sketches (e.g., a pair of positive and anchor in Figure 3) while and are a pair of non-corresponding pixels (e.g., a pair of anchor and negative).

We follow [47] to build the metric function by learning a sketch descriptor (SketchDesc) with a triplet loss function. Since sketch images are rather sparse, we adopt a multi-scale patch-based representation, with a multi-scale patch centered at a pixel of interest in a sketch. We resort to deep CNNs, which own the superior capability of learning discriminative feature descriptors from sufficient training data. As illustrated in Figure 4, our designed network (Section 3.2) takes a multi-scale patch as input and outputs a 128-D semantically meaningful distinctive descriptor.

To train the network, we first synthesize line drawings of a 3D shape from different viewpoints as multi-view sketches. We then generate the ground-truth correspondences by first uniformly sampling points on the 3D shape and then projecting them to the corresponding multiple views. We will discuss the detailed process of data preparation in the following section.

3.1 Data Preparation

Multi-view Sketches with Ground-truth Correspondences.

We follow a similar strategy in [27, 53, 46] to synthesize sketches from 3D shapes. Specifically, we first render a 3D shape with aligned upright orientation to a normal map under a specific viewpoint, and then extract an edge map from the normal map using Canny edge detection. We adopt this approach instead of the commonly used suggestive contours [11], since the latter is designed for high-quality 3D meshes and cannot generate satisfactory contours from poorly-triangulated meshes (e.g., airplanes and rifles in the Structure Recovery dataset [42]). Hidden lines of the edge detection results are removed. In our implementation, each sketch is resized to a image. As mentioned in [13], since most humans are not faithful artists and they create sketches in a casual and random way, only three clustered views [20] or limited canonical views [36] may not be enough to cover the situations how people express the correspondence. To obtain more general multi-view sketches, we uniformly sample 12 viewpoints on the upper unit viewing hemisphere (in the elevation angle of 1545 degrees for each 3D model.

By projecting each vertex to the corresponding views, we naturally construct ground-truth correspondences (with the projections from the same vertices) among synthesized multi-view sketches, as illustrated in Figure 2. We do not consider hidden vertex projections (invisible under the depth test). If the projections of a vertex are visible only in less than 2 different views, this vertex is not considered in ground-truth correspondences.

Figure 5: An illustration of two sampling mechanisms (Left: OR-sampling; Right: AND-sampling)
Multi-scale Patch Representation.

We represent each visible projection of a 3D vertex on sketch images with a patch-based representation centered at the corresponding pixel to capture the distinctive neighboring structures (Figure 3). To further handle the issues due to the sparsity of sketches and the lack of texture information, we adopt a multi-scale strategy. Given a sketch image, we employ a four-scale representation (i.e., , , , and ) for a pixel, as illustrated in Figure 3.

The multi-scale patch-based representation allows us to sample ground-truth correspondences inside sketched objects, not necessarily on sketch lines. This significantly increases the number of ground-truth correspondences. However, we do require non-zero information in a multi-scale patch. We have tried two sampling mechanisms: 1) OR-sampling: a multi-scale patch is valid if the patch is non-empty at any of the scales (Figure 5 (Left)); 2) AND-sampling: a multi-scale is valid if the patch is non-empty at every scale (Figure 5 (Left)). The former often makes a multi-scale patch at almost every visible vertex projection valid, since the patch at scale is often non-empty given its relatively large scale. Instead, the AND-sampling leads to the use of patches near to the sketch lines only. We will compare the performance of these two sampling mechanisms in Section 4.

To feed the patches into our multi-branch network, we rescale all the patches to 32

32 (i.e., the smallest scale) by bilinear interpolation (Figure

4). Below we describe our network architecture in details.

3.2 Network Architecture

We design a network architecture to learn a descriptor for measuring semantic distance between a pair of points in multi-view sketches. As illustrated in Figure 4, our network has four branches to process the four-scale input patches. The four branches share the same architecture and weights in the whole learning process. Each branch receives a 32 32 patch and outputs a 128 1 (i.e., 128-D) feature vector which is then further fused at the concatenation layer and the final full-connected layer. Note that due to the shared weights among the branches, the multi-branch structure does not increase the number of parameters. Our network produces a 128-D descriptor as output, which will be later used for sketch correspondence and pixel-wise retrieval in Section 4.

3.2.1 Design Choices

To accommodate the sparse nature of the sketch patches, in the initial design stage we experimented with several network schemes for sketches. They seem common in the image domain, i.e., attention and short-cut connection in ResNet, but we did not observe any significant improvement over our current network architecture. Below we briefly discuss our attempts and hope they are inspiring for the readers to explore alternative solutions.

Mask Convolution.

As reported in [16], the active regions in sparse inputs like line drawings become blurrier due to their diffusion to empty regions. after passing through several convolution layers. To obtain the features tightly along the edges in sketch patches, one option is to utilize the input sketch patch itself as a mask to filter the diffused regions in the feature maps after every convolution layer. However, we find that this operation will reduce the number of active regions in the feature maps, leading to degenerated results.

Skip Connection.

A possible way to address the limited context in sparse input sketches is to gather more information from previous layers to the current layer. However, this strategy achieves a comparable performance to our current network framework.

3.3 Objective Function

To train our network, we organize the training data as triplets of patches (Figure 3) and feed them to our network. We define the output features of a triplet as (), where is the feature vector of an anchor pixel in one sketch, and and represent the features of the corresponding pixel (corresponding to the same vertex in a 3D shape as the anchor pixel) and a non-corresponding pixel. The non-corresponding pixel can be selected from either the same sketch or in the other views. With the feature triplets, we adopt the triplet loss [41] to train the network. The triplet loss given in Equation 2 aims to pull the distance between feature descriptors of a pair of corresponding points () and push away the distance between a pair of non-corresponding points () in the metric space.


where is the number of triplets in a training batch, denotes the -th triplet, measures the Euclidean distance given two features, and the margin is set as in our experiments.

Structure-Recovery PSB ShapeNet
























Shapes 54 12 27 16 16 25 20 20 20 20 20 20 20 20 20 38 35 35 39 23 28 34 25
Views 11 12 12 12 12 12 11 12 12 11 12 11 12 11 11 11 12 12 12 12 12 12 12
Table 1: The statistics of shape repositories used in our evaluation, and the number of multi-view sketches per shape.

4 Experiments

We conduct extensive experiments on a multi-view sketch dataset synthesized from three existing 3D shape repositories: the Structure-Recovery database [42], Princeton Segmentation Benchmark (PSB) [6] and ShapeNet [56]. Table 1 shows the detailed information of the number of 3D shapes, the names of selected categories, as well as the number of views used in each dataset.

We also evaluated on a dataset of hand-drawn sketches. Our current evaluation focused on the chair category, since chairs have rather complicated geometry and structure. Several participants were invited to create test sketches. They were asked to create sketches on a touchscreen by recollection, after observing the model for a fixed amount of time. Each participant was given 3 salient views that ordinary users are familiar of, to sketch for. Figure 1 (Bottom) shows some representative results of sketch correspondence.

4.1 Implementation Details

We train our network category-wise for a specific object category with a data splitting ratio of 8 : 1 : 1 (training : validating : testing). All multi-view sketches are rendered to the size of . The batch size is set as . We sample the training triplets in a batch randomly. Our network is trained on the Graphics Card RTX

Ti and optimized by the Adam optimizer. The learning rate and number of iteration epochs in our experiments are set to

and , respectively.

4.2 Performance Evaluation

To verify the effectiveness of our proposed network, we design two evaluation tasks including multi-view sketch correspondence, and pixel-wise retrieval. We compare our approach with the existing learning-based descriptors, including LeNet [24], L2-Net [47], HardNet [34], SOSNet [48] and AlexNet-based view pooling [20, 36] (AlexNet-VP in short). Note that we reimplemented LeNet, L2-Net, HardNet, and SOSNet following their original configurations. We also reimplemented AlexNet-VP and provide it with the same multi-scale patches as our network. The training input to all other networks are single-scale 3232 patches. For fair comparison, we only consider pixels which pass the AND-sampling criteria, i.e., the 32 32 patch of a multi-scale should be non-empty.

Figure 6: For a given pixel inside a sketched object under View 1, we find a corresponding point in the other sketch under View 2 by computing a distance map through our learned descriptor.

4.2.1 Sketch Correspondence

In this task we validate the performance of the learned descriptors in finding corresponding pixels in pairs of multi-view sketches in the test set. Given a pair of test sketches, for each pixel in one sketch, we compute its distances to all pixels in the other sketch (see distance visualization in Figure 6). We consider it as a successful matching if the pixel with the shortest distance is no further than 16 pixels (half of the smallest patch scale) away from the ground-truth pixel. Note that not all pixels in a sketch image have ground-truth correspondences (only among those projected). We report the averaged success rate as the correspondence accuracy.

Figure 7: Sketch correspondence results by computing the distance maps with different approaches. Correct and wrong matching results are marked as green and red boxes, respectively.

The performance of of our approach and other competing methods on several representative categories is reported in Table 2. The detailed experimental results can be found in Table 7 in the appendix. It is observed that our network outperforms its competitors. Due to the lack of texture in sketch patches, the frameworks designed for image patches (i.e., L2-Net, HardNet and SOSNet) cannot learn effective descriptors to match the corresponding pixels in multi-view sketches. Our descriptor also surpasses those based on LeNet and AlexNet-VP. Figure 7 gives some qualitative comparisons between different approaches. Since 32 32 patches are required to be non-empty, test pixels are near sketch lines. We observe that the descriptors learned by LeNet, HardNet, L2-Net and SOSNet are less discriminative, and are thus not effective in finding corresponding pixels among multi-view sketches. AlexNet-VP uses a large input patch size of the 224

224, making it difficult to focus on local details in small regions. In addition, the max-pooling layer in AlexNet-VP may further decrease performance of the learned descriptors

[34]. As a consequence, AlexNet-VP produces more ambiguous distance maps (e.g., see the chair example in Figure 7) compared with our method. More correspondence results of multi-view sketches can be found in Figure 10.

Structure-Recovery PSB ShapeNet
Methods Human Rifle Fish Chair Airplane Cap avg_acc
LeNet 0.280 0.426 0.296 0.285 0.394 0.209 0.315
L2-Net 0.379 0.521 0.359 0.267 0.436 0.252 0.369
HardNet 0.402 0.530 0.345 0.243 0.398 0.262 0.363
SOSNet 0.351 0.503 0.314 0.261 0.352 0.216 0.340
AlexNet-VP 0.635 0.670 0.633 0.443 0.674 0.586 0.607
SketchDesc 0.726 0.762 0.794 0.602 0.686 0.752 0.720
Table 2: Sketch correspondence accuracy (i,e., the averaged success rate) on representative categories with different methods. We highlight the best results in each category in boldface.
Structure-Recovery PSB ShapeNet
Methods Chair Human Plier Hand Cap Pistol avg_map
LeNet 0.385 0.204 0.179 0.082 0.157 0.288 0.216
L2-Net 0.446 0.436 0.357 0.373 0.238 0.345 0.369
HardNet 0.423 0.398 0.214 0.373 0.188 0.353 0.325
SOSNet 0.418 0.265 0.179 0.279 0.169 0.360 0.278
AlexNet-VP 0.704 0.744 0.786 0.519 0.594 0.691 0.673
SketchDesc 0.817 0.919 0.857 0.635 0.713 0.820 0.794
Table 3: Pixel-wise retrieval on representative categories with different methods.

4.2.2 Multi-view Pixel-wise Retrieval

We further design a multi-view corresponding pixel retrieval task. Given multi-view sketches synthesized from multiple shapes, we uniformly sample a set of pixels (10001600 pixels) on one sketch and search in the other sketches for the corresponding pixels which are from the same shape (in different views). We use the descriptors computed from the compared networks as queries and adopt the Mean Average Precision (MAP) to measure the retrieval performance. Experiments are performed category-wise in Structure-Recovery, PSB, and ShapeNet. Table 3 shows the results given by the compared methods. More detailed comparisons can be found in Table 8. In Tables 3 and 8, SketchDesc achieves the best performance among all the learned descriptors produced by the state-of-art learning-based methods on this task. Our learned descriptor surpasses the image-based descriptors of L2-Net, HardNet, and SOSNet by a large margin. For AlexNet-VP, it achieves a closer but still lower performance compared with our method.

4.2.3 View Disparity.

To further show the robustness of different learned descriptors against the degree of view disparity, given the same input test pixels in one sketch we visualize how the quality of correspondence inference changes with the increasing view disparity. As shown in Figure 8, SketchDesc shows a more stable performance of correspondence inference than the competitors. Please note that the ground-truth corresponding pixels might become invisible in certain views, while all the learned descriptors could not distinguish the visibility of corresponding pixels. Nevertheless, SketchDesc still produces the most reasonable results.

Figure 8: Performance of different methods with increasing view disparity (30, 60, 180, 240 and 300 degree respectively ). Given some anchor pixels in sketched object at top-left, we show the corresponding pixels computed by different methods. The ground-truth correspondences are labeled with the green boxes.

4.3 Ablation Study

In this subsection we validate the effectiveness of the key components in our method by conducting an ablation study.

Multi-scale Strategy

The designed multi-scale patch-based representation () plays an essential role in our method. We first show how different scales can influence the performance of the learned descriptors. We feed our network with increasingly more scales and evaluate the performance on the pixel-wise retrieval task. We employ the average MAP (Mean Average Precision) metric over the whole dataset. Quantitative results are shown in Table 4. A representative visual comparison is shown in Figure 9 It is found that as larger scales are involved, the ambiguous regions (bright yellow regions) on feet, legs and backs of the camel are gradually rejected. In other words, by fed with the multi-scale patches, our network can enjoy not only a more precise local perception but also a global perspective.

Figure 9: Visual results of different multi-scale choices. The distance maps show the distances from the highlighted point in the left sketch to all the pixels in the other sketch.
Different scales Structure-Recovery PSB ShapeNet
First scale () 0.472 0.314 0.202
First 2 scales () 0.695 0.570 0.484
First 3 scales () 0.795 0.701 0.607
All 4 scales () 0.819 0.727 0.660
Table 4: The performance of using increasingly larger scales in the pixel-wise retrieval task.
Shared Weights

In our method the multi-scale patches are processed by a shared-weight scheme. To verify its effectiveness, we perform a comparison on the pixel-wise retrieval task with an unshared-weight network structure. The comparison results are reported in Table 5. It is found that the shared-weight structure in our network achieves a higher accuracy. The improvement is even more significant on PSB.

Structure-Recovery PSB ShapeNet
Unshared weights 0.801 0.646 0.628
Shared weights 0.819 0.727 0.660
Table 5: The comparison with and without using shared-weights.
Training Data Generation

We evaluate the performance of two patch sampling mechanisms: OR-sampling and AND-sampling (Section 3.1). It can be found from Table 6 that OR-sampling achieves a significantly better performance. This is possibly because the OR-sampling leads to a significantly larger dataset for training our network.

Sampling Mechanism Structure-Recovery PSB ShapeNet
AND-sampling 0.789 0.679 0.591
OR-sampling 0.819 0.727 0.660
Table 6: The performance of two sampling mechanisms for data preparation on the task of corresponding pixel retrieval.

4.4 Limitations and Discussions

Our method has several limitations. First, although our method could generalize well to unseen sketches or even hand-drawn sketches of the same objects, when the viewpoint differs from the examples in the training set drastically, our method could fail. This is a common generalization problem for any learning-based methods. Increasing the training data could help yet in the cost of additional training burden. We use a rather simple method to sample viewpoints for preparing the training data. A more careful view selection might be made by adopting best-view selection methods [25]. Additionally, our method is currently designed for multi-view correspondences of rigid objects. If the object undergoes articulation or non-rigid deformations (e.g., people dancing), our method may not perform well. We consider this as an intriguing future work to explore.

5 Conclusions

In this paper, we have introduced a deep learning based method for correspondence learning among multiple sketches of an object in different views. We have proposed a multi-branch network which encodes contexts from multi-scale patches with global and local perspectives to produce a novel descriptor for semantically measuring the distance of pixels in multi-view sketch images. The multi-branch and shared-weights designs help the network capture more feature information from all scales of sketch patches. Our data preparation method provides the ground truth effectively for training our multi-branch network. We believe the generated data can benefit other applications. Both qualitative and quantitative experiments show that our learned descriptor is more effective than the existing learning-based descriptors. In the future, it would be interesting to exploit more neighboring information and learn the per-point features in a joint manner.


  • [1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski (2011) Building rome in a day. Communications of the ACM 54 (10), pp. 105–112. Cited by: §2.1.
  • [2] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In Proc. IEEE CVPR, Cited by: §2.1.
  • [3] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016) Learning local feature descriptors with triplets and shallow convolutional neural networks. In Proc. BMVC, pp. 119.1–119.11. External Links: Document, ISBN 1-901725-59-6 Cited by: §2.1.
  • [4] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In

    European Conference on Computer Vision

    pp. 404–417. Cited by: §2.1.
  • [5] S. Belongie, J. Malik, and J. Puzicha (2001) Shape context: a new descriptor for shape matching and object recognition. In NIPS, pp. 831–837. Cited by: §1, §2.1.
  • [6] X. Chen, A. Golovinskiy, and T. Funkhouser (2009) A benchmark for 3d mesh segmentation. In ACM Transactions on Graphics, Vol. 28, pp. 73. Cited by: §1, §4.
  • [7] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang (2015) A deep visual correspondence embedding model for stereo matching costs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 972–980. Cited by: §1.
  • [8] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker (2016) Universal correspondence network. In NIPS, Cited by: §2.1.
  • [9] G. Csurka and M. Humenberger (2018) From handcrafted to deep local invariant features. CoRR abs/1807.10254. Cited by: §2.1.
  • [10] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In CVPR ’05, Cited by: §2.1.
  • [11] D. DeCarlo, A. Finkelstein, S. Rusinkiewicz, and A. Santella (2003) Suggestive contours for conveying shape. In ACM Transactions on Graphics (TOG), Vol. 22, pp. 848–855. Cited by: §3.1.
  • [12] J. Delanoy, M. Aubry, P. Isola, A. A. Efros, and A. Bousseau (2018) 3d sketching using multi-view deep volumetric prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques 1 (1), pp. 21. Cited by: §1, §2.2.
  • [13] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. ACM Trans. Graph. 31 (4), pp. 44–1. Cited by: §2.2, §3.1.
  • [14] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, X. Bing, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In International Conference on Neural Information Processing Systems, Cited by: §2.2.
  • [15] L. Governi, R. Furferi, M. Palai, and Y. Volpe (2013) 3D geometry reconstruction from orthographic views: a method based on 3d image processing and data fitting. Computers in Industry 64 (9), pp. 1290–1300. Cited by: §1, §2.2.
  • [16] B. Graham, M. Engelcke, and L. van der Maaten (2018) 3d semantic segmentation with submanifold sparse convolutional networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9224–9232. Cited by: §3.2.1.
  • [17] D. Ha and D. Eck (2017) A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477. Cited by: §2.2.
  • [18] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg (2015) MatchNet: unifying feature and metric learning for patch-based matching. In Proc. IEEE CVPR, pp. 3279–3286. External Links: ISSN 1063-6919 Cited by: §2.1.
  • [19] H. Hirschmuller and D. Scharstein (2008) Evaluation of stereo matching costs on images with radiometric differences. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (9), pp. 1582–1599. Cited by: §1, §2.1.
  • [20] H. Huang, E. Kalogerakis, S. Chaudhuri, D. Ceylan, V. G. Kim, and E. Yumer (2018) Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Transactions on Graphics (TOG) 37 (1), pp. 6. Cited by: §2.1, §2.2, §3.1, §4.2.
  • [21] M. Keller, Z. Chen, F. Maffra, P. Schmuck, and M. Chli (2018) Learning deep descriptors with scale-aware triplet networks. In CVPR ’18, Cited by: §2.1.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §2.2.
  • [23] V. Kumar B G, G. Carneiro, and I. Reid (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proc. IEEE CVPR, Cited by: §2.1.
  • [24] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
  • [25] C. H. Lee, A. Varshney, and D. W. Jacobs (2005) Mesh saliency. ACM Transactions on Graphics (TOG) 24 (3), pp. 659–666. Cited by: §4.4.
  • [26] C. Li, H. Pan, Y. Liu, X. Tong, A. Sheffer, and W. Wang (2018) Robust flow-guided neural prediction for sketch-based freeform surface modeling. In ACM Transactions on Graphics, pp. 238. Cited by: §1, §2.2.
  • [27] L. Li, H. Fu, and C. Tai (2018) Fast sketch segmentation and labeling with deep learning. IEEE computer graphics and applications 39 (2), pp. 38–51. Cited by: §3.1.
  • [28] L. Li, C. Zou, Y. Zheng, Q. Su, H. Fu, and C. Tai (2018) Sketch-r2cnn: an attentive network for vector sketch recognition. arXiv preprint arXiv:1811.08170. Cited by: §2.2.
  • [29] S. Liu and D. B. Cooper (2010) Ray markov random fields for image-based 3d modeling: model and efficient inference. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1530–1537. Cited by: §1, §2.1.
  • [30] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §2.1, §2.1.
  • [31] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang (2017) 3d shape reconstruction from sketches via multi-view convolutional networks. In 2017 International Conference on 3D Vision (3DV), pp. 67–77. Cited by: §2.2.
  • [32] W. Luo, A. G. Schwing, and R. Urtasun (2016) Efficient deep learning for stereo matching. In CVPR ’16, pp. 5695–5703. Cited by: §1.
  • [33] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan (2018) Geodesc: learning local descriptors by integrating geometry constraints. In ECCV ’18, pp. 168–183. Cited by: §2.1.
  • [34] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Advances in Neural Information Processing Systems, pp. 4826–4837. Cited by: §1, §2.1, §4.2.1, §4.2.
  • [35] D. Mishkin, F. Radenovic, and J. Matas (2018) Repeatability is not enough: learning affine regions via discriminability. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 284–300. Cited by: §2.1.
  • [36] P. Navarro, J. I. Orlando, C. Delrieux, and E. Iarussi (2019) SketchZooms: deep multi-view descriptors for matching line drawings. arXiv preprint arXiv:1912.05019. Cited by: §2.2, §3.1, §4.2.
  • [37] G. Nishida, I. Garcia-Dorado, D. G. Aliaga, B. Benes, and A. Bousseau (2016) Interactive sketching of urban procedural models. ACM Transactions on Graphics (TOG) 35 (4), pp. 130. Cited by: §1, §2.2.
  • [38] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proc. IEEE CVPR, Cited by: §2.1.
  • [39] A. Rivers, F. Durand, and T. Igarashi (2010-07) 3D modeling with silhouettes. ACM Trans. Graph. 29 (4). External Links: ISSN 0730-0301 Cited by: §2.2.
  • [40] P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35 (4), pp. 119. Cited by: §2.2.
  • [41] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §3.3.
  • [42] C. Shen, H. Fu, K. Chen, and S. Hu (2012) Structure recovery by part assembly. ACM Transactions on Graphics (TOG) 31 (6), pp. 180. Cited by: §1, §3.1, §4.
  • [43] P. Shilane and T. Funkhouser (2007) Distinctive regions of 3d surfaces. ACM Transactions on Graphics (TOG) 26 (2), pp. 7. Cited by: §2.1.
  • [44] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In ICCV ’15, pp. 118–126. External Links: ISSN 2380-7504 Cited by: §2.1.
  • [45] N. Snavely, S. M. Seitz, and R. Szeliski (2006) Photo tourism: exploring photo collections in 3d. In ACM Transactions on Graphics (TOG), Vol. 25, pp. 835–846. Cited by: §2.1.
  • [46] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller (2015) Multi-view convolutional neural networks for 3d shape recognition. In ICCV ’15, pp. 945–953. Cited by: §2.2, §3.1.
  • [47] Y. Tian, B. Fan, and F. Wu (2017) L2-Net: deep learning of discriminative patch descriptor in euclidean space. In CVPR ’17, pp. 661–669. Cited by: §1, §1, §2.1, §2.1, Figure 4, §3, §4.2.
  • [48] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) SOSNet: second order similarity regularization for local descriptor learning. In CVPR ’19, pp. 11016–11025. Cited by: §1, §2.1, §4.2.
  • [49] F. Tombari, S. Mattoccia, L. Di Stefano, and E. Addimanda (2008) Classification and evaluation of cost aggregation methods for stereo correspondence. In CVPR ’08, pp. 1–8. Cited by: §1, §2.1.
  • [50] T. Tung, S. Nobuhara, and T. Matsuyama (2009) Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In ICCV ’09, pp. 1709–1716. Cited by: §1, §2.1.
  • [51] X. Wei, Y. Zhang, Y. Gong, and N. Zheng (2018) Kernelized subspace pooling for deep local descriptors. In CVPR ’18, Cited by: §2.1.
  • [52] S. Winder and M. Brown (2007-06) Learning local image descriptors. In Proc. IEEE CVPR, Cited by: §2.1.
  • [53] K. Xu, K. Chen, H. Fu, W. Sun, and S. Hu (2013) Sketch2Scene: sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics (TOG) 32 (4), pp. 123. Cited by: §3.1.
  • [54] P. Xu, Y. Huang, T. Yuan, K. Pang, Y. Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo (2018) Sketchmate: deep hashing for million-scale human sketch retrieval. In CVPR ’18, pp. 8090–8098. Cited by: §2.2.
  • [55] G. Yang, J. Manela, M. Happold, and D. Ramanan (2019) Hierarchical deep stereo matching on high-resolution images. In CVPR ’19, pp. 5515–5524. Cited by: §1.
  • [56] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35 (6), pp. 210. Cited by: §1, §4.
  • [57] Q. Yu, Y. Yang, F. Liu, Y. Song, T. Xiang, and T. M. Hospedales (2017) Sketch-a-net: a deep neural network that beats humans. International Journal of Computer Vision 122 (3), pp. 411–425. Cited by: §1, §2.2.
  • [58] S. Zagoruyko and N. Komodakis (2015-06) Learning to compare image patches via convolutional neural networks. In CVPR ’15, Cited by: §2.1.
  • [59] X. Zhang, F. X. Yu, S. Kumar, and S. Chang (2017) Learning spread-out local feature descriptors. In ICCV ’17, Cited by: §2.1.

6 Appendix

LeNet L2Net HardNet SOSNet AlexNetVP SketchDesc
Structure-Recovery Airplane 0.184 0.236 0.207 0.189 0.360 0.544
Bicycle 0.376 0.388 0.422 0.399 0.663 0.710
Chair 0.339 0.399 0.385 0.390 0.508 0.557
Fourleg 0.292 0.336 0.316 0.277 0.519 0.662
Human 0.280 0.379 0.402 0.351 0.635 0.726
Rifle 0.426 0.521 0.530 0.503 0.670 0.762
avg_acc 0.316 0.377 0.377 0.352 0.559 0.660
PSB Airplane 0.182 0.233 0.132 0.097 0.521 0.716
Bust 0.330 0.220 0.226 0.196 0.314 0.454
Chair 0.285 0.267 0.243 0.261 0.443 0.602
Cup 0.221 0.251 0.191 0.221 0.354 0.569
Fish 0.296 0.359 0.345 0.314 0.633 0.794
Human 0.347 0.352 0.381 0.341 0.537 0.793
Octopus 0.098 0.108 0.091 0.065 0.239 0.284
Plier 0.098 0.087 0.063 0.136 0.569 0.728
avg_acc 0.232 0.235 0.209 0.204 0.451 0.618
ShapeNet Airplane 0.394 0.436 0.398 0.352 0.674 0.686
Bag 0.167 0.137 0.146 0.136 0.313 0.384
Cap 0.209 0.252 0.262 0.216 0.552 0.752
Car 0.171 0.206 0.209 0.210 0.546 0.733
Chair 0.189 0.196 0.180 0.192 0.423 0.472
Earphone 0.151 0.184 0.159 0.141 0.252 0.415
Mug 0.186 0.198 0.170 0.169 0.317 0.392
Pistol 0.300 0.339 0.333 0.337 0.663 0.707
avg_acc 0.221 0.244 0.232 0.219 0.469 0.568
Table 7: Detailed comparison of different approaches on sketch correspondence task.
LeNet L2-Net HardNet SOSNet AlexNet-VP SketchDesc
Structure-Recovery Airplane 0.333 0.417 0.450 0.400 0.450 0.683
Bicycle 0.365 0.399 0.439 0.392 0.601 0.838
Chair 0.385 0.446 0.423 0.418 0.704 0.817
Fourleg 0.328 0.454 0.412 0.303 0.748 0.824
Human 0.204 0.436 0.398 0.265 0.744 0.919
Rifle 0.560 0.560 0.571 0.536 0.762 0.833
avg_map 0.363 0.452 0.449 0.386 0.668 0.819
PSB Airplane 0.263 0.263 0.105 0.105 0.421 0.684
Bust 0.012 0.294 0.292 0.258 0.513 0.636
Chair 0.333 0.479 0.418 0.455 0.751 0.864
Cup 0.050 0.083 0.035 0.062 0.222 0.499
Fish 0.187 0.277 0.232 0.208 0.581 0.734
Human 0.126 0.526 0.523 0.461 0.786 0.930
Hand 0.082 0.373 0.373 0.279 0.519 0.635
Octopus 0.375 0.292 0.292 0.458 0.667 0.708
Plier 0.179 0.357 0.214 0.179 0.786 0.857
avg_map 0.179 0.327 0.276 0.274 0.583 0.727
ShapeNet Airplane 0.143 0.215 0.152 0.090 0.367 0.552
Bag 0.172 0.199 0.149 0.135 0.534 0.686
Cap 0.157 0.238 0.188 0.169 0.594 0.713
Car 0.195 0.191 0.224 0.221 0.559 0.669
Chair 0.114 0.150 0.140 0.137 0.505 0.666
Earphone 0.254 0.301 0.254 0.197 0.491 0.832
Mug 0.045 0.062 0.057 0.076 0.280 0.338
Pistol 0.288 0.345 0.353 0.360 0.691 0.820
avg_map 0.171 0.213 0.190 0.173 0.503 0.660
Table 8: Detailed comparison of different approaches on corresponding pixel retrieval task.
Figure 10: Sketch alignment for multi-view sketches. The red lines indicate the failed matching and the green lines show the matching correspondences among multi-view sketches.