Hand-drawn line sketches have long been employed to communicate ideas in a minimal yet understandable manner. In this paper, we explore the problem of parsing sketched objects, i.e. given a freehand line sketch of an object, determine its salient attributes (e.g. category, semantic parts, pose). The ability to understand sketches in terms of local (e.g. parts) and global attributes (e.g. pose) can drive novel applications such as sketch captioning, storyboard animation  and automatic drawing assessment apps for art teachers. The onset of deep network era has resulted in architectures which can impressively recognize object sketches at a coarse (category) level [36, 39, 51]. Paralleling the advances in parsing of photographic objects [13, 21, 46] and scenes [4, 29, 8], the time is ripe for understanding sketches too at a fine-grained level [50, 20].
A number of unique challenges need to be addressed for semantic sketch parsing. Unlike richly detailed color photos, line sketches are binary (black and white) and sparsely detailed. Sketches exhibit a large amount of appearance variations induced by the range of drawing skills among general public. The resulting distortions in object depiction pose a challenge to parsing approaches. In many instances, the sketch is not drawn with a ‘closed’ object boundary, complicating annotation, part-segmentation and pose estimation. Given all these challenges, it is no surprise that only a handful of works exist for sketch parsing[16, 38]. However, even these approaches have their own share of drawbacks (Section 2).
To address these issues, we propose a novel architecture called SketchParse for fully automatic sketch object parsing. In our approach, we make three major design decisions:
Design Decision #1 (Data): To bypass burdensome part-level sketch annotation, we leverage photo image datasets containing part-level annotations of objects . Suppose is an object image and is the corresponding part-level annotation. We subject to a sketchification procedure (Section 3.2) and obtain . Thus, our training data consists of the sketchified image and corresponding part-level annotation pairs () for each category (Figure 1).
Design Decision #2 (Model): Many structurally similar object categories tend to have common parts. For instance, ‘wings’ and ‘tail’ are common to both birds and airplanes. To exploit such shared semantic parts, we design our model as a two-level network of disjoint experts (see Figure 2). The first level contains shared layers common to all object categories. The second level contains a number of experts (sub-networks). Each expert is configured for parsing sketches from a super-category set comprising of categories with structurally similar parts111For example, categories cat,dog,sheep comprise the super-category Small Animals. . Instead of training from scratch, we instantiate our model using two disjoint groups of pre-trained layers from a scene parsing net (Section 4.1). We perform training using the sketchified data (Section 5.1
) mentioned above. At test time, the input sketch is first processed by the shared layers to obtain intermediate features. In parallel, the sketch is also provided to a super-category sketch classifier. The label output of the classifier is used to automatically route the intermediate features to the appropriate super-category expert for final output i.e. part-level segmentation (Section5.2).
Design Decision #3 (Auxiliary Tasks): A popular paradigm to improve performance of the main task is to have additional yet related auxiliary targets in a multi-task setting [32, 1, 18, 26]. Motivated by this observation, we configure each expert network for the novel auxiliary task of 2-D pose estimation.
At first glance, our approach seems infeasible. After all, sketchified training images resemble actual sketches only in terms of stroke density (see Figure 1). They seem to lack the fluidity and unstructured feel of hand-drawn sketches. Moreover, SketchParse’s base model , originally designed for photo scene
segmentation, seems an unlikely candidate for enabling transfer-learning basedsketch object segmentation. Yet, as we shall see, our design choices result in an architecture which is able to successfully accomplish sketch parsing across multiple categories and sketch datasets.
We propose SketchParse – the first deep hierarchical network for fully automatic parsing of hand-drawn object sketches (Section 4). Our architecture includes object pose estimation as a novel auxiliary task.
We provide the largest dataset of part-annotated object sketches across multiple categories and multiple sketch datasets. We also provide 2-D pose annotations for these sketches.
We outline how SketchParse’s output can form the basis for novel applications such as automatic sketch description (Section 6.3).
Please visit https://github.com/val-iisc/sketch-parse for pre-trained models, code and resources related to the work presented in this paper.
2 Related Work
Semantic Parsing (Photos):
Existing deep-learning approaches for semantic parsing of photos can be categorized into two groups. The first group consists of approaches for scene-level semantic parsing (i.e. output an object label for each pixel in the scene)[5, 29, 8]. The second group of approaches attempt semantic parsing of objects (i.e. output a part label for each object pixel) [13, 21, 46]. Compared to our choice (of a scene parsing net), this latter group of object-parsing approaches seemingly appear better candidates for the base architecture. However, they pose specific difficulties for adoption. For instance, the hypercolumns approach of Hariharan et al.  requires training separate part classifiers for each class. Also, the evaluation is confined to a small number of classes (animals and human beings). The approach of Liang et al.  consists of a complex multi-stage hybrid CNN-RNN architecture evaluated on only two animal categories (horse and cow). The approach of Xia et al.  fuses object part score evidence from scene, object and part level to obtain impressive results for two categories (human, large animals). However, it is not clear how their method can be adopted for our purpose.
Semantic Object Parsing (Sketches): Only a handful of works exist for sketch parsing [16, 38]. Existing approaches require a part-annotated dataset of sketch objects. Obtaining such part-level annotations is very laborious and cumbersome. Moreover, one-third of a dataset  evaluated by these approaches consists of sketches drawn by professional artists. Given the artists’ relatively better drawing skills, incorporating such sketches artificially reduces the complexity of the problem. On the architectural front, the approaches involve tweaking of multiple parameters in a hand-crafted segmentation pipeline. Existing approaches label individual sketch strokes as parts. This requires strokes within part interiors to be necessarily labelled, which can result in peculiar segmentation errors . Our method, in contrast, labels object regions as parts. In many instances, the object region boundary consists of non-stroke pixels. Therefore, it is not possible to directly compare with existing approaches. Unlike our category-scalable and fully automatic approach, these methods assume object category is known and train a separate model per category (E.g. dog and cat require separate models). Existing implementations of these approaches also have prohibitive inference time – parsing an object sketch takes anywhere from minutes  to minutes , rendering them unsuitable for interactive sketch-based applications. In contrast, our model’s inference time is fraction of a second. Also, our scale of evaluation is significantly larger. For example, Schneider et al.’s method  is evaluated on 5 test sketches per category. Our model is evaluated on sketches per category. Finally, none of the previous approaches exploit the hierarchical category-level groupings which arise naturally from structural similarities . This renders them prone to drop in performance as additional categories (and their parts) are added.
Sketch Recognition: The initial performance of handcrafted feature-based approaches  for sketch recognition has been surpassed in recent times by deep CNN architectures [51, 39]. The sketch router classifier in our architecture is a modified version of Yang et al.’s Sketch-a-Net . While the works mentioned above use sketches, Zhang et al.  use sketchified photos for training the CNN classifier which outputs class labels. We too use sketchified photos for training. However, the task in our case is parsing and not classification.
Class-hierarchical CNNs: Our idea of having an initial coarse-category net which routes the input to finer-category experts can be found in some recent works as well [2, 48], albeit for object classification. In these works, coarse-category net is intimately tied to the main task (viz. classification). In our case, the coarse-category CNN classifier serves a secondary role, helping to route the output of a parallel, shared sub-network to the finer-category parsing experts. Also, unlike above works, the task of our secondary net (classification) is different from the task of experts (segmentation).
Domain Adaptation/Transfer Learning: Our approach can be viewed as belonging to the category of domain adaptation techniques [34, 30, 3]. These techniques have proven to be successful for various problems, including image parsing [33, 15, 41]. However, unlike most approaches wherein image modality does not change, our domain-adaptation scenario is characterized by extreme modality-level variation between source (image) and target (freehand sketch). This drastically reduces the quantity and quality of data available for transfer learning, making our task more challenging.
Multi-task networks: The effectiveness of addressing auxiliary tasks in tandem with the main task has been shown for several challenging problems in vision [32, 1, 18, 26]. In particular, object classification , detection , geometric context , saliency  and adversarial loss  have been utilized as auxiliary tasks in deep network-based approaches for semantic parsing. The auxiliary task we employ – object viewpoint estimation – has been used in a multi-task setting but for object classification [55, 11]. For instance, Wong et al.  project the output of semantic scene segmentation onto a depth map and use a category-aware 3D model approach to enable 3-D pose-based grasping in robots. Zhao and Itti  utilize 3-D pose information from toy models of categories as a target auxiliary task to improve object classification. Elhoseiny et al.  also introduce pose-estimation as an auxiliary task along with classification. To the best of our knowledge, we are the first ones to design a custom pose estimation architecture to assist semantic parsing.
3 Data Preparation
We first summarize salient details of two semantic object-part photo datasets.
3.1 Object photo datasets
PASCAL-Parts: This image dataset  provides semantic part segmentation of objects from the object categories from the PASCAL VOC2010 dataset. We select categories (aeroplane, bicycle, bus, car, cat, cow, dog, flying bird, horse, motorcycle, sheep) for our experiments. To obtain the cropped object images, we used object bounding box annotations from PASCAL-parts.
CORE: The Cross-category Object REcognition (CORE) dataset  contains segmentation and attribute information for objects in images distributed across categories of vehicles and animals. We select categories from CORE dataset based on their semantic similarity with PASCAL-part categories (e.g. CORE category crow is selected since it is semantically similar to the PASCAL-part category bird).
To enable estimation of object pose as an auxiliary objective, we annotated all the images from PASCAL-Parts and CORE datasets with 2-D pose information based on the object’s orientation with respect to the viewing plane. Specifically, each image is labeled with one of the cardinal (‘North’,‘East’, ‘West’,‘South’) and intercardinal directions (‘NE’,’NW’,‘SE’,‘SW’) . We plan to release these pose annotations publicly for the benefit of multimedia community.
Next, we shall describe the procedure for obtaining sketchified versions of photos sourced from these datasets.
3.2 Obtaining sketchified images
Suppose is an object image. As the first step, we use a Canny edge detector tuned to produce only the most prominent object edges. Visually, we found the resulting image to contain edge structures which perceptually resemble human sketch strokes compared to those produced by alternatives such as Sketch Tokens  and SCG . We augment with part contours and object contours from the part-annotation data of and perform morphological dilation to thicken edge segments (using a square structured element of side ) to obtain the sketchified image (see Figure 1). To augment data available for training the segmentation model, we apply a series of rotations (, , degrees) and mirroring about the vertical axis to . Overall, this procedure results in augmented images per each original, sketchified image. To ensure a good coverage of parts and eliminate inconsistent labelings, we manually curated object images from PASCAL-Parts and CORE datasets. Given the varying semantic granularity of part labels across these datasets, we manually curated the parts considered for each category . Finally, we obtain the training dataset consisting of paired sketchified images and corresponding part-level annotations, distributed across object categories.
We evaluate our model’s performance on freehand line sketches from two large-scale datasets. We describe these datasets and associated data preparation procedures next.
3.3 Sketch datasets and augmentation
TU-Berlin: The TU-Berlin sketch database  contains hand drawn sketches spanning common object categories, with sketches per object. For this dataset, only the category name was provided to the sketchers during the drawing phase.
Sketchy: The Sketchy database  contains sketches spanning object categories, with sketches per category. To collect this dataset, photo images of objects were initially shown to human subjects. After a gap of seconds, the image was replaced by a gray screen and subjects were asked to sketch the object from memory. Compared to the draw-from-category-name-only approach employed for TU-Berlin dataset , this memory-based approach provides a larger variety in terms of multiple viewpoints and object detail in sketches. On an average, each object photo is typically associated with different sketches.
From both the datasets, we use sketches from only those object categories which overlap with the categories from PASCAL-Parts mentioned in Section 3.1. For augmentation, we first apply morphological dilation (using a square structured element of side ) to each sketch. This operation helps minimize the impact of loss in stroke continuity when the sketch is processed by deeper layers of the network. Subsequently, we apply a series of rotations (, , degrees) and mirroring about the vertical axis on the dilated sketch. This produces augmented variants per original sketch for use during training.
4 Our model (SketchParse)
4.1 Instantiating SketchParse levels
We design a two-level deep network architecture for SketchParse (see Figure 2). The first level is intended to capture category-agnostic low-level information and contains shared layers common to all object categories. We instantiate first-level layers with the shallower, initial layers from a scene parsing net . Specifically, we use multi-scale version of ResNet-101  variant of DeepLab , a well-known architecture designed for semantic scene parsing. In this context, we wish to emphasize that our design is general and can accommodate any fully convolutional scene parsing network.
The categories are grouped into smaller , disjoint super-category subsets using meronym (i.e. part-of relation)-based similarities between objects . To obtain super-categories, we start with given categories and cluster them based on the fraction of common meronyms (part-names). For example, ‘flying bird’ and ‘airplane’ have the largest common set of part-names between themselves compared to any other categories and therefore form a single cluster. This procedure can also be used when a new category is introduced – we assign to the super-category with which it shares largest common set of part-names.
The second level consists of expert sub-networks , each of which is specialized for parsing sketches associated with a super-category. We initialize these experts using the deeper, latter layers from the scene parsing model. Suppose the total number of parts across all object categories within the -th super-category is . We modify the final layer for each expert network such that it outputs part-label predictions. In our current version of the architecture, and . We performed ablative experiments to determine optimal location for splitting the layers of semantic scene parsing net into two groups. Based on these experiments, we use all the layers up to the res5b block as shared layers.
4.2 Router Layer
From the above description (Section 4.1), SketchParse’s design so far consists of a single shared sub-network and
expert nets. We require a mechanism for routing the intermediate features produced by shared layers to the target expert network. In addition, we require a mechanism for backpropagating error gradients from theexpert nets and update the weights of the shared layer sub-network during training. To meet such requirements, we design a Router Layer (shaded green in Figure 2
). During the training phase, routing of features from the shared layer is dictated by ground-truth category and by extension, the super-category it is associated with. A branch index array is maintained for each training mini-batch. Since the ground-truth super-category for each training example is available, creation of branch index array requires only knowledge of the mini-batch label composition. The array entries are referenced during backward propagation to (a) recombine the gradient tensors in the same order as that of the mini-batch and (b) route error gradient from the appropriate branch to the shared layers during backpropagation.
To accomplish routing during test time, we use a -way classifier (shaded red in Figure 2) whose output label corresponds to one of the expert networks . In this regard, we experimented with a variety of deep CNN architectures. We examined previous custom architectures for sketch recognition by Seddati et al.  and Yang et al. . In addition to custom nets, we explored architectures which involved fine-tuning image object classification networks such as AlexNet and GoogleNet. Table 1 lists some of the architectures explored along with their classification performance. We briefly describe the architectures next.
[SketchParse-pool5][GAP][FC64][DO 0.5][FC5], using sketchified images - We first train SketchParse and then take global average pooled output after pool5. All the layers of SketchParse are frozen, only the fully connected layers are learnt. 0.5 dropout is used for the fully connected layer.
[SketchParse-pool5][FC64][DO 0.5][FC64][DO 0.5][FC5], using sketchified images - This experiment is similar to the above experiment. Here we use 3 fully connected layers after the global average pool output of pool5. 0.5 dropout is used for the fully connected layers.
Google net [GAP][FC5], using sketches - The last fully connected layer of the Google net is removed and it is replaced by a global average pool layer and a fully connected layer for classification. This model is then fine-tuned for sketch classification.
Custom net (used), using sketches - This is the custom net, described in Table 2. It was found that on increasing the number of layers in all Conv layers of this architecture improves performance.
Alex net [GAP][FC5], using sketches - In this experiment, we replaced the last 3 fully connected layers of Alex net with a global average pool layer and a fully connected layer for classification. This new architecture was then fine-tuned.
|[SketchParse-pool5][GAP][FC64][DO 0.5][FC5]||sketchified images|
|[SketchParse-pool5][FC64][DO 0.5][FC64][DO 0.5][FC5]||sketchified images|
|Google net [GAP][FC5]||sketch images|
|Custom net(used)||sketch images||91.3|
|Alex net [GAP][FC5]||sketch images|
We found that a customized version of Yang et al.’s SketchCNN architecture, with a normal (non-sketchified) input, provided the best performance. Specifically, we found that (1) systematically increasing the number of filters in all the conv layers and (2) using a larger size kernel in last layer, results in better performance. We use a very high dropout rate of % in our network to combat over-fitting and to compensate for lower amount of training data. Table 2 depicts the architecture of our router classifier.
We also wish to point out that our initial attempts involved training custom CNNs solely on sketchified images or their deep feature variants. However, the classification performance was subpar. Therefore, we resorted to training the classifier using actual sketches.
|index||Type||Filter size||no. of filters||stride|
4.2.1 Confusion Matrix
A highly accurate sketch classifier is a crucial requirement for proper routing and overall performance of our approach. Table 3
shows the confusion matrix of the classifier used by our model on the test set. We note that most of the misclassification occurs due to confusion betweenSmall Animal and Large Animal super-categories because these sketches tend to be very similar.
|Large Anim.||Small Anim.||4 wheel.||2 wheel.||Flying.|
4.3 Auxiliary (Pose Estimation) Task Network
The architecture for estimating the 2-D pose of the sketch (shaded pink and shown within the top-right brown dotted line box in Figure 2) is motivated by the following observations:
First, the part-level parsing of the object typically provides clues regarding object pose. For example, if we view the panel for Two Wheelers in Figure 3, it is evident that the relative location and spatial extent of ‘handlebar’ part for a bicycle is a good indicator of pose. Therefore, to enable pose estimation from part-level information, the input to the pose net is the tensor of pre-softmax pixelwise activations222Shown as a blue oval in Figure 2. generated within the expert part-parsing network. To capture the large variation in part appearances, locations and combinations thereof, the first two layers in the pose network contain dilated convolutional filters , each having rate and stride with kernel width . Each convolutional layer is followed by a ReLU non-linearity .
Second, 2-D pose is a global attribute of an object sketch. Therefore, to provide the network with sufficient global context, we configure the last convolutional layer with and , effectively learning a large spatial template filter. The part combinations captured by initial layers also mitigate the need to learn many such templates – we use in our implementation. The resulting template-based feature representations are combined via a fully-connected layer and fed to a
-way softmax layer which outputs 2-D pose labels corresponding to cardinal and intercardinal directions. Note also that each super-category expert has its own pose estimation auxiliary net.
Having described the overall framework for SketchParse, we next describe major implementation details of our training and inference procedure.
5 Implementation Details
SketchParse: Before commencing SketchParse training, the initial learning rate for all but the final convolutional layers is set to . The rate for the final convolutional layer in each sub-networks is set to
. Batch-norm parameters are kept fixed during the entire training process. The architecture is trained end-to-end using a per-pixel cross-entropy loss as the objective function. For optimization, we employ stochastic gradient descent with a mini-batch size ofsketchified image, momentum of and polynomial weight decay policy. We stop training after iterations. The training takes hours on a NVIDIA Titan X GPU.
A large variation can exist between part sizes for a given super-category (e.g. number of ‘tail’ pixels is smaller than ‘body’ pixels in Large Animals). To accommodate this variation, we use a class-balancing scheme which weighs per-pixel loss differently based on relative presence of corresponding ground-truth part . Suppose a pixel’s ground-truth part label is . Suppose is present in training images and suppose the total number of pixels with label across these images is . We weight the corresponding loss by where and is the median over the set of s. In effect, losses for pixels of smaller parts get weighted to a larger extent and vice-versa.
Pose Auxiliary Net: The pose auxiliary network is trained end-to-end with the rest of SketchParse, using an -way cross-entropy loss. The learning rate for the pose module is set to . All other settings remain the same as above.
Sketch classifier: For training the -way sketch router classifier, we randomly sample of sketches per category from TU-Berlin and Sketchy datasets333This translates to sketches from TU-Berlin and sketches from Sketchy for training, and sketches for validation.. We augment data with flip about vertical axis, series of rotations (, , degrees) and sketch-scale augmentations (%, % of image height). For training, the initial learning rate is set to . The classifier is trained using stochastic gradient descent with a mini-batch size of sketches, momentum of and a polynomial weight decay policy. We stop training after iterations. The training takes hours on a NVIDIA Titan X GPU.
For evaluation, we used equal number of sketches () per category from TU-Berlin and Sketchy datasets except for bus category which is present only in TU-Berlin dataset. Thus, we have a total of sketches for testing. For sketch router classifier, we follow the conventional approach 
of pooling score outputs corresponding to cropped (four corner crops and one center crop) and white-padded versions of the original sketch and its vertically mirrored version. Overall, the time to obtain part-level segmentation and pose for an input sketch isseconds on average. Thus, SketchParse is an extremely suitable candidate for developing applications which require real-time sketch understanding.
To enable quantitative evaluation, we crowdsourced part-level annotations for all the sketches across categories, in effect creating the largest part-annotated dataset for sketch object parsing.
Evaluation procedure: For quantitative evaluation, we adopt the average IOU measure widely reported in photo-based scene and object parsing literature . Consider a fixed test sketch. Let the number of unique part labels in the sketch be . Let be the number of pixels of part-label predicted as part-label . Let be the total number of pixels whose ground-truth label is . We first define part-wise Intersection Over Union for part-label as = . Then, we define sketch-average IOU (sIOU) = . For a given category, we compute sIOU for each of its sketches individually and average the resulting values to obtain the category’s average IOU (aIOU) score.
|B + Class-Balanced Loss Weighting (C)|
|B + C + Pose Auxiliary Task (P)|
Significance of class-balanced loss and auxiliary task: To determine whether to incorporate class-balanced loss weighting and 2-D pose as auxiliary task, we conducted ablative experiments on a baseline version of SketchParse configured for a single super-category (‘Large Animals’). As the results in Table 4 indicate, class-balanced loss weighting and inclusion of 2-D pose as auxiliary task contribute to improved performance over the baseline model.
|Branch point/||aIOU %|
|# layers specialized for a super category|
|res5c / classifier block|
|res5b / 1 res block + classifier block|
|res5a / 2 res blocks + classifier block|
|res4b22 / 3 res blocks + classifier block|
|res4b17 / 8 res blocks + classifier block|
Determining split point in base model: A number of candidate split points exist which divide layers of the base scene parsing net  into two disjoint groups. We experimented with different split points within the scene parsing net. For each split point, we trained a full super-category version of SketchParse model. Based on the results (see Table 5), we used the split point (res5b) which generated best performance for the final version viz. the SketchParse model with class-balanced loss weighting and pose estimation included for all super-categories. Note that we do not utilize the sketch router for determining the split point. From our experiments, we found the optimal split point results in shallow expert networks. This imparts SketchParse with better scalability. In other words, additional new categories and super-categories can be included without a large accompanying increase in number of parameters.
|Large Animals||Small Animals||4-Wheelers||2-Wheelers||Flying Things|
|BCP-R5 (100% router)||64.45|
Relative Loss weighting: The ‘parsing’ and ‘pose estimation’ tasks are trained simultaneously. We weigh the losses individually and perform a grid search on the values of and learning rate. Hence, total loss, is
The grid search is performed on the super category Large-Animals and the optimal and learning rate value , lr = (see Table 7) are used for each branch in the final route network.
|learning rate||aIOU %|
Full version net and baselines: We compare the final super-category version (BCP-R5) of SketchParse (containing class-balanced loss weighting and pose estimation auxiliary task) with certain baseline variations – (i) B-R5: No additional components included (ii) BC-R5: Class-balanced loss weighting included in B-R5 (iii) BCP-R-: All categories are grouped into a single branch (iv) BCP-R-: A variant of the final version with a dedicated expert network for each category (i.e. one branch per category).
From the results (Table 6), we make the following observations: (1) Despite the challenges posed by hand-drawn sketches, our model performs reasonably well across a variety of categories (last but one row in Table 6). (2) Sketches from Large Animals are parsed the best while those from Flying Things do not perform as well. On closer scrutiny, we found that bird category elicited inconsistent sketch annotations given the relatively higher degree of abstraction in the corresponding sketches. (4) In addition to confirming the utility of class-balanced loss weighting and pose estimation, the baseline performances demonstrate that part (and parameter) sharing at category level is a crucial design choice, leading to better overall performance. In particular, note that having category per branch (BCP-R-) almost doubles the number of parameters, indicating poor category scalability.
To examine the effect of the router, we computed IOU by assuming a completely accurate router for BCP-R5 (last row). This improves average IOU performance by 1.28. The small magnitude of improvement also shows that our router is quite reliable. The largest improvements are found among Small Animals (‘cat’,‘dog’). This is also supported by the router classifier’s confusion matrix (see Table 3).
The performance of pose classifier can be viewed in Table 8. Note that simplifying the canonical pose directions (merging non-canonical directional labels with canonical directions) lends a dramatic improvement in accuracy. In addition, the most confusion among predictions is between the left-right directions and their corresponding perspective views (see Table 16). Depending on the granularity of pose information required we may merge the perspective directions as appropriate.
Qualitative evaluation: Rather than cherry-pick results, we use a principled approach to obtain a qualitative perspective. We first sort the test sketches in each super-category by their aIOU (average IOU) values in decreasing order. We then select sketches located at -th, -th, -th, and -th percentile in the sorted order. These sketches can be viewed in Figure 3. The part-level parsing results reinforce the observations made previously in the context of quantitative evaluation.
6.1 Parsing semantically related categories
We also examine the performance of our model for sketches belonging to categories our model is not trained on but happen to be semantically similar to at least one of the existing categories. Since segmentation information is unavailable for these sketches, we show two representative parsing outputs per class. We include the classes monkey, tiger, teddy-bear, camel, bear, giraffe, elephant, race car and tractor which are semantically similar to categories already considered in our formulation. As the results demonstrate (Figure 4), SketchParse accurately recognizes parts it has seen before (‘head’, ‘body’, ‘leg’ and ‘tail’). It also exhibits a best-guess behaviour to explain parts it is unaware of. For instance, it marks elephant ‘trunk’ as either ‘legs’ or ‘tails’ which is a semantically reasonable error given the spatial location of the part. These experiments demonstrate the scability of our model in terms of category coverage. In other words, our architecture can integrate new, hitherto unseen categories without too much effort.
6.2 Evaluating category-level scalability
As an additional experiment to evaluate performance scalability when categories are added incrementally, we performed the following experiment: (1) Train the BCP-R5 network with all categories, except the categories (‘dog’,‘sheep’) from Small Animals super-category. (2) Freeze shared layers and fine-tune Small Animals branch with ‘dog’ data added. (3) Freeze shared layers and fine-tune the trained model from Step-(2) with ‘sheep’ data added.
|Large Animals||Small Animals||4-Wheelers||2-Wheelers||Flying Things|
We observe (Table 10) that the average IOU (last column) progressively improves as additional categories are added. In other words, overall performance does not drop even though shared layers are frozen. This shows the scalable nature of our architecture. Of course, the IOU values are slightly smaller compared to the original result (last row) where all categories are present from the beginning, but that is a consequence of freezing shared layers.
6.3 Fine-grained retrieval
In another experiment, we determine whether part-level parsing of sketches can improve performance for existing sketch-based image retrieval approaches which use global image-level features. We use the PASCAL parts dataset , consisting of photo images across categories, as the retrieval database. As a starting point, we consider the sketch-based image retrieval system of Sangkloy et al. . The system consists of a trained Siamese Network model which projects both sketches and photos into a shared latent space. We begin by supplying the query sketch from our dataset  and obtain the sequence of retrieved PASCAL parts images . Suppose the part-segmented version of is . We use a customized Attribute-Graph approach  and construct a graph from . The attribute graph is designed to capture the spatial and semantic aspects of the part-level information at local and global scales. We use annotations from the PASCAL parts dataset to obtain part-segmented versions of retrieved images, which in turn are used to construct corresponding attribute graphs .
Each graph has two kinds of nodes: a single global node and a local node for each non-contiguous instance of a part present in the segmentation output from SketchParse. The global node attributes include:
a histogram that keeps a count of each type of part present in the image
the non-background area in the image as a fraction of total area
A local node is instantiated at every non-contiguous part present in the segmentation output. We drop nodes for which the corresponding part area is less than 0.1% of the total non-background area with the assumption that these are artifacts in the segmentation output. Each local node has the following attributes:
angle subtended by the part at the centre of the sketch
centre of the part
Edges are present between local nodes corresponding to parts that have a common boundary. Each such edge encodes the relative position of both parts using a polar coordinate system.
Every local node is also connected to the global node. These edges encode the absolute position of the part in the image. Again, the area of each part is used as a multiplicative weight for each similarity computation it’s corresponding node participates in.
For re-ranking the retrieved images, we use Reweighted Random Walks Graph Matching  to compute similarity scores between and , although any other graph matching algorithm that allows incorporating constraints may be used. During the graph matching process we enforce two constraints:
A global node can only be matched to a global node of the other graph.
Local nodes can only be matched if they correspond to the same type of part (e.g. local nodes corresponding to legs can only be matched to other legs and cannot be matched to other body parts)
For our experiments, we examine our re-ranking procedure for top- (out of images) of Sketchy model’s retrieval results. In Figure 5, each panel corresponds to top- retrieval results for a particular sketch. The sketch and its parsing are displayed alongside the nearest neighbors in latent space of Sketchy model (top row) and the top re-ranked retrievals using our part-graphs (bottom row). The results show that our formulation exploits the category, part-level parsing and pose information to obtain an improved ranking.
6.4 Describing sketches in detail
Armed with the information provided by our model, we can go beyond describing a hand-drawn sketch by a single category label. For a given sketch, our model automatically provides its category, associated super-category, part-labels and their counts and 2-D pose information. From this information, we use a template-filling approach to generate descriptions – examples can be seen alongside our qualitative results in Figure 6. A fascinating application, inspired by the work of Zhang et al. , would be to use such descriptions to generate freehand sketches using a Generative Adversarial Network approach.
Given the generally poor drawing skills of humans and sparsity of detail, it is very challenging to simultaneously recognize and parse sketches across multiple groups of categories. In this paper, we have presented SketchParse, the first deep-network architecture for fully automatic parsing of freehand object sketches. The originality of our approach lies in successfully repurposing a photo scene-segmentation net into a category-hierarchical sketch object-parsing architecture. The general nature of our transfer-learning approach also allows us to leverage advances in fully convolutional network-based scene parsing approaches, thus continuously improving performance. Another novelty lies in obtaining labelled training data for free by sketchifying photos from object-part datasets, thus bypassing burdensome annotation step. Our work stands out from existing approaches in the complexity of sketches, number of categories considered and semantic variety in categories. While existing works focus on one or two super-categories and build separate models for each, our scalable architecture can handle a larger number of super-categories, all with a single, unified model. Finally, the utility of SketchParse’s novel multi-task architecture is underscored by its ability to enable applications such as fine-grained sketch description and improving sketch-based image retrieval.
Please visit https://github.com/val-iisc/sketch-parse for pre-trained models, code and resources related to the work presented in this paper.
For future exploration, it would be interesting to explore additional auxiliary tasks such as adversarial loss  and part-histogram loss  to further boost part-parsing performance. Another natural direction to pursue would be the viability of SketchParse’s architecture for semantic parsing of photo objects.
-  A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia. Multi-task cnn model for attribute prediction. IEEE Transactions on Multimedia, 17(11):1949–1959, 2015.
K. Ahmed, M. H. Baig, and L. Torresani.
Network of experts for large-scale image categorization.
14th European Conference on Computer Vision (Part-VII), pages 516–532. Springer International Publishing, 2016.
-  A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In NIPS, pages 181–189, 2010.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915, 2016.
-  X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
-  M. Cho, J. Lee, and K. M. Lee. Reweighted random walks for graph matching. In ECCV, pages 492–505. Springer-Verlag, 2010.
J. Dai, K. He, and J. Sun.
Instance-aware semantic segmentation via multi-task network cascades.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of IEEE ICCV, pages 2650–2658, 2015.
-  M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Transactions on Graphics (TOG), 31(4):44, 2012.
-  M. Elhoseiny, T. El-Gaaly, A. Bakry, and A. Elgammal. A comparative analysis and study of multiview cnn models for joint object categorization and pose estimation. In Proceedings of ICML, volume 48, pages 888–897. JMLR.org, 2016.
-  A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-category generalization. In IEEE CVPR, pages 2352–2359. IEEE, 2010.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE CVPR, pages 447–456, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE CVPR, 2016.
S. Hong, J. Oh, H. Lee, and B. Han.
Learning transferrable knowledge for semantic segmentation with deep convolutional neural network.In Proceedings of the IEEE CVPR, 2016.
-  Z. Huang, H. Fu, and R. W. H. Lau. Data-driven segmentation and labeling of freehand sketches. Proceedings of SIGGRAPH Asia, 2014.
-  R. H. Kazi, F. Chevalier, T. Grossman, S. Zhao, and G. Fitzmaurice. Draco: bringing life to illustrations with kinetic textures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 351–360. ACM, 2014.
M. Lapin, B. Schiele, and M. Hein.
Scalable multitask representation learning for scene classification.In Proceedings of the IEEE CVPR, pages 1434–1441, 2014.
-  X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE Transactions on Image Processing, 25(8):3919–3930, 2016.
-  Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Fine-grained sketch-based image retrieval by matching deformable part models. In BMVC, 2014.
X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan.
Semantic object parsing with local-global long short-term memory.In The IEEE CVPR, June 2016.
-  X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636, 2015.
-  J. J. Lim, C. L. Zitnick, and P. Dollár. Sketch tokens: A learned mid-level representation for contour and object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3158–3165, 2013.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE CVPR, pages 3431–3440, 2015.
-  P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training, 2016.
-  B. Mahasseni and S. Todorovic. Latent multitask learning for view-invariant action recognition. In Proceedings of the IEEE ICCV, pages 3128–3135, 2013.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
-  V. Nekrasov, J. Ju, and J. Choi. Global deconvolutional networks for semantic segmentation. CoRR, abs/1602.03930, 2016.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE ICCV, pages 1520–1528, 2015.
-  V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
-  N. Prabhu and R. Venkatesh Babu. Attribute-graph: A graph based approach to image ranking. In Proceedings of the IEEE ICCV, pages 1071–1079, 2015.
-  R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. arXiv preprint arXiv:1603.01249, 2016.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE CVPR, pages 3234–3243, 2016.
-  K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
-  P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: Learning to retrieve badly drawn bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016.
R. K. Sarvadevabhatla, J. Kundu, and R. V. Babu.
Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition.In Proceedings of the ACMMM, pages 247–251, 2016.
R. G. Schneider and T. Tuytelaars.
Sketch classification and classification-driven analysis using fisher vectors.ACM Trans. Graph., 33(6):174:1–174:9, Nov. 2014.
-  R. G. Schneider and T. Tuytelaars. Example-based sketch segmentation and labeling using crfs. ACM Trans. Graph., 35(5):151:1–151:9, July 2016.
-  O. Seddati, S. Dupont, and S. Mahmoudi. Deepsketch: deep convolutional neural networks for sketch recognition and similarity search. In 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2015.
-  A. Theobald. An ontology for domain-oriented semantic similarity search on XML data. In BTW 2003, Datenbanksysteme für Business, Technologie und Web, Tagungsband der 10. BTW-Konferenz, 26.-28. Februar 2003, Leipzig, pages 217–226, 2003.
-  A. van Opbroek, M. A. Ikram, M. W. Vernooij, and M. De Bruijne. Transfer learning improves supervised image segmentation across imaging protocols. IEEE transactions on medical imaging, 34(5):1018–1030, 2015.
-  A. Vezhnevets and J. M. Buhmann. Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. In IEEE CVPR, pages 3249–3256. IEEE, 2010.
-  P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Joint object and part segmentation using deep learned potentials. In Proceedings of the IEEE ICCV, pages 1573–1581, 2015.
-  Wikipedia. Cardinal direction — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Cardinal_direction, 2017.
-  J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. Johnson, et al. Segicp: Integrated deep semantic segmentation and pose estimation. arXiv preprint arXiv:1703.01661, 2017.
-  F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In Proceedings of 14th European Conference in Computer Vision: Part V, pages 648–663, 2016.
-  R. Xiaofeng and L. Bo. Discriminatively trained sparse code gradients for contour detection. In Advances in neural information processing systems, pages 584–592, 2012.
-  Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, and Y. Yu. Hd-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE ICCV, pages 2740–2748, 2015.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
-  Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy. Sketch me that shoe. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. Hospedales. Sketch-a-net that beats humans. BMVC, 2015.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
-  Y. Zhang, Y. Zhang, and X. Qian. Deep neural networks for free-hand sketch recognition. In 17th Pacific-Rim Conference on Multimedia, Xi’an, China, September 15-16, 2016, 2016.
-  B. Zhao, F. Li, and E. P. Xing. Large-scale category structure aware image categorization. In NIPS, pages 1251–1259, 2011.
-  J. Zhao and L. Itti. Improved deep learning of object category using pose information. CoRR, abs/1607.05836, 2016.
Confusion Matrices for Pose