SketchParse : Towards Rich Descriptions for Poorly Drawn Sketches using Multi-Task Hierarchical Deep Networks

09/05/2017 ∙ by Ravi Kiran Sarvadevabhatla, et al. ∙ indian institute of science 0

The ability to semantically interpret hand-drawn line sketches, although very challenging, can pave way for novel applications in multimedia. We propose SketchParse, the first deep-network architecture for fully automatic parsing of freehand object sketches. SketchParse is configured as a two-level fully convolutional network. The first level contains shared layers common to all object categories. The second level contains a number of expert sub-networks. Each expert specializes in parsing sketches from object categories which contain structurally similar parts. Effectively, the two-level configuration enables our architecture to scale up efficiently as additional categories are added. We introduce a router layer which (i) relays sketch features from shared layers to the correct expert (ii) eliminates the need to manually specify object category during inference. To bypass laborious part-level annotation, we sketchify photos from semantic object-part image datasets and use them for training. Our architecture also incorporates object pose prediction as a novel auxiliary task which boosts overall performance while providing supplementary information regarding the sketch. We demonstrate SketchParse's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing unseen, semantically related object categories (iii) in improving fine-grained sketch-based image retrieval. As a novel application, we also outline how SketchParse's output can be used to generate caption-style descriptions for hand-drawn sketches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 11

page 12

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hand-drawn line sketches have long been employed to communicate ideas in a minimal yet understandable manner. In this paper, we explore the problem of parsing sketched objects, i.e. given a freehand line sketch of an object, determine its salient attributes (e.g. category, semantic parts, pose). The ability to understand sketches in terms of local (e.g. parts) and global attributes (e.g. pose) can drive novel applications such as sketch captioning, storyboard animation [17] and automatic drawing assessment apps for art teachers. The onset of deep network era has resulted in architectures which can impressively recognize object sketches at a coarse (category) level [36, 39, 51]. Paralleling the advances in parsing of photographic objects [13, 21, 46] and scenes [4, 29, 8], the time is ripe for understanding sketches too at a fine-grained level [50, 20].

A number of unique challenges need to be addressed for semantic sketch parsing. Unlike richly detailed color photos, line sketches are binary (black and white) and sparsely detailed. Sketches exhibit a large amount of appearance variations induced by the range of drawing skills among general public. The resulting distortions in object depiction pose a challenge to parsing approaches. In many instances, the sketch is not drawn with a ‘closed’ object boundary, complicating annotation, part-segmentation and pose estimation. Given all these challenges, it is no surprise that only a handful of works exist for sketch parsing 

[16, 38]. However, even these approaches have their own share of drawbacks (Section 2).

To address these issues, we propose a novel architecture called SketchParse for fully automatic sketch object parsing. In our approach, we make three major design decisions:

Design Decision #1 (Data): To bypass burdensome part-level sketch annotation, we leverage photo image datasets containing part-level annotations of objects [6]. Suppose is an object image and is the corresponding part-level annotation. We subject to a sketchification procedure (Section 3.2) and obtain . Thus, our training data consists of the sketchified image and corresponding part-level annotation pairs () for each category (Figure 1).

Design Decision #2 (Model): Many structurally similar object categories tend to have common parts. For instance, ‘wings’ and ‘tail’ are common to both birds and airplanes. To exploit such shared semantic parts, we design our model as a two-level network of disjoint experts (see Figure 2). The first level contains shared layers common to all object categories. The second level contains a number of experts (sub-networks). Each expert is configured for parsing sketches from a super-category set comprising of categories with structurally similar parts111For example, categories cat,dog,sheep comprise the super-category Small Animals. . Instead of training from scratch, we instantiate our model using two disjoint groups of pre-trained layers from a scene parsing net (Section 4.1). We perform training using the sketchified data (Section 5.1

) mentioned above. At test time, the input sketch is first processed by the shared layers to obtain intermediate features. In parallel, the sketch is also provided to a super-category sketch classifier. The label output of the classifier is used to automatically route the intermediate features to the appropriate super-category expert for final output i.e. part-level segmentation (Section

5.2).

Design Decision #3 (Auxiliary Tasks): A popular paradigm to improve performance of the main task is to have additional yet related auxiliary targets in a multi-task setting [32, 1, 18, 26]. Motivated by this observation, we configure each expert network for the novel auxiliary task of 2-D pose estimation.

At first glance, our approach seems infeasible. After all, sketchified training images resemble actual sketches only in terms of stroke density (see Figure 1). They seem to lack the fluidity and unstructured feel of hand-drawn sketches. Moreover, SketchParse’s base model [5], originally designed for photo scene

segmentation, seems an unlikely candidate for enabling transfer-learning based

sketch object segmentation. Yet, as we shall see, our design choices result in an architecture which is able to successfully accomplish sketch parsing across multiple categories and sketch datasets.

Contributions:

  • We propose SketchParse – the first deep hierarchical network for fully automatic parsing of hand-drawn object sketches (Section 4). Our architecture includes object pose estimation as a novel auxiliary task.

  • We provide the largest dataset of part-annotated object sketches across multiple categories and multiple sketch datasets. We also provide 2-D pose annotations for these sketches.

  • We demonstrate SketchParse’s abilities on two challenging large-scale sketch object datasets (Section 5.2), on unseen semantically related categories, (Section 6.1) and for improving fine-grained sketch-based image retrieval (Section 6.3).

  • We outline how SketchParse’s output can form the basis for novel applications such as automatic sketch description (Section 6.3).

Please visit https://github.com/val-iisc/sketch-parse for pre-trained models, code and resources related to the work presented in this paper.

2 Related Work

Semantic Parsing (Photos):

Existing deep-learning approaches for semantic parsing of photos can be categorized into two groups. The first group consists of approaches for scene-level semantic parsing (i.e. output an object label for each pixel in the scene) 

[5, 29, 8]. The second group of approaches attempt semantic parsing of objects (i.e. output a part label for each object pixel) [13, 21, 46]. Compared to our choice (of a scene parsing net), this latter group of object-parsing approaches seemingly appear better candidates for the base architecture. However, they pose specific difficulties for adoption. For instance, the hypercolumns approach of Hariharan et al. [13] requires training separate part classifiers for each class. Also, the evaluation is confined to a small number of classes (animals and human beings). The approach of Liang et al. [21] consists of a complex multi-stage hybrid CNN-RNN architecture evaluated on only two animal categories (horse and cow). The approach of Xia et al. [46] fuses object part score evidence from scene, object and part level to obtain impressive results for two categories (human, large animals). However, it is not clear how their method can be adopted for our purpose.

Figure 1: An illustration of our sketchification procedure (Section 3.2). The edge image , corresponding to the input photo , is merged with the part and object contours derived from ground-truth labeling , to obtain the final sketchified image . We train SketchParse using instances as inputs and instances as the corresponding part-labelings.

Semantic Object Parsing (Sketches): Only a handful of works exist for sketch parsing [16, 38]. Existing approaches require a part-annotated dataset of sketch objects. Obtaining such part-level annotations is very laborious and cumbersome. Moreover, one-third of a dataset [16] evaluated by these approaches consists of sketches drawn by professional artists. Given the artists’ relatively better drawing skills, incorporating such sketches artificially reduces the complexity of the problem. On the architectural front, the approaches involve tweaking of multiple parameters in a hand-crafted segmentation pipeline. Existing approaches label individual sketch strokes as parts. This requires strokes within part interiors to be necessarily labelled, which can result in peculiar segmentation errors [38]. Our method, in contrast, labels object regions as parts. In many instances, the object region boundary consists of non-stroke pixels. Therefore, it is not possible to directly compare with existing approaches. Unlike our category-scalable and fully automatic approach, these methods assume object category is known and train a separate model per category (E.g. dog and cat require separate models). Existing implementations of these approaches also have prohibitive inference time – parsing an object sketch takes anywhere from minutes [38] to minutes [16], rendering them unsuitable for interactive sketch-based applications. In contrast, our model’s inference time is fraction of a second. Also, our scale of evaluation is significantly larger. For example, Schneider et al.’s method [38] is evaluated on 5 test sketches per category. Our model is evaluated on sketches per category. Finally, none of the previous approaches exploit the hierarchical category-level groupings which arise naturally from structural similarities [54]. This renders them prone to drop in performance as additional categories (and their parts) are added.

Sketch Recognition: The initial performance of handcrafted feature-based approaches [37] for sketch recognition has been surpassed in recent times by deep CNN architectures [51, 39]. The sketch router classifier in our architecture is a modified version of Yang et al.’s Sketch-a-Net [51]. While the works mentioned above use sketches, Zhang et al. [53] use sketchified photos for training the CNN classifier which outputs class labels. We too use sketchified photos for training. However, the task in our case is parsing and not classification.

Class-hierarchical CNNs: Our idea of having an initial coarse-category net which routes the input to finer-category experts can be found in some recent works as well [2, 48], albeit for object classification. In these works, coarse-category net is intimately tied to the main task (viz. classification). In our case, the coarse-category CNN classifier serves a secondary role, helping to route the output of a parallel, shared sub-network to the finer-category parsing experts. Also, unlike above works, the task of our secondary net (classification) is different from the task of experts (segmentation).

Domain Adaptation/Transfer Learning: Our approach can be viewed as belonging to the category of domain adaptation techniques [34, 30, 3]. These techniques have proven to be successful for various problems, including image parsing [33, 15, 41]. However, unlike most approaches wherein image modality does not change, our domain-adaptation scenario is characterized by extreme modality-level variation between source (image) and target (freehand sketch). This drastically reduces the quantity and quality of data available for transfer learning, making our task more challenging.

Multi-task networks: The effectiveness of addressing auxiliary tasks in tandem with the main task has been shown for several challenging problems in vision [32, 1, 18, 26]. In particular, object classification [28], detection [8], geometric context [42], saliency [19] and adversarial loss [25] have been utilized as auxiliary tasks in deep network-based approaches for semantic parsing. The auxiliary task we employ – object viewpoint estimation – has been used in a multi-task setting but for object classification [55, 11]. For instance, Wong et al. [45] project the output of semantic scene segmentation onto a depth map and use a category-aware 3D model approach to enable 3-D pose-based grasping in robots. Zhao and Itti [55] utilize 3-D pose information from toy models of categories as a target auxiliary task to improve object classification. Elhoseiny et al. [11] also introduce pose-estimation as an auxiliary task along with classification. To the best of our knowledge, we are the first ones to design a custom pose estimation architecture to assist semantic parsing.

3 Data Preparation

We first summarize salient details of two semantic object-part photo datasets.

3.1 Object photo datasets

PASCAL-Parts: This image dataset [43] provides semantic part segmentation of objects from the object categories from the PASCAL VOC2010 dataset. We select categories (aeroplane, bicycle, bus, car, cat, cow, dog, flying bird, horse, motorcycle, sheep) for our experiments. To obtain the cropped object images, we used object bounding box annotations from PASCAL-parts.

CORE: The Cross-category Object REcognition (CORE) dataset [12] contains segmentation and attribute information for objects in images distributed across categories of vehicles and animals. We select categories from CORE dataset based on their semantic similarity with PASCAL-part categories (e.g. CORE category crow is selected since it is semantically similar to the PASCAL-part category bird).

To enable estimation of object pose as an auxiliary objective, we annotated all the images from PASCAL-Parts and CORE datasets with 2-D pose information based on the object’s orientation with respect to the viewing plane. Specifically, each image is labeled with one of the cardinal (‘North’,‘East’, ‘West’,‘South’) and intercardinal directions (‘NE’,’NW’,‘SE’,‘SW’)  [44]. We plan to release these pose annotations publicly for the benefit of multimedia community.

Next, we shall describe the procedure for obtaining sketchified versions of photos sourced from these datasets.

3.2 Obtaining sketchified images

Suppose is an object image. As the first step, we use a Canny edge detector tuned to produce only the most prominent object edges. Visually, we found the resulting image to contain edge structures which perceptually resemble human sketch strokes compared to those produced by alternatives such as Sketch Tokens [23] and SCG [47]. We augment with part contours and object contours from the part-annotation data of and perform morphological dilation to thicken edge segments (using a square structured element of side ) to obtain the sketchified image (see Figure 1). To augment data available for training the segmentation model, we apply a series of rotations (, , degrees) and mirroring about the vertical axis to . Overall, this procedure results in augmented images per each original, sketchified image. To ensure a good coverage of parts and eliminate inconsistent labelings, we manually curated object images from PASCAL-Parts and CORE datasets. Given the varying semantic granularity of part labels across these datasets, we manually curated the parts considered for each category [43]. Finally, we obtain the training dataset consisting of paired sketchified images and corresponding part-level annotations, distributed across object categories.

We evaluate our model’s performance on freehand line sketches from two large-scale datasets. We describe these datasets and associated data preparation procedures next.

3.3 Sketch datasets and augmentation

TU-Berlin: The TU-Berlin sketch database [10] contains hand drawn sketches spanning common object categories, with sketches per object. For this dataset, only the category name was provided to the sketchers during the drawing phase.

Sketchy: The Sketchy database [35] contains sketches spanning object categories, with sketches per category. To collect this dataset, photo images of objects were initially shown to human subjects. After a gap of seconds, the image was replaced by a gray screen and subjects were asked to sketch the object from memory. Compared to the draw-from-category-name-only approach employed for TU-Berlin dataset [10], this memory-based approach provides a larger variety in terms of multiple viewpoints and object detail in sketches. On an average, each object photo is typically associated with different sketches.

From both the datasets, we use sketches from only those object categories which overlap with the categories from PASCAL-Parts mentioned in Section 3.1. For augmentation, we first apply morphological dilation (using a square structured element of side ) to each sketch. This operation helps minimize the impact of loss in stroke continuity when the sketch is processed by deeper layers of the network. Subsequently, we apply a series of rotations (, , degrees) and mirroring about the vertical axis on the dilated sketch. This produces augmented variants per original sketch for use during training.

4 Our model (SketchParse)

Figure 2: The first level of SketchParse (shaded purple) is instantiated with shallower layers of a scene parsing net. The second level consists of expert super-category nets (shaded yellow) and is instantiated with deeper layers of the scene parsing net. Given the test sketch , the Router Layer (shaded green) relays intermediate features produced by shared layers to the target expert (). The pre-softmax activations (blue oval) generated in

are used to obtain the part parsing. These activations are also used by our novel pose estimation auxiliary net (shaded light-pink). The architecture of the pose net can be seen within the brown dotted line box in top-right. Within the pose net, convolutional layers are specified in the format dimensions/ stride/ dilation [number of filters]. FC = Fully-Connected layer. The dash-dotted line connected to the router classifier is used to indicate that Router Layer is utilized only during inference.

4.1 Instantiating SketchParse levels

We design a two-level deep network architecture for SketchParse (see Figure 2). The first level is intended to capture category-agnostic low-level information and contains shared layers common to all object categories. We instantiate first-level layers with the shallower, initial layers from a scene parsing net [5]. Specifically, we use multi-scale version of ResNet-101 [14] variant of DeepLab [5], a well-known architecture designed for semantic scene parsing. In this context, we wish to emphasize that our design is general and can accommodate any fully convolutional scene parsing network.

The categories are grouped into smaller , disjoint super-category subsets using meronym (i.e. part-of relation)-based similarities between objects [40]. To obtain super-categories, we start with given categories and cluster them based on the fraction of common meronyms (part-names). For example, ‘flying bird’ and ‘airplane’ have the largest common set of part-names between themselves compared to any other categories and therefore form a single cluster. This procedure can also be used when a new category is introduced – we assign to the super-category with which it shares largest common set of part-names.

The second level consists of expert sub-networks , each of which is specialized for parsing sketches associated with a super-category. We initialize these experts using the deeper, latter layers from the scene parsing model. Suppose the total number of parts across all object categories within the -th super-category is . We modify the final layer for each expert network such that it outputs part-label predictions. In our current version of the architecture, and . We performed ablative experiments to determine optimal location for splitting the layers of semantic scene parsing net into two groups. Based on these experiments, we use all the layers up to the res5b block as shared layers.

4.2 Router Layer

From the above description (Section 4.1), SketchParse’s design so far consists of a single shared sub-network and

expert nets. We require a mechanism for routing the intermediate features produced by shared layers to the target expert network. In addition, we require a mechanism for backpropagating error gradients from the

expert nets and update the weights of the shared layer sub-network during training. To meet such requirements, we design a Router Layer (shaded green in Figure 2

). During the training phase, routing of features from the shared layer is dictated by ground-truth category and by extension, the super-category it is associated with. A branch index array is maintained for each training mini-batch. Since the ground-truth super-category for each training example is available, creation of branch index array requires only knowledge of the mini-batch label composition. The array entries are referenced during backward propagation to (a) recombine the gradient tensors in the same order as that of the mini-batch and (b) route error gradient from the appropriate branch to the shared layers during backpropagation.

To accomplish routing during test time, we use a -way classifier (shaded red in Figure 2) whose output label corresponds to one of the expert networks . In this regard, we experimented with a variety of deep CNN architectures. We examined previous custom architectures for sketch recognition by Seddati et al. [39] and Yang et al. [51]. In addition to custom nets, we explored architectures which involved fine-tuning image object classification networks such as AlexNet and GoogleNet. Table 1 lists some of the architectures explored along with their classification performance. We briefly describe the architectures next.

  • [SketchParse-pool5][GAP][FC64][DO 0.5][FC5], using sketchified images - We first train SketchParse and then take global average pooled output after pool5. All the layers of SketchParse are frozen, only the fully connected layers are learnt. 0.5 dropout is used for the fully connected layer.

  • [SketchParse-pool5][FC64][DO 0.5][FC64][DO 0.5][FC5], using sketchified images - This experiment is similar to the above experiment. Here we use 3 fully connected layers after the global average pool output of pool5. 0.5 dropout is used for the fully connected layers.

  • Google net [GAP][FC5], using sketches - The last fully connected layer of the Google net is removed and it is replaced by a global average pool layer and a fully connected layer for classification. This model is then fine-tuned for sketch classification.

  • Custom net (used), using sketches - This is the custom net, described in Table 2. It was found that on increasing the number of layers in all Conv layers of this architecture improves performance.

  • Alex net [GAP][FC5], using sketches - In this experiment, we replaced the last 3 fully connected layers of Alex net with a global average pool layer and a fully connected layer for classification. This new architecture was then fine-tuned.

Architecture training data performance
[SketchParse-pool5][GAP][FC64][DO 0.5][FC5] sketchified images
[SketchParse-pool5][FC64][DO 0.5][FC64][DO 0.5][FC5] sketchified images
Google net [GAP][FC5] sketch images
Custom net(used) sketch images 91.3
Alex net [GAP][FC5] sketch images
Table 1: Some of the architectures explored to find the best classifier and their performance numbers.

We found that a customized version of Yang et al.’s SketchCNN architecture, with a normal (non-sketchified) input, provided the best performance. Specifically, we found that (1) systematically increasing the number of filters in all the conv layers and (2) using a larger size kernel in last layer, results in better performance. We use a very high dropout rate of % in our network to combat over-fitting and to compensate for lower amount of training data. Table 2 depicts the architecture of our router classifier.

We also wish to point out that our initial attempts involved training custom CNNs solely on sketchified images or their deep feature variants. However, the classification performance was subpar. Therefore, we resorted to training the classifier using actual sketches.

index Type Filter size no. of filters stride
1 Conv 15x15 64 3
2 ReLu - - -
3 Maxpool 3x3 - 2
4 Conv 5x5 128 1
5 ReLu - - -
6 Maxpool 3x3 - 2
7 Conv 3x3 256 1
8 ReLu - - -
9 Conv 3x3 256 1
10 ReLu - - -
11 Conv 3x3 256 1
12 ReLu - - -
13 Maxpool 3x3 - 2
14 Conv 1x1 512 -
15 ReLu - - -
16 Dropout(0.7) - - -
17 Conv 1x1 5 -
Table 2: The architecture of the sketch router classifier used by SketchParse.

4.2.1 Confusion Matrix

A highly accurate sketch classifier is a crucial requirement for proper routing and overall performance of our approach. Table 3

shows the confusion matrix of the classifier used by our model on the test set. We note that most of the misclassification occurs due to confusion between

Small Animal and Large Animal super-categories because these sketches tend to be very similar.

Large Anim. Small Anim. 4 wheel. 2 wheel. Flying.
Large Anim.
Small Anim.
4 wheel.
2 wheel.
Flying.
Table 3: The confusion matrix of the classifier used by our model on the test set.

4.3 Auxiliary (Pose Estimation) Task Network

The architecture for estimating the 2-D pose of the sketch (shaded pink and shown within the top-right brown dotted line box in Figure 2) is motivated by the following observations:

First, the part-level parsing of the object typically provides clues regarding object pose. For example, if we view the panel for Two Wheelers in Figure 3, it is evident that the relative location and spatial extent of ‘handlebar’ part for a bicycle is a good indicator of pose. Therefore, to enable pose estimation from part-level information, the input to the pose net is the tensor of pre-softmax pixelwise activations222Shown as a blue oval in Figure 2. generated within the expert part-parsing network. To capture the large variation in part appearances, locations and combinations thereof, the first two layers in the pose network contain dilated convolutional filters  [49], each having rate and stride with kernel width . Each convolutional layer is followed by a ReLU non-linearity [27].

Second, 2-D pose is a global attribute of an object sketch. Therefore, to provide the network with sufficient global context, we configure the last convolutional layer with and , effectively learning a large spatial template filter. The part combinations captured by initial layers also mitigate the need to learn many such templates – we use in our implementation. The resulting template-based feature representations are combined via a fully-connected layer and fed to a

-way softmax layer which outputs 2-D pose labels corresponding to cardinal and intercardinal directions. Note also that each super-category expert has its own pose estimation auxiliary net.

Having described the overall framework for SketchParse, we next describe major implementation details of our training and inference procedure.

5 Implementation Details

5.1 Training

SketchParse: Before commencing SketchParse training, the initial learning rate for all but the final convolutional layers is set to . The rate for the final convolutional layer in each sub-networks is set to

. Batch-norm parameters are kept fixed during the entire training process. The architecture is trained end-to-end using a per-pixel cross-entropy loss as the objective function. For optimization, we employ stochastic gradient descent with a mini-batch size of

sketchified image, momentum of and polynomial weight decay policy. We stop training after iterations. The training takes hours on a NVIDIA Titan X GPU.

A large variation can exist between part sizes for a given super-category (e.g. number of ‘tail’ pixels is smaller than ‘body’ pixels in Large Animals). To accommodate this variation, we use a class-balancing scheme which weighs per-pixel loss differently based on relative presence of corresponding ground-truth part [9]. Suppose a pixel’s ground-truth part label is . Suppose is present in training images and suppose the total number of pixels with label across these images is . We weight the corresponding loss by where and is the median over the set of s. In effect, losses for pixels of smaller parts get weighted to a larger extent and vice-versa.

Pose Auxiliary Net: The pose auxiliary network is trained end-to-end with the rest of SketchParse, using an -way cross-entropy loss. The learning rate for the pose module is set to . All other settings remain the same as above.

Sketch classifier: For training the -way sketch router classifier, we randomly sample of sketches per category from TU-Berlin and Sketchy datasets333This translates to sketches from TU-Berlin and sketches from Sketchy for training, and sketches for validation.. We augment data with flip about vertical axis, series of rotations (, , degrees) and sketch-scale augmentations (%, % of image height). For training, the initial learning rate is set to . The classifier is trained using stochastic gradient descent with a mini-batch size of sketches, momentum of and a polynomial weight decay policy. We stop training after iterations. The training takes hours on a NVIDIA Titan X GPU.

5.2 Inference

For evaluation, we used equal number of sketches () per category from TU-Berlin and Sketchy datasets except for bus category which is present only in TU-Berlin dataset. Thus, we have a total of sketches for testing. For sketch router classifier, we follow the conventional approach [39]

of pooling score outputs corresponding to cropped (four corner crops and one center crop) and white-padded versions of the original sketch and its vertically mirrored version. Overall, the time to obtain part-level segmentation and pose for an input sketch is

seconds on average. Thus, SketchParse is an extremely suitable candidate for developing applications which require real-time sketch understanding.

6 Experiments

To enable quantitative evaluation, we crowdsourced part-level annotations for all the sketches across categories, in effect creating the largest part-annotated dataset for sketch object parsing.

Evaluation procedure: For quantitative evaluation, we adopt the average IOU measure widely reported in photo-based scene and object parsing literature [24]. Consider a fixed test sketch. Let the number of unique part labels in the sketch be . Let be the number of pixels of part-label predicted as part-label . Let be the total number of pixels whose ground-truth label is . We first define part-wise Intersection Over Union for part-label as   = . Then, we define sketch-average IOU (sIOU) = . For a given category, we compute sIOU for each of its sketches individually and average the resulting values to obtain the category’s average IOU (aIOU) score.

Architecture aIOU %
Baseline (B)
B + Class-Balanced Loss Weighting (C)
B + C + Pose Auxiliary Task (P)
Table 4: Performance for architectural additions to a single super-category (‘Large Animals’) version of SketchParse.

Significance of class-balanced loss and auxiliary task: To determine whether to incorporate class-balanced loss weighting and 2-D pose as auxiliary task, we conducted ablative experiments on a baseline version of SketchParse configured for a single super-category (‘Large Animals’). As the results in Table 4 indicate, class-balanced loss weighting and inclusion of 2-D pose as auxiliary task contribute to improved performance over the baseline model.

Branch point/ aIOU %
# layers specialized for a super category
res5c / classifier block
res5b / 1 res block + classifier block
res5a / 2 res blocks + classifier block
res4b22 / 3 res blocks + classifier block
res4b17 / 8 res blocks + classifier block
Table 5: Performance for various split locations within the scene parsing net [5]. As suggested by the results, we use the layers upto res5b to instantiate the first level in SketchParse. The subsequent layers are used to instantiate expert super-category nets in the second level.

Determining split point in base model: A number of candidate split points exist which divide layers of the base scene parsing net [5] into two disjoint groups. We experimented with different split points within the scene parsing net. For each split point, we trained a full super-category version of SketchParse model. Based on the results (see Table 5), we used the split point (res5b) which generated best performance for the final version viz. the SketchParse model with class-balanced loss weighting and pose estimation included for all super-categories. Note that we do not utilize the sketch router for determining the split point. From our experiments, we found the optimal split point results in shallow expert networks. This imparts SketchParse with better scalability. In other words, additional new categories and super-categories can be included without a large accompanying increase in number of parameters.

Large Animals Small Animals 4-Wheelers 2-Wheelers Flying Things
Model cow horse cat dog sheep bus car bicycle motorbike airplane bird Avg.
B-R5
BC-R5
BCP-R-
BCP-R-
BCP-R5 63.17
BCP-R5 (100% router) 64.45
Table 6: Comparing the full super-category version of SketchParse (denoted BCP-R5) with baseline architectures.

Relative Loss weighting: The ‘parsing’ and ‘pose estimation’ tasks are trained simultaneously. We weigh the losses individually and perform a grid search on the values of and learning rate. Hence, total loss, is

(1)

The grid search is performed on the super category Large-Animals and the optimal and learning rate value , lr = (see Table 7) are used for each branch in the final route network.

learning rate aIOU %
0.1
0.1
0.1
1
Table 7: Grid search over and learning rate

Full version net and baselines: We compare the final super-category version (BCP-R5) of SketchParse (containing class-balanced loss weighting and pose estimation auxiliary task) with certain baseline variations – (i) B-R5: No additional components included (ii) BC-R5: Class-balanced loss weighting included in B-R5 (iii) BCP-R-: All categories are grouped into a single branch (iv) BCP-R-: A variant of the final version with a dedicated expert network for each category (i.e. one branch per category).

From the results (Table 6), we make the following observations: (1) Despite the challenges posed by hand-drawn sketches, our model performs reasonably well across a variety of categories (last but one row in Table 6). (2) Sketches from Large Animals are parsed the best while those from Flying Things do not perform as well. On closer scrutiny, we found that bird category elicited inconsistent sketch annotations given the relatively higher degree of abstraction in the corresponding sketches. (4) In addition to confirming the utility of class-balanced loss weighting and pose estimation, the baseline performances demonstrate that part (and parameter) sharing at category level is a crucial design choice, leading to better overall performance. In particular, note that having category per branch (BCP-R-) almost doubles the number of parameters, indicating poor category scalability.

To examine the effect of the router, we computed IOU by assuming a completely accurate router for BCP-R5 (last row). This improves average IOU performance by 1.28. The small magnitude of improvement also shows that our router is quite reliable. The largest improvements are found among Small Animals (‘cat’,‘dog’). This is also supported by the router classifier’s confusion matrix (see Table 3).

Directions Large Small 4-Wheelers 2-Wheelers Flying AVG.
Animals Animals Things
8
4
Table 8: Performance of our best SketchParse model (BCP-R5) on the pose auxiliary task.
Figure 3: Qualitative part-parsing results across super-categories. Each row contains four panels of three images. In each panel, the test sketch is left-most, corresponding part-level ground-truth in the center and SketchParse’s output on the right. The four panels are chosen from the th, th, th, th percentile of each super-category’s test sketches sorted by their average IOU. The background is color-coded black in all the cases.

The performance of pose classifier can be viewed in Table 8. Note that simplifying the canonical pose directions (merging non-canonical directional labels with canonical directions) lends a dramatic improvement in accuracy. In addition, the most confusion among predictions is between the left-right directions and their corresponding perspective views (see Table 16). Depending on the granularity of pose information required we may merge the perspective directions as appropriate.

N E S W
N 0 6 5 3
E 0 329 24 47
S 0 12 55 15
W 0 62 19 410
Table 9: Overall pose confusion matrix (4 directions)

Qualitative evaluation: Rather than cherry-pick results, we use a principled approach to obtain a qualitative perspective. We first sort the test sketches in each super-category by their aIOU (average IOU) values in decreasing order. We then select sketches located at -th, -th, -th, and -th percentile in the sorted order. These sketches can be viewed in Figure 3. The part-level parsing results reinforce the observations made previously in the context of quantitative evaluation.

6.1 Parsing semantically related categories

We also examine the performance of our model for sketches belonging to categories our model is not trained on but happen to be semantically similar to at least one of the existing categories. Since segmentation information is unavailable for these sketches, we show two representative parsing outputs per class. We include the classes monkey, tiger, teddy-bear, camel, bear, giraffe, elephant, race car and tractor which are semantically similar to categories already considered in our formulation. As the results demonstrate (Figure 4), SketchParse accurately recognizes parts it has seen before (‘head’, ‘body’, ‘leg’ and ‘tail’). It also exhibits a best-guess behaviour to explain parts it is unaware of. For instance, it marks elephant ‘trunk’ as either ‘legs’ or ‘tails’ which is a semantically reasonable error given the spatial location of the part. These experiments demonstrate the scability of our model in terms of category coverage. In other words, our architecture can integrate new, hitherto unseen categories without too much effort.

Figure 4: Part-parsing results for sketches from categories which are semantically similar to categories on which SketchParse is originally trained.

6.2 Evaluating category-level scalability

As an additional experiment to evaluate performance scalability when categories are added incrementally, we performed the following experiment: (1) Train the BCP-R5 network with all categories, except the categories (‘dog’,‘sheep’) from Small Animals super-category. (2) Freeze shared layers and fine-tune Small Animals branch with ‘dog’ data added. (3) Freeze shared layers and fine-tune the trained model from Step-(2) with ‘sheep’ data added.

Large Animals Small Animals 4-Wheelers 2-Wheelers Flying Things
Model cow horse cat dog sheep bus car bicycle motorbike airplane bird Avg.
Row-1
Row-2
Row-3
Row-4
Table 10: Comparing the full super-category version of SketchParse (denoted BCP-R5) with baseline architectures (Row-1: Model trained without ‘sheep’,‘dog’ categories, Row-2: Previous model (Row-1’s) fine-tuned with only ‘dog’ data added, Row-3: Fine-tune previous (i.e. ‘dog’ fine-tuned) model with ‘sheep’ data added, Row-4: Original result (where all categories are present from the beginning – same as last but one row of Table 6)).

We observe (Table 10) that the average IOU (last column) progressively improves as additional categories are added. In other words, overall performance does not drop even though shared layers are frozen. This shows the scalable nature of our architecture. Of course, the IOU values are slightly smaller compared to the original result (last row) where all categories are present from the beginning, but that is a consequence of freezing shared layers.

6.3 Fine-grained retrieval

Figure 5: Five sketch-based image retrieval panels are shown. In each panel, the top-left figure is the query sketch. Its part parsing is located immediately below. Each panel has two sets of retrieval results for the presented sketch. The first row corresponds to Sangkloy et al.’s [35] retrieval results and the second contains our re-ranked retrievals based on part-graphs (Section 6.3).

In another experiment, we determine whether part-level parsing of sketches can improve performance for existing sketch-based image retrieval approaches which use global image-level features. We use the PASCAL parts dataset [6], consisting of photo images across categories, as the retrieval database. As a starting point, we consider the sketch-based image retrieval system of Sangkloy et al. [35]. The system consists of a trained Siamese Network model which projects both sketches and photos into a shared latent space. We begin by supplying the query sketch from our dataset [10] and obtain the sequence of retrieved PASCAL parts images . Suppose the part-segmented version of is . We use a customized Attribute-Graph approach [31] and construct a graph from . The attribute graph is designed to capture the spatial and semantic aspects of the part-level information at local and global scales. We use annotations from the PASCAL parts dataset to obtain part-segmented versions of retrieved images, which in turn are used to construct corresponding attribute graphs .

Each graph has two kinds of nodes: a single global node and a local node for each non-contiguous instance of a part present in the segmentation output from SketchParse. The global node attributes include:

  • a histogram that keeps a count of each type of part present in the image

  • the non-background area in the image as a fraction of total area

A local node is instantiated at every non-contiguous part present in the segmentation output. We drop nodes for which the corresponding part area is less than 0.1% of the total non-background area with the assumption that these are artifacts in the segmentation output. Each local node has the following attributes:

  • angle subtended by the part at the centre of the sketch

  • centre of the part

Edges are present between local nodes corresponding to parts that have a common boundary. Each such edge encodes the relative position of both parts using a polar coordinate system.

Every local node is also connected to the global node. These edges encode the absolute position of the part in the image. Again, the area of each part is used as a multiplicative weight for each similarity computation it’s corresponding node participates in.

For re-ranking the retrieved images, we use Reweighted Random Walks Graph Matching [7] to compute similarity scores between and , although any other graph matching algorithm that allows incorporating constraints may be used. During the graph matching process we enforce two constraints:

  • A global node can only be matched to a global node of the other graph.

  • Local nodes can only be matched if they correspond to the same type of part (e.g. local nodes corresponding to legs can only be matched to other legs and cannot be matched to other body parts)

For our experiments, we examine our re-ranking procedure for top- (out of images) of Sketchy model’s retrieval results. In Figure 5, each panel corresponds to top- retrieval results for a particular sketch. The sketch and its parsing are displayed alongside the nearest neighbors in latent space of Sketchy model (top row) and the top re-ranked retrievals using our part-graphs (bottom row). The results show that our formulation exploits the category, part-level parsing and pose information to obtain an improved ranking.

6.4 Describing sketches in detail

Figure 6: Some examples of fine-grained sketch descriptions. Each panel above shows a test sketch (left), corresponding part-parsing (center) and the description (last column). Note that in addition to parsing output, we also use the outputs of auxiliary pose network and router classifier to generate the description. The color-coding of part-name related information in the description aligns with the part color-coding in the parsing output. See Section 6.4 for additional details.

Armed with the information provided by our model, we can go beyond describing a hand-drawn sketch by a single category label. For a given sketch, our model automatically provides its category, associated super-category, part-labels and their counts and 2-D pose information. From this information, we use a template-filling approach to generate descriptions – examples can be seen alongside our qualitative results in Figure 6. A fascinating application, inspired by the work of Zhang et al. [52], would be to use such descriptions to generate freehand sketches using a Generative Adversarial Network approach.

7 Conclusion

Given the generally poor drawing skills of humans and sparsity of detail, it is very challenging to simultaneously recognize and parse sketches across multiple groups of categories. In this paper, we have presented SketchParse, the first deep-network architecture for fully automatic parsing of freehand object sketches. The originality of our approach lies in successfully repurposing a photo scene-segmentation net into a category-hierarchical sketch object-parsing architecture. The general nature of our transfer-learning approach also allows us to leverage advances in fully convolutional network-based scene parsing approaches, thus continuously improving performance. Another novelty lies in obtaining labelled training data for free by sketchifying photos from object-part datasets, thus bypassing burdensome annotation step. Our work stands out from existing approaches in the complexity of sketches, number of categories considered and semantic variety in categories. While existing works focus on one or two super-categories and build separate models for each, our scalable architecture can handle a larger number of super-categories, all with a single, unified model. Finally, the utility of SketchParse’s novel multi-task architecture is underscored by its ability to enable applications such as fine-grained sketch description and improving sketch-based image retrieval.

Please visit https://github.com/val-iisc/sketch-parse for pre-trained models, code and resources related to the work presented in this paper.

For future exploration, it would be interesting to explore additional auxiliary tasks such as adversarial loss [25] and part-histogram loss [22] to further boost part-parsing performance. Another natural direction to pursue would be the viability of SketchParse’s architecture for semantic parsing of photo objects.

References

  • [1] A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia. Multi-task cnn model for attribute prediction. IEEE Transactions on Multimedia, 17(11):1949–1959, 2015.
  • [2] K. Ahmed, M. H. Baig, and L. Torresani. Network of experts for large-scale image categorization. In

    14th European Conference on Computer Vision (Part-VII)

    , pages 516–532. Springer International Publishing, 2016.
  • [3] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In NIPS, pages 181–189, 2010.
  • [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015.
  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915, 2016.
  • [6] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
  • [7] M. Cho, J. Lee, and K. M. Lee. Reweighted random walks for graph matching. In ECCV, pages 492–505. Springer-Verlag, 2010.
  • [8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2016.
  • [9] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of IEEE ICCV, pages 2650–2658, 2015.
  • [10] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM Transactions on Graphics (TOG), 31(4):44, 2012.
  • [11] M. Elhoseiny, T. El-Gaaly, A. Bakry, and A. Elgammal. A comparative analysis and study of multiview cnn models for joint object categorization and pose estimation. In Proceedings of ICML, volume 48, pages 888–897. JMLR.org, 2016.
  • [12] A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-category generalization. In IEEE CVPR, pages 2352–2359. IEEE, 2010.
  • [13] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE CVPR, pages 447–456, 2015.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE CVPR, 2016.
  • [15] S. Hong, J. Oh, H. Lee, and B. Han.

    Learning transferrable knowledge for semantic segmentation with deep convolutional neural network.

    In Proceedings of the IEEE CVPR, 2016.
  • [16] Z. Huang, H. Fu, and R. W. H. Lau. Data-driven segmentation and labeling of freehand sketches. Proceedings of SIGGRAPH Asia, 2014.
  • [17] R. H. Kazi, F. Chevalier, T. Grossman, S. Zhao, and G. Fitzmaurice. Draco: bringing life to illustrations with kinetic textures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 351–360. ACM, 2014.
  • [18] M. Lapin, B. Schiele, and M. Hein.

    Scalable multitask representation learning for scene classification.

    In Proceedings of the IEEE CVPR, pages 1434–1441, 2014.
  • [19] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE Transactions on Image Processing, 25(8):3919–3930, 2016.
  • [20] Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Fine-grained sketch-based image retrieval by matching deformable part models. In BMVC, 2014.
  • [21] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan.

    Semantic object parsing with local-global long short-term memory.

    In The IEEE CVPR, June 2016.
  • [22] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636, 2015.
  • [23] J. J. Lim, C. L. Zitnick, and P. Dollár. Sketch tokens: A learned mid-level representation for contour and object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3158–3165, 2013.
  • [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE CVPR, pages 3431–3440, 2015.
  • [25] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. In NIPS Workshop on Adversarial Training, 2016.
  • [26] B. Mahasseni and S. Todorovic. Latent multitask learning for view-invariant action recognition. In Proceedings of the IEEE ICCV, pages 3128–3135, 2013.
  • [27] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
  • [28] V. Nekrasov, J. Ju, and J. Choi. Global deconvolutional networks for semantic segmentation. CoRR, abs/1602.03930, 2016.
  • [29] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE ICCV, pages 1520–1528, 2015.
  • [30] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
  • [31] N. Prabhu and R. Venkatesh Babu. Attribute-graph: A graph based approach to image ranking. In Proceedings of the IEEE ICCV, pages 1071–1079, 2015.
  • [32] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. arXiv preprint arXiv:1603.01249, 2016.
  • [33] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE CVPR, pages 3234–3243, 2016.
  • [34] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
  • [35] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: Learning to retrieve badly drawn bunnies. ACM Trans. Graph., 35(4):119:1–119:12, July 2016.
  • [36] R. K. Sarvadevabhatla, J. Kundu, and R. V. Babu.

    Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition.

    In Proceedings of the ACMMM, pages 247–251, 2016.
  • [37] R. G. Schneider and T. Tuytelaars.

    Sketch classification and classification-driven analysis using fisher vectors.

    ACM Trans. Graph., 33(6):174:1–174:9, Nov. 2014.
  • [38] R. G. Schneider and T. Tuytelaars. Example-based sketch segmentation and labeling using crfs. ACM Trans. Graph., 35(5):151:1–151:9, July 2016.
  • [39] O. Seddati, S. Dupont, and S. Mahmoudi. Deepsketch: deep convolutional neural networks for sketch recognition and similarity search. In 13th International Workshop on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2015.
  • [40] A. Theobald. An ontology for domain-oriented semantic similarity search on XML data. In BTW 2003, Datenbanksysteme für Business, Technologie und Web, Tagungsband der 10. BTW-Konferenz, 26.-28. Februar 2003, Leipzig, pages 217–226, 2003.
  • [41] A. van Opbroek, M. A. Ikram, M. W. Vernooij, and M. De Bruijne. Transfer learning improves supervised image segmentation across imaging protocols. IEEE transactions on medical imaging, 34(5):1018–1030, 2015.
  • [42] A. Vezhnevets and J. M. Buhmann. Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning. In IEEE CVPR, pages 3249–3256. IEEE, 2010.
  • [43] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille. Joint object and part segmentation using deep learned potentials. In Proceedings of the IEEE ICCV, pages 1573–1581, 2015.
  • [44] Wikipedia. Cardinal direction — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Cardinal_direction, 2017.
  • [45] J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. Johnson, et al. Segicp: Integrated deep semantic segmentation and pose estimation. arXiv preprint arXiv:1703.01661, 2017.
  • [46] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In Proceedings of 14th European Conference in Computer Vision: Part V, pages 648–663, 2016.
  • [47] R. Xiaofeng and L. Bo. Discriminatively trained sparse code gradients for contour detection. In Advances in neural information processing systems, pages 584–592, 2012.
  • [48] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, and Y. Yu. Hd-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE ICCV, pages 2740–2748, 2015.
  • [49] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [50] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, and C.-C. Loy. Sketch me that shoe. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [51] Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. Hospedales. Sketch-a-net that beats humans. BMVC, 2015.
  • [52] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
  • [53] Y. Zhang, Y. Zhang, and X. Qian. Deep neural networks for free-hand sketch recognition. In 17th Pacific-Rim Conference on Multimedia, Xi’an, China, September 15-16, 2016, 2016.
  • [54] B. Zhao, F. Li, and E. P. Xing. Large-scale category structure aware image categorization. In NIPS, pages 1251–1259, 2011.
  • [55] J. Zhao and L. Itti. Improved deep learning of object category using pose information. CoRR, abs/1607.05836, 2016.

Confusion Matrices for Pose

N NE E SE S SW W NW
N 0 0 0 2 0 0 0 0
NE 0 0 1 1 0 0 0 0
E 0 0 50 2 0 0 2 0
SE 0 0 3 6 0 0 0 0
S 0 0 1 1 7 4 0 0
SW 0 0 0 0 3 9 5 0
W 0 0 0 0 2 9 79 0
NW 0 0 0 0 0 1 0 0
Table 11: Pose confusion matrix for Large Animals
N NE E SE S SW W NW
N 0 0 0 0 1 0 1 0
NE 0 0 5 4 3 0 0 0
E 0 0 30 12 7 1 4 0
SE 0 0 5 8 10 1 0 1
S 0 0 2 2 44 6 3 0
SW 0 0 2 0 2 5 10 0
W 0 0 4 3 8 20 64 0
NW 0 0 1 2 1 0 8 0
Table 12: Pose confusion matrix for Small Animals
N NE E SE S SW W NW
N 0 0 0 1 1 0 0 0
NE 0 0 0 0 0 2 1 0
E 0 2 27 7 0 1 16 0
SE 0 1 2 14 1 0 0 0
S 0 0 0 0 2 0 1 0
SW 0 0 0 0 0 12 2 0
W 0 1 14 3 0 7 23 0
NW 0 0 0 1 0 0 0 0
Table 13: Pose confusion matrix for 4 Wheelers
N NE E SE S SW W NW
N 0 0 0 0 0 0 1 0
NE 0 0 8 0 0 0 0 0
E 0 3 57 1 0 1 8 0
SE 0 0 5 5 0 0 2 0
S 0 0 0 1 0 1 0 0
SW 0 0 0 0 0 6 5 0
W 0 0 11 0 0 21 43 3
NW 0 0 0 0 0 3 2 1
Table 14: Pose confusion matrix for 2 Wheelers
N NE E SE S SW W NW
N 0 0 2 1 3 1 1 0
NE 0 0 10 5 0 1 2 0
E 0 1 41 5 2 1 1 0
SE 0 0 3 5 1 2 0 0
S 0 0 3 2 2 0 0 0
SW 0 0 2 1 0 6 5 0
W 0 0 5 8 1 25 31 0
NW 0 0 2 2 2 3 2
Table 15: Pose confusion matrix for Flying Things
N NE E SE S SW W NW
N 0 0 2 4 5 1 2 0
NE 0 0 24 10 3 3 3 0
E 0 6 205 27 9 4 31 0
SE 0 1 18 38 12 3 2 1
S 0 0 6 6 55 11 4 0
SW 0 0 4 1 5 38 27 0
W 0 1 34 14 11 82 240 3
NW 0 0 3 5 3 7 12 1
Table 16: Overall pose confusion matrix