Sequential Dual Deep Learning with Shape and Texture Features for Sketch Recognition

08/09/2017 ∙ by Qi Jia, et al. ∙ 0

Recognizing freehand sketches with high arbitrariness is greatly challenging. Most existing methods either ignore the geometric characteristics or treat sketches as handwritten characters with fixed structural ordering. Consequently, they can hardly yield high recognition performance even though sophisticated learning techniques are employed. In this paper, we propose a sequential deep learning strategy that combines both shape and texture features. A coded shape descriptor is exploited to characterize the geometry of sketch strokes with high flexibility, while the outputs of constitutional neural networks (CNN) are taken as the abstract texture feature. We develop dual deep networks with memorable gated recurrent units (GRUs), and sequentially feed these two types of features into the dual networks, respectively. These dual networks enable the feature fusion by another gated recurrent unit (GRU), and thus accurately recognize sketches invariant to stroke ordering. The experiments on the TU-Berlin data set show that our method outperforms the average of human and state-of-the-art algorithms even when significant shape and appearance variations occur.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Our ancestors used sketches to record their lives in ancient times. Nowadays, sketches are still regarded as an effective communicative tool. In past decades, researchers have intensively explore sketch characteristics in various multimedia applications including sketch recognition [1, 2]

, sketch-based image retrieval 

[3, 4] and sketch-based 3D model retrieval[5]

. Nevertheless, it is an extremely challenging task to recognize sketches drawn by non-artists even for human being. Firstly, sketches are highly abstract. For example, we can only use simple sticks to represent limbs of a human body in a sketch. Secondly, sketches have numerous styles that present significantly different appearances due to their free-hand nature. For a cow sketch, one may draw its coarse contour, while another may provide detailed patterns inside the outer contour. Last but not least, sketches are lack of texture cues. Lines and curves are enough to compose a sketch, showing evident differences with natural images.Unfortunately, most existing approaches to sketch recognition employ traditional feature extraction and classification techniques designated for textural images.

Researchers typically bring the features working pretty well for natural images into sketch recognition. These features include the histogram of oriented gradient (HOG) [6] and scale-invariant feature transform (SIFT) [7]. Both features highly depend on gradients of rich textures that rarely present in sketches composed of lines or curves. Recently, Yu et al.

developed Sketch-A-Net (SAN) that learns sketch features via the deep convolutional neural network (DCNN) inspired by the success of DCNN on recognizing handwritten digits 

[8]. These learnt features outperform those handcrafted ones, e.g., HOG and SIFT. However, DCNN generates features from the convoluted responses on image intensities (textures), which neglects the distinct geometrical structures in sketches. Object shapes representing intrinsic geometries are also stable to illumination and color variations. Therefore, the recognition rates of these methods lower than what human achieve can be partially owing to the absence of shape features [1].

Most existing methods also neglect the inherently sequential nature of sketches besides their intrinsic geometries. In a recent work [9], Sarvadevabhatla et al. built a recurrent network to sequentially learn features for recognition, yielding great improvements on accuracy. It is worth noting that sequentially drawn sketches exhibit a much higher degree of intra-class variations in stroke order. The learning strategy in [9] fails tackling these high variations on stroke ordering. It is still unresolved to combat these variations when we learn sequential features for sketch recognition.

In this study, we devise dual recurrent neural networks with respect to textural and shape characteristics of sketches, and sequentially learn (actually combine) both features via joint Bayes for recognition. For the first time, coded shape features are introduced in a recurrent manner in order to distinguish sketches with similar textures but different shapes. Additionally, we explore the sequential nature of stroke groups rather than individual strokes so that stroke ordering is able to provide more information for recognition while excluding the effects from its variations. We validate our dual learning on the largest hand-free sketch benchmark TU-Berlin

[1]. Our method outperforms the state-of-the-art over 7 percentage points on the recognition rate.

The overall procedure is shown in Fig. 1. Each sketch is divided into five groups according to its stroke ordering. We extract encoded shape context features [10] and texture features by Sketch-A-Net [11] from each sketch group. Subsequently, dual recurrent neural networks upon gated recurrent units (GRUs) take textural and shape features as the inputs, respectively. The network receiving textural features is colored as blue in Fig. 1

, while the one for shape features in yellow. And Then concatenate these two features by their time step. After this, we feed the concatenated features to another gated recurrent unit (GRU)(colored as blue and yellow). We sequentially train these networks group by group, and apply sumpooling to classify sketches upon the fused features.

Fig. 1: Dual recurrent neural networks with respect to textural and shape features. These dual networks are sequentially trained by examples divided into five groups according to stroke ordering, and joint Bayes is applied for classification by the fused features from the networks.

Ii Related Work

We review the features and deep learning architectures for sketch recognition.

Ii-a Features for sketch recognition

In 2012, [1] released a dataset containing 20,000 sketches distributed over 250 object categories. These sketches are drawn from daily objects, but humans can only correctly identify the sketch category with the accuracy 73%. These results show that sketch recognition is a challenging task even for humans. Features for sketch recognition can be roughly categorized into handcrafted ones and those learned from deep learning.

Hand-crafted features share similar spirit with image classification methods, including feature extraction and classification. Most of the existing works regard sketch as texture image, such as HOG and SIFT [6] [7]. In [1], SIFT feature are extracted in image patches. The method also takes into account that sketches do not have smooth gradients and are much sparser than images. Similarly, [2]

leverage Fisher Vectors and spatial pyramid pooling to represent SIFT features.

[12] employ multiple-kernel learning (MKL) to learn appropriate weights of different features. The method improve the performance a lot as different features are complementary to each other.

Recently, deep neural networks (DNNs) have achieved great success[13]by replacing hand-crafted representation with learning strategy.[11]  [8] leverage a well-designed convolutional neural network(CNN) architecture for sketch recognition.According to their experimental results, their method surpass the best result achieved by human. According to their experimental results, their method surpass the best result achieved by human.In [14]  [2] , the sequential information of sketch is exploited.. [9] explicitly uses sequential regularities in strokes with the help of recurrent neural work (RNN).

The experimental results show that they get the state-of-art performance on sketch recognition task. However, all the methods upon deep learning only rely on texture information. The convolution process may confuse sketches with similar texture but different contours. The variation on stroke orders of the same class also affect the convergence of RNN.

Ii-B Learning architectures

CNNs and RNNs are two branches of neural network. CNNs have obtained great success in many practical application [13, 15]. LeNet [16] is a classical CNN architecture, which has been used in handwritten number recognition. Wang et al. [5] use Siamese network to retrieve 3D model by sketch. However, all the above methods take sketches as traditional image and ignore the sparse of sketches.

In order to exploit appropriate learning architecture for sketch recognition, [11] [8] develop a network named as Sketch-A-Net (SAN), which enlarges the pooling sizes and patches of filters to cater the sparse character of sketches. Different from traditional images, sketches involve inherent sequential property. [11] and [8] divide the strokes in several groups according to the strokes order. However, CNNs can not build connections between sequential strokes.

The sequential property is not proprietary for sketches. Many works resort to RNN to improve the performance in speech recognition [17]

and text generation 

[18]. RNN is specialized for processing input sequence, which can bridge the hidden units and deliver the outputs from former sequence to the latter. However, it has a significant limitation called ’vanishing gradient’. When the input sequence is quite long, RNN is difficult to propagate the gradients through deep layers of the neural network, which is easy to cause gradients vanishing and exploding problems [19]

. In order to overcome the limitation of RNN, long short term memory (LSTM) 

[19] and gated recurrent unit (GRU) [20] are proposed. GRU can be regarded as a light-weight version of LSTM, which outperforms LSTM in some certain cases by learning smaller number of parameters [21].

Sarvadevabhatla et al. [9] take orders of strokes as a sequence and feed their features to GRU. In this way, a long-term sequential and structural regularities of stoke can be exploited. As a result, they achieve the best recognition performance. However, stroke order varies severely in the same kind, which results in the fluctuation of the network.

Fig. 2: Comparisons of stroke by stroke and stroke groups

Iii sequential learning framework

In this section, we introduce the architecture of our dual sequential learning framework. We first illustrate the recurrent neural network based on Gated Recurrent Unit (GRU).Then, we exploit the stroke order and describe strokes by texture and shape features. Finally, the coupled features are combined another (GRU).

Iii-a Gated Recurrent Unit

In our work, we take GRU as basic network architecture and then exploit the sequential characteristic of the sketches.A GRU network learns how to map input sequence to output sequence . This mapping is illustrated in  (1) and  (5).


where is the input and is the output. is the ’hidden’ state of GRU and regulated by gating units , and . The operation denotes the element-wise vector product. , are weight matrices and is the weight vector for GRU. More details about GRU can be found in [22].

Iii-B Sequential dual learning architecture based on GRU

In this section, the strokes are first divided into several groups to reduce the effects on the variation of strokes in time sequence. Then, two features of texture and coded shape are introduced to characterize the texture and geometry information of sketches.

Fig. 3: Input sequence of one stroke group.

Iii-B1 Stroke order exploitation

For traditional images, all the pixels are captured at the same time. However, sketch consists of sequential strokes and the order of strokes is an important information for sketch. When drawing the same object, different people have their own orders of strokes [1]. As shown on the left of Fig. 2, there are 3 pairs of sketches: two airplanes, two cows and two elephants. The figure is divided into three parts and they are divided by 2 vertical dash lines. The middle part is the complete sketches. The left part shows the first ten strokes of the sketches. We can see that the same objects have different stroke orders. If each stroke is taken as the input of GRU directly, the difference of the input may make the network unstable. The right part shows the stroke groups which are based on stroke orders, the first column shows the first 20% strokes, and the second column shows the first 40% in stroke sequence. And so on, the last column shows the complete sketches. We can see the difference of the sketches in the same time step is reduced.

In order to reduce the effect of stroke order, each sketch is divided into five sketches according to the time sequence. Suppose there are strokes for sketch , which can be represented as . Then, the first extended sketch group contains strokes from to , the second sketch contains strokes from to , accordingly, the last one contains the whole sketch of the original, as shown in Fig. 1. For robustness, we also take 10 crops and reflections for each train and test sketches [23]. After this operation, each sketch is derived into 10 sketches in each group. Hence, we have 50 sketches in total, represented as , and each input sequence contains 10 sketches. Figure. 3 shows an input sequence of one stroke group. From right to left, when

is odd, the inputs are crops of original stroke group and when

is even, the inputs are crops of original stroke group’s reflection. The order for each is top left, bottom left, top right, bottom right and center.

Iii-B2 Texture and shape features

Sketches encode both texture and geometry information. The texture information is used to characterize details of sketches, while shape information is used to obtain geometry feature. Both are necessary for sketch recognition. For example, basketball and football share similar out contour but different texture inside, while parrots and pigeon have similar wings but different beaks. Hence, two features are mutual complementary.

Fig. 4: Visualisation of the learned filters by deconvolutional.
Fig. 5: Encoded features of similar shapes.

We use the newly developed Sketch-A-Net (SAN) [11] to obtain texture feature, which is based on CNN architecture. And the shape feature is represented by coded shape context [10], which makes our method robust to stroke variations. Each kind of features are taken as input of two GRUs respectively.

Texture feature is used to represent texture information of image, which is an important feature for recognition. Recently, texture features extracted by CNNs provide strong discrimination on sketch recognition. Thus, in this paper, we use Sketch-A-Net [11]

to extract texture features of sketch. Sketch-A-Net has 8 layers, the first five layers are convolutional layers and the last three layers are fully connected layers, each convolutional layer with rectifier (ReLU) units, while the first, the second and the fifth layers are followed by max pooling. The final layer has 250 output units, which are corresponding to the number of unique classes in the TU-Berlin 

[1] dataset.

In order to get texture features of sketch, we feed each sketch in sequence into Sketch-A-Net and take the 512 dimensional features extracted from the last fully-connected layer to represent texture features of sketch. The texture features of the sketch are denoted as . In order to observe what has been learned by Sketch-A-Net, we make a deconvolution on the filters of the fifth layer. As shown in Fig.  4, the filter can capture complex texture features of object.

Shape feature is used to describe geometry information of sketches. As the high flexibility of sketches, each stroke can be drawn in different styles. As shown in Fig. 5, strokes in each rectangle are the same kind with different variations. In order to handle intra-class variation of strokes, we leverage a coding method [24] to make similar strokes represented by similar features.

First of all, shape context [10] is employed as the geometry metric for strokes. Each stroke is represented by shape context features with dimension . Then, the feature of th stroke can be represented as , and is the number of the stroke.

Secondly, we use k-means 

[25] to obtain codebook based on low-level geometry features. We apply the k-means algorithm on randomly selected shape features. The cluster centers are regarded as the codebook, as shown in Equ. (6). is denoted as the clustering center and there are cluster centers in total. Thereafter, we can use prototypes to describe the whole stroke space. As shown in Fig. 5, the colorful circle points stand for the cluster centers, and each sketch can be represented by several of them.


In order to get the final representation of strokes, we use codebook to encode shape features. By comparing two classical coding method vector quantization coding (VQ) [26] and Local-constraint linear coding (LLC) [27], we choose the latter one, which is more fast and effective.

We uses nearest neighbors of in codebook (in Fig. 5, is 5) to reconstruct , which can be denoted as . Here is a set which contains nearest neighbors in . While is a matrix, which consists of columns of . Furthermore, we can obtain the linear coefficients of columns in by optimizing Equ. (7). As shown in Fig. 5, similar strokes may have the same nearest neighbors.


Finally, we get more discriminative features by combining LLC and max-pooling [28]. In our work, we run max-pooling on all stroke features and the final shape feature can be represented as .For each sketch () in sketch sequence, we can extract both texture and shape features denoted as and , respectively.

Iii-C Dual Deep Learning Strategy

As texture and shape feature are mutual complementary to each other, we leverage a dual deep learning strategy to combine two features. Specifically, we use two separated RNNs to take each feature as input, and combine them at output by time-based weights. In each time step (), both texture and shape RNNs can generate features to represent input sketches, denoted as and , respectively. In order to utilize texture and shape features, we concatenate these two generated features by their timesteps. And then we feed concatenated features to another GRU ()to learn relevance of different features and different timesteps.

After processed by SANSC, every timestep () can give a prediction vector, denoted as of the input sketch. We sum over all prediction vectors denoted as (see equation (8))and choose the class with max value in (see equation (9)) as prediction


As texture and shape feature are mutual complementary to each other,we leverage a dual deep learning strategy to combine two features. As texture and shape feature are mutual complementary to each other, we leverage a dual deep learning strategy to combine two features. Specifically, we use two separated RNNs to take each feature as input, and combine them at output by time-based weights.

Iv Experiments and Results

In this section we first provide the dataset and experiment setting of our method. Then, our method is compared with the state-of-the-art based on the same protocol. Finally, we validate the effectiveness of the shape feature The results show our method outperform the other methods at 7% in recognition rate.

Iv-a Dataset and data augmentation

We evaluate our method on the TU-Berlin sketch dataset [1], which is the most largest and commonly used human sketch dataset. It consists of 20000 sketches and 250 categories (80 sketches per category). The dataset was collected on Amazon Mechanical Turk(AMT) from 1350 participants. Thus the dataset guarantees the diversity of object categories and sketching styles within every category. We use of data for training, for evaluation and the rest of sketches are used for testing.

In our experiments, data augmentation is employed to reduce overfitting. In order to increase the number of sketches per category, we apply several transformations on each sketch, including horizontal reflection and rotation ([-5,-3,0,+3,+5] degrees). Then we do systematic combinations of horizontal and vertical shifts ( pixels). The data augmentation procedure results in sketches per category, a total number of sketches distributed over 250 categories.

Iv-B Experiment settings

In our method, we use Sketch-A-Net [11] framework to extract texture features of sketches. The framework  [11] consists of five different CNNs which are independently trained for five different scaled versions of original sketch. In our experiment, we employ the single channel network which is trained for sketches with pixels and we take 512-dimensional features of the last fully-connected layer to represent sketches. For shape features, we apply shape context on points with equal interval of each stroke. The number of bins of shape context is set to be . Thus, the dimension of shape context descriptor for each stroke is . The size of codebook is

. We implement our network based on Torch

[29]. The initial learning rate is set to be and batch size is .

Iv-C Comparison with the state-of-the-art

We compare our method with the state-of-the-arts, which are used for sketch recognition. These methods can be divided into two groups, one group combines hand-crafted features and classifiers, including HOG-SVM method [1], structured ensemble matching [30], multi-kernel SVM [12] and Fisher Vector Spatial Pooling (FV-SP) [2]. The other group is DNN-based methods, including AlexNet [13], LeNet [16], AlexNet-FC-GRU [31], and two versions (SN1.0 [11] and SN2.0 [8]) of Sketch-A-Net framework. The experiments results of these methods are obtained by their executive implements or the reported results in their papers.

Method Recognition Accuracy(%)
HOG-SVM [1] 56.0
Ensemble [30] 61.5
MKL-SVM [12] 65.8
FV-SP [2] 68.9
Humans [1] 73.1
AlexNet-SVM [13] 67.1
AlexNet-Sketch [13] 68.6
LeNet [16] 55.2
SN1.0 [11] 74.9
SN2.0 [8] 78.0
AlexNet-FC-GRU [31] 85.1
Our Method 92.2
TABLE I: Accuracy of Sketch Recognition on TU-Berlin Dataset

Table I shows the recognition accuracy of the comparable methods on TU-Berlin dataset. In general, methods based on DNN gain better performance than the ones based on hand-crafted features. The accuracy of the methods based on hand-crafted features is 63% in average, which is lower than the results of Humans [1]. This is because the existing hand-crafted features are well-designed for traditional images but not fit for abstract and sparse sketches. In contrast, the accuracy in average for DNN based methods is about 74%, which is about 1% higher than Humans [1], and 11% higher than the feature based group. Among the existing DNN based methods, SN1.0 is the first method that beats humans with the accuracy 74.9%, and the improved version (SN2.0) obtain the accuracy 77.95%. As far as we know, [31] is the sate-of-art method with recognition accuracy of 85.1%.

Our method shares the advantages of DNN based methods, and outperform the other methods with classification accuracy 92.2%, which is about 7% higher than the state-of-the-art  [31]. Thus, the results illustrate the effectiveness of our dual learning strategy and the coupled features.

Iv-D Effects of shape features

In order to validate the effect of the shape features in sketch recognition, we list the results of our architecture with and without shape features in Tab. II. We can see that the accuracy without shape features is only 84.1%, which is about 7% lower than the one with shape features. This indicates shape feature plays an important role in sketch recognition.

Method Recognition Accuracy (%)
Without Shape Features 84.1
With Shape Features 92.2
TABLE II: Evaluation on the Contributions of Shape Features

In order to make the effects of shape feature more intuitive, we also exhibit some samples of the recognition results without and with shape features in Fig. 6. The first column lists the query sketches, including lobster, fire hydrant, tiger, spider, and pigeon. The second column shows the classification results without shape features and the last column is the results of our method with shape features. Each class are represented by four sketches in it. We can see that the method without shape features produces wrong classification results for all the requests. The reason is texture features have low distinctive ability to sketches with similar textures, such as horizontal and vertical lines on fire hydrant and skyscraper and stripe on tiger and church. In contrast, the method with shape features obtain correct classifications for all requests, and it is more distinctive than the methods without shape features. Our method can even distinguish the vivid difference at the beak of pigeon and parrot. Method without shape features gives the wrong results and there is an important connection between input sketch and results of method without shape features, that is, they have similar texture.

Fig. 6: Classification results of method without shape features and our method

Further, we illustrate the classification results of each outputs. As shown in Fig. 7, the first column shows the query sketches represented by five groups, the second and the third columns show the results for each outputs without and with shape features. We can see that the results of each output in the second column are wrong, while ours produce correct results from the second output of the time sequence. The methods in this experiment are implemented without JB, which can verify that the shape feature can work effectively without the help of JB.

Fig. 7: Classification results of method without shape features and our method in different time steps. First column:input, second column : results without shape features, the last column: results with shape features
Fig. 8: Classification results of method without Joint Bayesian and with Joint Bayesian.

Iv-E Effects of Joint Bayesian

In this section, the effects of Joint Bayesian is illustrated. As shown in Tab. III

, the results with and without JB are listed separately. The recognition result of our method without JB is achieved by softmax layer. The recognition rate is 88.2%, which is still higher than 

[31]. The results indicate that JB make features from different time steps have different weights. With the help of JB, features from different time steps are fused together. JB can provide the weights between different features, and assignment important features with higher weights. As shown in Fig.  8, we choose some recognition results of method without JB and with JB. The sketches in the first column represent query sketches, they are mug, speed-boat and chandelier. The sketches in the second column and third column represent samples of classification category given by two methods respectively. We can see that classification results given by method with JB are all right, however method without JB gives the wrong answers. The reason is that classification results are influenced by texture and shape features from all time steps of one sketch. Different features should have different contributions to classification. However, method without JB could not give the contributions of all features and is easy to give the wrong classification results. Such as in Fig.  8, mug and teapot have similar shapes and textures, speed-boat and flying saucer also have similar shaped and features, method without JB could not leverage the weights of all features. And as for chandelier and armchair, they have the same texture, that is vertical lines, shape features are disadvantaged to determine classification. Compared to method without JB, method with JB can learn weights of all features and give the right classification.

Method Recognition Accuracy (%)
Without JB 88.2
With JB 92.2
TABLE III: Evaluation on the contributions of JB

V Conclusion

In this paper, we proposed a sequential dual deep learning strategy combined both shape and texture features for sketch recognition. According to our experimental results, we achieve the best performance on sketch recognition. Our method has the following advantages. First, we employ texture feature and shape feature to characterize texture and geometry information of one sketch, which are complementary to each other. Second, We explore the sequential nature of sequential stroke groups rather than sequential strokes. The learned features of sketch can be used in some other sketch-related applications, such as sketch-based image retrieval and 3D shape retrieval.


  • [1] M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?” ACM Trans. Graph., vol. 31, no. 4, pp. 44:1–44:10, 2012. [Online]. Available:
  • [2] R. G. Schneider and T. Tuytelaars, “Sketch classification and classification-driven analysis using fisher vectors,” ACM Transactions on Graphics (TOG), vol. 33, no. 6, p. 174, 2014.
  • [3] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa, “Sketch-based image retrieval: Benchmark and bag-of-features descriptors,” IEEE Trans. Vis. Comput. Graph., vol. 17, no. 11, pp. 1624–1636, 2011. [Online]. Available:
  • [4] R. Hu and J. P. Collomosse, “A performance evaluation of gradient field HOG descriptor for sketch based image retrieval,” Computer Vision and Image Understanding, vol. 117, no. 7, pp. 790–806, 2013. [Online]. Available:
  • [5] F. Wang, L. Kang, and Y. Li, “Sketch-based 3d shape retrieval using convolutional neural networks,” in

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015

    , 2015, pp. 1875–1883. [Online]. Available:
  • [6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, 2005, pp. 886–893. [Online]. Available:
  • [7] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [Online]. Available:
  • [8] Q. Yu, Y. Yang, F. Liu, Y. Z. Song, T. Xiang, and T. M. Hospedales, “Sketch-a-net: A deep neural network that beats humans,” International Journal of Computer Vision, pp. 1–15, 2016.
  • [9] R. K. Sarvadevabhatla, J. Kundu, and V. Babu, R, “Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition,” pp. 247–251, 2016.
  • [10] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 4, pp. 509–522, 2002.
  • [11] Q. Yu, Y. Yang, Y. Song, T. Xiang, and T. M. Hospedales, “Sketch-a-net that beats humans,” in Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, 2015, pp. 7.1–7.12. [Online]. Available:
  • [12] Y. Li, T. M. Hospedales, Y. Song, and S. Gong, “Free-hand sketch recognition by multi-kernel feature learning,” Computer Vision and Image Understanding, vol. 137, pp. 1–11, 2015. [Online]. Available:
  • [13]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., 2012, pp. 1106–1114. [Online]. Available:
  • [14] G. Johnson, M. D. Gross, J. Hong, E. Y.-L. Do et al., “Computational support for sketching in design: a review,” Foundations and Trends® in Human–Computer Interaction, vol. 2, no. 1, pp. 1–93, 2009.
  • [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available:
  • [16] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], 1989, pp. 396–404. [Online]. Available:
  • [17] O. Vinyals, S. V. Ravuri, and D. Povey, “Revisiting recurrent neural networks for robust asr,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on.   IEEE, 2012, pp. 4085–4088.
  • [18] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in

    Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011

    , 2011, pp. 1017–1024.
  • [19] A. Graves, “Long short-term memory,” Neural Computation, vol. 9, no. 8, p. 1735, 1997.
  • [20] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [21] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [22] J. Chung, C. Gülçehre, K. Cho, and Y. Bengio, “Gated feedback recurrent neural networks.” in ICML, 2015, pp. 2067–2075.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [24] X. Wang, B. Feng, X. Bai, W. Liu, and L. J. Latecki, “Bag of contour fragments for robust shape classification,” Pattern Recognition, vol. 47, no. 6, pp. 2116–2125, 2014.
  • [25] R. O. Duda, P. E. Hart, and D. G. Stork, “Pattern classification and scene analysis part 1: Pattern classification,” Wiley, Chichester, 2000.
  • [26] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Computer vision and pattern recognition, 2006 IEEE computer society conference on, vol. 2.   IEEE, 2006, pp. 2169–2178.
  • [27] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.   IEEE, 2010, pp. 3360–3367.
  • [28] Y. Huang, Z. Wu, L. Wang, and T. Tan, “Feature coding in image classification: A comprehensive study,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 3, pp. 493–506, 2014.
  • [29] R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: a modular machine learning software library,” Idiap, Tech. Rep., 2002.
  • [30] Y. Li, Y.-Z. Song, and S. Gong, “Sketch recognition by ensemble matching of structured features.” in BMVC, vol. 1, 2013, p. 2.
  • [31] R. K. Sarvadevabhatla, J. Kundu, and R. V. Babu, “Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition,” in Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, 2016, pp. 247–251. [Online]. Available: