Freehand sketching is an inherently sequential process. Yet, most approaches for hand-drawn sketch recognition either ignore this sequential aspect or exploit it in an ad-hoc manner. In our work, we propose a recurrent neural network architecture for sketch object recognition which exploits the long-term sequential and structural regularities in stroke data in a scalable manner. Specifically, we introduce a Gated Recurrent Unit based framework which leverages deep sketch features and weighted per-timestep loss to achieve state-of-the-art results on a large database of freehand object sketches across a large number of object categories. The inherently online nature of our framework is especially suited for on-the-fly recognition of objects as they are being drawn. Thus, our framework can enable interesting applications such as camera-equipped robots playing the popular party game Pictionary with human players and generating sparsified yet recognizable sketches of objects.READ FULL TEXT VIEW PDF
We present sketch-rnn, a recurrent neural network (RNN) able to construc...
Recognizing freehand sketches with high arbitrariness is greatly challen...
We propose a multi-scale multi-channel deep neural network framework tha...
In this paper, we address the problem of hand-drawn sketch recognition.
Based on the progress of image recognition, video recognition has been
Freehand sketches often contain sparse visual detail. In spite of the
Studies from neuroscience show that part-mapping computations are employ...
The process of freehand sketching has long been employed by humans to communicate ideas and intent in a minimalist yet almost universally understandable manner. In spite of the challenges posed in recognizing them , sketches have formed the basis of applications in areas of forensic analysis , electronic classroom systems , sketch-based retrieval [20, 13] etc.
Sketching is an inherently sequential process. The proliferation of pen and tablet based devices today enables us to capture and analyze the entire process of sketching, thus providing additional information compared to passive parsing of static sketched content. Yet, most sketch recognition approaches either ignore the sequential aspect or lack the ability to exploit it [20, 7, 18]. The few approaches which attempt to exploit the sequential sketch stroke data do so either in an unnatural manner  or impose restrictive constraints (e.g. Markov assumption) .
In our work, we propose a recurrent neural network architecture for sketch object recognition which exploits the long-term sequential and structural regularities in stroke data in a scalable manner. We make the following contributions:
We propose the first deep recurrent neural network architecture which can recognize freehand sketches across a large number () of object categories. Specifically, we introduce a Gated Recurrent Unit (GRU)-based framework (Section 3.1) which leverages deep sketch features and weighted per-stroke loss to achieve state-of-the-art results.
We show that the choice of deep sketch features and recurrent network architecture both play a crucial role in obtaining good recognition performance (Section 4.3).
Via our experiments on sketches with partial temporal stroke content, we show that our framework recognizes the largest percentage of sketches (Section 4.3).
Given the on-line nature of our recognition framework, it is especially suited for on-the-fly interpretation of sketches as they are drawn. Thus, our framework can enable interesting applications such as camera-equipped robots playing the popular party game Pictionary  with human players, generating sparsified yet recognizable sketches of objects , interpreting hand-drawn digital content in electronic classrooms  etc.
To retain focus, we review approaches exclusively related to recognition of hand-drawn object sketches. Early datasets tended to contain either a small number of sketches and/or object categories [20, 14]. In , Eitz et al.  released a dataset containing hand-drawn sketches across categories of everyday objects. The dataset, currently the largest sketch object dataset available, provided the first opportunity to attempt the sketch object recognition problem at a relatively large-scale. Since its release, a number of approaches have been proposed to recognize freehand sketches of objects. The initial performance of handcrafted feature-based approaches [13, 18] has been recently surpassed by deep feature-based approaches [17, 19]
, culminating in an custom-designed Convolutional Neural Network dubbed SketchCNN which achieved state-of-the-art results. The approaches mentioned above are primarily designed for static, full-sketch object recognition. In contrast, another set of approaches attempt to exploit the sequential stroke-by-stroke nature of hand-drawn sketch creation [1, 22]. For example, Arandjelovic and Sezgin 
propose a Hidden Markov Model (HMM)-based approach for recognizing military and crisis management symbol objects. Although mentioned above in the context of static object recognition, a variant of the SketchCNN can also handle sequential stroke data. In fact, the authors demonstrate that exploiting the sequential nature of sketching process improves the overall recognition rate. However, given that CNNs are not inherently designed to preserve sequential “state”, better results can be expected from a framework which handles sequential data in a more natural fashion. The approach we present in our paper aims to do precisely this. Our framework is based on Gated Recurrent Unit (GRU) networks recently proposed by Cho et al. 
. GRU architectures share a number of similarities with the more popular Long Short Term Memory Networks including the latter’s ability to perform better  than traditional models (e.g. HMM) for problems involving long and complicated sequential structures. To the best of our knowledge, recurrent neural networks have not been utilized for online sketch recognition.
Sketch creation involves accumulation of hand-drawn strokes over time. Thus, we require our recognition framework to optimally exploit object category evidence being accumulated on a per-stroke basis as well as temporally. Moreover, the variety in sketch-based depiction and intrinsic representational complexity of objects results in a large range for stroke-sequence lengths. Therefore, we require our recognition framework to address this variation in sequence lengths appropriately. To meet these requirements, we employ Gated Recurrent Unit (GRU) networks . Our choice of GRU architecture is motivated by the observation that it involves learning a smaller number of parameters and performs better compared to LSTM in certain instances  including, as shall been seen (Section 4), our problem of sketch recognition as well.
A GRU network learns to map an input sequence to an output sequence . This mapping is performed by the following transformations which are applied at each time step:
Here, and represent the -th input and -th output respectively, represents the “hidden” sequence state of the GRU whose contents are regulated by parameterized gating units and represents the elementwise dot-product. The subscripted s, s and represent the trainable parameters of the GRU. Please refer to Chung et al.  for details.
For each sketch, information is available at temporal stroke level. We use this to construct an image sequence of sketch strokes cumulatively accumulated over time. Thus, represents the full, final object sketch and represents the number of sketch strokes or equivalently, time-steps (see Figure 1). To represent the stroke content for each , we utilize deep features obtained when is provided as input to Alexnet111Specifically, we remove classification layer and use outputs from final fully-connected layer of the resulting net as features. . The resulting deep feature sequence forms the input sequence to GRU (see Figure 1). The GRU unit contains
hidden units and its output is densely connected to a final softmax layer for classification. For better generalization, we include a dropout layer before the final classification layer which tends to benefit recurrent networks having a large number of hidden units. We used a dropout ofin our experiments.
Our architecture produces an output prediction for every time-step in the sequence. By comparing the predictions with the ground-truth, we can determine the corresponding loss for a fixed loss function (shown as a yellow box in Figure 1). This loss is weighted by a corresponding and backpropagated) for the corresponding time step . For the weighing function, we use
Thus, losses corresponding to final stages of sequence are weighted more to encourage correct prediction of the full sketch. Also, since is non-zero, our design incorporates losses from all steps of the sequence. This has the net effect of encouraging correct predictions even in the early stages of the sequence. Overall, this feature enables our recognition framework to be accurate and responsive right from the beginning of the sketching process (Section 4) in contrast with frameworks which need to wait for the sketching to finish before analysis can begin. We additionally studied variations of the weighing function given in Equation (6) – using the final sequence member loss (i.e. ) and linearly weighted losses (i.e. ). We found that exponentially weighted loss222We used for our experiments. gave superior results.
To address the high variation of sequence length across sketches, we create batches of sketches having equal sequence length (i.e. (Sec. 3.1)). These batches of varying size are randomly shuffled and delivered to the recurrent network during training. For each batch, categorical-cross-entropy loss is generated for each sequence by comparing the predictions with the ground-truth. The resulting losses are weighted (Equation (6
)) on a per-timestep basis as described previously and back-propagated through the corresponding sequence during training. We used stochastic gradient descent with a learning rate offor training.
Suppose for a given input sequence , the corresponding outputs at the softmax layer are . Note that in our case, where is the dimension of deep feature and where is the number of object categories (). To determine the final category label , we perform a weighted sum-pooling of the softmax outputs as where is as given in Equation (6) and . We explored various other softmax output pooling schemes – last sequence member-based prediction (
), max-pooling (), mean-pooling (). From our validation experiments, we found weighted sum-pooling to be the best choice overall.
|CNN||Recurrent Network||#Hidden||Avg. Acc|
In addition to obtaining the best results among approaches using handcrafted features, the work of Rosalia et al.  was especially instrumental in identifying a -category subset of the TU Berlin dataset which could be unambiguously recognized by humans. Consequently, our experiments are based on this curated -category subset of sketches. Following Rosalia et al. , we use sketches from each of the categories. To ensure principled evaluation, we split the sketches of each category randomly into sets containing , and of sketches to be used for training, validation and testing respectively333Thus, we have , and sketches from each category in the training, validation and test sets respectively.. Additionally, we utilized the validation set exclusively for making choices related to architecture and parameter settings and performed a one-shot comparative evaluation of ours and competing approaches on the test set.
We compared our performance with the following architectures:
Alexnet-FT: As a baseline experiment, we fine-tuned Alexnet using our -category training data. To ensure sufficient data, we augmented the training data on the lines of Sarvadevabhatla et al. . We also used the final fully-connected -dimensional layer features as input to our recurrent architectures. We shall refer to such usage by Alexnet-FC.
SketchCNN: This is essentially the deep architecture of Yu et al.  but retrained for the categories and splits mentioned in Section 4.1. Since CNNs do not inherently store “state", the authors construct six different sub-sequence stroke accumulation images which comprise the channels of the input representation to the CNNs. It comprises of five different CNNs, each trained for five different scaled versions of sketches. The last fully-connected layer’s -dimensional features from all the five CNNs are processed using a Bayesian fusion technique to obtain the final classification.
For our experiments, we also concatenated the dimensional features from each scale of SketchCNN as the input feature to the recurrent neural network architectures that were evaluated. However, only the full sketch was considered as the input to CNN (i.e. single-channel). For the rest of the paper, we refer to the resulting -dimensional feature as SketchCNN-SCh-FC.
Recurrent architectures: We experimented with the number of hidden units, the number of recurrent layers, the type of recurrent layers (i.e. LSTM or GRU), the training loss function (Section 3.2) and various pooling methods for obtaining final prediction in terms of individual sequence member predictions (Section 3.3).
Overall performance: Table 1 summarizes the overall performance in terms of average recognition accuracy for various architectures. As can be seen, our GRU-based architecture (first row) outperforms SketchCNN by a significant margin even though it is trained on only of the total data. We believe our good performance stems from (a) being able to exploit the sequential information in a scalable and efficient manner via recurrent neural networks (b) the superiority of the deep sketch features provided by Alexnet  compared to the SketchCNN-FC features. The latter can be clearly seen when we compare the first two rows of Table 1 with the last two rows. In our case, the performance of GRU was better than that of LSTM when Alexnet features were used. Overall, it is clear that the choice of (sketch) features and the recurrent network both play a crucial role in obtaining state-of-the-art performance for the sketch recognition task.
On-line recognition: We also compared the various architectures for their ability to recognize sketches as they are being drawn (i.e. on-line recognition performance). For each classifier, we determined the fraction of test sketches correctly recognized when only the first of the temporal sketch strokes are available. We varied between to in steps of and plotted as a function of . The results can be seen in Figure 2. Intuitively, the higher a curve on the plot, the better its online recognition ability. As can be seen, our framework consistently recognizes a larger fraction of sketches at all levels of sketch completion (except for very small ) relative to other architectures.
Semantic information: To determine the extent to which our architecture captures semantic information, we examined the performance of the classifier on misclassified sketches. As can be seen in Figure 3, most of the misclassifications are reasonable errors (e.g. guitar is mistaken for violin) and demonstrate that our framework learns the overall semantics of the object recognition problem.
In this paper, we have presented our deep recurrent neural network architecture for freehand sketch recognition. Our architecture has two prominent traits. Firstly, its design accounts for the inherently sequential and cumulative nature of human sketching process in a natural manner. Secondly, it exploits long-term sequential and structural regularities in stroke data represented as deep features. These two traits enable our system to achieve state-of-the-art recognition results on a large database of freehand object sketches. We have also shown that our recognition framework is highly suitable for on-the-fly interpretation of sketches as they are being drawn. Our framework source-code and associated data (pre-trained models) can be accessed at https://github.com/val-iisc/sketch-obj-rec.
We thank NVIDIA for their grant of Tesla K40 GPU.
Sketch classification and classification-driven analysis using fisher vectors.ACM Trans. Graph., 33(6):174:1–174:9, Nov. 2014.