Learning meaningful representations of free-hand sketches remains a
challenging task given the signal sparsity and the high-level abstraction of
sketches. Existing techniques have focused on exploiting either the static
nature of sketches with Convolutional Neural Networks (CNNs) or the temporal
sequential property with Recurrent Neural Networks (RNNs). In this work, we
propose a new representation of sketches as multiple sparsely connected graphs.
We design a novel Graph Neural Network (GNN), the Multi-Graph Transformer
(MGT), for learning representations of sketches from multiple graphs which
simultaneously capture global and local geometric stroke structures, as well as
temporal information. We report extensive numerical experiments on a sketch
recognition task to demonstrate the performance of the proposed approach.
Particularly, MGT applied on 414k sketches from Google QuickDraw: (i) achieves
small recognition gap to the CNN-based performance upper bound (72.80
74.22
the best of our knowledge, this is the first work proposing to represent
sketches as graphs and apply GNNs for sketch recognition. Code and trained
models are available at
https://github.com/PengBoXiangShang/multigraph_transformer.

The dynamic effects of smoke are impressive in illustration design, but ...

Code Repositories

multigraph_transformer

accepted by IEEE TNNLS, transformer, multi-graph transformer, graph, graph classification, sketch recognition, sketch classification, free-hand sketch, official code of the paper "Multi-Graph Transformer for Free-Hand Sketch Recognition"

Free-hand sketches are drawings made without the use of any instruments.
Sketches are different from traditional images: they are formed of temporal sequences of strokes (Ha and Eck, 2018; Xu et al., 2018), while images are static collections of pixels with dense color and texture patterns. Sketches capture high-level abstraction of visual objects with very sparse information compared to regular images, which makes the modelling of sketches unique and challenging.

The modern prevalence of touchscreen devices has led to a flourishing of sketch-related applications in recent years, including sketch recognition (Liu et al., 2019; Sarvadevabhatla et al., 2016)

(Sangkloy et al., 2016; Liu et al., 2017; Shen et al., 2018; Collomosse et al., 2019; Dutta and Akata, 2019; Dey et al., 2019), and sketch-related generative models (Ha and Eck, 2018; Chen and Hays, 2018; Lu et al., 2018; Liu et al., 2019).

Figure 1: Sketches can be seen as sets of curves and strokes, which are discretized by graphs.

If we assume sketches to be 2D static images, CNNs can be directly applied to sketches, such as “Sketch-a-Net” (Yu et al., 2015). If we now suppose that sketches are ordered sequences of point coordinates, then RNNs can be used to recursively capture the temporal information, e.g., “SketchRNN” (Ha and Eck, 2018).

In this work, we introduce a new representation of sketches with graphs. We assume that sketches are sets of curves and strokes, which are discretized by a set of points representing the graph nodes. This view offers high flexibility to encode different sketch geometric properties as we can decide different connectivity structures between the node points. We use two types of graphs to represent sketches: intra-stroke graphs and extra-stroke graphs. The first graphs capture the local geometry of strokes, independently to each other, with for example 1-hop or 2-hop connected graphs, see Figure 1. The second graphs encode the global geometry and temporal information of strokes. Another advantage of using graphs is the freedom to choose the node features. For sketches, spatial, temporal and semantic information is available with the stroke point coordinates, the ordering of points, and the pen state information, respectively. In summary, representing sketches with graphs offers a universal representation that can make use of global and local spatial sketch structures, as well as temporal and semantic information.

To exploit these graph structures, we propose a new Transformer (Vaswani et al., 2017)

architecture that can use multiple sparsely connected graphs.
It is worth reporting that a direct application of the original Transformer model on the input spatio-temporal features provides poor results.
We argue that the issue comes from the graph structure in the original Transformer which is a fully connected graph.
Although fully-connected word graphs work impressively for Natural Language Processing, where the underlying word representations themselves contain rich information, such dense graph structures provide poor innate priors/inductive bias

(Battaglia et al., 2018) for 2D sketch tasks.
Transformers require sketch-specific design coming from geometric structures. This led us to naturally extend Transformers to multiple arbitrary graph structures. Moreover, graphs provide more robustness to handle noisy and style-changing sketches as they focus on the geometry of stokes and not on the specific distribution of points.

Another advantage of using domain-specific graphs is to leverage the sparsity property of discretized sketches. Observe that intra-stroke and extra-stroke graphs are highly sparse adjacency matrices. In practical sketch-based human-computer interaction scenarios, it is time-consuming to directly transfer the original sketch picture from user touch-screen devices to the back-end servers. To ensure real-time applications, transferring the stroke coordinates as a character string would be more beneficial, see Figure 2.

Our main contributions can be summarised as follows:
(i) We propose to model sketches as sparsely connected graphs, which are flexible to encode local and global geometric sketch structures. To the best of our knowledge, it is the first time that graphs are proposed for representing sketches.
(ii) We introduce a novel Transformer architecture that can handle multiple arbitrary graphs. Using intra-stroke and extra-stroke graphs, the proposed Multi-Graph Transformer (MGT) learns both local and global patterns along sub-components of sketches.
(iii) This Multi-Graph Transformer model is agnostic to graph domains, and can be used beyond sketch applications.
(iv) Numerical experiments demonstrate the performances of our model. MGT significantly outperforms RNN-based models, and
achieves small recognition gap to CNN-based architectures.
This is promising for real-time sketch-based human-computer interaction systems.
Note that for sketch recognition, CNNs are the performance upper bound of coordinate-based models that involve truncating coordinate sequences, e.g., RNN or Transformer based architectures.

2 Related Work

Neural Network Architectures for Sketches

CNNs are a common choice for feature extraction from sketches. “Sketch-a-Net”

(Yu et al., 2015) was the first CNN-based model having a sketch-specific architecture.
It was directly inspired from AlexNet (Krizhevsky et al., 2012) with larger first layer filters, no layer normalization, larger pooling sizes, and high dropout. Song et al. (2017) further improved Sketch-a-Net by adding spatial-semantic attention layers. “SketchRNN” (Ha and Eck, 2018) was a seminal work to model temporal stroke sequences with RNNs.
A CNN-RNN hybrid architecture for sketches was proposed in (Sarvadevabhatla et al., 2016).

In this work, we propose a novel Graph Neural Network architecture for learning sketch representations from multiple sparse graphs, combining both stroke geometry and temporal order.

Graph Neural Networks
Graph Neural Networks (GNNs) (Bruna et al., 2014; Defferrard et al., 2016; Sukhbaatar et al., 2016; Kipf and Welling, 2017; Hamilton et al., 2017; Monti et al., 2017) aim to generalize neural networks to non-Euclidean domains such as graphs and manifolds.
GNNs iteratively build representations of graphs through recursive neighborhood aggregation (or message passing),
where each graph node gathers features from its neighbors to represent local graph structure.

Transformers
The Transformer architecture (Vaswani et al., 2017), originally proposed as a powerful and scalable alternative to RNNs, has been widely adopted in the Natural Language Processing community for tasks such as machine translation (Edunov et al., 2018; Wang et al., 2019), language modelling (Radford et al., 2018; Dai et al., 2019), and question-answering (Devlin et al., 2019; Yang et al., 2019).

Transformers for NLP can be regarded as GNNs which use self-attention (Bahdanau et al., 2014; Veličković et al., 2018) for neighborhood aggregation on fully-connected word graphs (Ye et al., 2019).
However, GNNs and Transformers perform poorly when sketches are modelled as fully-connected graphs.
This work advocates for the injection of inductive bias into Transformers through domain-specific graph structures.

3 Method

3.1 Notation

We assume that the training dataset D consists of N labeled sketches: D={(Xn,zn)}Nn=1.
Each sketch Xn has a class label zn, and can be formulated as a S-step sequence [Cn,fn,p]∈RS×4.
Cn={(xsn,ysn)}Ss=1∈RS×2 is the coordinate sequence of the sketch points Xn.
All sketch point coordinates have been uniformly scaled to xsn,ysn∈[0,256]2.
If the true length of Cn is shorter than S

fn∈{f1,f2,f3}S×1 is a ternary integer vector that denotes the pen state sequence corresponding to each point of Xn. It is defined as follows:
f1 if the point (xsn,ysn) is a starting or ongoing point of a stroke, f2 if the point is the ending point of a stroke, and f3 for a padding point.
Vector p=[0,1,2,⋯,S−1]T is a positional encoding vector that represents the temporal position of the points in each sketch Xn.

Given D, we aim to model Xn as multiple sparsely connected graphs and learn a deep embedding space, where the high-level semantic tasks can be conducted, e.g., sketch recognition.

3.2 Multi-Modal Input Layer

Given a sketch Xn, we model its S stroke points as S nodes of a graph. Each node has three features: (i) Csn is the spatial positional information of the current stroke point s, (ii) fsn is the pen state of the current stroke point. This information helps to identify the stroke points belonging to the same stroke,
and (iii) ps is the temporal information of the current stroke point. As sketching is a dynamic process, it is important to use the temporal information.

The complete model architecture for our Multi-Graph Transformer is presented in Figure 3.
Let us start by describing the input layer. The final vector at node s of the multi-modal input layer is defined as

(hsn)(l=0)=C(E1(Csn),E2(fsn),E2(ps)),

(1)

where E1(Csn) is the embedding of Csn with a linear layer of size 2×^d, E2(fsn) and E2(ps) are the embeddings of the flag bit fsn (3 discrete values) and the position encoding ps (S discrete values) from an embedding dictionary of size (S+3)×^d,
and C(⋅,⋅) is the concatenation operator. The node vector (hsn)(l=0) has dimension d=3^d.
The design of the input layer was selected after extensive ablation studies, which are described in subsequent sections.

3.3 Multi-Graph Transformer

The initial node embedding (hsn)(l=0) is updated by stacking L Multi-Graph Transformer (MGT) layers (7). Let us describe all layers.

Graph Attention Layer
Let A be a graph adjacency matrix of size S×S and Q∈RS×dq,K∈RS×dk,V∈RS×dv be the query, key, and value matrices. We define a graph attention layer as

GraphAttention(Q,K,V,A)=A⊙softmax(QKT√dk)V,

(2)

where ⊙ is the Hadamard product. We simply weight the “Scaled Dot-Product Attention” (Vaswani et al., 2017) with the graph edge weights.
We set dq=dk=dv=dI, where I is the number of attention heads.

Multi-Head Attention Layer We aggregate the graph attentions with multiple heads:

MultiHead(Q,K,V,A)=C(head1,⋯,headI)WO,

(3)

where WO∈RIdv×d
and each attention head is computed with the graph attention layer (2):

headi=GraphAttention(QWQi,KWKi,VWVi,A),

(4)

where WQi∈Rd×dq, WKi∈Rd×dk, and WVi∈Rd×dv.
We add dropout (Srivastava et al., 2014) before the linear projections of Q, K and V.
An illustration of the Multi-Head Attention Layer is presented in Figure 4.

Multi-Graph Multi-Head Attention Layer Given a set of adjacency graph matrices
{Ag}Gg=1, we can concatenate Multi-Head Attention Layers:

where W˜O∈RGd×d and each Multi-Head Attention Layer is computed with (3):

gheadg=MultiHead(Q,K,V,Ag).

(6)

Multi-Graph Transformer Layer The Multi-Graph Transformer (MGT) at layer l for node s is defined as

(hsn)(l)=MGT((hn)(l−1))=^hsn+FF(l)(^hsn),

(7)

where the intermediate feature representation ^hsn is defined as:

^hsn=(MGMHAsn)(l)((h1n)(l−1),⋯,(hSn)(l−1)).

(8)

The MGT layer is thus composed of (i) a Multi-Graph Multi-Head Attention (MGMHA) sub-layer (5) and (ii) a position-wise fully connected Feed-Forward (FF) sub-layer.
Each MHA sub-layer (6) and FF (7

In this section, we discuss the graph structures we used in our Graph Transformer layers. We considered two types of graphs, which capture local and global geometric sketch structures.

The first class of graphs focus on representing the local geometry of individual strokes. We choose K-hop graphs to describe the local geometry of strokes. The intra-stroke adjacency matrix is defined as follows:

AK-hopn,ij={1 if j∈NK-hopi %
and j∈{global}(i),0 otherwise ,

(10)

where NK-hopi is the K-hop neighborhood of node i and {global}(i) is the stroke of node i.

The second class of graphs capture the global and temporal relationships between the strokes composing the whole sketch. We define the extra-stroke adjacency matrix as follows:

Aglobaln,ij={1 if |i−j|=1 and {global}(i)≠{global}(j),0 otherwise .

(11)

This graph will force the network to pay attention between two points belonging to two distinct strokes but consecutive in time,
thus allowing the model to understand the relative arrangement of strokes.

4 Experiments

4.1 Experimental Setting

Dataset and Pre-Processing
Google QuickDraw (Ha and Eck, 2018) ^{1}^{1}1https://quickdraw.withgoogle.com/data is the largest available sketch dataset containing 50 Million sketches as simplified stroke key points in temporal order, sampled using the Ramer–Douglas–Peucker algorithm after uniformly scaling image coordinates within 0 to 256.
Unlike smaller crowd-sourced sketch datasets, e.g., TU-Berlin (Eitz et al., 2012), QuickDraw samples were collected via an international online game where users have only 20 seconds to sketch objects from 345 classes, such as cats, dogs, clocks, etc.
Thus, sketch classification on QuickDraw not only involves a diversity of drawing styles, but can also be highly abstract and noisy, making it a challenging and practical test-bed for comparing the effectiveness of various neural network architectures.
Following recent practices (Dey et al., 2019; Xu et al., 2018), we create random training, validation and test sets from the full dataset by sampling 1000, 100 and 100 sketches respectively from each of the 345 categories in QuickDraw.
Following (Xu et al., 2018), we truncate or pad all samples to a uniform length of 100 key points/steps to facilitate efficient training of RNN and GNN-based models.
We provide summary statistics for our training, validation and test sets in Table 1, and histograms visualizing the key points per sketch are shown in Figure 5.

Evaluation Metrics

Our evaluation metric for sketch recognition is “top K accuracy”, the proportion of samples whose true class is in the top K model predictions, for values

k=1,5,10.
(Note that acc.@k =1.0 means 100%)

Set

# Samples

# Truncated (ratio)

# Key Points

max

min

mean

std

Training

345,000

11788 (3.42%)

100

2

43.26

21.85

Validation

34,500

1218 (3.53%)

100

2

43.24

21.89

Test

34,500

1235 (3.58%)

100

2

43.20

21.93

Table 1: Summary statistics for our subset of QuickDraw.

Figure 5: Histograms of key points per sketch for our subset of QuickDraw. The sharp spike at 100 key points is due to truncation.

Implementation Details

For fair comparison under similar hardware conditions, all experiments were implemented in PyTorch

and run on one Nvidia 1080Ti GPU.
For Transformer models, we use the following hyperparameter values:

S=100, L=4, ^d=128, G=3 (A1-hop,A2-hop,Aglobal), and I=8 (per graph) for our Base model (and ^d=256 for our Large model).
Our FF sub-layer is a d-dimensional linear layer (d=3^d) followed by ReLU (Glorot et al., 2011) and dropout.
The MLP Classifier consists of two 4^d-dimensional linear layers with ReLU and dropout, followed by a 345

-dimensional linear projection representing logits over the 345 categories in QuickDraw.
We train all models by minimizing the softmax cross-entropy loss using the Adam

(Kingma and Ba, 2014) optimizer for 100epochs.
We use an initial learning rate of 5e−5 and multiply by a factor 0.7 every 10 epochs.
We use an early stopping strategy (with the hyper-parameter “patience” of 10 epochs) for selecting the final model, and
the checkpoint with the highest validation performance is chosen to report test performance.

Baselines
(i) From the perspective of coordinate-based sketch recognition, RNN models are a simple-yet-effective baseline.
Following Xu et al. (2018), we design several bi-directional LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) models at increasing parameter budgets comparable with MGT.
The final RNN states are concatenated and passed to the MLP classifier described previously.
We use batch size 256, initial learning rate 1e−4 and multiply by 0.9 every 10 epochs.
We train models with both our multi-modal input (Section 3.2) as well as the 4D input from Xu et al. (2018).
(ii) Although converting sketch coordinates to images adds time overhead in practical settings and can be seen as auxilary information, we compare MGT to various state-of-the-art CNN architectures.
It is important to note that sketch sequences were truncated/padded for training both MGT and RNNs, hence image-based CNNs stand as an upper bound in terms of performance.
For Inception V3 (Szegedy et al., 2016) and MobileNet V2 (Sandler et al., 2018), initial learning rate is 1e−3 and multiplied by 0.5 every 10 epochs.
For other CNN baselines, the initial learning rate and decay are configured following their original papers.

For each model, we use the maximum possible batch size. Following standard practice in computer vision

(He et al., 2016; Huang et al., 2017), we employ early stopping based on observing over-fitting in the validation loss,
and select the checkpoint with the highest validation accuracy for evaluation on the test set.
(iii) To evaluate the effectiveness of the proposed Graph Transformer layer,
we compare it with popular GNN variants: the Graph Convolutional Network (Kipf and Welling, 2017) and the Graph Attention Network (Veličković et al., 2018) ^{2}^{2}2For GAT, we use the same scaled dot-product attention mechanism as GT for efficiency..
All GNN models follow the same hyperparameter setup as Transformers
(L=4, ^d=256)
and are augmented with residual connections and batch normalization for fair comparison, following (Bresson and Laurent, 2018).
Optimal hyper-parameters and learning rate schedules are selected based on validation set performance.

4.2 Results

For fair comparison with RNN and CNN baselines at various parameter budgets, we implement two configurations of MGT: Base (10M parameters) and Large (40M parameters).
Additionally, we perform several ablation studies to evaluate the effectiveness of our multi-graph architecture and our sketch-specific input design.
Our main results are presented in Table 2.

Table 2:
Test set performance of MGT vs. the state-of-the-art RNN and CNN architectures. The 1st/2nd/3rd best results per column are indicated in red/blue/magenta.

Network

Configurations

Recognition Accuracy

ParameterAmount

G

Graph Structure

Itotal

^d

L

Dropout

acc.@1

acc.@5

acc.@10

GT #1

1

Fully-connected (vanilla)

8

256

4

0.10

0.5249

0.7802

0.8486

14,029,401

GT #2

1

Intra-stroke Fully-connected

8

256

4

0.10

0.6487

0.8697

0.9151

14,029,401

GT #3

1

Random (10%)

8

256

4

0.10

0.5271

0.7890

0.8589

14,029,401

GT #4

1

Random (20%)

8

256

4

0.10

0.5352

0.7945

0.8617

14,029,401

GT #5

1

Random (30%)

8

256

4

0.10

0.5322

0.7917

0.8588

14,029,401

GT #6

1

A1-hop

8

256

4

0.10

0.7023

0.8974

0.9303

14,029,401

GT #7

1

A2-hop

8

256

4

0.10

0.7082

0.8999

0.9336

14,029,401

GT #8

1

A3-hop

8

256

4

0.10

0.7028

0.8991

0.9327

14,029,401

GT #9

1

Aglobal

8

256

4

0.10

0.5488

0.8009

0.8659

14,029,401

GT #10

1

A1-hop||A2-hop||Aglobal

8

256

4

0.10

0.7057

0.9021

0.9346

14,029,401

MGT #11

2

A1-hop,A2-hop

16

256

4

0.25

0.7149

0.9049

0.9361

28,188,249

MGT #12

2

A1-hop,Aglobal

16

256

4

0.25

0.7111

0.9041

0.9355

28,188,249

MGT #13

2

A2-hop,Aglobal

16

256

4

0.25

0.7237

0.9102

0.9400

28,188,249

MGT #14

3

A1-hop,A1-hop,A1-hop

24

256

4

0.25

0.7077

0.9020

0.9340

39,984,729

MGT #15

3

A1-hop,A2-hop,A3-hop

24

256

4

0.25

0.7156

0.9066

0.9365

39,984,729

MGT #16

3

A1-hop||A2-hop||Aglobal

24

256

4

0.25

0.7126

0.9051

0.9372

39,984,729

MGT #17

3

A1-hop,A2-hop,Aglobal

24

256

4

0.25

0.7280

0.9106

0.9387

39,984,729

Table 3: Ablation study for multi-graph architecture of MGT.
GT denotes single-graph variants of MGT.
The 1st/2nd best results per column are indicated in red/blue.
|| denotes the logical union operation.

Comparison with RNN Baselines
We trained RNNs at various parameter budgets, and present result for the best performing bi-directional LSTM and GRU models in Table 2:
(i) MGT outperforms both LSTM and GRU baselines by a significant margin (by 3% acc.@1 for Base, 5% for Large),
indicating that both geometry and temporal order of strokes are important for sketch representation learning.
(ii) Training larger RNNs is harder to converge, leading to degrading performance, e.g., GRUs outperform deeper LSTMs by 2%.

These results are not surprising: RNNs are notoriously hard to train at scale (Pascanu et al., 2013),
while Transformer performance is known to improve with scale, even with billions of model parameters (Shoeybi et al., 2019).

Comparison with CNN Baselines
Table 2 also presents performance of several state-of-the-art CNN architectures for computer vision:
(i) Inception V3 (Szegedy et al., 2016) and MobileNet V2 (Sandler et al., 2018) are the best performing CNN architectures. Our MGT Base has competitive or better recognition accuracy than all other baselines: AlexNet (Krizhevsky et al., 2012), VGG-11 (Simonyan and Zisserman, 2014), ResNet models (He et al., 2016), and DenseNet-201 (Huang et al., 2017).
(ii) MGT Large has small performance gap to Inception V3 and MobileNet V2 (i.e., 72.80% acc.@1 vs. 74.22%, 72.80% acc.@1 vs. 73.10%) and outperforms all other CNN architectures by almost 2%.
(iii) Somewhat counter-intuitively, shallow networks (Inception V3, MobileNet V2) outperform deeper networks (ResNet-152, Densenet-201) by almost 2%.
This result highlights that CNNs designed for images with dense colors and textures are un-suitable for sparse sketches.

Note that MobileNet V2 is specifically designed for fast inference on mobile phones and is not directly comparable in terms of model parameters.

Table 4: Test set performance of Graph Transformer vs. other GNN variants.
The 1st/2nd best results per column are indicated in red/blue.
“fully” denotes fully-connected.
“Gra. Stru.” denotes graph structure.

Figure 6: Selected attention heads at each layer of MGT for a sample from the test set (labelled ‘alarm clock’). Each layer has I=8 attention heads per graph in total. We manually choose the most interesting heads for each graph.
Darker reds indicate higher attention values. Best viewed in color.

Input Permutation

Recognition Accuracy

acc.@1

acc.@5

acc.@10

coo.

0.6512

0.8735

0.9162

coo. + flag

0.6568

0.8762

0.9176

coo. + flag + pos.

0.6600

0.8766

0.9182

C(coo., flag)

0.7017

0.8996

0.9321

C(coo., flag, pos.)

0.7280

0.9106

0.9387

4D Input

0.6559

0.8758

0.9175

4D Input + pos.

0.6606

0.8781

0.9190

C(4D Input, pos.)

0.7117

0.9048

0.9366

Table 5: Ablation study for multi-modal input for MGT (Large). Notations: “+” and “C(⋯)” denote “sum” and “concatenate”, respectively; “coo.”, “flag”, and “pos.” represent “coordinate”, “flag bit”, and “position encoding”, respectively.
The 1st/2nd best results per column are indicated in red/blue.

Ablations for Multi-Graph Architecture
We design several ablation studies to evaluate our sketch-specific multi-graph architecture in Table 3:
(i) We evaluate Graph Transformers trained on fully-connected graphs, i.e.vanilla Transformers (GT #1), fully-connected graphs within strokes (GT #2), as well as random graphs with 10%, 20% and 30% connectivity (GT #3, #4, and #5 respectively).
We compare their performance with Graph Transformers trained on sketch-specific graphs A1-hop (GT #6), A2-hop (GT #7), A3-hop (GT #8), and Aglobal (GT #9).
We find that vanilla Transformers on fully-connected (52.49% acc.@1) and random graphs (52.71%, 53.52%, 53.22%) perform poorly compared to sketch-specific graph structures determined by domain expertise, such as fully-connected stroke graphs (64.87%) and A1-hop (70.23%).
The superior performance of K-hop graphs suggests that Transformers benefit from sparse graphs representing local sketch geometry.
We also evaluate a combined sketch-specific graph structure, i.e., A1-hop||A2-hop||Aglobal (GT #10), where the graph connectivity is the logical union set of A1-hop, A2-hop, and Aglobal.
However, this structure fails to gain performance improvement over A1-hop, A2-hop, and Aglobal, despite involving more domain knowledge.
(ii) We experiment with various permutations of graphs for multi-graph models (MGT #11-#17).
We find that using a 3-graph architecture (MGT #17) combining local sketch geometry (A1-hop,A2-hop) and global temporal relationships (Aglobal) significantly boosts performance over 2-graph and 1-graph models (72.80% vs. 72.37% for 2-graph and 70.82% for 1-graph).
This result is interesting because using global graphs independently (GT #9) leads to comparatively poor performance (54.88%).
Additionally, we found that using diverse graphs (MGT #15, #17) is better than using the same graph (MGT #14).
Comparing MGT #14 and MGT #6 further shows that performance gains are due to the multi-graph architecture as opposed to more model parameters.
(iii) We also repeatedly input the adjacency matrix of GT #10 (i.e., A1-hop||A2-hop||Aglobal) three times as the multiple graph structures to train our MGT (see MGT #16 in Table 3). Compared with MGT #17, there is a clear performance gap (71.26% vs. 72.80%).
This further validates our idea of learning sketch representations through multiple separate graphs.

Comparison with GNN Baselines
In Table 4, we present performance of our Graph Transformer model compared to GCN and GAT, two popular GNN variants:
(i) We find that all models perform similarly on fully-connected graphs.
Using 1-hop graphs results in significant gains for all models, with Transformer performing the best.
(ii) Interestingly, both GNNs on fully-connected graphs are outperformed by a simple position-wise embedding method without any graph structure:
each node undergoes 4 feed-forward (FF) layers followed by summation and the MLP classifier.
These results further highlights the importance of sketch-specific graph structures for the success of Transformers.
Our final models use the Transformer layer, which implicitly includes the FF sub-layer (7).

Ablations for Multi-Modal Input
In Table 5, we experiment with various permutations of our sketch-specific multi-modal input design.
We aggregate information from spatial (coordinates), semantic (flag bits), and temporal (position encodings) modalities via summation (as in Transformers for NLP) or concatenation:
(i) Effectively using all modalities is important for performance (e.g., “C(coo., flag, pos.)” outperforms “coo.” and “C(coo., flag)”: 72.80% acc.@1 vs. 65.12%, 70.17%).
(ii) Concatenation works better than 4D input as well as summation (e.g., “C(coo., flag, pos.)” outperforms “C(4D Input, pos.)” and “coo. + flag + pos.”: 72.80% vs. 71.17%, 66.06%).

Qualitative Results
In Figure 6, we visualize attention heads at each layer of MGT for a sample from the test set (labelled ‘alarm clock’).
Attention heads in the initial layers attend very strongly to certain neighbors and very weakly to others,
i.e., the model builds local patterns for sketch sub-components (strokes) through message passing along their contours.
In penultimate layers, the intensity of neighborhood attention is significantly lower and evenly distributed,
indicating that the model is aggregating information from various strokes at each node.

Additionally, we believe Aglobal graphs are critical for message passing between strokes,
enabling the model to understand their relative arrangement, e.g., the feet of the clock are attached to the bottom of the body, the arms are located inside the body, etc.

5 Conclusion

This paper introduces a novel representation of free-hand sketches as multiple sparsely connected graphs.
We design a Multi-Graph Transformer (MGT) for capturing both geometric structure and temporal information from sketch graphs.
The intrinsic traits of the MGT architecture include:
(i) using graphs as universal representations of sketch geometry, as well as temporal and semantic information,
(ii) injecting domain knowledge into Transformers through sketch-specific graphs,
and
(iii) making full use of multiple intra-stroke and extra-stroke graphs.

We hope MGT can serve as a foundation for future work in sketch applications and network architectures,
motivating
the community towards sketch representation learning using graphs.
Additionally, for the graph neural network (GNN) community, we hope that MGT helps free-hand sketch become a new test-bed for GNNs.

Acknowledgements

Xavier Bresson is supported in part by NRF Fellowship NRFF2017-10.

References

D. Bahdanau, K. Cho, and Y. Bengio (2014)Neural machine translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473.
Cited by: §2.

P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018)

Relational inductive biases, deep learning, and graph networks

X. Bresson and T. Laurent (2018)An experimental study of neural networks for variable graphs.
In ICLR Workshop,
Cited by: §4.1.

J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014)Spectral networks and locally connected networks on graphs.
In ICLR,
Cited by: §2.

W. Chen and J. Hays (2018)SketchyGAN: towards diverse and realistic sketch to image synthesis.
In CVPR,
Cited by: §1.

K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)On the properties of neural machine translation: encoder-decoder approaches.
arXiv preprint arXiv:1409.1259.
Cited by: §4.1.

J. Collomosse, T. Bui, and H. Jin (2019)LiveSketch: query perturbations for guided sketch-based visual search.
In CVPR,
Cited by: §1.

Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019)Transformer-xl: attentive language models beyond a fixed-length context.
arXiv preprint arXiv:1901.02860.
Cited by: §2.

M. Defferrard, X. Bresson, and P. Vandergheynst (2016)Convolutional neural networks on graphs with fast localized spectral filtering.
In NeurIPS,
Cited by: §2.

J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019)BERT: pre-training of deep bidirectional transformers for language understanding.
In ACL,
Cited by: §2.

S. Dey, P. Riba, A. Dutta, J. Llados, and Y. Song (2019)Doodle to search: practical zero-shot sketch-based image retrieval.
In CVPR,
Cited by: §1,
§4.1.

A. Dutta and Z. Akata (2019)Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval.
In CVPR,
Cited by: §1.

S. Edunov, M. Ott, M. Auli, and D. Grangier (2018)Understanding back-translation at scale.
In EMNLP,
Cited by: §2.

M. Eitz, J. Hays, and M. Alexa (2012)How do humans sketch objects?.
ACM TOG.
Cited by: §4.1.

X. Glorot, A. Bordes, and Y. Bengio (2011)Deep sparse rectifier neural networks.
In AISTATS,
Cited by: §4.1.

D. Ha and D. Eck (2018)A neural representation of sketch drawings.
In ICLR,
Cited by: §1,
§1,
§1,
§2,
§4.1.

W. Hamilton, Z. Ying, and J. Leskovec (2017)Inductive representation learning on large graphs.
In NeurIPS,
Cited by: §2.

K. He, X. Zhang, S. Ren, and J. Sun (2016)Deep residual learning for image recognition.
In CVPR,
Cited by: §3.3,
§4.1,
§4.2,
Table 2.

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017)Densely connected convolutional networks.
In CVPR,
Cited by: §4.1,
§4.2,
Table 2.

S. Ioffe and C. Szegedy (2015)Batch normalization: accelerating deep network training by reducing internal covariate shift.
In ICML,
Cited by: §3.3.

D. P. Kingma and J. Ba (2014)Adam: a method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Cited by: §4.1.

T. N. Kipf and M. Welling (2017)Semi-supervised classification with graph convolutional networks.
In ICLR,
Cited by: §2,
§4.1,
Table 4.

A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)Imagenet classification with deep convolutional neural networks.
In NeurIPS,
Cited by: §2,
§4.2,
Table 2.

F. Liu, X. Deng, Y. Lai, Y. Liu, C. Ma, and H. Wang (2019)SketchGAN: joint sketch completion and recognition with generative adversarial network.
In CVPR,
Cited by: §1.

L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao (2017)Deep sketch hashing: fast free-hand sketch-based image retrieval.
In CVPR,
Cited by: §1.

Y. Lu, S. Wu, Y. Tai, and C. Tang (2018)Image generation from sketch constraint using contextual gan.
In ECCV,
Cited by: §1.

F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017)Geometric deep learning on graphs and manifolds using mixture model cnns.
In CVPR,
Cited by: §2.

R. Pascanu, T. Mikolov, and Y. Bengio (2013)On the difficulty of training recurrent neural networks.
In ICML,
Cited by: §4.2.

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)PyTorch: an imperative style, high-performance deep learning library.
In NeurIPS,
Cited by: §4.1.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018)Improving language understanding by generative pre-training.
OpenAI Blog.
Cited by: §2.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018)MobileNetV2: inverted residuals and linear bottlenecks.
In CVPR,
Cited by: §4.1,
§4.2,
Table 2.

P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016)The sketchy database: learning to retrieve badly drawn bunnies.
ACM TOG.
Cited by: §1.

R. K. Sarvadevabhatla, J. Kundu, and V. Babu R (2016)Enabling my robot to play pictionary: recurrent neural networks for sketch recognition.
In ACM MM,
Cited by: §1,
§2.

Y. Shen, L. Liu, F. Shen, and L. Shao (2018)Zero-shot sketch-image hashing.
In CVPR,
Cited by: §1.

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019)Megatron-lm: training multi-billion parameter language models using gpu model parallelism.
arXiv preprint arXiv:1909.08053.
Cited by: §4.2.

K. Simonyan and A. Zisserman (2014)Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Cited by: §4.2,
Table 2.

J. Song, Q. Yu, Y. Song, T. Xiang, and T. M. Hospedales (2017)Deep spatial-semantic attention for fine-grained sketch-based image retrieval.
In ICCV,
Cited by: §2.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)Dropout: a simple way to prevent neural networks from overfitting.
JMLR.
Cited by: §3.3.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)Rethinking the inception architecture for computer vision.
In CVPR,
Cited by: §4.1,
§4.2,
Table 2.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017)Attention is all you need.
In NeurIPS,
Cited by: §1,
§2,
§3.3,
Table 2.

P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018)Graph Attention Networks.
In ICLR,
Cited by: §2,
§4.1,
Table 4.

Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao (2019)Learning deep transformer models for machine translation.
arXiv preprint arXiv:1906.01787.
Cited by: §2.

P. Xu, Y. Huang, T. Yuan, K. Pang, Y. Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo (2018)Sketchmate: deep hashing for million-scale human sketch retrieval.
In CVPR,
Cited by: §1,
§1,
§4.1,
§4.1.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019)XLNet: generalized autoregressive pretraining for language understanding.
arXiv preprint arXiv:1906.08237.
Cited by: §2.

Y. Ye, Y. Lu, and H. Jiang (2016)Human’s scene sketch understanding.
In ICMR,
Cited by: §1.

Z. Ye, Q. Guo, Q. Gan, X. Qiu, and Z. Zhang (2019)BP-transformer: modelling long-range context via binary partitioning.
arXiv preprint arXiv:1911.04070.
Cited by: §2.

Q. Yu, Y. Yang, Y. Song, T. Xiang, and T. Hospedales (2015)Sketch-a-net that beats humans.
In BMVC,
Cited by: §1,
§2.