Sketch-based Creativity Support Tools using Deep Learning

by   Forrest Huang, et al.
berkeley college

Sketching is a natural and effective visual communication medium commonly used in creative processes. Recent developments in deep-learning models drastically improved machines' ability in understanding and generating visual content. An exciting area of development explores deep-learning approaches used to model human sketches, opening opportunities for creative applications. This chapter describes three fundamental steps in developing deep-learning-driven creativity support tools that consumes and generates sketches: 1) a data collection effort that generated a new paired dataset between sketches and mobile user interfaces; 2) a sketch-based user interface retrieval system adapted from state-of-the-art computer vision techniques; and, 3) a conversational sketching system that supports the novel interaction of a natural-language-based sketch/critique authoring process. In this chapter, we survey relevant prior work in both the deep-learning and human-computer-interaction communities, document the data collection process and the systems' architectures in detail, present qualitative and quantitative results, and paint the landscape of several future research directions in this exciting area.



There are no comments yet.


page 27


Visual Intelligence through Human Interaction

Over the last decade, Computer Vision, the branch of Artificial Intellig...

Deep Learning for Free-Hand Sketch: A Survey and A Toolbox

Free-hand sketches are highly illustrative, and have been widely used by...

Automatic code generation from sketches of mobile applications in end-user development using Deep Learning

A common need for mobile application development by end-users or in comp...

Deep Learning for Free-Hand Sketch: A Survey

Free-hand sketches are highly hieroglyphic and illustrative, which have ...

Advances and Challenges in Conversational Recommender Systems: A Survey

Recommender systems exploit interaction history to estimate user prefere...

MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment

Building computer systems that can converse about their visual environme...

On Training Sketch Recognizers for New Domains

Sketch recognition algorithms are engineered and evaluated using publicl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sketching is a natural and effective means to express novel artistic and functional concepts. It is an integral part of the creative process for many artists, engineers, and educators. The abstract yet expressive nature of sketches enables sketchers to quickly communicate conceptual and high-level ideas visually while leaving out unnecessary details. These characteristics are most notably manifested in the use of sketches in design processes, where sketches are used by designers to iteratively discuss and critique high-level design concepts and ideas.

Recent advances in deep-learning (DL) models greatly improved machines’ abilities to perform Computer Vision tasks. In particular, convolutional neural networks, recurrent neural networks, and attention mechanisms dramatically outperform prior state-of-the-art methods in comprehending and generating visual content. They can perform these tasks even conditioned on user-specified natural language or other accompanying semantic information. These architectures provide great opportunities for developing creativity-support applications of the type we have argued for. They support sketch inputs and outputs for applications such as sketch-based image retrieval and sketch generation systems.

This chapter explores and surveys multiple facets of research in deep-learning-based sketching systems that support creative processes. We describe three research projects spanning design and artistic applications, targeting amateur and professional users, and using sketches as inputs and outputs. We outline three important aspects in this area of research: 1) collecting appropriate sketch-based data; 2) adapting existing architectures and tasks to the sketch domain; and, 3) formulating new tasks that supports novel interactions using state-of-the-art model architectures and deploying and evaluating these novel systems.

The key projects that we will include in this chapter are as follows:

  1. a new dataset consisting of paired sketches and UI designs produced by UI/UX designers

  2. a new system that retrieves relevant UI designs based on user-specified sketches that can support various design applications

  3. a novel formulation of an iterative sketch-authoring process based on user-specified natural language inputs

We begin the chapter by reviewing relevant prior literature on sketch-based applications in both the Deep-Learning (DL) community and Human-Computer-Interaction (HCI) community. We then describe the three aforementioned projects in detail and present qualitative and quantitative experimental results on existing and novel tasks. Finally, we conclude the chapter with a discussion of several avenues of further research in sketch-based deep-learning systems. We hope to provide a guide to aid the research and development of newer systems that works with sketches and sketch-based tasks. We aim to push the frontier of future sketch-based systems research to allow machines to support sketch-based creativity in various domains for users with all levels of expertise.

2 Role of Sketching in supporting Creative Activities

Sketching plays a pivotal role in many types of creative activities because of its highly visual nature and its flexibility for creation and manipulation: users can create and imagine any kind of visual content, and continuously revise it without being constrained by unnecessary details. Sketches are an independent art form, but are also extensively used to draft and guide other forms of artistic expression such as oil painting, or storyboarding in films and motion graphics. Moreover, because sketches effectively communicate visual ideas, they are well-suited for design processes such as User Interface (UI) or User Experience (UX) Design.

For these reasons, a plethora of research systems and tools that use sketches have been developed by the HCI community. We survey several notable systems in the domains of artistic sketches and design sketches in this section.

2.1 Sketch-based Applications supporting Artistic Expressions

Prior works that aim to support artistic sketches have mostly taken the form of real-time assistance to directly improve the final sketching product, or to generate sketching tutorials for improving users’ sketching proficiency. A number of sketching assistants use automatically-generated and crowd-sourced drawing guidance. ShadowDraw (Lee et al., 2011) and EZ-sketching (Su et al., 2014) use edge images traced from natural images to suggest realistic sketch strokes to users. PortraitSketch provides sketching assistance specifically on facial sketches by adjusting geometry and stroke parameters (Xie et al., 2014). Real-time, crowd-sourced feedback have also been used to to correct and improve users’ sketched strokes (Limpaecher et al., 2013).

In addition to assisted sketching tools, researchers have developed tutorial systems to improve users’ sketching proficiency. How2Sketch automatically generates multi-step tutorials for sketching 3D objects (Hennessey et al., 2017). Sketch-sketch revolution provides first-hand experiences created by sketch experts for novice sketchers (Fernquist et al., 2011) .

2.2 Sketch-based Applications supporting Design in Various Domains

Designers use sketches to expand novel ideas, visualize abstract concepts, and rapidly compare alternatives (Buxton, 2007). They are commonplace in the modern design workspace and typically require minimal effort for designers to produce. They are also sometimes preferred over high-fidelity artifacts because the details left out by abstract sketches implies and indicates incompleteness of designs. They encourage designers to more freely imagine and provide alternatives based on current designs without being concerned about committed to existing designs. From some of our informal conversations with designers, it is observed that designers will trace high-fidelity design renderings using rough sketch strokes for soliciting higher-level, creative feedback.

Research in the HCI community has produced interfaces that use drawing input for creating interactive design prototypes. SILK is the first system that allows designers to author interactive, low-fidelity UI prototypes by sketching (Landay, 1996). DENIM allows web designers to prototype with sketches at multiple detail levels (Lin et al., 2000). More recently, Apparition uniquely allow users sketch their desired interfaces while having crowdworkers translate sketches into workable prototypes in near real-time (Lasecki et al., 2015).

3 Large-scale Sketch Datasets and Sketch-based Deep-learning Applications

With the widespread use of DL approaches in Computer Vision, researchers have collected large-scale sketch datasets to train and evaluate these models. We describe several projects in the DL community that enable sketch-based retrieval and sketch generation tasks using recently-collected sketch datasets. The capabilities of these models allow us to create novel interactive systems that were difficult prior to the introduction of deep-learning.

3.1 Large-scale Sketch Datasets

To support DL-based approaches towards sketch comprehension and generation, which rely heavily on large-scale datasets, researchers have crowdsourced sketch datasets of individual objects. Most of these datasets have focused on individual instances that either correspond to natural language or a general semantic class. The Quick, Draw! (Jongejan et al., 2016) and TU-Berlin (Eitz et al., 2012) datasets consist of human-drawn sketches for 345 and 250 object classes respectively. SketchyDB provides paired images and simple sketches for retrieval tasks (Sangkloy et al., 2016).

More recently, researchers have explored beyond individual sketched instances to compile datasets of sketched multi-object scenes. The SketchyScene dataset consists of sketched scenes of pre-drawn objects transformed and resized by humans, as scene sketches are highly demanding for users to create from scratch (Zou et al., 2018). SketchyCOCO starts with the MSCOCO dataset and retrieves relevant instances from the Sketchy dataset, compiling them into complex scenes (Gao et al., 2020). Nevertheless, none of these datasets consist of sketched scenes drawn by humans entirely from scratch. This is mainly due to the high skill barrier of generating these sketches that make large-scale crowd-sourcing difficult (Zou et al., 2018).

Related to complex sketches of multiple artistic objects, the DiDi dataset introduced drawings of flow-charts and functional diagrams traced entirely by human users (Gervais et al., 2020). While users did not design the diagrams from scratch, they generated each individual sketch stroke based on the procedurally generated templates.

3.2 Sketch-Based Image and 3D Model Retrieval

Sketch-based Image Retrieval is a frequently studied problem in the Computer Vision community. The current state-of-the-art solution to an instance of this problem is deep-learning-based (Sain et al., 2020). A typical DL-based approach involves training an encoder network to produce meaningful representations in a low-dimensional embedding space based on high-dimensional image inputs. To learn this embedding space, the network is trained on ‘triplets’ of data samples: one ‘anchor’ image as the reference, one ‘positive’ image that is either directly paired with the ‘anchor’ image or is semantically relevant to the ‘anchor’ image, and another ‘negative’ image that is irrelevant to the ‘anchor’ image. The network is trained to produce similar embeddings (with a low analytical distance in the embedding space) for the ‘positive’ and ‘anchor’ images and dissimilar embeddings for the ‘negative’ and ‘anchor’ images. When retrieving images with a sketch query, the natural images are ranked by the distance (e.g., Euclidean Distance) between their embedding outputs and the sketch query’s embedding outputs. A few of the recently introduced datasets mentioned in Section 3.1 have introduced DL-based architectures that establish state-of-the-art performance over non-neural baselines.

3.3 Neural Sketch Generation

Recent advancements in the DL community have introduced deep neural networks capable of recognizing and generating sketches. Sketch-RNN 

(Ha and Eck, 2018)

is one of the first RNN-based models that can generate sequential sketch strokes through supervised learning on sketch datasets. We will describe this architecture in detail in Section 


due to its high relevance to our work. Transformer networks 

(Vaswani et al., 2017) have shown superior performance in natural language modeling/generation tasks and have recently been applied to model sketch strokes and SVG graphics. These approaches have been shown to improve sketch generation performance over Sketch-RNN. Most recently, CoSE is a hierarchical architecture that aims to improve performance of DL models in generating structured sketches previously difficult for flat models (Aksan et al., 2020). In particular, this approach targets the generation of structured diagrams in the DiDi dataset.

Another highly relevant model architecture is Generative Adversarial Networks (GANs). GANs have also been used to translate realistic images into sketches (or edges) at the pixel level by training on paired 

(Li et al., 2019) and unpaired (Zhu et al., 2017) sketch and image data. While the pixel-based architectures have been effective when handling natural image data, it does not naturally couple with humans’ sketching processes, which are inherently sequential.

4 Developing a Paired Sketch/User Interface Dataset

In this section, we explore the first step towards developing a novel deep-learning-driven creativity support application: collecting a large-scale sketch dataset. We specifically target the task of drawing correspondance between two types of visual artifacts commonly used in the early-stage of UI design: low-fidelity design sketches and high-fidelity UI examples.

Both Sketches and UI design examples are commonly used in the UI design process as reported by a variety of prior studies (Newman and Landay, 2000; Herring et al., 2009) and our informal conversation with designers. Designers search, consult and curate design examples to gain inspiration, explore viable alternatives and form the basis for comparative evaluations (Herring et al., 2009; Bonnardel, 1999). Similarly, designers frequently use sketches to expand novel ideas, visualize abstract concepts, and rapidly compare alternatives (Buxton, 2007). As such, understanding correspondences between these two modalities would allow machines to rapidly retrieve popular visual illustrations, common flow patterns and high-fidelity layout implementations (Deka et al., 2016) from large corpuses, which can greatly augment various design tasks (Kumar et al., 2011; Swearngin et al., 2018; Nguyen and Csallner, 2015).

To solve this task using deep-learning-based approaches, we decided to collect a dataset of actual sketches stylistically and semantically similar to designers’ sketches of UIs. This also allows us to leverage large-scale UI datasets recently introduced by mobile-interaction mining applications (Deka et al., 2016), such that we would only need to collect sketches newly created by designers based on screenshots of original UIs in the Rico dataset (Deka et al., 2017). The dataset is now publicly available at

4.1 Designer Recruitment and Compensation

We recruited 4 designers through the freelancing platform Upwork. All designers reported having at least occasional UI/UX design experience and substantial sketching experience. In addition, all designers reported receiving formal training in UI design and degrees in design-related fields. They were compensated 20 USD per hour and worked for 60-73 hours.

4.2 Dataset Statistics

We collected 3702 sketches111This total of 3702 sketches differs from original Swire publication (Huang et al., 2019). We discovered that 100 trial sketches from a pilot study were accidentally included in the original stated total and we have corrected the numbers in this chapter. of 2201 UI examples from 167 popular apps in the Rico dataset. Each sketch was created with pen and paper in 4.1 minutes on average. Many UI examples were sketched by multiple designers. 66.5% of the examples were sketched by 2 designers, 32.7% of the examples were sketched by 1 designer and the remaining examples (¡ 1%) were sketched by 3 designers in our dataset. Our 4 designers sketched 455/1017/1222/1008 UIs respectively based on their availability. We allocated batches of examples to different combinations of designers to ensure the generality of the dataset.

We did not have the resources to generate sketches for every UI in the Rico dataset, so we curated a diverse subset of well-designed UI examples that cover 23 app categories in the Google Play Store and were of average to high design quality. We omitted poorly designed UIs from the dataset because of the relatively small size of the dataset for neural network training. Noise introduced into training by poor designs would have the potential to negatively impact the training time and quality of our model.

4.3 Data Collection and Postprocessing Procedure

We supplied the screenshots of our curated UI examples to the recruited designers and asked them to create sketches corresponding to the screenshots with pen and paper. They were prompted to reconstruct a low-fidelity sketch from each screenshot as if they were the designers of the interfaces. We instructed them to replace all actual image content with a sketched placeholder (a square with a cross or a mountain) and replace dynamic text with template text in each screenshot as shown in Figure 1. We added these instructions to obtain sketches with a more unified representation focused on the design layout of various UIs. These instructions also make it easier for the neural network to learn the concepts of images and text within the constraints of our small dataset.

In order to efficiently collect and calibrate sketches created by multiple designers in various formats of photos and scans, we supplied them with paper templates with frames for them to sketch on as shown in Figure 1. These frames are annotated with four ArUco codes (Munoz-Salinas, 2012) at the corners to allow perspective correction. All photos and scans of the sketches are corrected with affine transformation and thresholded to obtain binary sketches as final examples in the dataset.

Figure 1: Data Collection Procedure. We first send a UI screenshot (left) and paper templates with ArUco markers to designers. Designers then sketch on the template and sends back a photo or a scan of the completed sketch (middle). We then post-process the photo using Computer Vision techniques to obtain the final clean sketch dataset (right).

5 Developing Swire: a Sketch-based User Interface Retrieval System

After collecting a dataset of corresponding UIs and screenshots, we proceed to develop Swire, a sketch-based user interface retrieval system that learns correspondances between UIs and sketches in the dataset. In developing Swire, we adapted a commonly used DL architecture, training paradigm, and loss function that are able to retrieve relevant visual content given input sketches. Similar to prior work in sketch-based image retrieval, we used a convolutional neural network architecture and adopt the cross-modal embedding training scheme using a triplet loss. This allows us to take advantage of a convolutional neural network’s strong capability to understand high-dimensional visual features while creating a unified embedding space for both sketches and UIs with learned concepts based on their correspondences. This means Swire can be used to search a dataset of UIs using either sketches or UI screenshots as the querying modality.

The development of Swire consists of a training phase and a querying phase. During the training phase, we train Swire’s deep neural network to generate similar low-dimensional outputs (64-dimensional) for matching pairs of screenshots and sketches, and dissimilar outputs for non-matching pairs of screenshots and sketches. This training scheme is shown to be useful for sketch-based image retrieval (Sangkloy et al., 2016). In the querying phase, we use Swire’s trained neural network to encode a user’s sketch query and retrieve UIs with the closest output to the sketch query’s output.

5.1 Network Architecture

We used two convolutional sub-networks to handle the two inputs of sketch-screenshot pairs, these two sub-networks are similar to VGG-A (Simonyan and Zisserman, 2015), a shallow variant of the state-of-the-art network that won the ILSVRC2014 image recognition challenge (Russakovsky et al., 2014)

. Our network consists of 11 layers, with five convolutional blocks and three fully-connected layers. Each convolutional block contains two (one for the first two blocks) convolutional layers with 3x3 kernels and one max-pooling layer. The convolutional layers in the five blocks have 64, 128, 256, 512, and 512 filters respectively. The first two fully-connected layers have 4096 hidden units. The last layer has 64 hidden units and outputs the 64-dimensional embedding used for querying. The activation functions of all layers except the last layer are ReLU. The network architecture is described in detail in Figure


The final 64-dimensional output embeddings of the sub-networks are trained to produce adequate embeddings represented as codes in their respective final layers. The model is trained with a pairwise sampling scheme described in the following subsection.

Figure 2: Network Architecture of Swire’s Neural Network. Swire’s neural network consists of two identical sub-networks similar to the VGG-A deep convolutional neural network. These networks have different weights and attempts to encode matching pairs of screenshots and sketches with similar values.

5.2 Triplet Loss

The model is trained with a Triplet Loss function (Schroff et al., 2015; Yu et al., 2016) that involves the neural-network outputs of three inputs: an ‘anchor’ sketch , a ‘positive’ matching screenshot and a ‘negative’ mismatched screenshot . This forms two pairs of input during training. The positive pair consists of a sketch-screenshot pair that correspond to each other. The negative pair consists of a sketch-screenshot pair that does not correspond. The negative pair is obtained with the same sketch from the positive pair and a random screenshot sampled from the mini-batch.

During training, each pair is passed through two sub-networks such that the sketch sample is passed through the sketch sub-network and outputs an embedding , and we similarly obtain the neural-network output of the screenshot from the screenshot sub-network. We compute the Euclidean distance between the neural network outputs. For the positive pair,

Similarly, for the distance of the negative pair,

With these distances, we formulate a triplet loss function,

We maintain a margin between the positive and negative pairs to prevent the network from learning trivial solutions (zero embeddings for all examples).

5.3 Data and Training Procedure

We train our network using the dataset described in Chapter 4. Since the sketches are created by four separate designers, we split the data and used data collected from three designers for training and from one designer for testing. This is to ensure that the model generalizes across sketches produced by different designers. In addition, we do not repeat interfaces from the same apps between the training and test sets. This creates 1722 matching sketch-screenshot pairs for training and 276 pairs for testing.

During training, the sketches and screenshots are resized to pixels, and the pixel values are normalized between centered at

. The network is trained using a Stochastic Gradient Descent Optimizer with a mini-batch size of 32. The learning rate is

. The margin is in all models. All hyper-parameters listed above were determined by empirical experiments on the training set.

5.4 Querying

When a user makes a query with a drawn sketch, the model computes an output by passing the sketch through the sketch sub-network. This output is then compared with all neural-network outputs of the screenshots of UI examples in the dataset using a nearest neighbor search. The UI results are ranked by the distance between their outputs and the user’s sketch’s output.

5.5 Results


We implement competitive non-neural baselines to evaluate the performance of our method. As described in Section 3.2, typical methods of sketch-based image retrieval involve two steps: (1) extract an edge-map from the original image to be queried, and (2) match the edge-map using a specific similarity metric. Using this framework, we first extracted the edges of the screenshots using the Canny Edge detector (Canny, 1986)

. We then extracted features from the edges using Bag-of-words (BoW) Histogram of Oriented Gradients (HOG) filters. BoW-HOG filters is an advanced method of computing similarity between images. They capture edge features in an image by computing the magnitude of gradients across the entire image with respect to multiple orientations. This method summarizes image features with fixed-length vectors that describe the occurrences and characteristics of edges in images. This method is highly effective for sketch-based image retrieval as it focuses on the characteristics of edges while being insensitive to local translations and rotations.

After obtaining these fixed-length vectors for screenshots and sketch queries, we compare them using Euclidean distance as a simple metric to query for closest matching images (design screenshots in our case) to the sketch queries.

Quantitative Results

We use a test set that consists of 276 UI examples to compare Top-1 and Top-10 performances of BoW-HOG filters and Swire. The results are summarized in Table 1. We observe that Swire significantly outperform BoW-HOG filters for Top-10 performance at 60.9%. For Top-1 accuracy, Swire achieves an accuracy of 15.9% which only slightly outperformed the strong baseline of BoW-HOG filters at 15.6%. This shows Swire to be particularly effective for retrieving complex examples from the dataset compared to the BoW-HOG filters. We believe deep-learning-based Swire is advantageous compared to BoW-HOG filters that rely on matching edge-maps because UI sketches have semantic complexities that are not captured by edge-maps of screenshots.

Technique Top-1 Top-10
(Chance) 0.362% 3.62%
BoW-HOG filters 15.6% 38.8%
Swire 15.9% 60.9%
Table 1: Top-k Accuracy of Various Models on the Test Set. Swire significantly outperforms BoW-HOG filters.

Qualitative Results

We visualize query results from the test set to qualitatively understand the performance of Swire in Figure 3. Swire is able to retrieve relevant menu-based interfaces (Example a) despite the difference in visual appearance of the menu items. Swire is also able to retrieve pop-up windows (Example b) implemented in various ways despite the drastic difference in the dimensions of the pop-up windows. We observe similar efficacy in retrieving settings (Example c), list-based layouts (Example f), and login layouts (Example e). Nevertheless, we observe that Swire sometimes ignores smaller details of the interfaces described by sketched elements. This limitation will be further discussed in Section 7.1.

Figure 3: Query Results for Complete Sketches. Swire is able to retrieve common types of UIs such as sliding menus (a), settings (c), and login (e) layouts.

Expert Evaluation

To better evaluate Swire’s performance from professional users’ perspectives, we recruited 5 designers on Upwork with substantial experience in mobile UI/UX design to evaluate selected results from the test set. There was no overlap between these designers and those recruited for creating the dataset. We provided them with 9 sets of query sketches and the corresponding Top-10 retrieved results for each query from the test set. The 9 sets consist of 3 ‘best’ results (the corresponding screenshot of the sketch query is retrieved as the Top-1 result), 3 ‘mediocre’ results (the corresponding screenshot of the sketch query is retrieved within the Top-10 results, but not Top-1), and 3 ‘poor’ results (the corresponding screenshot of the sketch query is not retrieved within the Top-10 results). We asked the designers to provide comments on each set of results regarding the relevance between the sketches and the screenshots, and to comment on the potential integration of this tool into their design workflows.

Most designers agreed that all retrieved results in the ‘best’ result sets are relevant to the query, and they would be satisfied with the results. They were especially satisfied with a result set of sliding menus (also shown in Figure 3a). They were able to identify the results as ‘variations on the theme of navigation drawers’ (D3) or ‘slide out modal pattern.’ (D2) Moreover, the designers also expressed satisfaction towards some sets of ‘mediocre’ results. Most were satisfied with a set of results that ‘show variations of the top tabbed navigation’ (D5) which is a common design pattern.

On the other hand, some designers considered the ‘poor’ results unsatisfactory. For example, designers were less satisfied with the model’s performance on a sign-up sketch, commenting that the model only gathered screens with similar element layouts while ignoring the true nature of the input fields and buttons in the query (D3). However, D4 considered ‘rows of design elements’ common in the results relevant to the sketch, and D1 considered two similar sign-up screens retrieved by the model as strong results even they did not match up perfectly with the sketch.

In general, we observed that designers were more satisfied with the results when the model was able to retrieve results that are semantically similar at a high-level instead of those with matching low-level element layouts. Notably, D1 commented that we ‘probably already considered the common UI sketch patterns and train[ed] [our] system to match it up with image results,’ which reflects the effectiveness of Swire in detecting common UI patterns in some instances provided that it was not specifically trained to recognize these patterns. All designers also considered Swire to be potentially useful in their workflows for researching, ideating and implementing novel designs.

5.6 Applications

In Section 5.5, we evaluated and validated Swire’s effectiveness for generally finding design examples through sketch-based queries. Since both sketches and UI design examples are commonly used in early stages of the UI design process as reported by a variety of prior studies (Newman and Landay, 2000; Herring et al., 2009), we explore the potential usage of Swire through several design applications in this section. Prototypes of these applications implemented with Jupyter Notebook are available at

Evaluation with Alternative Designs

Designers often explore alternative design examples to support the implementation and comparative evaluation (Herring et al., 2009) of their own designs. HCI research literature also recommends the use of parallel prototyping techniques to obtain better final products through extensive comparison (Dow et al., 2010). Swire is able to support design comparisons because it enables querying for similar UIs with high-fidelity UI prototypes.

Swire is effective in retrieving similar UIs because the visual content of UI screenshots are reinforced with the semantic structure of sketches in the embedding space during training. Swire can thus be used as a semantically-aware similarity metric between interfaces.

Figure 4 shows an example of Swire retrieving similar menus using a high-fidelity screenshot. We similarly observe that it is able to retrieve design patterns such as login screens, list-based UIs, and grid-based UIs when querying with screenshots (please see the original Swire paper (Huang et al., 2019) for a complete figure illustrating these patterns). This enables effective comparison between similar designs with slight variations.

Figure 4: Alternative Design Query Results. Swire is able to retrieve similar UIs in the dataset from queries of complete, high-fidelity UI screenshots.

Auto-completing Partial Designs

Sketches are often used for rapid exploration of potential design solutions (Buxton, 2007). Designers use partial sketches to express core ideas, while leaving out parts of the interface in sketches for considering viable design alternatives. We trained an alternative model Swire-segments on partial sketches of UIs, which allows us to ‘auto-complete’ the remaining UI by retrieving a variety of examples that are only required to match parts of the UI sketched by the user. This model allows designers to quickly gain design inspirations that are relevant to the key UI elements desired by them.

In the training and querying phases of Swire-segments, UI examples are split into small parts. Designers can thus specify one-or-more parts of the UIs to be matched by the model with examples in the dataset. We compute an embedding for each part of the interface and match only the embeddings of the parts specified by the users for retrieval. Figure (a)a shows that Swire-segments is able to retrieve multiple designs that all contain the Floating Action Button (FAB, a popular Android design paradigm) but with diverse layouts. Swire-segments is also able to retrieve interfaces with only tab-based top bars in common. In these examples, Swire-segments is able to remain agnostic to the unspecified part of the sketch queries.

User Flow Examples

Beyond querying for single UIs, designers also use sketches to illustrate user experience at multiple scales (Newman and Landay, 2000), such as conveying transitions and animations between multiple interfaces. Since the Rico dataset also includes user interaction data, we use this data to enable flow querying with Swire. Designers can use this application to interact with interaction design examples that can accelerate the design of effective user flows.

To query flow examples in the dataset, since Swire creates a single embedding for each UI, we can match an arbitrary number of interfaces in arbitrary order by concatenating the embedding values during the ranking process of querying. We qualitatively observe that Swire is able to retrieve registration (Figure (b)b) and ‘closing menu’ flows that are commonly implemented by designers. Since Rico also contain transition details between consequent UIs, these examples can demonstrate popular animation patterns (Deka et al., 2016) that provide inspiration to interaction and animation designers.

(a) Autocomplete Query Results
(b) Flow Query Results
Figure 7: Query Results for (a) incomplete sketches and (b) flow queries. Swire is able to retrieve interfaces only based on parts specifed by users’ sketches while remaining agnostic to other parts of the UIs. Swire is also able to retrieve user flows by querying multiple UIs in sequences concurrently.

6 Developing Scones: a Conversational Sketching System

Beyond comprehending and understanding the correspondances between UIs and sketches, in this section we explore the possibility of sketch generation through another more abstract input modality: natural language. In particular we develop a system that supports iterative generation of sketches given users’ text instructions to support refinement and critique. The goal is to mimic a user trying to verbally convey a visual idea to an expert sketcher.

While sketching is a powerful communication medium, creating sketches that effectively communicate ideas visually requires significant training. The use of sketches in an iterative design and/or artistic process, where the sketch itself is annotated or refined, requires additional, specialized expertise. Inspired by recent development of deep-learning-driven QA systems and generative sketching models, we develop Scones, a deep-learning-based sketching system that can progressively construct a sketched scene based on multiple natural language instructions across multiple turns, an interaction analogous to an iterative sketch/critique process. This system must unify knowledge of the low-level mechanics for generating sketch strokes and natural language modification instructions with a high-level understanding of composition and object relationships in scenes.

We formulate the novel task of iteratively generating and refining sketches with text instructions and present a web-deployable implementation of Scones. Scones contains a scene composition proposer that takes a novel approach in creating and editing scenes of objects using natural language. It adapts a recent neural network architecture and improves state-of-the-art performance on the scene modification task. We also introduce in Scones a novel method for specifying high-level scene semantics within individual object sketches by conditioning sketch generation with mask outlines of target sketches. Using Scones, we hope to enable users of all levels of sketch expertise to freely express their intent using abstract, text-based instructions, together with concrete visual media.

6.1 System Architecture

The creation of complex sketches often begins with semantic planning of scene objects. Sketchers often construct high-level scene layouts before filling in low-level details. Modeling ML systems after this high-to-low-level workflow has been shown to be beneficial for transfer learning from other visual domains and for supporting interactive interfaces for human users 

(Huang and Canny, 2019). Inspired by this high-to-low-level process, Scones adopts a hierarchical workflow that first proposes a scene-level composition layout of objects using its Composition Proposer, then generates individual object sketches, conditioned on the scene-level information, using its Object Generators (Figure 8).

Figure 8: Overall Architecture of Scones. Scones takes a two-stage approach towards generating and modifying sketched scenes based on users’ instructions.

Composition Proposer

The Composition Proposer in Scones uses text instructions to place and configure objects in the scene. It also considers recent past iterations of text instructions and scene context at each conversation turn. As text instructions and sketch components occur sequentially in time, each with a variable length of tokens and objects, respectively, we formulate composition proposal as a sequence modeling task. We use a self-attention-only decoder component of the Transformer (Vaswani et al., 2017), a recent deep-learning model architecture with high performance for this task.

To produce the output scene at turn , the Composition Proposer takes inputs of previous scenes and text instructions as recent context of the conversation. Each output scene contains objects and special tokens marking the beginning and marking the end of the scene. Each text instruction contains text tokens that consist of words and punctuation marks.

We represent each object as a 102-dimensional vector :

The first two dimensions are Boolean attributes reserved for the start and end of the scene object sequences. is a 58-dimensional one-hot vector222an encoding of class information that is an array of bits where only the corresponding position for the class to be encoded is 1, and all other bits are 0s. representing one of 58 classes of the scene objects is a 35-dimensional one-hot vector representing one of 35 sub-types (minor variants) of the scene object. is a three-dimensional one-hot vector representing one of three sizes of the scene objects. is a two-dimensional one-hot vector representing the horizontal orientation of whether the object is flipped in the x-direction. The last two dimensions represents the x and y position of the center of the object. This representation is very similar to that of the CoDraw dataset the model was trained on, which is described in detail in Section 6.2. For each text token , we use a 300-dimensional GLoVe vector trained on 42B tokens from the Common Crawl dataset (Pennington et al., 2014) to semantically represent these words in the instructions.

To train the Transformer network with the heterogeneous inputs of and across the two modalities, we create a unified representation of cardinality and adapt and

to this representation by simply padding additional dimensions in the representations with zeros as shown in Equation



We interleave text instructions and scene objects chronologically to form a long sequence

as input to the model for generating an output scene representation

. We expand the sequential elements within and , and add separators to them to obtain the full input sequence to a single Transformer Decoder. To adapt the Transformer model to our multi-modal inputs and and produce new scene objects , we employ a 402-dimensional input embedding layer and 102-dimensional output embedding layer in the Transformer model. The outputs from the network are then passed to sigmoid and softmax activations for object position and other properties respectively. We show this generation process in Equation 6.1 and in Figure 9.

S_i = [o_(i, 1), … , (i, l)] = Transformer([o’_s, o’_(i - n, 1), … o’_(i - n, l_(i-n)),
o’_e, t’_(i - n, 1), … , t’_(i - n, m_(i - n)), … , t’_(i, 1), … t’_(i, l_i), o’_s])

Figure 9: The Scene Layout Generation Process using the Transformer Model of the Composition Proposer.

Object Generators

Since the outputs of the Composition Proposer are scene layouts consisting of high-level object specifications, we generate the final raw sketch strokes for each of these objects based on their specifications with Object Generators. We adapt Sketch-RNN to generate sketches of individual object classes to present to users for evaluation and revision in the next conversation turn. Each sketched object consists of strokes . The strokes are encoded using the Stroke-5 format (Ha and Eck, 2018). Each stroke represents states of a pen performing the sketching process. The first two properties and are offsets from the previous point that the pen moved from. The last three elements are a one-hot vector representing the state of the pen after the current point (pen down, pen up, end of sketch, respectively). All sketches begin with the initial stroke .

Since Sketch-RNN does not constrain aspect ratios, directions and poses of its output sketches, we introduce two additional conditions for the sketch generation process: masks and aspect ratios . These conditions ensure our Object Generators generate sketches with appearances that follow the object specifications generated by the Composition Proposer. For each object sketch, we compute the aspect ratio by taking the distance between the leftmost and rightmost stroke as and the distance between topmost and bottommost stroke as . To compute the object mask , we first render the strokes into a pixel bitmap, then mark all pixels as 1 if they are in between the leftmost pixel and rightmost pixel that are passed through by any strokes for each row , or if they are in between the bottommost pixel and topmost pixel that are passed through by any strokes for each column (Equation 2). As this mask-building algorithm only involves pixel computations, we can use the same method to build masks for clip art objects (used to train the Composition Proposer) to generate sketches with poses matching the Composition Proposer’s object representations.


We adapt the Variational-Autoencoder(VAE)-based conditional variant of Sketch-RNN to enable generating and editing of sketch objects. Our adapted conditional Sketch-RNN encodes input sketches with a Bi-directional LSTM to a latent vector

. The Hyper-LSTM decoder then recreates sketch strokes from , and described above during training, as defined in Equation 6.1 and shown in Figure 10

. Since the latent space is also trained to match a multivariate Gaussian distribution, the Object Generators can support sketch generation when the objects are first added to the scene by randomly sampling


q’_1 …h = Sketch-RNN Decoder([m, r, z]), z ∼N(0, 1)^128
z = Sketch-RNN Encoder(q_1 …h)

Figure 10: Sketch-RNN Model Architecture of the Object Generators.

As is a two-dimensional mask, we encode using a small convolutional neural network into a flattened embedding to be concatenated with , and

as inputs to the decoder. The decoder then outputs parameters for a Gaussian Mixture Model (GMM) which will be sampled to obtain

and . It also outputs probabilities for a categorical distribution that will be sampled to obtain and . This generation process and the architecture of the model are illustrated in Figure 10, and are described in the Sketch-RNN paper (Ha and Eck, 2018).

6.2 Datasets and Model Training

As Scones uses two components to generate scenes of sketched objects, it is trained on two datasets that correspond to the tasks these components perform.

CoDraw Dataset

We used the CoDraw dataset (Kim et al., 2019) to train the Composition Proposer to generate high-level scene layout proposals from text instructions. The task used to collect this data involves two human users taking on the roles of Drawer and Teller in each session. First, the Teller is presented with an abstract scene containing multiple clip art objects in certain configurations, and the Drawer is given a blank canvas. The Teller provides instructions using only text in a chat interface to instruct the Drawer on how to modify clip art objects in the scene. The Teller has no access to the Drawer’s canvas in most conversation turns, except in one of the turns when they can decide to ‘peek’ at the Drawer’s canvas. The dataset consists of 9993 sessions of conversation records, scene modifications, and ground-truth scenes.

Using this dataset, we trained the Composition Proposer to respond to users’ instructions given past instructions and scenes. We used the same training/validation/test split as the original dataset. Our model is trained to optimize the loss function that corresponds to various attributes of the scene objects in the training set:


is the cross-entropy loss between the one-hot vector of the true class label and the predicted output probabilities by the model. Similarly and are cross-entropy losses for the horizontal orientation and size of the object. is the Euclidean Distance between predicted position and true position of the scene object. We trained the model using an Adam Optimizer with the learning rate of

for 200 epochs. We set

, , , . These hyper-parameters were tuned based on empirical experiments on the validation split of the dataset.

Quick, Draw! Dataset

The Quick, Draw! dataset consists of sketch strokes of 345 concept categories created by human users in a game in 20 seconds (Jongejan et al., 2016). We trained our 34 Object Generators on 34 categories of Quick, Draw! data to create sketches of individual objects.

Each sketch stroke in Quick, Draw! was first converted to the Stroke-5 format. s and

s of the sketch strokes were normalized with their standard deviations for all sketches in their respective categories. Each category consists of 75000/2500/2500 sketches in the training/validation/test set.

The loss function of the conditional Sketch-RNN consists of the reconstruction loss and KL loss :


The KL loss

is the Kullback-Leibler divergence between the encoded

from the encoder and . The reconstruction loss is the negative log-likelihood of the ground-truth strokes under the GMM and a categorical distribution parametrized by the model. We refer interested readers to a detailed description of in the original Sketch-RNN paper (Ha and Eck, 2018). The initial learning rate of the training procedure was and exponentially decayed to at a rate of . was initially and exponentially increased to 333For some object categories, we found that increasing the KL weight to 1.0 improves the authors’ perceived quality of generated sketches. at a rate of

. The models were also trained with gradient clipping of


6.3 Results

To compare the effectiveness of Scones at generating scene sketches with existing models and human-level performance, we quantitatively evaluated its performance in an iterative scene authoring task. Moreover, as Scones uses generative models to produce object sketches, we qualitatively evaluated a large number of examples generated by the two components of Scones.

Composition Modification State-of-the-art

To evaluate the output of the Composition Proposer against the models introduced with the CoDraw dataset, we adapted its output to match that expected by the well-defined evaluation metrics proposed by the original CoDraw paper 

(Kim et al., 2019). The original task described in the CoDraw paper involves only proposing and modifying high-level object representations in scenes agnostic to their appearance. The performance of a “Drawer” (a human or machine which generates scene compositions) can be quantified by a similarity metric constrained between 0 and 5 (higher is more similar) by comparing properties of and relations between objects in the generated scene and objects in the ground truth from the dataset.

Running our Composition Proposer on the CoDraw test set, we achieved an average similarity metric of . This exceeded existing state-of-the-art performance (Table 2) on the iterative scene authoring task using replayed text instructions (script) from CoDraw.

Teller Drawer Similarity (out of 5)
Script Scones 3.55
Script Neural Network (Kim et al., 2019) 3.39
Script Nearest-Neighbour (Kim et al., 2019) 0.94
Script Human 3.83
Table 2: Test Set Performance of Various Models on the CoDraw Task

To provide an illustrative example of our Composition Proposer’s output on this task, we visualize two example scenes generated from the CoDraw validation set in Figure 11. In scene a), the Composition Proposer extracted the class (slide), direction (faces right), and position relative to parts of the object (ladder along left edge) from the text instruction, to place a slide in the scene. Similarly, it was able to place the bear in between the oak and pine trees in scene b), with the bear touching the left edge of the pine tree. It is important to note the Composition Proposer completely regenerates the entire scene at each conversation turn. This means it correctly preserved object attributes from previous scenes while making the requested modifications from the current turn. In these instances, the sun in scene a) and the trees in scene b) were left mostly unchanged while other attributes of the scenes were modified.

Figure 11: Example Scenes for the Scene Layout Modification Task. The Composition Proposer was able to improve state-of-the-art performance for modifying object representations in scene compositions.

Sketches with Clip Art Objects as Mask and Ratio Guidance

The Object Generators are designed to generate sketches which respect high-level scene layout information under the guidance of the mask and aspect ratio conditions. To inform generated object sketches with pose suggestions from scene composition layouts, we built outline masks from clip art objects and computed aspect ratios using the same method as building them for training sketches described in Section 6.1. We demonstrate the Object Generators’ performance in two important scenarios that allow Scones to adapt to specific subclass and pose contexts.

Generating objects for closely related classes

While the Composition Proposer classifies objects as one distinct class out of 58, some of these classes are closely related and are not differentiated by the Object Generators. In these cases, object masks can be used by an Object Generator to effectively disambiguate the desired output subclass. For instance, the Composition Proposer generates trees as one of three classes: Oak tree (tall and with curly edges), Apple tree (round and short), and Pine tree (tall and pointy); while there is only a single Object Generator trained on a general class of all types of tree objects. We generated three different masks and aspect ratios based on three clip art images and used them as inputs to a single tree-based Object Generator to generate appropriate tree objects (by sampling

). The Object Generator was able to sketch trees with configurations corresponding to input masks from clip art objects (Figure 12). The generated sketches for pine trees were pointy; for apple trees, had round leaves; and for oak trees, had curvy edges.

Figure 12: Sketch Generation Results of Trees Conditioned on Masks. The Object Generator was able to sketch trees of three different classes based on mask and aspect ratio inputs.
Generating objects with direction-specific poses

The Composition Proposer can specify the horizontal orientation of the objects (pointing left or right). As such, the Object Generators are required to sketch horizontally asymmetric objects (e.g., racquets, airplanes) with specific poses to follow users’ instructions. We show the ability of an Object Generator to produce racquets at various orientations in Figure 13. The generated racquet sketches conformed to the orientation of the mask, facing the specified direction at similar angles.

Figure 13: Sketch Generation Results of Racquets Conditioned on Masks. The Object Generator was able to sketch racquets at two orientations consistent to the masks.

Complete Sessions with Composition Layouts and Sketches

We show the usage of Scones in six turns of conversation from multiple sessions in Figure 14. We curated these sessions by interacting with the system ourselves to demonstrate various capabilities of Scones. In session a), Scones was able to draw and move the duck to the left, sketch a cloud in the middle, and place and enlarge the tree on the right, following instructions issued by the user. In session b), Scones was similarly able to place and move a cat, a tree, a basketball and an airplane, but at different positions from session a). For instance, the tree was placed on the left as opposed to the right, and the basketball was moved to the bottom. We also show the ability of Scones to flip objects horizontally in session b), such that the plane was flipped horizontally and regenerated given the instructions of “flip the plane to point to the right instead”. This flipping action demonstrates the Object Generator’s ability to generate objects with the require poses by only sharing the latent vectors , such that the flipped airplane exhibits similar characteristics as the original airplane. In both sessions, Scones was able to correlate multiple scene objects, such as placing the owl on the tree in session (a), and basketball under the tree in session b).

Figure 14: Complete Sketching Sessions with Scones curated by the authors.

Interpreting Transformer’s Attention Maps

We can further verify the relationships between text and object representations learned by the model by visualizing attention weights computed by the Transformer model of the Composition Proposer. These weights also create the unique possibility of generalizing and prompting for sketches of new objects specified by users.

The Transformer model in the Composition Proposer uses masked self-attention to attend to scene objects and instructions from previous time steps most relevant to generating the object specification at the current time step. We explore the attention weights of the first two turns of a conversation from the CoDraw validation set. In the first turn, the user instructed the system, “top left is an airplane medium size pointing left”. When the model generated the first object, it attended to the “airplane” and “medium” text tokens to select class and output size. In the second turn, the user instructed the model to place a slide facing right under the airplane. The model similarly attended to the “slide” token the most, it also significantly attended to the “under” and “plane” text tokens, and the airplane object. These objects and tokens are important for situating the slide object at the desired location relative to the existing airplane object (Figure 16).

Figure 15: Attention Map of the Transformer across Object and Text Tokens for the Generation of an Airplane, the First Object in the Scene.
Figure 16: Attention Map of the Transformer across Object and Text Tokens for the Generation of Slide in the Second Turn of Conversation. We observed that the Transformer model attended to the corresponding words and objects related to the newly generated ‘slide’ object.

These attention weights could potentially be used to handle unknown scene objects encountered in instructions. When the model does not output any scene objects, but only a (scene end) token, we can inspect the attention weights for generating this token to identify a potentially unknown object class, and ask the user for clarification. For example, when a user requests an unsupported class, such as a ‘sandwich’ or ‘parrot’ (Figure 17), Scones could identify this unknown object by taking the text token with the highest attention weight, and prompting the user to sketch it by name.

Figure 17: Attention Map of the Transformer for Text Instructions that Specify Unseen Objects.

6.4 Exploratory User Evaluation

To determine how effectively Scones can assist users in creating sketches from natural language, we conducted an exploratory evaluation of Scones. We recruited 50 participants from English-speaking countries on Amazon Mechanical Turk (AMT) for our study. We collected quantitative and qualitative results from user trials with Scones, as well as suggestions for improving Scones. Participants were given a maximum of 20 minutes to complete the study and were compensated $3.00 USD. Participants were only allowed to complete the task once.


The participants were asked to recreate one of five randomly chosen target scene sketches by providing text instructions to Scones in the chat window. Each target scene had between four and five target objects from a set of 17 possible scene objects. Participants were informed that the final result did not have to be pixel perfect to the target scene, and to mark the sketch as complete once they were happy with the result. Instructions supplied in the chat window were limited to 500 characters, and submitting an instruction was considered as taking a “turn”. The participants were only given the sketch strokes of the target scene without class labels, to elicit natural instructions.

Figure 18: Screenshot of Scones’s Evaluation User Interface.

Participants were first shown a short tutorial describing the canvas, chat interface, and target scene in the Scones interface (Figure 18), and were asked to give simple instructions in the chat window to recreate the target scene. Only two sample instructions were given in the tutorial: “add a tree”, and “add a cat next to the table”. At each turn, participants were given the option to redraw objects which remained in the scene for over three turns using a paintbrush-based interface. After completing the sketch, participants filled out an exit survey with likert-scale questions on their satisfaction at the sketch and enjoyment of the system, and open-ended feedback on the system.


Participants Satisfied with Sketches, Enjoyment Was Bimodal

Participants were generally satisfied with their final sketches (), and enjoyed the task (). In open-ended feedback, participants praised Scones’s ability to parse their instructions: “it was able to similarly recreate the image with commands that I typed” (P25); “I liked that it would draw what I said. it was simple and fun to use” (P40). Some participants even felt Scones was able to intuitively understand their instructions. P15 remarked, “I thought it was cool how quickly and intuitively it responded,” while P35 said, “It had an intuitive sense of what to draw, and I did not feel constrained in the language I used”.

While enjoyment was high on average, we found responses to enjoyment followed a bimodal distribution (Figure 19). By reviewing qualitative feedback and instructions to Scones, we observe that many instances of low enjoyment (score ) come from class confusion in target scene sketches. Some participants confused the tent in a target scene as a “pyramid” in their instructions, which Scones does not support: “There is a pyramid on the left side a little ways up from the bottom” (P44). P49 tried five times to add a “pyramid” to the scene.

Figure 19: Survey Results from User Sessions with Scones.

P17, who strongly disagreed with enjoying the task (1/5), faced repeated class confusion issues, mentioning, “it was very frustrating that it wouldn’t draw the circle by the cloud …It wouldn’t draw anything besides the plane, cloud, tent, and fire. Was that not a person up by the cloud?” Scones does not support “circle” or “person” classes—the target sketch had the sun next to the cloud. When Scones is asked to draw an unsupported object, the canvas will be left unchanged. Providing participants with an explicit list of classes in the target image or adding error messages could mitigate these frustrations. Furthermore, attention-based methods mentioned in Section 6.3 could be used when an unrecognized class is detected to prompt users to provide sketch strokes with corresponding labels.

Participants Communicate with Scones at Varying Concept Abstraction Levels

On average, participants completed the sketching task in under 8 turns (), with a varied number of tokens (words in instructions) per turn (). Several participants only asked for the objects themselves (turns delimited by commas): “helicopter, cloud, swing, add basketball” (P25). Other participants made highly detailed requests: “There is a sun in the top left, There is an airplane flying to the right in the top right corner, There is a cat standing on it’s hind legs in the bottom right corner, Move the cat a little to the right, please, …” (P14). Participants who gave instructions at the expected high-level detail produced satisfying results, “draw a tree in the middle, Draw a sun in the top left corner, A plane in the top right, A cat with a pizza under the tree” (P32). The recreation of this participant is shown on the top right of Figure 20.

Figure 20: Recreated Scenes during the User Study. Users combined Scones-generated outputs with their own sketch strokes to reproduce the target scenes presented to them.

The longest conversations were often from participants with mismatched expectations for Scones, who repeated commands: “Draw a cloud in the upper left corner with three round edges., Change the cloud to have 3 round edges., Draw only 3 round waves around the edge of the cloud., …Draw a snowman to the left of the table., …Draw a circle touching the middle circle., …” (P23). This trial reflects the need for Scones to make clearer expectations of input to users. P23’s 16-instruction session contains expectations for the system to modify low-level aspects of the sketches (changing the number of edges in the cloud), exhibits class confusion (snowman and circles with shovel), and has mismatched concept abstraction levels (drawing a shovel versus constructing a shovel from visual primitives, i.e., circles). A potentially simple mitigation for these hurdles would be to introduce more detailed tutorial content for a wider deployment of Scones.

Scones as a Tool for Collecting Iterative Sketching Data

The results of our study show significant potential for Scones to be used as a Game With a Purpose (GWAP) (von Ahn and Dabbish, 2008) to collect sketch critiques (natural language specified modifications to an input sketch to match a target sketch) and user-generated sketch strokes. 26 ( of) participants redrew objects in their sketches when prompted (), and participants who redrew objects expressed their appreciation for this feature: “I liked that I could redraw the image” (P48); “I liked being able to draw parts myself because it was relaxing and I felt I was more accurate” (P11). Most participants who redrew objects also kept output from Scones in their final sketches, reflecting Scones’s potential as a mixed-initiative design tool. Redrawing was voluntary in our task, and these results suggest Scones may be useful for collecting user-generated sketches in addition to natural language critique in a GWAP. Further motivating this application, 14 participants described the task as “fun” in open-ended feedback, e.g., “This was a very fun task” (P23); “[I liked] Playing the game and describing the drawing. It was fun!” (P42).

Participants’ Feedback for Improving Scones

Participants offered suggestions for how they would improve Scones, providing avenues for future work.

Object Translations and Spatial Relationships

A major theme of dissatisfaction came from the limited ability of our system to respond to spatial relationships and translation-related instructions at times: “It does not appear to understand spatial relationships that well” (P35); “you are not able to use directional commands very easily” (P11). These situations largely originate from the CoDraw dataset (Kim et al., 2019), in which users had a restricted view of the canvas, resulting in limited relative spatial instructions. This limitation is discussed further in Section 7.1.

To improve the usability of Scones, participants suggest its interface could benefit from the addition of direct manipulation features, such as selecting and manually transforming objects in the scene: “I think that I would maybe change how different items are selected in order to change of modify an object in the picture. (P33); “maybe there should be a move function, where we keep the drawing the same but move it” (P40). Moreover, some participants also recommended adding an undo feature, “Maybe a separate button to get back” (P31), or the ability to manually invoke Scones to redraw an object, “I’d like a way to ask the computer to redraw a specific object” (P3). These features could help participants express corrective feedback to Scones, potentially creating sketches that better match their intent.

More Communicative Output

Some participants expected Scones to provide natural language output and feedback to their instructions. Some participants asked questions directly to elicit Scones’s capabilities: “In the foreground is a table, with a salad bowl and a jug of what may be lemonade. In the upper-left is a roughly-sketched sun. Drifting down from the top-center is a box, tethered to a parachute., Did you need me to feed you smaller sentences? …” (P38). P23 explicitly suggested users should be able to ask Scones questions to refine their intentions: “I would like the system to ask more questions if it does not understand or if I asked for several revisions. I feel that could help narrow down what I am asking to be drawn”. Other participants used praise between their sketching instructions, which could be used as a cue to preserve the sketch output and guide further iteration: “…Draw an airplane, Good try, Draw a table …” (P1); “Draw a sun in the upper left corner, The sun looks good! Can you draw a hot air balloon in the middle of the page, near the top? …” (P15). Providing additional natural language output and prompts from Scones could enable users to refine Scones’s understanding of their intent and understand the system’s capabilities. A truly conversational interface with a sketching support tool could pave the way for advanced mixed-initiative collaborative design tools.

7 Limitations and Future Research Opportunities

Through the development of a sketch dataset and two sketch-based creativity support systems, we identified some limitations of our work to date and opportunities for future work in this area.

7.1 Dataset Scale and Match

One crucial area for improvement for sketch-based application is the scale and quality of the datasets. Although the dataset we described in Section 4 can be used to train a sketch-based UI retrieval model described in Section 5, we observed that the performance of the system has been limited by the diversity and complexity of sketches and UIs in the dataset. This is demonstrated by two major modes of failure in Swire when it struggles to handle rare, custom UI elements as exhibited by Example a in Figure 21, and fails to understand UIs with diverse colors, such as those with image backgrounds in Example b in Figure 21. We believe with increased amount of training data, Swire can better generalize to more complex and colorful UIs.

Figure 21: Failure Modes of UI Retrieval using Swire. Swire failed to understand a) custom and b) colorful UI elements.

Similarly, we observe that Scones (Section 6) has been constrained by the differences between the task protocol used to collect the CoDraw dataset (the dataset Scones was trained on) and the user interactions in Scones. Each conversation in CoDraw only offers the Teller one chance to ‘peek’ at the Drawer’s canvas, which significantly decreases the number of modifications made to existing scene objects. As a result, Scones performs well at adding objects of correct classes at appropriate locations and sizes, but is not as advanced at modifying or removing objects. Moreover, the current dataset isn’t end-to-end such that there is only a small, fixed number of styles of sketched objects, which reduces Scones’ ability in handling stylistic instructions. The ideal data to train this system on shall directly record iterative master-apprentice interactions in creating, critiquing, modifying and removing highly variable sketched scene objects at a large scale. Nevertheless, these datasets are considered to be difficult to collect due to the high sketching skill requirement for crowdworkers (Zou et al., 2018), such that even single-turn multi-object sketched scenes are difficult for crowdworkers to create.

We believe a significant research direction is to enable the collection of legitimate sketching data at a large-scale, especially in domains that require prior expertise such as UI and mechanical design. This can be achieved by either lowering the skill barrier, or by using tools like Scones and Swire to support parts of joint tasks. For example in the case of Scones, we can lower the skill barrier by decomposing each scene into object components, allowing crowdworkers to only sketch a single object at a time in context. Alternatively, future research could explore other means of data collection beyond the crowd. Tools that are used for joint tasks provide natural incentives for designers during realistic design use-cases, and can allow live collection of realistic design data from professional users. Researchers can also investigate sourcing sketches from students of sketching courses at various institutions offering design education who would possess higher sketching expertise.

7.2 Integration with Applications in Real Usage Scenarios

Another significant area of research is to create systems that tightly integrate with design and artistic applications in realistic use-cases. While many current research projects (including those described in this book chapter) demonstrate deep-neural-networks’ capability and potential in supporting design applications, these applications are currently rough prototypes that are not yet suitable for everyday use by designers and artists. Further developing these applications and exploring how they integrate into design and artistic processes will reveal important usability issues and inform future design and implementation choices of similar tools. For instance, to successfully support UI design processes with Swire, we need to carefully consider the visual representation of UI examples in the application and the underlying datasets to be queried.

Moreover, some of the capability of these tools can be best demonstrated when applied to professional domains. For instance, Scones could participate in the UI/UX design process by iteratively suggesting multiple possible modifications of UI design sketches according to design critique. To enable this interaction, we could consider complete UI sketches as ‘scenes’ and UI components as ‘scene objects’. Scones could be trained on this data along with text critiques of UI designs to iteratively generate and modify UI mockups from text. To allow Scones to generate multiple design candidates, we can modify the current architecture to model probabilistic outputs for both sketch strokes (which is currently probabilistic) and scene object locations. While datasets of UI layouts and components, such as those presented in Section 4, suggest this as a near possibility, this approach may generalize to other domains as well, such as industrial design. Nevertheless, this requires significant data support by solving the issues mentioned in Section 7.1.

8 Conclusion

This chapter presented three key aspects of developing deep-learning-driven, sketch-based creativity support tools. First, we collect the first large-scale dataset of sketches corresponding to UIs. Second, we develop a sketch-based UI retrieval technique that enables designers to interact with large-scale UI datasets using sketches. Third, we built a deep-learning-driven system that supports the novel interaction of generating scenes of sketched objects from text instructions. The dataset we collected supported our deep-learning-based tools, and we showed qualitatively and quantitatively that our systems can support targeted design and artistic applications. We further outlined several areas of future research opportunities and hope the documentation of our development experience and the release of our dataset can spur future research in this area.

Our ultimate goal of pursuing this line of research is to provide users of all levels of sketching expertise with relevant materials and computational resources to focus on creative and innovative tasks in creative processes. We also hope these projects can provide entirely new means for creative expression and rapid ideation. We are excited to continue designing for this future of design, art, and engineering.


  • E. Aksan, T. Deselaers, A. Tagliasacchi, and O. Hilliges (2020) CoSE: compositional stroke embeddings. Advances in Neural Information Processing Systems 33. Cited by: §3.3.
  • N. Bonnardel (1999) Creativity in design activities: the role of analogies in a constrained cognitive environment. In Proceedings of the 3rd Conference on Creativity & Cognition, C&C ’99, New York, NY, USA, pp. 158–165. External Links: ISBN 1-58113-078-3, Link, Document Cited by: §4.
  • B. Buxton (2007) Sketching user experiences: getting the design right and the right design. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 0123740371, 9780123740373 Cited by: §2.2, §4, §5.6.
  • J. Canny (1986) A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence (6), pp. 679–698. Cited by: §5.5.
  • B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar (2017) Rico: a mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, UIST ’17, New York, NY, USA, pp. 845–854. External Links: ISBN 978-1-4503-4981-9, Link, Document Cited by: §4.
  • B. Deka, Z. Huang, and R. Kumar (2016) ERICA: interaction mining mobile apps. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, UIST ’16, New York, NY, USA, pp. 767–776. External Links: ISBN 978-1-4503-4189-9, Link, Document Cited by: §4, §4, §5.6.
  • S. P. Dow, A. Glassco, J. Kass, M. Schwarz, D. L. Schwartz, and S. R. Klemmer (2010) Parallel prototyping leads to better design results, more divergence, and increased self-efficacy. ACM Trans. Comput.-Hum. Interact. 17 (4), pp. 18:1–18:24. External Links: ISSN 1073-0516, Link, Document Cited by: §5.6.
  • M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. ACM Trans. Graph. (Proc. SIGGRAPH) 31 (4), pp. 44:1–44:10. Cited by: §3.1.
  • J. Fernquist, T. Grossman, and G. Fitzmaurice (2011) Sketch-sketch revolution: an engaging tutorial system for guided sketching and application learning. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, New York, NY, USA, pp. 373–382. External Links: ISBN 978-1-4503-0716-1, Link, Document Cited by: §2.1.
  • C. Gao, Q. Liu, Q. Xu, L. Wang, J. Liu, and C. Zou (2020) SketchyCOCO: image generation from freehand scene sketches. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 5174–5183. Cited by: §3.1.
  • P. Gervais, T. Deselaers, E. Aksan, and O. Hilliges (2020) The didi dataset: digital ink diagram data. arXiv preprint arXiv:2002.09303. Cited by: §3.1.
  • D. Ha and D. Eck (2018) A neural representation of sketch drawings. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3.3, §6.1, §6.1, §6.2.
  • J. W. Hennessey, H. Liu, H. Winnemöller, M. Dontcheva, and N. J. Mitra (2017) How2Sketch: generating easy-to-follow tutorials for sketching 3d objects. Symposium on Interactive 3D Graphics and Games. Cited by: §2.1.
  • S. R. Herring, C. Chang, J. Krantzler, and B. P. Bailey (2009) Getting inspired!: understanding how and why examples are used in creative design practice. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, New York, NY, USA, pp. 87–96. External Links: ISBN 978-1-60558-246-7, Link, Document Cited by: §4, §5.6, §5.6.
  • F. Huang, J. F. Canny, and J. Nichols (2019) Swire: sketch-based user interface retrieval. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA. External Links: ISBN 9781450359702, Link, Document Cited by: §5.6, footnote 1.
  • F. Huang and J. F. Canny (2019) Sketchforme: composing sketched scenes from text descriptions for interactive applications. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, UIST ’19, New York, NY, USA, pp. 209–220. External Links: ISBN 9781450368162, Link, Document Cited by: §6.1.
  • J. Jongejan, H. Rowley, T. Kawashima, J. Kim, and N. Fox-Gieg (2016) External Links: Link Cited by: §3.1, §6.2.
  • J. Kim, N. Kitaev, X. Chen, M. Rohrbach, B. Zhang, Y. Tian, D. Batra, and D. Parikh (2019) CoDraw: collaborative drawing as a testbed for grounded goal-driven communication. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6495–6513. External Links: Link, Document Cited by: §6.2, §6.3, §6.4, Table 2.
  • R. Kumar, J. O. Talton, S. Ahmad, and S. R. Klemmer (2011) Bricolage: example-based retargeting for web design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, New York, NY, USA, pp. 2197–2206. External Links: ISBN 978-1-4503-0228-9, Link, Document Cited by: §4.
  • J. A. Landay (1996) SILK: sketching interfaces like krazy. In Conference Companion on Human Factors in Computing Systems, CHI ’96, New York, NY, USA, pp. 398–399. External Links: ISBN 0-89791-832-0, Link, Document Cited by: §2.2.
  • W. S. Lasecki, J. Kim, N. Rafter, O. Sen, J. P. Bigham, and M. S. Bernstein (2015) Apparition: crowdsourced user interfaces that come to life as you sketch them. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, New York, NY, USA, pp. 1925–1934. External Links: ISBN 978-1-4503-3145-6, Link, Document Cited by: §2.2.
  • Y. J. Lee, C. L. Zitnick, and M. F. Cohen (2011) ShadowDraw: real-time user guidance for freehand drawing. ACM Trans. Graph. 30 (4), pp. 27:1–27:10. External Links: ISSN 0730-0301, Link, Document Cited by: §2.1.
  • M. Li, Z. Lin, R. Mech, E. Yumer, and D. Ramanan (2019) Photo-sketching: inferring contour drawings from images. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1403–1412. Cited by: §3.3.
  • A. Limpaecher, N. Feltman, A. Treuille, and M. Cohen (2013) Real-time drawing assistance through crowdsourcing. ACM Trans. Graph. 32 (4), pp. 54:1–54:8. External Links: ISSN 0730-0301, Link, Document Cited by: §2.1.
  • J. Lin, M. W. Newman, J. I. Hong, and J. A. Landay (2000) DENIM: finding a tighter fit between tools and practice for web site design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’00, New York, NY, USA, pp. 510–517. External Links: ISBN 1-58113-216-6, Link, Document Cited by: §2.2.
  • R. Munoz-Salinas (2012) ARUCO: a minimal library for augmented reality applications based on opencv. Universidad de Córdoba. Cited by: §4.3.
  • M. W. Newman and J. A. Landay (2000) Sitemaps, storyboards, and specifications: a sketch of web site design practice. In Proceedings of the 3rd Conference on Designing Interactive Systems: Processes, Practices, Methods, and Techniques, DIS ’00, New York, NY, USA, pp. 263–274. External Links: ISBN 1-58113-219-0, Link, Document Cited by: §4, §5.6, §5.6.
  • T. A. Nguyen and C. Csallner (2015) Reverse engineering mobile application user interfaces with remaui (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Vol. , pp. 248–259. External Links: Document, ISSN Cited by: §4.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. External Links: Link Cited by: §6.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Link, 1409.0575 Cited by: §5.1.
  • A. Sain, A. K. Bhunia, Y. Yang, T. Xiang, and Y. Song (2020) Cross-modal hierarchical modelling for fine-grained sketch based image retrieval. In Proceedings of The 31st British Machine Vision Virtual Conference (BMVC 2020), pp. 1–14. Cited by: §3.2.
  • P. Sangkloy, N. Burnell, C. Ham, and J. Hays (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. 35 (4), pp. 119:1–119:12. External Links: ISSN 0730-0301, Link, Document Cited by: §3.1, §5.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §5.2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §5.1.
  • Q. Su, W. H. A. Li, J. Wang, and H. Fu (2014) EZ-sketching: three-level optimization for error-tolerant image tracing. ACM Trans. Graph. 33 (4), pp. 54:1–54:9. External Links: ISSN 0730-0301, Link, Document Cited by: §2.1.
  • A. Swearngin, M. Dontcheva, W. Li, J. Brandt, M. Dixon, and A. J. Ko (2018) Rewire: interface design assistance from examples. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18, New York, NY, USA, pp. 504:1–504:12. External Links: ISBN 978-1-4503-5620-6, Link, Document Cited by: §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §3.3, §6.1.
  • L. von Ahn and L. Dabbish (2008) Designing games with a purpose. Commun. ACM 51 (8), pp. 58–67. External Links: ISSN 0001-0782, Link, Document Cited by: §6.4.
  • J. Xie, A. Hertzmann, W. Li, and H. Winnemöller (2014) PortraitSketch: face sketching assistance for novices. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, UIST ’14, New York, NY, USA, pp. 407–417. External Links: ISBN 978-1-4503-3069-5, Link, Document Cited by: §2.1.
  • Q. Yu, F. Liu, Y. Song, T. Xiang, T. Hospedales, and C. C. Loy (2016) Sketch me that shoe. In Computer Vision and Pattern Recognition, Cited by: §5.2.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §3.3.
  • C. Zou, Q. Yu, R. Du, H. Mo, Y. Song, T. Xiang, C. Gao, B. Chen, and H. Zhang (2018) SketchyScene: richly-annotated scene sketches. In ECCV, pp. 438–454. External Links: Document, Link Cited by: §3.1, §7.1.