Creative Procedural-Knowledge Extraction From Web Design Tutorials

04/18/2019 ∙ by Longqi Yang, et al. ∙ adobe ByteDance Inc. cornell university 0

Complex design tasks often require performing diverse actions in a specific order. To (semi-)autonomously accomplish these tasks, applications need to understand and learn a wide range of design procedures, i.e., Creative Procedural-Knowledge (CPK). Prior knowledge base construction and mining have not typically addressed the creative fields, such as design and arts. In this paper, we formalize an ontology of CPK using five components: goal, workflow, action, command and usage; and extract components' values from online design tutorials. We scraped 19.6K tutorial-related webpages and built a web application for professional designers to identify and summarize CPK components. The annotated dataset consists of 819 unique commands, 47,491 actions, and 2,022 workflows and goals. Based on this dataset, we propose a general CPK extraction pipeline and demonstrate that existing text classification and sequence-to-sequence models are limited in identifying, predicting and summarizing complex operations described in heterogeneous styles. Through quantitative and qualitative error analysis, we discuss CPK extraction challenges that need to be addressed by future research.



There are no comments yet.


page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building applications that can enhance human abilities to accomplish creative tasks (such as writing summaries, designing graphics, and composing music) has recently captured a significant amount of attention from academia and industry Ganin et al. (2018); Simon and Oore (2017); See et al. (2017). In the domain of design and imaging, these tasks often require micro-controls over pixels, shapes, and objects and involve multiple steps of operations. For example, drawing a creative cartoon typically requires several iterations of an action sequence (contouring, brushing, coloring) applied to different areas on the canvas. Therefore, to realize a wide range of intelligent tasks (such as next-tool recommendation, auto-completion, personalized interface, and design material retrieval), applications need to gain Creative Procedural-Knowledge (CPK).

In this paper we exploit the opportunity for machine learning algorithms to extract CPK from the growing corpus of online tutorials. Creative professionals document detailed actions and steps for using design software (e.g., Photoshop, GIMP, Inkspace, and Autodesk) in online design tutorials. These tutorials can be in the form of text, image, or video and may vary in quality, length, and topics. For example,

envatotuts+ website111 contains tutorials for text effects, sketches, watercolor painting, and animation. To the best of our knowledge, few prior work investigated the problem of parsing and extracting knowledge from free-text tutorials. Existing research on mining web content has been mostly focused on articles Ren et al. (2017) and scientific papers Zhang (2015), and the tasks are limited to tagging Joulin et al. (2017)

, name entity recognition 

Manning et al. (2014), and summarization See et al. (2017). Prior work on parsing cooking recipes Jermsurawong and Habash (2015); Chen (2017) investigated clean and structured documents where ingredients are presented upfront, and instructions are described step-by-step without irrelevant or redundant information. In comparison, tutorials are much more diverse and unstructured, and the CPK extends beyond name entities, topics, and a fixed list of ingredients.

Figure 1: Sample CPK (Section 3) annotations for a design tutorial. A professional designer is instructed to identity relevant text chunks and summarize for the goal and actions. For each action, a command is also labeled.

To facilitate and benchmark research, we propose an ontology for CPK using the workflow representation that consists of goal, workflow, action, command and usage (Section 3). Based on the defined ontology, we collect thousands of text-based Adobe Photoshop tutorials from hundreds of domains and build a web application to elicit labels from professional designers. Our application allows annotators to flexibly extract and summarize various components and specify the sources from which the knowledge is derived. Finally, our dataset contains 819 unique commands, 47,491 actions, and 2,022 workflows and goals. A representative set of annotations is shown in Fig. 1. The code and dataset are available at With the dataset, we quantitatively and qualitatively present the challenges of CPK extraction. Specifically, we demonstrate that existing text classification and sequence-to-sequence models Sutskever et al. (2014)

fall short in handling heterogeneous writing styles of different tutorials. As a result, predictive models are suboptimal in identifying and classifying implicit- or latent-mentioned commands and produce low quality summaries for complex operations.

Since CPK stores valuable procedural knowledge necessary for diverse design tasks, it can potentially enable or enhance a wide range applications. For example, (a) smarter tutorial search and recommendation. CPK can enable command, action and intention-based search that are beyond text matching. Similarly, tutorial recommendations can be conducted based on the goal that an user plans to achieve or the commands and operations that she aspires to learn; (b) skill trees and personalized curriculum. A large collection of CPK provides resources to contruct skill trees that encode the rank of each operation, which can power personalized curriculum, and (c) autonomous design agents. CPK provides mappings from design goals to action and command sequences. An autonomous design agent can be trained supervisedly to act on users’ queries; In addition, since each annotated CPK directly corresponds to a tutorial, the collected dataset can be used for tutorial generations based on given command sequences and goal descriptions.

2 Related Work

Our work is related to and inspired by previous research on knowledge base construction and document modeling. The extracted CPK has applications to computational creativity.

2.1 Knowledge base construction

A knowledge base (graph) Bollacker et al. (2008); Liu et al. (2010) stores facts using triplets (, , ), where and are generic entities (e.g., person, location, and event), or domain-specific entities (e.g. drug and effect), and is a relation that builds connections between and (e.g., a drug has a effect, and a person lives in a location). The process of Knowledge Base Construction (KBC) Niu et al. (2012); Mitchell et al. (2018); Mahdisoltani et al. (2013); Ritter (2013); Zhang (2015) is to extract triplets from free text Mitchell et al. (2018); Liu et al. (2017) or structured tables Ran et al. (2015); Crestan and Pantel (2010). Prior work has proposed diverse approaches for the extraction, such as supervised methods Bunescu and Mooney (2005), semi- or weakly supervised algorithms Liu et al. (2010); Mahdisoltani et al. (2013); Mitchell et al. (2018); Jiang et al. (2012), and distantly-supervised algorithms Liu et al. (2017); Zhang (2015); Surdeanu et al. (2012); Angeli et al. (2014); Ren et al. (2017)

. These approaches require different levels human labeling efforts and are shown to vary in precision and recall. However, while existing knowledge bases have enabled important applications (such as web search 

Deshpande et al. (2013) and question answering Yih and Ma (2016)), the triplet representations are over-simplified for design and creative tasks. These tasks often involve complex workflows, which require sequential and nested structures beyond binary relationships between entities. Our work compensates existing KBC literature by proposing a new ontology specifically tailored for CPK. We demonstrate that CPK extraction can benefit from and pose new challenges to existing approaches.

2.2 Document modeling

Existing work on document modeling has mainly focused on word-level and document(sentence)-level predictive tasks, e.g., name entity recognition, part-or-speech tagging, and semantic parsing Manning et al. (2014) are used to tag word-level structures; and document representations Le and Mikolov (2014); Lau and Baldwin (2016)

are leveraged for sentiment analysis 

Maas et al. (2011), textual similarity comparison Maas et al. (2011), question retrieval Hoogeveen et al. (2015), and summarization See et al. (2017). Since many end applications are built on global contexts, it is non-trivial to extract documents’ internal structures using trained predictive models, even with the attention mechanism Bahdanau et al. (2014); See et al. (2017). Prior work on parsing cooking recipes Jermsurawong and Habash (2015); Chen (2017) produced structured outputs on template documents, i.e., ingredients are known, and no redundant information is presented. In our work, we explicitly mine goals and design procedures from free-text tutorials, which are much more complex and unstructured. Such an information extraction poses challenges to existing modeling frameworks for word or sentence-level tasks and clean documents. But extracted fine-grained procedures may benefit traditional document classification, retrieval and summarization tasks.

2.3 Computational creativity

Recently, the field of computational creativity Colton and Wiggins (2012) has captured growing attention from academia and industry. The goal of this field is to create programs that can master or enhance human-level creativity222 In the domain of design and arts, researchers have proposed algorithms that can color grey images Zhang et al. (2016), transfer image contextual styles Li et al. (2017), and synthesis images Sangkloy et al. (2017); Ganin et al. (2018). In addition, many applications, such as command search Adar et al. (2014), command recommendation Li et al. (2011), and creative content recommendation Yang et al. (2017), have been designed to assist creative professionals to accomplish complex tasks. Moving forward, to truely understand creativity, it is important to learn fundamental procedures of design. To that end, our exploration provides initial resources and insights for scalable discovery and learning of CPK in the future.

3 Ontology of creative procedural-knowledge

Figure 2: A CPK ontology for the object re-coloring task. CPK components are colored and bolded.

Each design using software is fundamentally a process that achieves an artistic objective by completing a sequence of actions, e.g., to re-color an object in an image using Photoshop, one needs to (1) load a target image into the software, (2) select object area, (3) adjust color, and (4) save the modified image. All processes that designers developed for diverse tasks form the CPK. To encode process-oriented nature, we compose an ontology of CPK using the workflow representation, which was adopted to characterize business and scientific processes Bergmann and Gil (2014). In principle, we represent each design process with two major components, a goal and a workflow, and the workflow ensembles actions (i.e., (command, usage) tuples) in a structured manner. A concrete implementation of the ontology for the object re-coloring task is shown in Fig. 2. Below are detailed definitions of goal and workflow.

  • Goal. The goal defines the objective for a design process. It typically describes a targeted artifact or a visual effect. For example, in terms of digital painting, the goal can be “create character concept art”, “paint water, waves and ocean”, or “turn a pencil sketch into a colorful character illustration”; and in terms of photo effects, potential goals are “create an architecture sketch effect”, “add lights to a tree”, or “make someone look younger”. In general, goals are not mutually exclusive, and they more or less relate to each other, e.g., achieving higher-level goals may depend on the completion of lower-level tasks (inclusion), and an abstract goal may correspond to multiple concrete implementations (hierarchy). In Fig. 2 example, the goal is to “re-color an object in an image”.

  • Workflow. The workflow represents unrolled and step-by-step actions that need to be performed to accomplish the goal. Each action specifies a software command to be used and its usage (i.e., what it is used for). For example, the workflow for the task in Fig. 2 contains four actions (“{}” and “[]” represent an usage and a command respectively): (1) use [File Open] to {open an image}, (2) employ [Lasso Tool] to {select object area}, (3) {adjust object color} through [Image Adjustments Hue/Saturation], and (4) {save the image} using [File Save as]. Similar to the goal definition, workflows also correlate with each other, e.g, two workflows may share a similar sub-action sequence. In reality, a comprehensively defined workflow should be executable given an environmental context (i.e., an image).

The canonical knowledge representation makes it possible to annotate and extract CPK from online design tutorials.

4 Annotating Creative Procedural-Knowledge

Figure 3: The CPK annotation pipeline. It contains three steps: scraping, filtering (coarse- and fine-grained) and annotation.

We collect CPK from design tutorials (Fig. 1) that explicitly or implicitly describe the task that it tries to accomplish and detailed step-by-step instructions. To capture a wide range of goals and workflows, we consider Adobe Photoshop as the design software, because of its rich functionality and active user community. In this section, we describe processes of collecting, filtering and annotating tutorial webpages. As shown in Fig. 3, the filtering contains two stages, coarse-grained and fine-grained, that are conducted through Amazon Mechanical Turk (AMT) and professional designers from Upwork, respectively.

4.1 Tutorial collection

Using Google search engine, we identify 148 valid and unique domains, where each domain serves more than 10 Photoshop tutorials. We build a generic web crawler to recursively scrape all web pages under these domains and only retain pages with the keyword “photoshop”. To further improve the accuracy of the scraping, for the 27 largest domains, we build dedicated scrapers tailored for their website structures. Eventually, 19.6K web pages are collected. We use Simhash Manku et al. (2007) to detect duplicated documents, which results in 18,100 distinct pages for further processing.

4.2 Course-grained filtering

Figure 4: The web application for coarse-grained webpage filtering on Amazon Mechanical Turk (AMT) platform. Each worker is instructed to complete two tasks: answer a filtering-related question, and select the title of the tutorial.

The scraped web pages are potentially noisy. For example, pages may not be Photoshop tutorials. Also, since we mainly focus on text-based tutorials, web pages that only contain videos are out of our scope. Because identifying these characteristics does not require professional design knowledge, we conduct a coarse-grained filtering using Amazon Mechanical Turk (AMT) platform. For each web page, two distinct workers are recruited to use a web application (Fig. 4) to (a) answer the question “Is the main content of this web page a single text-based Adobe Photoshop Tutorial?”, and (b) click on the title of this tutorial (if the answer is “Yes” to the previous question). The second task is designed to verify the simple click made for the first question. Qualified workers need to satisfy three requirements: (1) “masters” as determined by AMT platform, (2) have more than 90% of approval rate, and (3) have completed more than 100 tasks. The workers are paid for $10/hour. Finally, 9996 pages receive consistent ratings from both workers.

4.3 Fine-grained filtering and annotation

Figure 5: The web application for professional designers to conduct fine-grained filtering and CPK annotations. Each session consists of five steps: (1) filter out low-quality tutorials, (2) select the title of the tutorial, (3) select the sentences describing the goal, (4) summarize the goal using less than 10 words, (5) annotate actions in the order of their usage (continue until all actions are identified). The text can be directly selected using the cursor.

Cleaned tutorial pages vary in quality, e.g., some pages may only contain simple tips, while others contain detailed steps for complex tasks. To achieve quality consistency, we recruit 8 professional designers from Upwork platform to filter out low-quality tutorials (those have unclear goals or use few commands) and annotate high-quality ones using a web application (Fig. 5). Recruited designers have 2-10 years of Photoshop experience and are fluent in English. For each tutorial page, a designer is assigned to (a) select the text that describes the title and the goal, (b) use natural language to summarize the goal (with less than 10 words), and (c) identify actions performed. Annotations for an action contain the original text, an used command, and summarization (10 words maximum) for its usage. These annotations map unstructured design tutorials into structured CPK (Section 3).

4.4 Dataset overview

Figure 6: Action and command distributions in the annotated CPK dataset. X-axises are log-scaled for better visualization.
unigram bigram
Brush Tool New Layer*Brush Tool
New Layer* Blend ModeOpacity
Blend Mode Brush ToolNew Layer*
Duplicate Layer EditCopyEditPaste
FileOpen Brush ToolBrush Tool
Table 1: Five most frequently annotated unigram and bigram commands in the dataset (*:Layers, : Layer Panel, : Layer)

After completing annotations, 819 unique commands and 47,491 actions in 2,022 tutorials are identified. We show the long-tail distribution of action density (number of actions per tutorial) in Fig. 6

-(a) — Around 90% of tutorials have less than 50 actions. In terms of the commands used, their frequencies are also skewed, as shown in Fig. 

6-(b), (c). The most frequently used tools and their combinations (Table. 1), such as Brush Tool, relate to common tasks in digital designs, e.g., manipulate layers and areas. These demonstrate that many complex design tasks do not necessarily require rare tools.

For the goal and usage, the annotations are in the form of free text. We find that designers tend to use boilerplate phrases to summarize the goal, such as “how to”, “learn to”, and “learn how to”. We remove these phrases using regular expressions since they do not provide content information. To visualize the goal summaries, we first derive summary representations using the element-wise average of GloVe word vectors 

Pennington et al. (2014)

(excluding stop words), and then apply the K-means algorithm to discover underlying clusters. We set

, and the summaries closest to each cluster center are presented in Table. 2. The five clusters cover a wide range of design topics, e.g., text effect, photo effect, web design, textures, and scenes, which demonstrates the diversity of the dataset. In addition, we apply the same clustering approach to the usage summaries of all commands. As a result, commands are grouped by their usage similarities. For example, as shown in Table. 3, filter- and transform-related commands are grouped into the cluster 1 and cluster 5, respectively. This demonstrates that CPK reveals semantic relationships between different commands.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
create photo having pretty good light effects create dreamy photo effect using simple tools and filters create fun designs comning text and shapes create abstract photo using effects and color create this trasparent text effect
create like in my dream photo effect create abstract photo using effects and color create picture by blending different images and tools create a painting from a photo create a text effect
create realistic light around a subject creating a cool matelic background using some filters and effect create colorful futuristic looking text effect create a lomography photo effect create a text effect
use fire and blending modes to create got theme photos create a fabric text effect using layer styles and texture creating a landscape creative image using structure images create a-smoke photo effect create text effect
create a night scene image creating a text effect using texture and filters create a modern web design style create a colorful photo manipulation create a brazil-inspired text effect in photoshop
Table 2: Five K-means clusters of goal summaries. The K-means algorithm is applied to summary representations, which are derived by taking the average of GloVe word vectors Pennington et al. (2014) excluding stop words (represented with light fonts).
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
LayerNew Adjustment LayerPhoto Filter Pen Tool LayerNew Adjustment LayerColor Balance Layers Panel EditTransformWarp
FilterTextureGrain Selection Tool LayerNew Adjustment LayerSelective Color Eye Icon EditTransformForward Warp
FilterDistortSpherize Polygonal Lasso Tool ImageAdjustments Selective Color LayersPalette Options EditFree Transform
FilterTextureTexturizer Move Tool ImageAdjustments Variations LayerLayer MaskReveal All ImageTransformFree Transform
FilterDistortRipple Rectangular Marquee Tool LayerNew Adjustment LayerHue/Saturation LayerLayer MaskHide All EditTransform Perspective
Table 3: Five K-means clusters of commands. The K-means algorithm is applied to commands’ usage summary representations, which are derived by taking the average of GloVe word vectors Pennington et al. (2014) excluding stop words.
Id Sentence Prediction@1 Ground Truth
1 make sure that you have a radial gradient that fades from solid white to transparent as shown in the image below : on your new layer , create a gradient and change the blending mode to overlay . Gradient Tool
Gradient Tool
Blend Mode
New Layer
2 copy ( ctrl + c ) and paste ( ctrl + v ) the selection . Edit>Paste
3 create another new layer above , use it as clipping mask ( cmd/ctrl + alt + g ) again . Create Clipping Mask
Create Clipping Mask
New Layer
4 step 2 right click the canvas and choose select inverse to invert the selection . Select>Inverse
(a) True Positive (TP) samples
Id Sentence
1 next you will find a short version of the tutorial but for a more details please watch the video tutorial .
2 we want to sharpen objects in an image without increasing the effect or visibility of those unwanted elements and also in a way that does not affect the original colors .
3 i hope that you enjoyed this tutorial , and as always i’d love to hear what you think !
4 my recommendation is to work in the early morning hours before the crowds set in .
(b) True Negative (TN) samples (Prediction@1=Ground Truth=“No Action”)
Id Sentence Prediction@1 Ground Truth
1 then create a new shape using round shape tool . Custom Shape Tool Ellipse Tool
2 set it to soft light mode at 16 % opacity . Blend Mode Gradient Map
3 ( i ’ve used # 003200 here . ) Paint Bucket Tool Color Picker
4 then press ctrl+c ( win ) / command+c ( mac ) to copy the image to the clipboard . Edit>Copy No Action
5 rename the merged layer to crackedplanet . Rename Layer No Action
(c) False Positive (FP) samples
Id Sentence Ground Truth
1 step 21 : go to the background image layer at the bottom and press ctrl + j . Duplicate Layer
2 painting it just a bit off could make her eyes look unparallel . Clone Stamp Tool
3 draw two lines down from the corner of her eye , and tap it once or twice here and there : brushing the paint drip : zoomed out , this is what it looks like : time to fix her hair , we’re going to plant a tree on top of her head , but we need her head to be a little more flat . Brush Tool
4 adding bottom text step 1 follow the same techniques to add bottom text . Horizontal Type Tool
5 hit command + d to deselect . Gradient Tool
6 once you have opened the actions panel you can begin creating your first lomo leak . Actions Panel
(d) False Negative (FN) samples (Prediction@1=“No Action”)
Table 4: Commands predictions for sample sentences in the testing set. The samples are grouped into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). The predictions are from the fastText model built on the 1,2,3,4-gram features. The Prediction@1 denotes the top label (either a command or a “No Action”) predicted for each instance.

5 Extracting creative procedural-knowledge

The annotated dataset makes it possible to supervisedly build models to mine and extract CPK. In this section, we propose a general pipeline that produces structured knowledge outputs based on unstructured web content (Section 5.1). In addition, we experiment existing text classification and summarization algorithms for various components (Section 5.2). We find and discuss limitations of existing approaches and challenges of the extraction task.

5.1 A general extraction pipeline

Figure 7: Proposed CPK extraction pipeline. All arrows represent data flows. Solid-line arrows additionally denote the order for executing modules. The parallelogram represents the input, Rectangulars represent identification tasks; and Rounded-rectangulars represent prediction/generation tasks.

Inspired by the annotation procedure, we present an extraction pipeline in Fig. 7. It consists of six steps: (1) extract main text content from a raw HTML tutorial webpage; (2) identify a text chunk that presents the goal of a design task; (3) summarize the goal; (4) identify text segments that describe actions recursively and orderly; (5) predict commands used for each action; and (6) summarize the usage of each command. The predictions and summarizations can leverage the local (identified text) and global (complete text content) information.

5.2 Experimenting existing algorithms

The proposed pipeline decomposes the knowledge extraction task into several sub-tasks, which can be potentially powered by existing solutions for text analysis. For example, the command prediction can be viewed as a multi-label text classification problem and the usage summarization can leverage the sequence-to-sequence models Sutskever et al. (2014) developed for machine translation. To explore the extent to which existing solutions can solve the CPK extraction, we (a) apply the fastText Joulin et al. (2017)

to identify action-related text chunks and predict commands, and (b) adapt the neural machine translation model 

Bahdanau et al. (2014) to summarize commands’ usages. Specifically, we use the boilerpipe Kohlschütter et al. (2010) and the newspaper333 packages to extract main content from HTML pages, and assume the minimum unit for a text chunk is a sentence. As a result, an action may span across multiple sentences, and a sentence may contain multiple actions.

5.2.1 Text chunks identification and commands prediction

To conduct sentence-level predictions, we use the NLTK Bird et al. (2009)

to segment clean tutorial text into sentences. In total, 2022 tutorials are divided into 94,022 segments. For each sentence, a multi-label classifier predicts commands used or outputs “No Action” indicating that the sentence does not describe any action. In practice, the “No Action” is treated as an additional label, along with all commands that appear more than 5 times (404 out of 819 satisfy the requirement), in the classification. 54% of sentences are “No Action” labeled. To measure the performance of fastText, we randomly split sentences into a training set (62,936 sentences) and a testing set (31,086 sentences), and experiment N-gram features, where N ranges from 1 to 4. The fastText classification performances, in terms of Precision@1 and Recall@1, are presented in Fig. 

8. Compared to the majority baseline (0.54), fastText performs significantly better, and the performance is further improved by adding more N-gram features. However, the improvements saturate when N is larger than 3, and the best performance is still unsatisfactory.

To more deeply understand the characteristics of the task and its unique challenges, we analyze errors of the best classifier (built on the 1,2,3,4-gram features). Specifically, samples of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) in the testing set are shown in Table. 4 (“No Action” is treated as negative). The TP samples demonstrate that commands predictions are accurate when they are explicitly mentioned in the text, e.g., “create a gradient”(a-1), “paste the selection”(a-2), “use it as clipping mask” (a-3), and “select inverse to invert the selection” (a-4), etc. Also, according to TN samples, “No Action” sentences are easy to classify when the semantics are clearly irrelevant to tutorial workflows, such as introductory and conclusive sentences (b-1 and b-3) and side notes (b-2 and b-4). However, as the sentences become more complex, the classifier fails under various scenarios, as discussed below.

  • Implicit mention of commands. The command names may not be explicitly mentioned in the text. Instead, they may be referred using short-cuts (d-1 and d-5) or appearances (c-1).

  • No direct mention of commands (Latent commands). In many cases, commands are not mentioned at all. For example, in c-2, c-3, d-2, and d-3, text chunks describe the changes to be made to the canvas; and the usage of a command may be mentioned out of the given sentence, e.g., d-4 and d-5.

In order to tackle these challenges, we argue that prediction models need to (1) understand the global structure of a tutorial; and (2) develop deeper understandings of the relationships between commands and their applied consequences. Among the FP and FN samples, we find some labeling errors, such as c-4, c-5, and d-6. These errors may stem from the label mapping process, i.e., labels from an action are shared across sentences that the action spans, or are due to the mistakes made by annotators. We leave further data cleaning as future work.

Figure 8: Precision and Recall for the command prediction task evaluated in the testing set. The performances of four fastText models built on different N-gram features are presented.

5.2.2 Usage summarization

dropout validation testing
0 0.2 0.5 0 0.2 0.5
1-layer 11.73 12.29 13.49 10.33 11.71 12.17
1-layer-att 18.45 19.18 21.53 17.24 17.84 19.70
2-layer 11.37 11.97 13.16 10.03 11.30 12.56
2-layer-att 16.37 16.23 17.18 14.83 15.27 16.85
Table 5: BLEU scores for the usage summarization task evaluated in the validation and testing sets. The best performance in either set is bolded. The naming schema of different algorithms: [number of layers]-layer-[whether or not the attention mechanism Bahdanau et al. (2014) is applied].

An usage summarization module takes a raw sentence as the input and generates a command usage summary. A natural model that can accomplish this task is the sequence-to-sequence model Sutskever et al. (2014)

: a Recurrent Neural Network (RNN) based encoder “reads” a list of raw text tokens and produces a vector representation. Then a seperate RNN-based decoder takes the representation as the input and sequentially generates a list of words as the summary. We experiment the Neural Machine Translation (NMT) model 

Bahdanau et al. (2014)

, which achieves competitive translation performance, and a standard NMT Tensorflow implementation 

Luong et al. (2017)

is used. We leverage the Long-Short Term Memory (LSTM) 

Hochreiter and Schmidhuber (1997) as the RNN cell and consider three aspects that vary the design of the NMT model: (1) with or without the attention mechanism Bahdanau et al. (2014), (2) different dropout rates, and (3) the number of recurrent layers. To evaluate summarization performances, sentences that contain at least one action (43,582 out of 94,022 sentences satisfy the requirement) are randomly divided into a training set (23,582 sentences), a validation set (10,000 sentences), and a testing set (10,000 sentences); and each sentence corresponds to one or more reference summaries. NMT models are trained on the training set (batch size: 128, iterations: 100,000) using the Adam optimizer Kingma and Ba (2014), and the optimal training iterations are determined by the validation BLEU scores Papineni et al. (2002). Finally, models’ performances are reported using the testing BLEU scores, as shown in Table. 5.

Id Sentence NMT References BLEU
1 create one more new layer just below the dress , set blending mode of the layer to “ soft light ” . create one more new layer
create one more new layer
set blending mode of the layer to“ soft light ”
2 select your base image and duplicate ( control-j ) it . duplicate layer duplicate image 0.84
3 final image preview start working by creating a new document ( ctrl+n ) in adobe photoshop with the size 1900px by 1200px ( rgb color mode ) at a resolution of 72 pixels/inch . creating new file making new document 0.76
4 start by opening any image you want to work on . opening photo of an image open image 0.67
5 step 3 : load the photographic toning presets when the gradient picker appears , click on the small gear icon in the top right corner : clicking the gear icon . load photographic toning presets and choose choose sepia unk then gradient click on the small gear icon 0
6 we want the white bands on the umbrella to be slightly translucent , so choose select color range , select the white stripes and apply these settings controls the amount white area to blend over the stage details between better areas select color 0
7 desaturate this cloud layer and use a big soft brush to erase the outer portion of the cloud , leave the bits as show below : change the blending mode to “ color burn ” and you will have the following effect : duplicate this cloud layer a few more times , position the duplicated layer around the edge of the cliff , adjust their blending mode as shown below : add a curves adjustment layer to it with mask as shown below : you will have the following effect : step 4 now let ’ s produce some lave and flaming effect to the image . add blur effect
desaturate this cloud layer
change to color burn
duplicate this cloud layer
adjust according to image shown below
name it lava
Table 6: Usage summaries for sample sentences in the validation set. The instances are ordered by their BLEU scores descendingly. The summaries are generated from the 1-layer-att-dropout-0.5 NMT model. Each sentence may correspond to multiple ground-truth references.

The quantitative results demonstrate that adding the attention layer and increasing the dropout rate significantly improve the summarization performance, but simply stacking more LSTM layers does not help. Compared to the machine translation task where the existing model Luong et al. (2017) achieves the BLEU score close to 30, the best summarization model (BLEU: 21.53) is still sub-optimal. To better understand errors, similar to Section 5.2.1, we qualitatively analyze the outputs of the model layer-1-att-dropout-0.5. In Table. 6, we present the summaries for sample sentences in the validation set, along with their groundtruth references. Our main findings are discussed as follows.

  • Common and preliminary operations are easy to summarize. These operations include creating or duplicating layers (6-1, 2), and opening or loading documents (6-3, 4). The model can easily pick up keywords, such as “create”, “duplicate”, “layer”, “document” and “file”, and generate clean summaries accordingly.

  • Long text and complex operations pose challenges to the summarization. The summaries tend to be incomplete (6-5, 6) or trivial (6-7) when the input text is long (6-7) or the operations are complex, e.g., adding diverse effects and involving multiple (6-7) or long-tail (6-5, 6) actions.

To handle complex scenarios discussed above, summarization models need to paraphrase or accurately identify keywords and phrases from long text descriptions. In addition, similar to command predictions, understanding diverse effects that may be described is important for summarizing multiple or uncommon operations.

6 Conclusions and Future work

We formalized and collected annotations for CPK extractions, which are shown to pose new methodological research challenges. As discussed, CPK has potentials to power intelligent applications for information retrieval, personalized learning and autonomous design. Our future work includes further cleaning the collected annotations, building scalable knowledge extractors to populate and enrich CPK, and exploring tutorials with other formats, such as videos.


  • Adar et al. (2014) Eytan Adar, Mira Dontcheva, and Gierad Laput. 2014. Commandspace: modeling the relationships between tasks, descriptions and features. In UIST.
  • Angeli et al. (2014) Gabor Angeli, Julie Tibshirani, Jean Wu, and Christopher D Manning. 2014. Combining distant and partial supervision for relation extraction. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pages 1556–1567.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Bergmann and Gil (2014) Ralph Bergmann and Yolanda Gil. 2014. Similarity assessment and efficient retrieval of semantic workflows. Information Systems, 40:115–127.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM.
  • Bunescu and Mooney (2005) Razvan C Bunescu and Raymond J Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 724–731. Association for Computational Linguistics.
  • Chen (2017) Yuzhe Chen. 2017. A Statistical Machine Learning Approach to Generating Graph Structures from Food Recipes. Ph.D. thesis.
  • Colton and Wiggins (2012) Simon Colton and Geraint A. Wiggins. 2012. Computational creativity: The final frontier? In ECAI.
  • Crestan and Pantel (2010) Eric Crestan and Patrick Pantel. 2010. Web-scale knowledge extraction from semi-structured tables. In Proceedings of the 19th international conference on World wide web, pages 1081–1082. ACM.
  • Deshpande et al. (2013) Omkar Deshpande, Digvijay S Lamba, Michel Tourn, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. 2013. Building, maintaining, and using knowledge bases: a report from the trenches. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1209–1220. ACM.
  • Ganin et al. (2018) Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. 2018. Synthesizing programs for images using reinforced adversarial learning. CoRR, abs/1804.01118.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hoogeveen et al. (2015) Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. 2015. Cqadupstack: A benchmark data set for community question-answering research. In ADCS.
  • Jermsurawong and Habash (2015) Jermsak Jermsurawong and Nizar Habash. 2015. Predicting the structure of cooking recipes. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 781–786.
  • Jiang et al. (2012) Shangpu Jiang, Daniel Lowd, and Dejing Dou. 2012. Learning to refine an automatically extracted knowledge base using markov logic. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 912–917. IEEE.
  • Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 427–431. Association for Computational Linguistics.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kohlschütter et al. (2010) Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, pages 441–450. ACM.
  • Lau and Baldwin (2016) Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Rep4NLP@ACL.
  • Le and Mikolov (2014) Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML.
  • Li et al. (2011) Wei Li, Justin Matejka, Tovi Grossman, Joseph A. Konstan, and George W. Fitzmaurice. 2011. Design and evaluation of a command recommendation system for software applications. ACM Trans. Comput.-Hum. Interact., 18:6:1–6:35.
  • Li et al. (2017) Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal style transfer via feature transforms. In NIPS.
  • Liu et al. (2017) Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, and Jiawei Han. 2017. Heterogeneous supervision for relation extraction: A representation learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 46–56.
  • Liu et al. (2010) Xiaojiang Liu, Zaiqing Nie, Nenghai Yu, and Ji-Rong Wen. 2010. Biosnowball: automated population of wikis. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 969–978. ACM.
  • Luong et al. (2017) Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural machine translation (seq2seq) tutorial.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In ACL.
  • Mahdisoltani et al. (2013) Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. 2013. Yago3: A knowledge base from multilingual wikipedias. In CIDR.
  • Manku et al. (2007) Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, pages 141–150. ACM.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
  • Mitchell et al. (2018) Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, B Yang, J Betteridge, A Carlson, B Dalvi, M Gardner, B Kisiel, et al. 2018. Never-ending learning. Communications of the ACM, 61(5):103–115.
  • Niu et al. (2012) Feng Niu, Ce Zhang, Christopher Ré, and Jude Shavlik. 2012. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. International Journal on Semantic Web and Information Systems (IJSWIS), 8(3):42–73.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
  • Ran et al. (2015) Chenwei Ran, Wei Shen, Jianyong Wang, and Xuan Zhu. 2015. Domain-specific knowledge base enrichment using wikipedia tables. In Data Mining (ICDM), 2015 IEEE International Conference on, pages 349–358. IEEE.
  • Ren et al. (2017) Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R Voss, Heng Ji, Tarek F Abdelzaher, and Jiawei Han. 2017. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proceedings of the 26th International Conference on World Wide Web, pages 1015–1024. International World Wide Web Conferences Steering Committee.
  • Ritter (2013) Alan L Ritter. 2013. Extracting Knowledge from Twitter and The Web. Ph.D. thesis.
  • Sangkloy et al. (2017) Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2017. Scribbler: Controlling deep image synthesis with sketch and color.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 6836–6845.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In ACL.
  • Simon and Oore (2017) Ian Simon and Sageev Oore. 2017. Performance rnn: Generating music with expressive timing and dynamics.
  • Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 455–465. Association for Computational Linguistics.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Yang et al. (2017) Longqi Yang, Chen Fang, Hailin Jin, Matthew D. Hoffman, and Deborah Estrin. 2017. Personalizing software and web services by integrating unstructured application usage traces. In WWW.
  • Yih and Ma (2016) Wen-tau Yih and Hao Ma. 2016. Question answering with knowledge base, web and beyond. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 1219–1221. ACM.
  • Zhang (2015) Ce Zhang. 2015. DeepDive: a data management system for automatic knowledge base construction. Ph.D. thesis, The University of Wisconsin-Madison.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A. Efros. 2016.

    Colorful image colorization.

    In ECCV.