Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

by   Evan Zheran Liu, et al.
Stanford University

Reinforcement learning (RL) agents improve through trial-and-error, but when reward is sparse and the agent cannot discover successful action sequences, learning stagnates. This has been a notable problem in training deep RL agents to perform web-based tasks, such as booking flights or replying to emails, where a single mistake can ruin the entire sequence of actions. A common remedy is to "warm-start" the agent by pre-training it to mimic expert demonstrations, but this is prone to overfitting. Instead, we propose to constrain exploration using demonstrations. From each demonstration, we induce high-level "workflows" which constrain the allowable actions at each time step to be similar to those in the demonstration (e.g., "Step 1: click on a textbox; Step 2: enter some text"). Our exploration policy then learns to identify successful workflows and samples actions that satisfy these workflows. Workflows prune out bad exploration directions and accelerate the agent's ability to discover rewards. We use our approach to train a novel neural policy designed to handle the semi-structured nature of websites, and evaluate on a suite of web tasks, including the recent World of Bits benchmark. We achieve new state-of-the-art results, and show that workflow-guided exploration improves sample efficiency over behavioral cloning by more than 100x.


page 1

page 2

page 3

page 4


Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

In complex environments with high dimension, training a reinforcement le...

Learning to Navigate the Web

Learning in environments with large state and action spaces, and sparse ...

Policy Learning Using Weak Supervision

Most existing policy learning solutions require the learning agents to r...

Learning UI Navigation through Demonstrations composed of Macro Actions

We have developed a framework to reliably build agents capable of UI nav...

Learning Montezuma's Revenge from a Single Demonstration

We propose a new method for learning from a single demonstration to solv...

Depth and nonlinearity induce implicit exploration for RL

The question of how to explore, i.e., take actions with uncertain outcom...

Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents

We investigate the task of learning to follow natural language instructi...

Code Repositories


Workflow-Guided Exploration: sample-efficient RL agent for web tasks

view repo

1 Introduction

We are interested in training reinforcement learning (RL) agents to use the Internet (e.g., to book flights or reply to emails) by directly controlling a web browser. Such systems could expand the capabilities of AI personal assistants (Stone & Soper, 2014), which are currently limited to interacting with machine-readable APIs, rather than the much larger world of human-readable web interfaces.

Reinforcement learning agents could learn to accomplish tasks using these human-readable web interfaces through trial-and-error (Sutton & Barto, 1998). But this learning process can be very slow in tasks with sparse reward, where the vast majority of naive action sequences lead to no reward signal (Vecerik et al., 2017; Nair et al., 2017). This is the case for many web tasks, which involve a large action space (the agent can type or click anything) and require a well-coordinated sequence of actions to succeed.

A common countermeasure in RL is to pre-train the agent to mimic expert demonstrations via behavioral cloning (Pomerleau, 1991; Kim et al., 2013), encouraging it to take similar actions in similar states. But in environments with diverse and complex states such as websites, demonstrations may cover only a small slice of the state space, and it is difficult to generalize beyond these states (overfitting). Indeed, previous work has found that warm-starting with behavioral cloning often fails to improve over pure RL (Shi et al., 2017). At the same time, simple strategies to combat overfitting (e.g. using fewer parameters or regularization) cripple the policy’s flexibility (Bitzer et al., 2010), which is required for complex spatial and structural reasoning in user interfaces.

In this work, we propose a different method for leveraging demonstrations. Rather than training an agent to directly mimic them, we use demonstrations to constrain exploration. By pruning away bad exploration directions, we can accelerate the agent’s ability to discover sparse rewards. Furthermore, because the agent is not directly exposed to demonstrations, we are free to use a sophisticated neural policy with a reduced risk of overfitting.

Preprocessing: for all demonstrations  do      Induce workflow lattice from Every iteration: Observe an initial environment state samples a workflow from a lattice Roll out an episode from the workflow Use to update if  gets reward  then      Add to replay buffer Periodically: if replay buffer size > threshold then      Sample episodes from replay buffer      Update with sampled episodes Observe an initial environment state rolls out episode Update and critic with if  gets reward  then      Add to replay buffer
Figure 1: Workflow-guided exploration (WGE). After inducing workflow lattices from demonstrations, the workflow policy performs exploration by sampling episodes from sampled workflows. Successful episodes are saved to a replay buffer, which is used to train the neural policy .

To constrain exploration, we employ the notion of a “workflow” (Deka et al., 2016). For instance, given an expert demonstration of how to forward an email, we might infer the following workflow:

Click an email title Click a “Forward” button

Type an email address into a textbox Click a “Send” button

This workflow is more high-level than an actual policy: it does not tell us exactly which email to click or which textbox to type into, but it helpfully constrains the set of actions at each time step. Furthermore, unlike a policy, it does not depend on the environment state: it is just a sequence of steps that can be followed blindly. In this sense, a workflow is environment-blind. The actual policy certainly should not be environment-blind, but for exploration, we found environment-blindness to be a good inductive bias.

To leverage workflows, we propose the workflow-guided exploration (WGE) framework as illustrated in Figure 1:

  1. For each demonstration, we extract a lattice of workflows that are consistent with the actions observed in the demonstration (Section 3).

  2. We then define a workflow exploration policy (Section 4), which explores by first selecting a workflow, and then sampling actions that fit the workflow. This policy gradually learns which workflow to select through reinforcement learning.

  3. Reward-earning episodes discovered during exploration enter a replay buffer, which we use to train a more powerful and expressive neural network policy

    (Section 5).

A key difference between the web and traditional RL domains such as robotics (Atkeson & Schaal, 1997) or game-playing (Bellemare et al., 2013) is that the state space involves a mix of structured (e.g. HTML) and unstructured inputs (e.g. natural language and images). This motivates us to propose a novel neural network policy (DOMnet), specifically designed to perform flexible relational reasoning over the tree-structured HTML representation of websites.

We evaluate workflow-guided exploration and DOMnet on a suite of web interaction tasks, including the MiniWoB benchmark of (Shi et al., 2017), the flight booking interface for Alaska Airlines, and a new collection of tasks that we constructed to study additional challenges such as noisy environments, variation in natural language, and longer time horizons. Compared to previous results on MiniWoB Shi et al. (2017), which used 10 minutes of demonstrations per task (approximately 200 demonstrations on average), our system achieves much higher success rates and establishes new state-of-the-art results with only 3–10 demonstrations per task.

2 Setup

In the standard reinforcement learning setup, an agent learns a policy that maps a state

to a probability distribution over actions

. At each time step , the agent observes an environment state and chooses an action , which leads to a new state and a reward . The goal is to maximize the expected return , where and is a discount factor. Typical reinforcement learning agents learn through trial-and-error: rolling out episodes and adjusting their policy based on the results of those episodes.

We focus on settings where the reward is delayed and sparse. Specifically, we assume that (1) the agent receives reward only at the end of the episode, and (2) the reward is high (e.g., ) for only a small fraction of possible trajectories and is uniformly low (e.g., ) otherwise. With large state and action spaces, it is difficult for the exploration policy to find episodes with positive rewards, which prevents the policy from learning effectively.

We further assume that the agent is given a goal , which can either be a structured key-value mapping (e.g., {task: forward, from: Bob, to: Alice}) or a natural language utterance (e.g., “Forward Bob’s message to Alice”). The agent’s state consists of the goal and the current state of the web page, represented as a tree of elements (henceforth DOM tree). We restrict the action space to click actions Click(e) and type actions Type(e,t), where e is a leaf element of the DOM tree, and t is a string from the goal (a value from a structured goal, or consecutive tokens from a natural language goal). Figure 2 shows an example episode for an email processing task. The agent receives reward if the task is completed correctly, and reward otherwise.

3 Inducing workflows from demonstrations

Given a collection of expert demonstrations , we would like explore actions that are “similar” to the demonstrated actions . Workflows capture this notion of similarity by specifying a set of similar actions at each time step. Formally, a workflow is a sequence of workflow steps, where each step is a function that takes a state and returns a constrained set of similar actions. We use a simple compositional constraint language (Appendix A) to describe workflow steps. For example, with , the set contains click actions on any DOM element in with tag img.

Figure 2: From each demonstration, we induce a workflow lattice based on the actions in that demonstration. Given a new environment, the workflow policy samples a workflow (a path in the lattice, as shown in bold) and then samples actions that fit the steps of the workflow.

We induce a set of workflows from each demonstration as follows. For each time step , we enumerate a set of all possible workflow steps such that . The set of workflows is then the cross product of the steps. We can represent the induced workflows as paths in a workflow lattice as illustrated in Figure 2.

To handle noisy demonstrations where some actions are unnecessary (e.g., when the demonstrator accidentally clicks on the background), we add shortcut steps that skip certain time steps. We also add shortcut steps for any consecutive actions that can be collapsed into a single equivalent action (e.g., collapsing two type actions on the same DOM element into a single Type step). These shortcuts allow the lengths of the induced workflows to differ from the length of the demonstration. We henceforth ignore these shortcut steps to simplify the notation.

The induced workflow steps are not equally effective. For example in Figure 2, the workflow step Click(Near(Text("Bob"))) (Click an element near text “Bob”) is too specific to the demonstration scenario, while Click(Tag("div")) (Click on any <div> element) is too general and covers too many irrelevant actions. The next section describes how the workflow policy learns which workflow steps to use.

4 Workflow exploration policy

Our workflow policy interacts with the environment to generate an episode in the following manner. At the beginning of the episode, the policy conditions on the provided goal , and selects a demonstration that carried out a similar goal:


where measures the similarity between and the goal of demonstration . In our tasks, we simply let be 1 if the structured goals share the same keys, and otherwise.

Then, at each time step with environment state , we sample a workflow step according to the following distribution:


where each is a separate scalar parameter to be learned. Finally, we sample an action uniformly from the set .


The overall probability of exploring an episode is then:


where is the (unknown) state transition probability.

Note that is not a function of the environment states at all. Its decisions only depend on the selected demonstration and the current time . This environment-blindness means that the workflow policy uses far fewer parameters than a state-dependent policy, enabling it to learn more quickly and preventing overfitting. Due to environment-blindness, the workflow policy cannot solve the task, but it quickly learns to certain good behaviors, which can help the neural policy learn.

To train the workflow policy, we use a variant of the REINFORCE algorithm (Williams, 1992; Sutton & Barto, 1998). In particular, after rolling out an episode

, we approximate the gradient using the unbiased estimate


where is the return at time step and

is a baseline term for variance reduction.

Sampled episodes from the workflow policy that receive a positive reward are stored in a replay buffer, which will be used for training the neural policy .

5 Neural policy

As outlined in Figure 1, the neural policy is learned using both on-policy and off-policy updates (where episodes are drawn from the replay buffer). Both updates use A2C, the synchronous version of the advantage actor-critic algorithm (Mnih et al., 2016)

. Since only episodes with reward +1 enter the replay buffer, the off-policy updates behave similarly to supervised learning on optimal trajectories. Furthermore, successful episodes discovered during on-policy exploration are also added to the replay buffer.

Model architecture.

We propose DOMnet, a neural architecture that captures the spatial and hierarchical structure of the DOM tree. As illustrated in Figure 5, the model first embeds the DOM elements and the input goal, and then applies a series of attentions on the embeddings to finally produce a distribution over actions and a value function , the critic. We highlight our novel DOM embedder, and defer other details to Appendix C.

We design our DOM embedder to capture the various interactions between DOM elements, similar to recent work in graph embeddings (Kipf & Welling, 2017; Pham et al., 2017; Hamilton et al., 2017). In particular, DOM elements that are “related” (e.g., a checkbox and its associated label) should pass their information to each other.

To embed a DOM element , we first compute the base embedding by embedding and concatenating its attributes (tag, classes, text, etc.). In order to capture the relationships between DOM elements, we next compute two types of neighbor embeddings:

  1. We define spatial neighbors of to be any element within 30 pixels from , and then sum up their base embeddings to get the spatial neighbor embedding .

  2. We define depth- tree neighbors of to be any element such that the least common ancestor of and in the DOM tree has depth at most . Intuitively, tree neighbors of a higher depth are more related. For each depth , we apply a learnable affine transformation on the base embedding of each depth- tree neighbor

    , and then apply max pooling to get

    . We let the tree neighbor embedding be the concatenation of for .

Finally, we define the goal matching embedding to be the sum of the embeddings of all words in that also appear in the goal. The final embedding of is the concatenation of the four embeddings .

6 Experiments

6.1 Task setups

We evaluate our approach on three suites of interactive web tasks:

  1. MiniWoB: the MiniWoB benchmark of Shi et al. (2017)

  2. MiniWoB++: a new set of tasks that we constructed to incorporate additional challenges not present in MiniWoB, such as stochastic environments and variation in natural language.

  3. Alaska: the mobile flight booking interface for Alaska Airlines, inspired by the FormWoB benchmark of Shi et al. (2017).

We describe the common task settings of the MiniWoB and MiniWoB++ benchmarks, and defer the description of the Alaska benchmark to Section 6.3.3.


Each task contains a 160px 210px environment and a goal specified in text. The majority of the tasks return a single sparse reward at the end of the episode; either (success) or (failure). For greater consistency among tasks, we disabled all partial rewards in our experiments. The agent has access to the environment via a Selenium web driver interface.

The public MiniWoB benchmark111 contains 80 tasks. We filtered for the 40 tasks that only require actions in our action space, namely clicking on DOM elements and typing strings from the input goal. Many of the excluded tasks involve somewhat specialized reasoning, such as being able to compute the angle between two lines, or solve algebra problems. For each task, we used Amazon Mechanical Turk to collect 10 demonstrations, which record all mouse and keyboard events along with the state of the DOM when each event occurred.

Evaluation metric.

We report success rate: the percentage of test episodes with reward . Since we have removed partial rewards, success rate is a linear scaling of the average reward, and is equivalent to the definition of success rate in Shi et al. (2017).

6.2 Main results

Figure 3: Success rates of different approaches on the MiniWoB tasks. DOMnet+WGE outperforms Shi17 on all but two tasks and effectively solves a vast majority.
Task Description Steps BC+RL only WGE
click-checkboxes Click 0–6 specified checkboxes 7 98 81 100
click-checkboxes-large …5–12 targets 13 0 43 84
click-checkboxes-soft …specifies synonyms of the targets 7 51 34 94
click-checkboxes-transfer …training data has 0-3 targets 7 64 17 64
multi-ordering Fill a form with varying field orderings 4 5 78 100
multi-layout Fill a form with varying UIs layouts 4 99 9 100
social-media Do an action on the specified Tweet 2 15 2 100
social-media-all …on all matching Tweets 12 1 0 0
social-media-some …on specified no. of matching Tweets 12 2 3 42
email-inbox Perform tasks on an email inbox 4 43 3 99
email-inbox-nl …natural language goal 4 28 0 93
Table 1: Results on additional tasks. ( = MiniWoB++, Steps = task length as the maximum number of steps needed for a perfect policy to complete the task)

We compare the success rates across the MiniWoB tasks of the following approaches:

  • Shi17: the system from Shi et al. (2017), pre-trained with behavioral cloning on 10 minutes of demonstrations (approximately 200 demonstrations on average) and fine-tuned with RL. Unlike DOMnet, this system primarily uses a pixel-based representation of the state.222It is augmented with filters that activate on textual elements which overlap with goal text.

  • DOMnet+BC+RL: our proposed neural policy, DOMnet, but pre-trained with behavioral cloning on 10 demonstrations and fine-tuned with RL, like Shi17. During behavioral cloning, we apply early stopping based on the reward on a validation set.

  • DOMnet+WGE: our proposed neural policy, DOMnet, trained with workflow-guided exploration on 10 demonstrations.

For DOMnet+BC+RL and DOMnet+WGE, we report the test success rate at the time step where the success rate on a validation set reaches its maximum.

The results are shown in Figure 3. By comparing Shi17 with DOMnet+BC+RL, we can roughly evaluate the contribution of our new neural architecture DOMnet, since the two share the same training procedure (BC+RL). While Shi17 also uses the DOM tree to compute text alignment features in addition to the pixel-level input, our DOMnet uses the DOM structure more explicitly. We find DOMnet+BC+RL to empirically improve the success rate over Shi17 on most tasks.

By comparing DOMnet+BC+RL and DOMnet+WGE, we find that workflow-guided exploration enables DOMnet to perform even better on the more difficult tasks, which we analyze in the next section. Some of the workflows that the workflow policy learns are shown in Appendix B.

6.3 Analysis

6.3.1 MiniWoB++ benchmark

We constructed and released the MiniWoB++ benchmark of tasks to study additional challenges a web agent might encounter, including: longer time horizons (click-checkboxes-large), “soft” reasoning about natural language (click-checkboxes-soft), and stochastically varying layouts (multi-orderings, multi-layouts). Table 1 lists the tasks and their time horizons (number of steps needed for a perfect policy to carry out the longest goal) as a crude measure of task complexity.

We first compare the performance of DOMnet trained with BC+RL (baseline) and DOMnet trained with WGE (our full approach). The proposed WGE model outperforms the BC+RL model by an average of 42% absolute success rate. We analyzed their behaviors and noticed two common failure modes of training with BC+RL that are mitigated by instead training with WGE:

  1. The BC+RL model has a tendency to take actions that prematurely terminate the episode (e.g., hitting “Submit” in click-checkboxes-large before all required boxes are checked). One likely cause is that these actions occur across all demonstrations, while other non-terminating actions (e.g., clicking different checkboxes) vary across demonstrations.

  2. The BC+RL model occasionally gets stuck in cyclic behavior such as repeatedly checking and unchecking the same checkbox. These failure modes stem from overfitting to parts of the demonstrations, which WGE avoids.

Next, we analyze the workflow policy learned by WGE. The workflow policy by itself is too simplistic to work well at test time for several reasons:

  1. Workflows ignore environment state and therefore cannot respond to the differences in the environment, such as the different layouts in multi-layouts.

  2. The workflow constraint language lacks the expressivity to specify certain actions, such as clicking on synonyms of a particular word in click-checkboxes-soft.

  3. The workflow policy lacks expressivity to select the correct workflow for a given goal.

Nonetheless the workflow policy is sufficiently constrained to discover reward some of the time, and the neural policy is able to learn the right behavior from such episodes. As such, the neural policy can achieve high success rates even when the workflow policy performs poorly.

6.3.2 Natural language inputs

While MiniWoB tasks provide structured goals, we can also apply our approach to natural language goals. We collected a training dataset using the overnight data collection technique (Wang et al., 2015). In the email-inbox-nl task, we collected natural language templates by asking annotators to paraphrase the task goals (e.g., “Forward Bob’s message to Alice” “Email Alice the email I got from Bob”) and then abstracting out the fields (“Email <TO> the email I got from <FROM>). During training, the workflow policy receives states with both the structured goal and the natural language utterance generated from a random template, while the neural policy receives only the utterance. At test time, the neural policy is evaluated on unseen utterances. The results in Table 1 show that the WGE model can learn to understand natural language goals (93% success rate).

Note that the workflow policy needs access to the structured inputs only because our constraint language for workflow steps operates on structured inputs. The constraint language could potentially be modified to work with utterances directly (e.g., After("to") extracts the utterance word after “to”), but we leave this for future work.

6.3.3 Scaling to real world tasks

We applied our approach on the Alaska benchmark, a more realistic flight search task on the Alaska Airlines mobile site inspired by the FormWoB task in Shi et al. (2017). In this task, the agent must complete the flight search form with the provided information (6–7 fields). We ported the web page to the MiniWoB framework with a larger 375px 667px screen, replaced the server backend with a surrogate JavaScript function, and clamped the environment date to March 1, 2017.

Following Shi et al. (2017), we give partial reward based on the fraction of correct fields in the submitted form if all required fields are filled in. Despite this partial reward, the reward is still extremely sparse: there are over 200 DOM elements (compared to 10–50 in MiniWoB tasks), and a typical episode requires at least 11 actions involving various types of widgets such as autocompletes and date pickers. The probability that a random agent gets positive reward is less than .

We first performed experiments on Alaska-Shi17, a clone of the original Alaska Airlines task in Shi et al. (2017), where the goal always specifies a roundtrip flight (two airports and two dates). On their dataset, our approach, using only 1 demonstration, achieves an average reward of 0.97, compared to their best result of 0.57, which uses around 80 demonstrations.

Our success motivated us to test on a more difficult version of the task which additionally requires selecting flight type (a checkbox for one-way flight), number of passengers (an increment-decrement counter), and seat type (hidden under an accordion). We achieve an average reward of 0.86 using 10 demonstrations. This demonstrates our method can handle long horizons on real-world websites.

6.3.4 Sample efficiency

Figure 4: Comparison between DOMnet+BC+RL and DOMnet+WGE on several of the most difficult tasks, evaluated on test reward. DOMnet+WGE trained on 10 demonstrations outperforms DOMnet+BC+RL even with 1000 demonstrations.

To evaluate the demonstration efficiency of our approach, we compare DOMnet+WGE with DOMnet+BC+RL trained on increased numbers of demonstrations. We compare DOMnet+WGE trained on demonstrations with DOMnet+BC+RL on , , , and demonstrations. The test rewards333We report test reward since success rate is artificially high in the Alaska task due to partial rewards. on several of the hardest tasks are summarized in Figure 4.

Increasing the number of demonstrations improves the performance of BC+RL, as it helps prevent overfitting. However, on every evaluated task, WGE trained with only demonstrations still achieves much higher test reward than BC+RL with demonstrations. This corresponds to an over 100x sample efficiency improvement of our method over behavioral cloning in terms of the number of demonstrations.

7 Discussion

Learning agents for the web.

Previous work on learning agents for web interactions falls into two main categories. First, simple programs may be specified by the user (Yeh et al., 2009) or may be inferred from demonstrations (Allen et al., 2007). Second, soft policies may be learned from scratch or “warm-started” from demonstrations (Shi et al., 2017). Notably, sparse rewards prevented Shi et al. (2017) from successfully learning, even when using a moderate number of demonstrations. While policies have proven to be more difficult to learn, they have the potential to be expressive and flexible. Our work takes a step in this direction.

Sparse rewards without prior knowledge.

Numerous works attempt to address sparse rewards without incorporating any additional prior knowledge. Exploration methods (Osband et al., 2016; Chentanez et al., 2005; Weber et al., 2017) help the agent better explore the state space to encounter more reward; shaping rewards (Ng et al., 1999) directly modify the reward function to encourage certain behaviors; and other works (Jaderberg et al., 2016; Andrychowicz et al., 2017) augment the reward signal with additional unsupervised reward. However, without prior knowledge, helping the agent receive additional reward is difficult in general.

Imitation learning.

Various methods have been proposed to leverage additional signals from experts. For instance, when an expert policy is available, methods such as DAgger (Ross et al., 2011) and AggreVaTe (Ross & Bagnell, 2014; Sun et al., 2017) can query the expert policy to augment the dataset for training the agent. When only expert demonstrations are available, inverse reinforcement learning methods (Abbeel & Ng, 2004; Ziebart et al., 2008; Finn et al., 2016; Ho & Ermon, 2016; Baram et al., 2017) infer a reward function from the demonstrations without using reinforcement signals from the environment.

The usual method for incorporating both demonstrations and reinforcement signals is to pre-train the agent with demonstrations before applying RL. Recent work extends this technique by (1) introducing different objective functions and regularization during pre-training, and (2) mixing demonstrations and rolled-out episodes during RL updates (Hosu & Rebedea, 2016; Hester et al., 2018; Vecerik et al., 2017; Nair et al., 2017).

Instead of training the agent on demonstrations directly, our work uses demonstrations to guide exploration. The core idea is to explore trajectories that lie in a “neighborhood” surrounding an expert demonstration. In our case, the neighborhood is defined by a workflow, which only permits action sequences analogous to the demonstrated actions. Several previous works also explore neighborhoods of demonstrations via reward shaping (Brys et al., 2015; Hussein et al., 2017) or off-policy sampling (Levine & Koltun, 2013). One key distinction of our work is that we define neighborhoods in terms of action similarity rather than state similarity. This distinction is particularly important for the web tasks: we can easily and intuitively describe how two actions are analogous (e.g., “they both type a username into a textbox”), while it is harder to decide if two web page states are analogous (e.g., the email inboxes of two different users will have completely different emails, but they could still be analogous, depending on the task.)

Hierarchical reinforcement learning.

Hierarchical reinforcement learning (HRL) methods decompose complex tasks into simpler subtasks that are easier to learn. Main HRL frameworks include abstract actions (Sutton et al., 1999; Konidaris & Barto, 2007; Hauser et al., 2008), abstract partial policies (Parr & Russell, 1998), and abstract states (Roderick et al., 2017; Dietterich, 1998; Li et al., 2006). These frameworks require varying amounts of prior knowledge. The original formulations required programmers to manually specify the decomposition of the complex task, while Andreas et al. (2016) only requires supervision to identify subtasks, and Bacon et al. (2017); Daniel et al. (2016) learn the decomposition fully automatically, at the cost of performance.

Within the HRL methods, our work is closest to Parr & Russell (1998) and the line of work on constraints in robotics (Phillips et al., 2016; Perez-D’Arpino & Shah, 2017). The work in Parr & Russell (1998) specifies partial policies, which constrain the set of possible actions at each state, similar to our workflow items. In contrast to previous instantiations of the HAM framework (Andre, 2003; Marthi & Guestrin, 2005), which require programmers to specify these constraints manually, our work automatically induces constraints from user demonstrations, which do not require special skills to provide. Phillips et al. (2016); Perez-D’Arpino & Shah (2017) also resemble our work, in learning constraints from demonstrations, but differ in the way they use the demonstrations. Whereas our work uses the learned constraints for exploration, Phillips et al. (2016) only uses the constraints for planning and Perez-D’Arpino & Shah (2017) build a knowledge base of constraints to use at test time.


Our workflow-guided framework represents a judicious combination of demonstrations, abstractions, and expressive neural policies. We leverage the targeted information of demonstrations and the inductive bias of workflows. But this is only used for exploration, protecting the expressive neural policy from overfitting. As a result, we are able to learn rather complex policies from a very sparse reward signal and very few demonstrations.


This work was supported by NSF CAREER Award IIS-1552635.


Our code and data are available at Reproducible experiments are available on the CodaLab platform at


  • Abbeel & Ng (2004) P. Abbeel and A. Ng. Apprenticeship learning via inverse reinforcement learning. In

    International Conference on Machine Learning (ICML)

    , 2004.
  • Allen et al. (2007) J. Allen, N. Chambers, G. Ferguson, L. Galescu, H. Jung, M. Swift, and W. Taysom. PLOW: A collaborative task learning agent. In

    Association for the Advancement of Artificial Intelligence (AAAI)

    , pp. 1514–1519, 2007.
  • Andre (2003) D. Andre. Programmable reinforcement learning agents. PhD thesis, University of California, Berkeley, 2003.
  • Andreas et al. (2016) J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. arXiv preprint arXiv:1611.01796, 2016.
  • Andrychowicz et al. (2017) M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba. Hindsight experience replay. arXiv preprint arXiv:1707.01495, 2017.
  • Atkeson & Schaal (1997) C. G. Atkeson and S. Schaal. Robot learning from demonstration. In International Conference on Machine Learning (ICML), volume 97, pp. 12–20, 1997.
  • Bacon et al. (2017) P. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Association for the Advancement of Artificial Intelligence (AAAI), pp. 1726–1734, 2017.
  • Baram et al. (2017) N. Baram, O. Anschel, I. Caspi, and S. Mannor.

    End-to-end differentiable adversarial imitation learning.

    In International Conference on Machine Learning (ICML), pp. 390–399, 2017.
  • Bellemare et al. (2013) M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR), 47:253–279, 2013.
  • Bitzer et al. (2010) S. Bitzer, M. Howard, and S. Vijayakumar. Using dimensionality reduction to exploit constraints in reinforcement learning. In International Conference on Intelligent Robots and Systems (IROS), pp. 3219–3225, 2010.
  • Brys et al. (2015) T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. Taylor, and A. Now’e. Reinforcement learning from demonstration through shaping. In International Joint Conference on Artificial Intelligence (IJCAI), 2015.
  • Chentanez et al. (2005) N. Chentanez, A. G. Barto, and S. P. Singh. Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 1281–1288, 2005.
  • Daniel et al. (2016) C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research (JMLR), 17:3190–3239, 2016.
  • Deka et al. (2016) B. Deka, Z. Huang, and R. Kumar. Erica: Interaction mining mobile apps. In User Interface Software and Technology (UIST), pp. 767–776, 2016.
  • Dietterich (1998) T. G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In International Conference on Machine Learning (ICML), 1998.
  • Finn et al. (2016) C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (ICML), pp. 49–58, 2016.
  • Hamilton et al. (2017) W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • Hauser et al. (2008) K. Hauser, T. Bretl, K. Harada, and J. Latombe. Using motion primitives in probabilistic sample-based planning for humanoid robots. Algorithmic foundation of robotics, 7:507–522, 2008.
  • Hester et al. (2018) T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning from demonstrations. In Association for the Advancement of Artificial Intelligence (AAAI), 2018.
  • Ho & Ermon (2016) J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NIPS), pp. 4565–4573, 2016.
  • Hosu & Rebedea (2016) I. Hosu and T. Rebedea. Playing Atari games with deep reinforcement learning and human checkpoint replay. In Evaluating General Purpose AI, 2016.
  • Hussein et al. (2017) A. Hussein, E. Elyan, M. M. Gaber, and C. Jayne. Deep reward shaping from demonstrations. In International Joint Conference on Neural Networks, 2017.
  • Jaderberg et al. (2016) M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
  • Kim et al. (2013) B. Kim, A. massoud Farahmand, J. Pineau, and D. Precup. Learning from limited demonstrations. In Advances in Neural Information Processing Systems (NIPS), pp. 2859–2867, 2013.
  • Kipf & Welling (2017) T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • Konidaris & Barto (2007) G. Konidaris and A. G. Barto. Building portable options: Skill transfer in reinforcement learning. In International Joint Conference on Artificial Intelligence (IJCAI), 2007.
  • Levine & Koltun (2013) S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning (ICML), 2013.
  • Li et al. (2006) L. Li, T. J. Walsh, and M. L. Littman. Towards a unified theory of state abstraction for mdps. In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2006.
  • Marthi & Guestrin (2005) B. Marthi and C. Guestrin. Concurrent hierarchical reinforcement learning. In International Joint Conference on Artificial Intelligence (IJCAI), 2005.
  • Mnih et al. (2016) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
  • Nair et al. (2017) A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel. Overcoming exploration in reinforcement learning with demonstrations. arXiv preprint arXiv:1709.10089, 2017.
  • Ng et al. (1999) A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning (ICML), volume 99, pp. 278–287, 1999.
  • Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep exploration via bootstrapped DQN. In Advances In Neural Information Processing Systems, pp. 4026–4034, 2016.
  • Parr & Russell (1998) R. Parr and S. J. Russell. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems (NIPS), pp. 1043–1049, 1998.
  • Perez-D’Arpino & Shah (2017) C. Perez-D’Arpino and J. A. Shah. C-learn: Learning geometric constraints from demonstrations for multi-step manipulation in shared autonomy. In International Conference on Robotics and Automation (ICRA), pp. 4058–4065, 2017.
  • Pham et al. (2017) T. Pham, T. Tran, D. Phung, and S. Venkatesh. Column networks for collective classification. In Association for the Advancement of Artificial Intelligence (AAAI), 2017.
  • Phillips et al. (2016) M. Phillips, V. Hwang, S. Chitta, and M. Likhachev. Learning to plan for constrained manipulation from demonstrations. Autonomous Robots, 40(1):109–124, 2016.
  • Pomerleau (1991) D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
  • Roderick et al. (2017) M. Roderick, C. Grimm, and S. Tellex. Deep abstract Q-networks. arXiv preprint arXiv:1710.00459, 2017.
  • Ross & Bagnell (2014) S. Ross and J. A. Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
  • Ross et al. (2011) S. Ross, G. Gordon, and A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Artificial Intelligence and Statistics (AISTATS), 2011.
  • Shi et al. (2017) T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning (ICML), 2017.
  • Stone & Soper (2014) B. Stone and S. Soper. Amazon Unveils a Listening, Talking, Music-Playing Speaker for Your Home. Bloomberg L. P., 2014.
  • Sun et al. (2017) W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning (ICML), 2017.
  • Sutton & Barto (1998) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT Press MIT press Cambridge, 1998.
  • Sutton et al. (1999) R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Articial intelligence, 112:181–211, 1999.
  • Vecerik et al. (2017) M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothorl, T. Lampe, and M. Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
  • Wang et al. (2015) Y. Wang, J. Berant, and P. Liang. Building a semantic parser overnight. In Association for Computational Linguistics (ACL), 2015.
  • Weber et al. (2017) T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
  • Williams (1992) R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
  • Yeh et al. (2009) T. Yeh, T. Chang, and R. Miller. Sikuli: using GUI screenshots for search and automation. In User Interface Software and Technology (UIST), 2009.
  • Ziebart et al. (2008) B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Association for the Advancement of Artificial Intelligence (AAAI), 2008.

Appendix A Constraint language for workflow steps

We try to keep the constraint language as minimal and general as possible. The main part of the language is the object selector (elementSet) which selects either (1) objects that share a specified property, or (2) objects that align spatially. These two types of constraints should be applicable in many typical RL domains such as game playing and robot navigation.

constraint ::= Click(elementSet)
[Any click action on an element in elementSet]
| Type(elementSet,string)
[Any type action that types string on an element in elementSet]
| Type(elementSet,Field(*))
[Any type action that types a goal field value on an element in elementSet]
elementSet ::= Tag(tag)
[Any element with HTML tag tag]
| Text(string)
[Any element with text string]
| Like(string)
[Any element whose text is a substring of string]
| Near(elementSet)
[Any element that is within 30px from an element in elementSet]
| SameRow(elementSet)
[Any element that aligns horizontally with an element in elementSet]
| SameCol(elementSet)
[Any element that aligns vertically with an element in elementSet]
| And(elementSet,Class(classes))
[Any element from elementSet matching some class name in classes]
tag ::= a valid HTML tag name
string ::= a string literal
| Field(fieldName)
[The value from the goal field fieldName]
classes ::= a list of valid HTML class names

To avoid combinatorial explosion of relatively useless constraints, we limit the number of nested elementSet applications to 3, where the third application must be the Class filter. When we induce workflow steps from a demonstration, the valid literal values for tag, string, and classes are extracted from the demonstration state.

Appendix B Examples of learned workflows

Enter the username "ashlea" and password "k0UQp" and press login.
{username: ashlea, password: k0UQp}
Find the email by Ilka and forward it to Krista.
{task: forward, name: Ilka, to: Krista}
Enter "Cheree" and press "Search", then find and click the 5th search result.
{target: Cheree, rank: 5}
{departure city: Tampa, destination city: Seattle, ticket type: return flight,
departure day: 6, returning Day: 16, passengers: 3, seat type: first }

Type(And(Near(Like("From")),Class("text-input-pad")),Field("departure city"))

Type(And(Near(Like("To")),Class("text-input-pad")),Field("destination city"))
Click(Like(Field("destination city")))
Click(Text(Field("departure day")))
Click(Text(Field("returning day")))

Appendix C Details of the neural model architecture

Figure 5: The architecture of the neural policy . The inputs from the state are denoted in blue, while the outputs are denoted in red. Q

= query vector;

M = memory matrix.

From the input state, we first embed the DOM elements and the goal units , where is a key-value pair for structured goals and a token for natural language goals.

The process for computing the embedding of DOM elements is already described in Section 5. For the goal unit embedding , we embed each key-value pair as the sum of word embeddings, and embed natural language goals with an LSTM.


After obtaining the embedding of each DOM element and of each goal unit , we apply a series of attentions to relate the DOM elements with the goal:

  1. DOM context: we applied max-pooling on to get a query vector, and then attend over the DOM embeddings . The DOM context is the weighted average of the attended DOM embeddings.

  2. Goal contexts: we use the DOM context as the query vector to attend over the goal embeddings . We compute two goal contexts from two different attention heads. Each head uses sentinel attention, where part of the attention can be put on a learned NULL vector, which is useful for ignoring the goal when the next action should not depend on the goal.

  3. DOM element selection: We concatenate the DOM context and goal contexts into a query vector to attend over on the DOM embeddings . We use two attention heads, and combine the attention weights from the two heads based on ratio computed from the goal contexts. The result is a distribution over the target DOM elements e.

  4. Typed string and action selection: For a given target DOM element e, we combine the goal context and the embedding of e to get a query vector to attend over the goal embeddings . For structured queries, we get a distribution over the goal fields, while for natural language queries, we get distributions of the start and end tokens. The same query vector is also used to compute the distribution over the action types (click or type).