Learning When and What to Ask: a Hierarchical Reinforcement Learning Framework

Reliable AI agents should be mindful of the limits of their knowledge and consult humans when sensing that they do not have sufficient knowledge to make sound decisions. We formulate a hierarchical reinforcement learning framework for learning to decide when to request additional information from humans and what type of information would be helpful to request. Our framework extends partially-observed Markov decision processes (POMDPs) by allowing an agent to interact with an assistant to leverage their knowledge in accomplishing tasks. Results on a simulated human-assisted navigation problem demonstrate the effectiveness of our framework: aided with an interaction policy learned by our method, a navigation policy achieves up to a 7x improvement in task success rate compared to performing tasks only by itself. The interaction policy is also efficient: on average, only a quarter of all actions taken during a task execution are requests for information. We analyze benefits and challenges of learning with a hierarchical policy structure and suggest directions for future work.



There are no comments yet.


page 1

page 2

page 3

page 4


Data-Efficient Reinforcement Learning in Continuous-State POMDPs

We present a data-efficient reinforcement learning algorithm resistant t...

Learning to Interrupt: A Hierarchical Deep Reinforcement Learning Framework for Efficient Exploration

To achieve scenario intelligence, humans must transfer knowledge to robo...

Distilling a Hierarchical Policy for Planning and Control via Representation and Reinforcement Learning

We present a hierarchical planning and control framework that enables an...

Hierarchical Representation Learning for Markov Decision Processes

In this paper we present a novel method for learning hierarchical repres...

Just Ask:An Interactive Learning Framework for Vision and Language Navigation

In the vision and language navigation task, the agent may encounter ambi...

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Reinforcement learning agents that operate in diverse and complex enviro...

Reinforced Natural Language Interfaces via Entropy Decomposition

In this paper, we study the technical problem of developing conversation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To prevent catastrophic failures, AI agents should be taught to determine when they cannot make reliable decisions and to seek for advice from humans in those situations. In such cases, human assistants can potentially be more helpful (with lower effort) if the agent communicates specifically what knowledge it needs for making better decisions. For example, in an indoor object-finding task (Figure 1), if a robot could inform its human user that all it requires to accomplish the task is knowing what room the target object is in, the human user would concisely give the robot a room label (e.g,“it is in the kitchen”), instead of having to compose a verbose navigation instruction (e.g., “go to the door on your left, turn right…”). This would also lead to improved decision making on the part of the agent, because it receives exactly the information it needs and does not have to cope with redundant or misleading information.

One way agents can learn to express themselves to humans is to learn to mimic the sorts of questions humans might ask in similar circumstances, as has been explored in dialogue systems (rao-daume-iii-2018-learning; das2017visual; li2016learning) or generating explanations (camburu2018snli; chen2021generate; narang2020wt5; kim2018textual; lamm2021qed). A limitation of this approach is that the agent only mirrors human behavior, without forming an understanding of its own limitations and needs (camburu2019make); essentially it learns to convey what a human may be concerned about, not what it is concerned about. In this paper, we take a step towards designing agents that can ask questions that faithfully communicate their intrinsic needs and uncertainties. We develop a hierarchical decision-making framework named Hari (Human-Assisted Reinforced Interaction), where an agent is equipped with specific, innate information-seeking intents that can be easily translated into human-intelligible requests for different types of information. Hari extends standard partially-observed Markov decision processes (POMDPs), by introducing a human assistant into this framework, formulating how an agent can request additional information from a human to better accomplish tasks.

Under the Hari framework, three natural information-seeking intents arise: in every step, the agent can ask for more information about (i) its current state, (ii) the goal state, or (iii) a subgoal state which, if reached, helps it make progress on the current task. Through interacting with the assistant who can answer such questions, and with the environment, receiving costs or rewards, the agent learns to decide when to convey an intent to the assistant and which intent to convey. As the intents are integrated into the agent’s learning and decision-making process, the selected intent in every step reflects what type of information the agent intrinsically finds most useful to obtain at that time. Hari employs a stack data structure for managing goals, allowing the agent to conveniently construct deep goal hierarchies, whereas most hierarchical decision-making frameworks (le2018hierarchical; kulkarni2016hierarchical; sutton1999between) work with only two levels of goal abstractions.

We demonstrate the practicality of Hari on a simulated human-assisted navigation problem. In this setting, we learn an interaction policy that controls how an agent communicates with an assistant and navigates in an environment. Equipped with the interaction policy learned by our method, the agent achieves a substantial 7 increase in success rate on tasks that take place in environments it has not previously seen, versus performing those tasks without assistance. The agent achieves this level of performance while issuing only 4.8 requests to the assistant on average, representing less than 14 of the total number of actions taken in a task execution. The learned interaction policy exhibits sensible behaviors like requesting goal clarification only once right at the beginning or requesting subgoals more often on more difficult tasks. We show that performance of the policy in unseen environments can be further improved by recursively asking for subgoals of subgoals. Finally, we discuss limitations of the policy’s model and feature representation, which suggest room for future improvements.

Figure 1: An illustration of the Hari framework in a human-assisted navigation task. An agent can only observe part of an environment and is asked to find a mug in the kitchen. An assistant communicates with the agent and can provide it with information about the environment and the task. Initially (A) it may request more information about the goal, but may not know enough about where it currently is. For example, at location B, due to limited perception, it does not recognize that it is in a living room and stands next to a couch. It can obtain such information from the assistant. If the current task becomes too difficult (like at location C), the agent can require the assistant to provide a simpler subtask which, if accomplished, helps it make progress on the main task. The agent maintains a stack of tasks and always executes the task at the top. When the agent receives a (sub)task, it pushes the (sub)task to the top of the stack. When it wants to stop executing a (sub)task, it pops the (sub)task from the stack (e.g., at location D). At location E, the agent empties the stack and terminates its execution.

2 Motivation: Limitations of the Standard POMDP Framework

We consider an environment defined by a partially-observed Markov decision process (POMDP) with state space , action space , state-transition function , cost function , description space , and description function .111We use the term “description” in lieu of “observation” in the POMDP formulation to emphasize two properties of the information the agent has access to for making decisions: (i) the information can be in various modalities and (ii) the information can be obtained via not only perception, but also communication. Here,

denotes the set of all probability distributions over a set

. We refer to this environment as the operation environment because it is where the agent operates to accomplish tasks.

Each task in the environment is defined as a tuple where is the start state, is the goal state, and is a limited description of . Initially, a task is sampled from a task distribution . An agent starts in and is only given the goal description . It has to reach the goal state within time steps. Let and be the goal state and goal description being executed at time step , respectively. In a standard POMDP, and for . But later, we will enable the agent to set new goals via communication with humans.

At any time step , the agent does not know its true state but only receives a description of the state. Generally, the description can include any information coming from any knowledge source (e.g., an RGB image and/or a verbal description describing the current view). Given and , the agent then makes a decision , transitions to the next state , and receives a cost . A special action is taken when the agent decides to terminate its execution. The goal of the agent is to reach with minimum cumulative cost , where is an execution of the task.

As the agent does not have access to its true state, it can only make decisions based on the (observable) partial execution . kaelbling1998planning introduce the notion of a belief state , which sufficiently summarizes a partial execution as a distribution over the state space . In practice, when is continuous or high-dimensional, representing and updating a full belief state (whose dimension is ) is intractable. We follow hausknecht2015deep

, using recurrent neural networks to learn compact representation of partial executions. We denote by

a representation of the partial execution and by the set of all possible representations. The agent maintains an operation policy that maps a belief state and a goal description to a distribution over

. The learning objective for solving a standard POMDP is to estimate an operation policy that minimizes the expected cumulative cost of performing tasks:


where is the distribution over executions generated by a policy given start state and goal description . In a standard POMDP, an agent performs tasks by executing its own operation policy without asking for any external assistance. Moreover, the description function and the goal description are assumed to be fixed during a task execution. As seen from Equation 1, given a fixed environment and task distribution, the expected performance of the agent is solely determined by the operation policy . Thus, the standard POMDP framework does not provide any mechanism for improving the agent’s performance other than enhancing the operation policy.

3 Leveraging Human Knowledge to Better Accomplish Tasks

We introduce an assistant into the operation environment, who can provide information about the environment’s states. We consider a setting where the agent already possesses a decent-but-imperfect pre-learned operation policy . Our interest is to teach the agent to leverage the assistant’s knowledge to make better decisions without modifying .222Even though updating the operation policy itself using this human knowledge is an another interesting research direction to explore, we leave it for future work. Specifically, we learn an interaction policy (parametrized by ) that controls how the agent communicates with the assistant to gather additional information. The operation policy will be invoked by the interaction policy if the latter decides that the agent does not need new information and wants to take an operating action.

Let us first give an intuition for why obtaining additional information from a human assistant can improve performance of the agent in our framework. Our key insight is that we can enhance the operation policy (without updating it) by manipulating its input. In most cases of interest, the performance of the pre-trained operation policy over its input space is non-uniform: there are inputs on which it makes more accurate decisions than others. For example, in an object-finding navigation task, during pre-training, a human trainer may walk a robot around a house and verbally describe the locations they walk by. Alternatively, the robot can be trained in a simulated environment with dense visual semantic annotations. The robot may learn to perform well in such rich-information conditions. When being deployed in a new environment, its performance may degrade due to limited perception (e.g., it cannot recognize new objects). However, its capability of performing tasks under rich information remains. It would still find a spoon very quickly if it knew that the current room was a kitchen and there was cupboard next to it. The agent may obtain such information by asking the assistant to provide new state descriptions.

Communication with the Assistant.

The assistant is present all the time and knows the agent’s current state and the goal state . It is specified by two functions: a description function and a subgoal function .

specifies the probability of giving

as the new description of state given a current description . indicates the probability of proposing as a subgoal given a current state and a goal state .

At time step , the assistant accepts three types of request from the agent:

  1. Cur: requests a new description of and receives ;

  2. Goal: requests a new description of and receives ;

  3. Sub: requests a description of a subgoal and receives where and is an empty description.

Interaction Policy.

The action space of the interaction policy consists of five actions: . The first three actions correspond to making the three types of request that the assistants accepts. The remaining two actions are used to traverse in the operation environment:

  1. Do: executes the action . The agent transitions to a new operation state ;

  2. Done: determines that the current goal has been reached.333Note that the agent may falsely decide that a goal has been reached. If is a main goal (), the task episode ends. If is a subgoal (), the agent may choose a new goal to follow. Our problem formulation leaves it open on what goal should be selected next.

By selecting among these actions, the interaction policy essentially decides when to ask the assistant for additional information, and what types of information to ask for. Our formulation does not specify the input space of the interaction policy, as this space depends on how the agent implements its goal memory (i.e. how it stores and retrieves the subgoals). In the next section, we introduce an instantiation where the agent uses a stack data structure to manage (sub)goals.

4 Hierarchical Reinforcement Learning Framework

In this section, we describe the Hari framework. We first formulate the POMDP environment that the interaction policy acts in, referred to as the interaction environment (§ 4.1). Our construction employs a goal stack to manage multiple levels of (sub)goals (§ 4.2). A goal stack stores all the tasks the agent has been assigned but has not yet decided to terminate (by choosing the Done action). It is updated in every step depending on the taken action. We design a cost function (§ 4.4) that specifies a trade-off between the cost of taking actions (acting cost) and the cost of not completing a task (task error). While this cost function is defined to capture a specific trade-off, in general, the cost function is a customizable component in the Hari framework.

4.1 Interaction Environment

Given an operation environment , the interaction environment constructed on top of is a POMDP with:

  • State space where is the set of all goal stacks containing at most elements (

    is a hyperparameter). Each state

    is a tuple of an operation state , its description , and a goal stack . Each element in the goal stack is a tuple of a goal state and its description ;

  • Action space ;

  • State-transition function where and ;

  • Cost function (defined in § 4.4 to trade off acting cost and task error);

  • Description space where is the set of all goal-description stacks of size . At any time, the agent cannot access the environment’s goal stack , which contains true goal states. Instead, it can only observe the descriptions in . We call this partial stack a goal-description stack, denoted by ;

  • Description function , where . Unlike in the standard POMDP formulation, this description function is deterministic.

A belief state summarizes a partial execution . We formally define the interaction policy as , where is the set of all interaction belief states.

4.2 Goal Stack

A goal stack is an ordered set of tasks that the agent has not declared completion (by calling Done). The initial stack contains the main goal , and its description . Let be the goal stack at time step . Only the Goal, Sub, Done actions alter the stack. The Goal action replaces the top goal description with , the new description given by the assistant. The Sub action pushes a new subtask to the stack. The Done action pops the top (sub)task from the stack.


The Sub action is not available to the agent when the current stack contains elements, guaranteeing that goal stack always has at most elements. The goal-stack transition function is defined as where is an indicator function.

4.3 Transition of the Current Operation State and Its Description

To complete the definition of the state-transition function , we define the transition function

. This function is factored into two terms by the chain rule:


Only taking the Do action may change the current operation state


The description may vary when the agent moves to a new operation state (by taking the Do action) or requests a new description of (by taking the Cur action)


4.4 Cost Function

The interaction policy needs to balance between two types of cost: the cost of taking actions (acting cost) and the cost of not completing a task (task error). The acting cost also subsumes the cost of communicating with the assistant because, in reality, such interactions consume time, human effort, and possibly trust. Assuming that the assistant is helpful, acting cost and task error usually conflict with each other; for example, the agent may lower its task error if it is willing to suffer a larger acting cost by increasing the number of requests to the assistant.

We employ a simplified model where all types of cost are non-negative real numbers of the same unit. Making a request of type is assigned a constant cost . The cost of taking the Do action is , the cost of executing the action in the operation environment. Calling Done to terminate execution of the main goal incurs a task error . We exclude the task errors of executing subgoals because the interaction policy is only evaluated on reaching the main goal


The magnitudes of the costs naturally specify a trade-off between acting cost and task error. For example, setting the task errors much larger than the other costs indicates that completing tasks is prioritized over taking few actions.

5 Learning When and What to Ask in Human-Assisted Navigation


We apply our framework to modeling a human-assisted navigation (Han) problem. In Han, a human requests an agent to find an object in an indoor environment. Each task request asks the agent to find an object of type in a room of type (e.g., find a mug in a kitchen). The agent is equipped with a camera and shares its camera view with the human. We assume that the human is sufficiently familiar with the environment that they can recognize the agent’s location by looking at its current view. While an agent is performing a task, it may request the human to provide additional information via telecommunication (e.g., a chat app). Specifically, it can ask for a description of its current location (Cur), the goal location (Goal), or a subgoal location that is on the path from its current location to the goal location (Sub).444Let be the shortest path from to , and be the -th node on the path (). The subgoal location is chosen as where , where is a pre-defined constant.

Before issuing a task request, the human imagines a goal location (but do not reveal it to the agent). We are primarily interested in evaluating success in goal-finding, i.e. whether the agent can arrive at the human’s intended goal location. Even though there could be multiple locations that match a request, the agent only succeeds if it arrives exactly at the chosen goal location. We also determine success in request-fulfilling, where the agent successfully fulfills a request if it eventually navigates to any location that is within two meters of an object that matches the request.

Operation Environment.

We construct the operation environments using the environment graphs provided by the Matterport3D simulator (anderson2018vision). Each environment graph is generated from a 3D model of a house where each node is a location in the house and each edge connects two nearby unobstructed locations. Each operation state corresponds to a node in the graph. At any time, the agent’s operation action space consists of traversing to any of the nodes that are adjacent to its current node.

We employ a discrete bag-of-features representation for state descriptions.555While our representation of state descriptions simplifies the object/room detection problem for the agent, it does not necessarily make the navigation problem easier than with image input, as images may contain information that is not captured by our representation (e.g., object shapes and colors, visualization of paths). Using this representation, we can easily vary the type and amount of information given to the agent. Specifically, we simulate two settings of descriptions: dense and sparse. At evaluation time, the agent perceives sparse descriptions and request the assistant for dense descriptions. A dense description of a current location contains the room name at the location, and the features of objects restricted to be within meters of the location. The features of each object consists of (i) its name, (ii) horizontal and vertical angles (relative to the current viewpoint), and (iii) distance (in meters) from the object to the current location. A dense description of a goal follows the same representation scheme. In the sparse setting, the current-location description does not include the room name. Moreover, we remove the features of objects that are not in the top 100 most frequent objects, emulating an imperfect object detector module. The sparse goal description (the task request) has only features of the target object and the room name where the object is located at. Especially, if a subgoal location is adjacent or coincides with the agent’s current location, instead of describing room and object features, the human specifies the ground-truth action to go to the subgoal (an action is specified by its horizontal and vertical angles, and travel distance).666Here, we emulate a practical scenario where if a destination is visible in the current view, to save effort, a human would concisely tell an agent what to do rather than giving a verbose description of the destination.

Experimental Procedure.

We conduct our experiments in three phases. In the pre-training phase, we learn an operation policy with dense descriptions of the current location and the goal. In the training phase, the agent perceives a sparse description of its current location and is given a sparse initial goal description. It may request the human for dense descriptions. We use advantage actor-critic (mnih2016asynchronous) to learn an interaction policy that controls how the agent communicates with the human and navigates in an environment. The interaction policy is trained in environments that are previously seen as well as unseen during pre-training. Finally, in the evaluation phase, the interaction policy is tested on three conditions: seen environment and target object type but starting from a new room (UnseenStr), seen environment but new target object type (UnseenObj), and new environment (UnseenEnv). The pre-trained operation policy is fixed during the training and evaluation phases. We create 82,104 examples for pre-training, 65,133 for training, and approximately 2,000 for each validation or test set. Details about the training procedure and the dataset are included in the Appendix.

6 Results and Analyses

Success Rate % Avg. number of actions
Unseen Unseen Unseen
Agent Start Object Environment Cur Goal Sub Do
No assistant and interaction policy
black!50(: current-state description, : goal description)
 Sparse and 43.4 black!50(50.4) 16.4 black!50(23.2)  3.0   black!50(6.8) - - - 13.1
 Sparse , dense 67.2 black!50(68.4) 56.6 black!50(58.2)  9.7 black!50(12.3) - - - 12.6
 Dense , sparse 77.9 black!50(86.0) 30.6 black!50(40.3)  4.1   black!50(7.5) - - - 12.0
 Dense and 97.8 black!50(98.1) 81.7 black!50(83.3)  9.4 black!50(11.9) - - - 11.0
With assistant and interaction policy (Hari)
 Rule-based (baseline) 78.8 black!50(78.8) 68.5 black!50(68.5) 12.7 black!50(12.7) 2.0 1.0 1.7 11.3
 RL-learned (ours) 85.8 black!50(86.8) 78.2 black!50(79.6) 19.8 black!50(22.5) 2.1 1.0 1.7 11.2
 + Perfect nav. on sub-goals (skyline) 94.3 black!50(95.8) 95.1 black!50(96.1) 92.6 black!50(94.3) 0.0 0.0 6.3   7.3
Table 1: Main results on test sets. For success rate, we report both goal-finding (normal font) and request-fulfilling results (smaller grey font in parentheses). We also report the average number of different types of actions taken by the agent (across all task types). Assistance dramatically improves success rates, especially for unseen objects (5) and environments (7), with a small number of additional actions (2.8).


In our main experiments, we set: the cost of taking a Cur, Goal, Sub, or Do action to be 0.01 (we will consider other settings subsequently), the cost of calling Done to terminate the main goal (i.e. task error) equal the (unweighted) length of the shortest-path from the agent’s location to the goal, and the goal stack’s size () to be 2. We compare our RL-learned interaction policy with a rule-based baseline that first takes the Goal action and then randomly selects actions. In each episode, we enforce that the rule-based policy can take at most actions of type , where and is a constant. We tune each on the validation sets so that the rule-based policy has the same average count of each action as the RL-learned policy. To prevent early termination, we enforce that the rule-based policy cannot take more Done actions than Sub actions unless when its Sub action’s budget is exhausted. We also construct a skyline where the interaction policy is also learned by RL but with an operation policy that executes subgoals perfectly. As discussed in § 5, we are primarily interested in goal-finding success rate and will refer to this metric briefly as success rate.

Main Results (Table 1).

To inspect the potential benefits of asking for additional information, we compute how much the operation policy improves when it is supplied with dense information about the current and/or goal states. As seen, success rate of the operation policy is lifted dramatically when both the current-state and goal descriptions are dense (2 increase on UnseenStr, 5 on UnseenObj, and 3 on UnseenEnv). We find that dense information about the current state is more helpful on UnseenStr, while dense information about the goal is more valuable on UnseenObj and UnseenEnv. This is reasonable because on UnseenStr, the agent has been trained to find similar goals. In contrast, the initial goal descriptions in UnseenObj and UnseenEnv are completely new to the agent, thus gathering more information about them is necessary.

Aided by our RL-learned interaction policy, the agent observes a substantial 2 increase in success rate on UnseenStr, 5 on UnseenObj, and 7 on UnseenEnv, compared to when performing tasks using only the operation policy. In unseen environments, with its capability of requesting subgoals, the agent impressively doubles the success rate of the operation policy that has access to dense descriptions. The RL-learned interaction policy is significantly more effective than the rule-based baseline. The policy is also efficient: on average, it makes about 4.8 requests to the assistant (less than 14 of all the actions it takes during an episode).

On UnseenStr and UnseenObj, the RL-learned policy has not closed the gap with the operation policy that performs tasks with dense descriptions. Our investigation finds that limited information often causes the policy to not request information about the current state (i.e. taking Cur) and terminate prematurely or go to a wrong place. Encoding uncertainty in the current-state description (e.g., finkel2006solving; nguyen15calibration) is a plausible future direction for tackling this issue.

Finally, results obtained by replacing the learned operation policy with one that behaves optimally on subgoals shows that further improving performance of the operation policy on short-distance goals would effectively enhance the agent’s performance on long-distance goals.

(a) How frequently does the interaction policy execute different actions, on average across different types of tasks? Subgoals are requested much more in unseen environments.
(b) Over the course of a trajectory, how does the frequency of different types of actions change in unseen environments? Subgoals are requested in the middle, goal information at the beginning.
Figure 2: Analyzing the behavior of the RL-learned interaction policy (on validation tasks).
(a) Effect of cost on success rate.
(b) Effect of cost on action counts.
Figure 3: Analyzing the effect of simultaneously varying the cost of the Cur, Goal, Sub, Do actions in Hari (on validation tasks), thus trading off success rate versus number of actions taken.

Behavior of the RL-Learned Interaction Policy.

(a) characterizes behaviors of the RL-learned interaction policy in three evaluation conditions. We expect that tasks in UnseenStr are the easiest and those in UnseenEnv are the hardest. As the difficulty of the evaluation condition increases, the interaction policy issues more Cur, Sub, and Do actions. The average number of Goal actions does not vary, showing that the interaction policy has correctly learned that making more than one goal-clarifying request is unnecessary. (b) illustrates the distribution of each action along the length of an episode in the validation UnseenEnv dataset. The Goal action, if taken, is always taken only once and immediately in the first step. The number of Cur actions gradually decreases over time. The agent makes most Sub requests in the middle of an episode, after its has attempted but failed to accomplish the main goals. We observe similar patterns on the other two validation sets.

Effects of Varying Action Cost.

As mentioned, we assign the same cost to each Cur, Goal, Sub, or Do action. (a) demonstrates the effects of changing this cost on the success rate of the agent. Setting the cost equal to 0.5 makes it too costly to take any action, inducing a policy that always calls Done in the first step and thus fails on all tasks. Overall, the success rate of the agent rises as we reduce the action cost. The increase in success rate is most visible in UnseenEnv and least visible in UnseenStr. (b) provides more insights. As the action cost decreases, we observe a growth in the number of Sub and Do actions taken by the interaction policy. Meanwhile, the numbers of Cur and Goal actions are mostly static. Since requesting subgoals is more helpful in unseen environments than in seen environments, the increase in the number of Sub actions leads the more visible boost in success rate on UnseenEnv tasks.

Goal-finding success rate (%) Average number of actions
Unseen Unseen Unseen
Stack size Start Object Environment Cur Goal Sub Do
1 black!50(no subgoals) 92.2 78.4 12.5 5.1 1.9 0.0 10.7
2 86.9 77.6 21.6 2.1 1.0 1.7 11.2
3 83.2 78.6 33.5 1.3 1.0 5.0 8.2
Table 2: Success rates and numbers of actions taken with different stack sizes in Hari (on validation tasks). Larger stack sizes significantly help unseen environments, but not in seen environments.

Performing Tasks with Deeper Goal Stacks.

In Table 2, we test the functionality of our framework with a stack size 3, allowing the agent to request subgoals of subgoals. As expected, success rate on UnseenEnv is boosted significantly (+11.9% compared to using a stack of size 2). Success rate on UnseenObj is largely unchanged; we find that the agent makes more Sub requests (averagely 4.5 requests per episode compared to 1.0 request made when the stack size is 2), but doing so does not further enhance performance. The agent makes less Cur requests, possibly in order to offset the cost of making more Sub requests. Due to this behavior, success rate on UnseenStr declines with larger stack sizes, as information about the current state is more valuable for these tasks than subgoals. These results show that the critic model overestimates the values in states where Sub actions are taken, leading to the agent learning to request subgoals more than needed.

7 Related Work and Conclusion

To decide when to ask humans for help, previous work measures uncertainty of output distributions (nguyen2019vnla; zhu2008active) or directly trains an asking policy via interaction (chi2020just; nguyen2019hanna; kolb2019learning). In these approaches, the agent only calls for generic help and does not convey specific information-seeking intent. Teaching agents to generate questions by mimicking human-generated language utterances is widely adopted in constructing dialog systems (Roman2020; rao-daume-iii-2018-learning; das2017visual; li2016learning) or generating natural-language explanations (camburu2018snli; chen2021generate; narang2020wt5; kim2018textual; lamm2021qed). As discussed previously, naively mirroring human external behavior cannot enable agents to understand the limits of their knowledge. Our work is closely related to he2013dynamic, where an agent learns to select a subset of most useful input features. However, this work assumes the features arrive in a fixed order, and the agent at every step simply decides whether it needs more features. In contrast, our agent directly selects what type of information it needs. To teach agents to form and communicate information-seeking intents, our work enhances the single-agent POMDP model in two ways: (a) the agent can interact with an assistant to leverage their knowledge and (b) the agent can construct more complex goal hierarchies using a stack data structure, whereas most previous formulations of hierarchical reinforcement learning (le2018hierarchical; kulkarni2016hierarchical; sutton1999between) are restricted to only two levels of goals. Enhancing the sample efficiency of the learning policy by exploiting the hierarchical policy structure is an exciting future direction. Furthermore, techniques for generating faithful explanations (kumar-talukdar-2020-nile) can be applied to enhance the specificity of the generated questions. One important question for future work is how our formulation generalizes to richer environments with more complex action and state spaces (ALFRED20).


Appendix A Training Procedure

Training Algorithms.

We pre-train the operation policy with DAgger (ross2011reduction), minimizing the cross entropy between its action distribution with that of a shortest-path oracle (which is a one-hot distribution with all probability concentrated on the optimal action).

We use advantage actor-critic (mnih2016asynchronous) to train the interaction policy . This method simultaneously estimates an actor policy and a critic function . Given an execution , the gradients with respect to the actor and critic are


where , is a belief state that summarizes the partial execution for the actor, and is a belief state for the critic.

Cost function.

The cost function introduced in § 4.4 is not effective for learning the interaction policy because the task error is given only at the end of an episode. We extend the reward-shaping method proposed by ng1999policy to goal-conditioned policies, augmenting the original cost function with a shaping function with . We set to be the (unweighted) shortest-path distance from to . The cost received by the agent at time step is . We assume that the agent transitions to a special terminal state and remains there after it terminates execution of the main goal. We set , where signals that the episode has ended. Hence, the cumulative cost of an execution under the new cost function is


Since does not depend on the action taken in , minimizing the new cumulative cost does not change the optimal policy for the task .

Model Architecture.

We adapt the V&L BERT architecture (hong2020recurrent) for modeling the operation policy . Our model has two components: an encoder and a decoder; both are implemented as Transformer models (vaswani2017attention). The encoder takes as input a description or

and generates a sequence of hidden vectors. In every step, the decoder takes as input the previous hidden vector

, the sequence of vectors representing , and the sequence of vectors representing . It then performs self-attention on these vectors to compute the current hidden vector and a probability distribution over navigation actions .

The interaction policy

(the actor) is an LSTM-based recurrent neural network. The input of this model is the operation policy’s model outputs,

and , and the embedding of the previously taken action . The critic model also has a similar architecture but outputs a real number (the value) rather than an action distribution. When training the interaction policy, we always fix the parameters of the operation policy. We find it necessary to pre-train the critic before training it jointly with the actor.

Representation of State Descriptions.

The representation of each object, room, or action is computed as follows. Let , , , , and are the features of an object , consisting of its name, horizontal angle, vertical angle, distance, and type (a type is either Object, Room, or Action; in this case, the type is Object). For simplicity, we discretize real-valued features, resulting in 12 horizontal angles (corresponding to ), 3 vertical angles (corresponding to ), and 5 distance values (we round down a real-valued distance to the nearest integer). We then lookup the embedding of each feature from an embedding table and sum all the embeddings into a single vector that represents the corresponding object. For a room, , are zeroes. For an action, is either ActionStop for the stop action or ActionGo otherwise.

During pre-training, we randomly drop features in and so that the operation policy is familiar with making decisions under sparse information. Concretely, we refer to all features of an object, room or action as a feature set. For , let be the number objects in a description. We uniformly randomly keep feature sets among the feature sets of (the plus one is the room’s feature set), where .

For , we have two cases. If is not adjacent or equals to , we uniformly randomly alternate between giving a dense and a sparse description. In this case, the sparse description contains the features of the target object and the goal room’s name. Otherwise, with a probability of 13, we give either (a) a dense description (b) a (sparse) description that contains the target object’s features and the goal room’s name, or (c) a (sparse) description that describes the next ground-truth action.

We pre-train the operation policy on various path lengths (ranging from 1 to 10 graph nodes) so that it learns to accomplish both long-distance main goals and short-distance subgoals.

Split Number of examples
Pre-training 82,104
Pre-training validation 3,000
Training 65,133
Validation UnseenStr 1,901
Validation UnseenObj 1,912
Validation UnseenEnv 1,967
Test UnseenStr 1,653
Test UnseenObj 1,913
Test UnseenEnv 1,777
Table 3: Dataset statistics.


Table 3 summarizes the data splits. From a total of 72 environments provided by the Matterport3D dataset, we use 36 environments for pre-training, 18 as unseen environments for training, 7 for validation UnseenEnv, and 11 for test UnseenEnv. We use a vocabulary of size 1738, which includes object and room names, and special tokens representing the distance and direction values. The length of a navigation path ranges from 5 to 10 graph nodes.

Hyperparameter Name Value
Max. subgoal distance () 3 nodes
Max. stack size () 2
Max. object distance for 5 meters
Max. object distance for 3 meters
Max. number of objects () 20
Cost of taking each Cur, Goal, Sub, Do action 0.01
Operation policy
Hidden size 256
Number of hidden layers 2
Attention dropout probability 0.1
Hidden dropout probability 0.1
Number of attention heads 8
Optimizer Adam
Learning rate
Batch size 32
Number of training iterations
Max. number of time steps () 15
Interaction policy
Hidden size 512
Number of hidden layers 1
Entropy regularization weight 0.001
Optimizer Adam
Learning rate
Batch size 32
Number of critic pre-training iterations
Number of training iterations
Max. number of time steps () 30
Max. number of time steps for executing a subgoal 3 shortest distance to the subgoal
Table 4: Hyperparameters.


See Table 4.