A long-term goal of AI is to develop agents that can help humans execute complex tasks in the real world. Since reward functions that are aligned with human intentions are hard to manually specify (faulty)
, other approaches besides Reinforcement Learning are needed for creating agents that behave in the intended way. Among these are Reward Modeling(reward_modeling) and Imitation Learning (imitation_learning), but, eventually, it would be useful if we could use natural language to transmit wishes to the agents.
Recently, babyguide made progress in this direction by showing how communication can be used to guide a learner in a gridworld environment. Using emergent discrete messages, the guide is able to speed up the learning process of the learner and to let it generalize across incrementally more difficult environments.
In this setting, the communication channel is completely one-way: in each time step, the guide transmits a message that may help the learner make its decisions. In reality, however, communication is more complex than that: the guidance may be expensive and it can, therefore, be beneficial to have more sparse messages. Furthermore, the learner may want to ask for clarification if something is unclear. Therefore, it would be worthwhile if there was a communication channel from the learner back to the guide. It is this interactive nature of communication that arguably is needed for more advanced AI systems (mikolov).
In this paper, we equip the learner introduced by babyguide with a binary gate to indicate its need for guidance in each time step. A penalty for the use of guidance incentivizes a sparse usage of the gate. By analyzing the relationship between the learner’s usage of the gate and a number of measures, we show that the learner indeed learns to ask for guidance in a smart and economical way.
2 Related Work
In this section we briefly lay out relevant work that relate to our approach on the dimensions of following language instructions, emergent communication and the interactions that emerge from guidance requests.
2.1 Following Language Instruction
In recent years, much research has been conducted in the field of following language instructions. Since language commands build a way to interact with agents and communicate information in an effective and human-interpretable way, the agent’s processing and understanding of these commands is relevant for our project. Starting with manually engineered mappings (winograd1971procedures), the currently most relevant grounded language acquisition methods focus on learning a parser to map linguistic input to its executable equivalent, i.e. action specifications using statistical models (yu2018interactive), (kim2012unsupervised), (artzi2013weakly), (mei2016listen). In the BabyAI platform introduced by babyai a synthetic “Baby Language” is used, which consists of a subset of English and whose semantics is generated by a context-free grammar (as opposed to instruction templates) and is easily understood by humans. They employ a single model to combine linguistic and visual information, similar to misra2017mapping. Our setup builds on babyguide who extend that platform with a guide, like co2018guiding, that supplements the agent’s information with iterative linguistic messages.
2.2 Emergent Communication
In order to benefit most from the guide, the agent would ideally communicate back, thus creating a multi-agent cooperation scenario. Recent research in this area investigates the emergence and usage of emergent language, e.g. in the context of referential games (lazaridou2016multi). Furthermore, mordatch2018emergence show that multiple agents can develop a grounded compositional language to fulfill their tasks more effectively with spoken exchange. In our setup the emergent communication consists of discrete word tokens similar to havrylov2017emergence. jiang2018learning propose an attentional communication model to learn when communication is needed (and helpful), resulting in more effective (large-scale) multi-agent cooperation.
2.3 Guidance Requests
In prior work (babyguide), guidance is given at every time step and the communication is one-way from guide to learner. We extend this approach by allowing a communication channel in the other direction. Here we survey work that uses similar requests for help.
Most similar to our work is 1997, where “Ask for help” is proposed: in this setting, an RL agent has one additional action with which it can signify to a “trainer” that it wants to receive help. The trainer then chooses the agent’s action. Whether to ask for help is based on uncertainty about the highest action value. This is different from our setting in which the uncertainty is only implicitly responsible for queries, as can be seen in Section 5. kosoy studies the “Ask for help” setting theoretically and proves a regret bound for agents that act in infinite-horizon discounted MDPs and are able to delegate actions to an “advisor”.
In nguyen2019vision there is a help-requesting policy that can signify if the agent needs help. If this is the case, a guide answers with a language-based instruction of subgoals. Additionally, there is a budget that limits asking for help.
Also related is on_the_job, where structured prediction problems are considered: a sequence of words is received and each word is supposed to be mapped to a label. The system can query a crowd (as in crowd-sourcing) to obtain answers on specific words in the sequence. As in our case, querying the crowd is penalized by an additional loss.
In active_reinforcement_learning, arl2, active reinforcement learning (ARL) is proposed: different from usual RL, the agent has to choose in each round if it wants to receive the reward that results from its action, which results in a constant query-cost . Note that in this setting, what is queried is feedback, whereas in our setting, the model queries guidance prior to making a decision. Active Reward Learning (active_reward_learning) is a similar approach in the context of continuous control tasks.
3.1 BabyAI Game
All our experiments take place in the BabyAI platform (babyai). In this platform, an agent learns to complete tasks given by instructions in a subset of English in a mini gridworld environment. The environments are only partially observable to the agent.
Figure 1 shows two example levels. In total there are 19 different levels that increase in difficulty and complexity of tasks. For each level, the BabyAI framework can randomly generate many missions that require roughly the same skillset and are provided with a similar language instruction. For the following investigation, we only focus on the levels “GoToObj” and “PutNextLocal”. These are chosen to be simple but nevertheless require a representative set of skills. As such, our results can be understood as a proof of concept.
In this section, we describe the model that we use for a learner that may ask for guidance. We refer to Figure 2 for a visual explanation.
Based on the observation , the instruction and its own memory unit, the learner builds a representation :
is a tensor describing the viewable environment (e.g. the brighter area in Figure 1) and is a natural language instruction (e.g. ”Go to the purple ball”). The representation unit uses a FiLM module film followed by a memory RNN.
First, consider a learner without guidance. In this setting it directly takes the representation and feeds it into the policy module that outputs the next action . For more details on this baseline setting, see babyai.
A guided learner
Now consider babyguide, where a guide is added. The guide follows the same architecture as the learner-policy combination, but between the representing unit and the policy there is a discrete “bottleneck” that only allows different messages to pass. The policy then needs to encode this message continuously in order to choose the correct action out of possibilities. After this guide-policy combination is trained, the messages are fed into the policy attached to a newly initialized learner in order to help it make its decision. In this later guided training stage, the policy of the guide is not used anymore.
More formally, the guide uses the same input and a memory unit to produce a message of two words with 3 possible tokens for each.
The message is then encoded to a higher dimensional continuous encoding that is produced by an encoder of the same architecture as the encoder used while training the guide. The policy then bases its decision on both the learned representation and encoding , which are simply concatenated:
More details can be found in babyguide.
Adding a gate
To enable the learner to decide when to receive guidance, we extend the learner with a gate module to learn a gating weight that switches the policy input between (guided) and (unguided):
where . The gate module
is a two-layered MLP with a tanh activation functions that outputs a scalar, followed by a sigmoid activation function and a threshold function with parameter. The module that produces the gating weight will from here on be referred to as the gate.
We use a pretrained Reinforcement Learning (RL) expert, trained as in babyguide and babyai by proximal policy optimization (ppo). After training, the expert is placed once in many missions in order to create training data containing the expert behavior. Using imitation learning as in babyguide
, we have a cross-entropy loss functionthat measures how much the distribution over actions given by the policy of our model deviates from the “correct” action of the RL expert.111
For mitigating confusion, we mention explicitly that the RL expert is not the same as the guide: the expert creates the data that is used for backpropagating the model and thus for training itfollowing the choice of the action. The guide, however, gives its guidance prior to the decision about the chosen action.
Furthermore, we penalize the learner for asking for guidance by adding the gating weight to the loss and balance these incentives by a hyperparameter:
We use in all GoToObj experiments and for PutNextLocal, values that were found by hyperparameter search. The combined model consisting of pre-trained guide, also trained by imitation learning as in babyguide, and the newly initialized learner is then trained end-to-end by backpropagating the gradients to all the weights of the combined model. In order to pass the gradients also through the discrete gate
, we use the straight-through estimator(straight-through; binarized_straight_through). In order to allow the learner to learn the usefulness of the guidance at the beginning of training, we initialize with a positive bias.
In this section, we describe the experiments conducted in order to test the setting of a learner that queries the guide for help. In order to assess this, we train the combined model with for runs on the simplest level, GoToObj, until convergence. In this level, the learner is instructed to go to a specific object. Results with for runs on the level PutNextLocal can be found in Appendix B.
Performance and dynamics. First of all, we are interested in how our model performs compared to baselines. The first baseline is the learner on its own trained with imitation learning, which was the setting in babyai. The second baseline is the learner that receives guidance in each round without the need to query for it, which was studied in babyguide. We expect that our model learns faster than the original learner model, due to its access to guidance, but slower than the guided learner without the gate, since it does not always receive guidance. The results can be found in Figure 3 (a), where we plot the success rate, i.e. portion of successfully completed episodes, during validation.
We monitor the average gating weight over epochs, which we callguidance rate, to inspect the usage of the gate over the course of the training. Furthermore, we compare the overall accuracy with the accuracy conditioned on the cases of an open or closed gate to assess the influence on performance. These metrics can be found in Figure 3 (b)222The accuracy is the percentage of chosen actions that coincide with the “correct” action of the RL expert that is used in the imitation learning process..
Since the accuracy plots in Figure 3 (b) are solely correlational, we furthermore plot the validation accuracy for the case where we intervene during validation in the gate in order to have it opened or closed to assess the causal influence of the open gate on the accuracy, see Figure 4.
Economic requests. While the accuracy measures how often the agent was “right”, the cross-entropy policy loss gives greater insights into the performance with respect to the actual training objective. We would like to assess whether the learner uses the gate economically, since it is penalized. This means to ask for guidance in situations in which it expects the greatest reduction in the policy loss. The policy loss for cases of open and closed gate can therefore be found in Figure 5. We compare it with the counterfactual policy losses that arise if we force the gate to be opened if the learner wants it to be closed and vice versa. This intervention now allows us to assess the causal influence of the gate on the policy loss.
Guidance semantics. Finally, we are interested in whether there are meaningful correlations between the frames in which the learner asks for guidance and the actions that the learner takes (Figure 6) as well as the messages emitted by the guide (Figure 7). By analyzing this, we can see if the learner masters certain situations that require a certain action or are accompanied by a certain message.
5.1 Results and analysis of experiments
After outlining the experiments, we now briefly analyze the results.
Performance and dynamics. First, we notice in Figure 3 (a) that, while initially our model is indeed faster than the learner baseline and keeps up with the guided baseline, from epoch to there is a performance dip not seen in the former models. This is precisely the phase in which the gate successively closes more and more, as can be seen in Figure 3 (b). Our explanation for this dip is that the policy first needs to adjust to the fact that it does not get the familiar guidance anymore. Eventually, from epoch onward, our model performs almost perfectly and as well as the baselines.
As mentioned already, the learner becomes indeed more independent over time, roughly choosing its action on its own in of the cases from epoch onward. As soon as the guidance rate begins to drop in epoch , we can compute an accuracy conditioned on cases where the gate is closed, as can be seen in the blue line of Figure 3 (b). This accuracy is lower than the accuracy in case of an open gate (red dotted line) and only fully catches up roughly after epochs of training.
About the intervention accuracy in Figure 4: we observe that the initial phase with a guidance rate at sees a steep increase in accuracy with guidance but nearly no change in the accuracy without guidance. We suppose that is the case since in this phase the training happens exclusively with an open gate. Then, the guidance rate drops and training happens increasingly without guidance. Accordingly, the accuracy without guidance starts to increase and eventually catches up. In many individual runs, the performance without guidance is ultimately actually even better. We hypothesize that this is since the gate is mostly closed and so the policy doesn’t “expect” the gate to be open anymore. Consequently, an open gate and additional encoded message is confusing and leads to misbehavior. Intuitively, this is similar to how humans who are very experienced in their profession may actually just be distracted by someone who occasionally tries to give them advice instead of just letting them do their task on their own.
Economic requests. In order to get a better feeling for how smart the agent is in its guidance requests, we look at Figure 5. For similar results about the entropy, see Appendix A. We see an overall tendency for the policy loss to decrease, as we would expect due to the training. At the same time, the situations where the gate is open are those that are more difficult for the learner (including in the comparison of the counterfactual cases where we change the gate). Additionally, in those situations the reduction in policy loss achieved by asking for guidance is greater – this can be seen by comparing with the counterfactual situations. We furthermore observe that after the guidance rate starts to drop around epoch , the policy loss in situations of a closed gate rapidly sinks as the learner adapts to those novel situations. In the meantime, the open gate policy loss stabilizes until around epoch , while the counterfactual policy loss in these situations strongly increases. This indicates that the learner learns to selectively open the gate in situations that are more difficult and especially so without guidance.
Guidance semantics. To gain insights about dependencies between specific actions and the guidance rate, we now look at Figure 6. We see that in situations where the learner takes action , which corresponds to “move forward”, the guidance rate drops relatively early to . This may be the case since this is the most common and supposedly most easily identifiable action. For action that corresponds to “turn right” and action that corresponds to “turn left”, we see that the guidance rate also decreases, albeit slower and asymmetrically. This may be due to a higher difficulty of distinguishing those actions from each other. Intuitively they are symmetric and it may often be unclear what to do if “move forward” is not a promising action. In some runs, the guidance rate drops more for action and in some more for action . We may attempt to explain this by the learner either learning to request help in situations where action is one of the promising options (potentially the only one) or learning the same with action . In both cases, it is ensured that situations with confusion between those two actions are encompassed.
On the guidance rate conditioned on messages: as we can see in Figure 7, mainly three messages are used to convey guidance and for all of them the guidance rate decreases over time. We suppose that the overall trends happen due to the close correspondence between messages and actions that was already observed by babyguide.
5.2 Guidance in space and time
So far, we mainly discussed “global” metrics, in the sense that we aggregate information over complete epochs. This still leaves open the question how guidance requests evolve with respect to the position of the agent and temporally during an episode.
As we can see, the trained agent asks for guidance often specifically if it is near the goal object in order to find out if it should “turn towards it”, which would cause the goal to be reached. It is important for the agent to be reasonably sure about the goal being reached beforehand, since otherwise turning to the object will result in two lost moves.
In order to test the influence of objects on the agent more quantitatively, we create another metric that conditions the guidance rate on how “goal-like” the observation is that the agent faces. For this matter, we assign tuples to each observation, where and signify how goal-like the object left and right from the agent is. We measure this by the number of features it has in common with the goal object, where features are both color and object-type. means that the object shares features with the goal object. This creates observation types333Note that even the combination is in principle possible in higher levels: There are tasks such as “Go to a red ball” where several red ball can be in the mission. However, this is unlikely and the expert never places itself between two goal objects. Furthermore, in GoToObj there is simply just one object in the mission. Therefore, the graph in Figure 10 corresponding to is empty.. See Figure 9 for examples. The results can be found in Figure 10. We can observe strong changes in the guidance rate if the goal object is to the left or right: if it is on the left, the guidance rate is significantly greater and if it is to the right, the guidance rate is significantly smaller than usually. This is in line with the plots of the guidance rate conditioned on actions, Figure 6, which already showed that turning to the left requires considerably more guidance than turning to the right. This indicates that the high guidance rate at goal objects may to a large extend be caused by the high correlation between turning actions and guidance rates and be mostly independent of the fact that there is a goal around.
Now we turn to the question about the guidance rate in time: within one episode, are there usually phases where more or less guidance is needed? The heatmaps suggest that the agent mostly asks for guidance in the end of the episode.
In order to answer this question, we create the “guidance per time quantile” plot, Figure11. As we can see, the guidance rate is in general high in the beginning of episodes and drops once more knowledge about the environment is acquired. However, in the end of the episode, the guidance rate grows again and is greatest in the very end, which is in line with the qualitative assessment from the heatmaps.
One interpretation for this is the following: in the beginning, the agent needs to roughly figure out “in which direction to head”. Once this is clear, it can walk there without further guidance. But in the very end, it needs more precision and asks for guidance again in order to finally find its goal. This is similar to how humans often look at a map in the beginning of a hike in order to figure out the direction, and then in the end again in order to reassess how their new position now relates to the goal.
6 Conclusion and future research
In this paper, we extended a recently proposed method to guide a learner in a gridworld environment by letting the learner explicitly ask for help. To accomplish this, we defined a binary gate in the learner’s model. We brought the original approach closer to the real world by (i) enabling bi-directional communication and (ii) attaching a cost to it.
We showed that the learner successfully learns to utilize the guidance gate to achieve a favorable trade-off between learning speed and amount of guidance requested. Initially using the full guidance to learn faster than a learner without guidance, it eventually learns to request guidance only where it is especially helpful, acting increasingly independent.
Future research could consist of retraining the guide to see if it learns to send more abstract messages that provide guidance over multiple time steps. This may be fruitfully combined with giving the guide an additional information advantage like a bird’s eye’s view so that the guide has more foresight than the learner. It would require a way for the learner to memorize past messages.
Another direction is to replace the gate by an emergent communication channel from the learner to the guide, so that the learner can send its guidance requests in more nuanced ways. Furthermore, we saw that the policy may have problems dealing with the additional guidance it receives unexpectedly in late training. It may be worthwhile to experiment with policy architectures that can deal better with spontaneous changes in its input.
Finally, research might as well aim at finding ways to meaningfully replace the guide agent by a human. This might allow for better learning in tasks that autonomous agents struggle to learn by themselves.
Appendix A More results for Gotoobj
We use this part of the appendix to give additional figures for the Gotoobj level.
First, similarly to the reduction in policy loss as investigated in section 5, we are also interested in the reduction of overall uncertainty (Shannon entropy) over actions that the learner achieves by asking for guidance and compare this again for the cases where the learner actually wants to open or close the gate. This can be found in Figure 12. Note however the subtlety that the learner may become more certain about which actions to choose, but focus on the wrong action. By spotting differences between the two measures we may identify situations in which the guidance misleads the learner or, vice versa, in which a very certain learner that aims for the wrong action actually gets less certain by getting the correct guidance, leading to a reduction in policy loss.
As we see, the shape of the resulting plots are relatively similar to those of the policy loss. As one notable difference, the counterfactual entropy does not temporarily increase in the same way as the policy loss for open gate situations. This indicates that the learner generally gets continuously more certain, albeit not necessarily about the right actions.
Appendix B Results for Putnextlocal
We provide results for PutNextLocal corresponding to the same results for GoToObj which were in the body of the paper. As is visible, many of our findings remain valid in this higher level. Note however that over the course of training, we observe significant overfitting. We nevertheless showed the whole development of the learning process in order to show the full development of the guidance rate.