1 Introduction††footnotetext: 1 Mila, University of Montreal,2 Deepmind, 3 University of California, Berkeley, :email@example.com
A model that generalizes effectively should be able to pick up on relevant cues in the input while ignoring irrelevant distractors. For example, if one want to cross the street, one should only pay attention to the positions and velocities of the cars, disregarding their color. The information bottleneck (DBLP:journals/corr/physics-0004057) formalizes this in terms of minimizing the mutual information between the bottleneck representation layer with the input, while maximizing its mutual information with the correct output. This type of input compression can improve generalization (DBLP:journals/corr/physics-0004057)DBLP:journals/corr/AchilleS16; DBLP:journals/corr/AlemiFD016).
The information bottleneck is generally intractable, but can be approximated using variational inference (DBLP:journals/corr/AlemiFD016). This variational approach parameterizes the information bottleneck model using a neural network (i.e., an encoder). While the variational bound makes it feasible to train (approximate) information bottleneck layers with deep neural networks, the encoder in these networks – the layer that predicts the bottleneck variable distribution conditioned on the input – must still process the full input, before it is compressed and irrelevant information is removed. The encoder itself can therefore fail to generalize, and although the information bottleneck minimizes mutual information with the input on the training data, it might not compress successfully on new inputs. To address this issue, we propose to divide our input into two categories: standard input and privileged input, and then we aim to design a bottleneck that does not need to access the privileged input before deciding how much information about the input is necessary. The intuition behind not accessing the privileged input is twofold: (a) we might want to avoid accessing the privileged input because we want to generalize with respect to it (and therefore compress it) (b) we actually would prefer not to access it (as this input could be costly to obtain).
The objective is to minimize the conditional mutual information between the bottleneck layer and the privileged input, given the standard input. This problem statement is more narrow than the standard information bottleneck, but encompasses many practical use cases. For example, in reinforcement learning, which is the primary subject of our experiments, the agent can be augmented with some privileged information in the form of a model based planner, or information which is the result of communication with another agent. This “additional” information can be seen as a privileged input because it requires the agent to do something extra to obtain it.
Our work provides the following contributions. First, we propose a variational bandwidth bottleneck (VBB) that does not look at the privileged input before deciding whether to use it or not. At a high level, the network is trained first to examine the standard input, and then stochastically decide whether to access the privileged input or not. Second, we illustrate several applications of this approach to reinforcement learning, in order to construct agents that can stochastically determine when to evaluate costly model based computations, when to communicate with another agent, and when to access the memory. We experimentally show that the proposed model produces better generalization, as it learns when to use (or not use) the privileged input. For example, in the case of maze navigation, the agent learns to access information about the goal location only near natural bottlenecks, such as doorways.
2 Problem Formulation
We aim to address the generalization issue described in the introduction for an important special case of the variational information bottleneck, which we refer to as the conditional bottleneck. The conditional bottleneck has two inputs, a standard input, and a privileged input, that are represented by random variablesand , respectively. Hence, are three random variables with unknown distribution .
The information bottleneck provides us with a mechanism to determine the correct output while accessing the minimal possible amount of information about the privileged input . In particular, we formulate a conditional variant of the information bottleneck to minimize the mutual information between the bottleneck layer and the privileged input , given the standard input while avoiding unnecessary access to privileged input . The proposed model consists of two networks (see Fig. 1): The encoder network that takes in the privileged input as well as the standard input and outputs a distribution over the latent variable such that . The decoder network takes the standard input and the compressed representation and outputs the distribution over the target variable .
3 Variational Bottleneck on Standard Input and Privileged Input
The information bottleneck (IB) objective (DBLP:journals/corr/physics-0004057) is formulated as the maximization of , where refers to the input signal, refers to the target signal, refers to the compressed representation of , and controls the trade-off between compression and prediction. The IB has its roots in channel coding, where a compression metric represents the capacity of the communication channel between and . Assuming a prior distribution over the random variable , constraining the channel capacity corresponds to limiting the information by which the posterior is permitted to differ from the prior . This difference can be measured using the Kullback-Leibler (KL) divergence, such that refers to the channel capacity.
Now, we write the equations for the variational information bottleneck, where the bottleneck is learnt on both the standard input as well as a privileged input . The Data Processing Inequality (DPI) (Cover:2006:EIT:1146355)
for a Markov chainensures that . Hence for a bottleneck where the input is comprised of both the standard input as well as privileged input, we have . To obtain an upper bound on , we must first obtain an upper bound on , and then average over . We get the following result: We ask the reader to refer to the section on the conditional bottleneck in the supplementary material for the full derivation.
4 The Variational Bandwidth Bottleneck
We now introduce our proposed method, the variational bandwidth bottleneck (VBB). The goal of the variational bandwidth bottleneck is to avoid accessing the privileged input if it is not required to make an informed decision about the output . This means that the decision about whether or not to access must be made only on the basis of the standard input . The standard input is used to determine a channel capacity, , which controls how much information about is available to compute .
If denotes the channel capacity, one way to satisfy this channel capacity is to access the input losslessly with probability , and otherwise send no information about the input at all. In this communication strategy, we have if we choose to access the privileged input (with probability , where is a deterministic encoder, and denotes the Dirac delta function. The full posterior distribution over the compressed representation can be written as a weighted mixture of (a) (deterministically) accessing the privileged input and standard input and (b) sampling from the prior (when channel capacity is low), such that is sampled using
This modified distribution allows us to dynamically adjusts how much information about is transmitted through . As shown in the Figure 1, if is set to zero, is simply sampled from the prior and contains no information about . If it is set to one, the privileged information in is deterministically transmitted. The amount of information about that is transmitted is therefore determined by , which will depend only on the standard input .
This means that the model must decide how much information about the privileged input is required before accessing it. Optimizing the information bottleneck objective with this type of bottleneck requires computing gradients through the term (as in Eq. 1), where is sampled as in Eq. 2. The non-differentiable binary event, whose probability is represented by
, precludes us from differentiating through the channel capacity directly. In the next sections, we will first show that this mixture can be used within a variational approximation to the information bottleneck, and then describe a practical approximation that allows us to train the model with standard backpropagation.
4.1 Tractable Evaluation of Channel Capacity
In this section, we show how we can evaluate the channel capacity in a tractable way. We learn a deterministic function of the standard input which determines channel capacity. This function outputs a scalar value for , which is treated as the probability of accessing the information about the privileged input. This deterministic function is parameterized as a neural network. We then access the privileged input with probability . Hence, the resulting distribution over is a weighted mixture of accessing the privileged input with probability and sampling from the prior with probability . At inference time, using
, we sample from the Bernoulli distributionto decide whether to access the privileged input or not.
4.2 Optimization of the KL Objective
Given the standard input , privileged input , bottleneck variable , and a deterministic encoder , we can express the between the weighed mixture and the prior as
The proof is given in the Appendix, section B. This equation is fully differentiable with respect to the parameters of and , making it feasible to use standard gradient-based optimizers.
Summary: As in Eq. 2, we approximate as a weighted mixture of and the normal prior, such that . Hence, the quantity can be seen as a bound on the information bottleneck objective. When we access the privileged input , we pay a cost equal to , which is bounded by as in Eq. 1. Hence, optimizing this objective causes the model to avoid accessing the privileged input when it is not necessary.
5 Variational Bandwidth Bottleneck with RL
In order to show how the proposed model can be implemented, we consider a sequential decision making setting, though our variational bandwidth bottleneck could also be applied to other learning problems. In reinforcement learning, the problem of sequential decision making is cast within the framework of MDPs (sutton1998reinforcement). Our proposed method depends on two sources of input, standard input and privileged input. In reinforcement learning, privileged inputs could be the result of performing any upstream computation, such as running model based planning. It can also be the information from the environment, such as the goal or the result of active perception. In all these settings, the agent must decide whether to access the privileged input or not. If the agent decides to access the privileged input, then the the agent pays an “information cost”. The objective is to maximize the expected reward and reduce the cost associated with accessing privileged input, such that across all states on average, the information cost of using the privileged information is minimal.
We parameterize the agent’s policy using an encoder and a decoder , parameterized as neural networks. Here, the channel capacity network would take in the standard input that would be used to determine channel capacity, depending on which we decide to access the privileged input as in Section 4.1, such that we would output the distribution over the actions. That is, is , and . This would correspond to minimizing , resulting in the objective
denotes an expectation over trajectories generated by the agent’s policy. We can minimize this objective with standard optimization methods, such as stochastic gradient descent with backpropagation.
6 Related Work
A number of prior works have studied information-theoretic regularization in RL. For instance, 5967384 use information theoretic measures to define relevant goal-information, which then could be used to find subgoals. Our work is related in that our proposed method could be used to find relevant goal information, but without accessing the goal first. Information theoretic measures have also been used for exploration (still2012information; mohamed2015variational; houthooft2016vime; gregor2016variational). More recently goyal2019infobot proposed InfoBot, where “decision” states are identified by training a goal conditioned policy with an information bottleneck. In InfoBot, the goal conditioned policy always accesses the goal information, while the proposed method conditionally access the goal information. The VBB is also related to work on conditional computation. Conditional computation aims to reduce computation costs by activating only a part of the entire network for each example (bengio2013estimating). Our work is related in the sense that we activate the entire network, but only conditionally access the privileged input.
Another point of comparison for our work is the research on attention models ((bahdanau2014neural; mnih2014recurrent; xu2015show)). These models typically learn a policy, that allows them to selectively attend to parts of their input. However, these models still need to access the entire input in order to decide where to attend. Our method dynamically decides whether to access privileged information or not. As shown in our experiments, our method performs better than the attention method of mnih2014recurrent.
Recently, many models have been shown to be effective at learning communication in multi-agent reinforcement learning (foerster2016learning; sukhbaatar2016learning). (sukhbaatar2016learning) learns a deep neural network that maps inputs of all the agents to their respective actions. In this particular architecture, each agent sends its state as the communication message to other agents. Thus, when each agent takes a decision, it takes information from all the other agents. In our proposed method, each agent communicates with other agents only when its necessary.
Our work is also related to work in behavioural research that deals with two modes of decision making (Dickinson67; 10.1257/000282803322655392; sloman1996empirical; botvinick2015motivation; shenhav2017toward): an automatic systems that relies on habits and a controlled system that uses some extra information for making decision making. These systems embody different accuracy and demand trade-offs. The habit based system (or the default system) has low computation cost but is often more accurate, whereas the controlled system (which uses some external information) achieves greater accuracy but often is more costly. The proposed model also has two parts of input processing, when the channel capacity is low, agent uses its standard input only, and when channel capacity is high, agent uses both the standard as well as privileged input. The habit based system is analogous to using only the standard input, while the controlled system could be analogous to accessing more costly privileged input.
7 Experimental Evaluation
In this section, we evaluate our proposed method and study the following questions:
Better generalization?: Does the proposed method learn an effective bottleneck that generalizes better on test distributions, as compared to the standard conditional variational information bottleneck?
Learn when to access privileged input?: Does the proposed method learn when to access the privileged input dynamically, minimizing unnecessary access?
We compare the proposed method to the following methods and baselines:
Randomly accessing goal (RAG): - Here, we compared the proposed method to the scenario where we randomly access the privileged input (e.g., of the time). This baseline evaluates whether the VBB is selecting when to access the goal in an intentional and intelligent way.
Conditional Variational Information Bottleneck (VIB): The agent always access the privileged input, with a VIB using both the standard and the privileged input InfoBot (goyal2019infobot).
Deterministically Accessing Privileged Input (UVFA): The agent can deterministically access both the state as well as the privileged input. This has been shown to improve generalization in RL problems UVFA (schaul2015universal).
Accessing Information at a Cost (AIC): We compare the proposed method to simpler reinforcement-learning baselines, where accessing privileged information can be formalized as one of the available actions. This action reveals the privileged information, at the cost of a small negative reward. This baseline evaluates whether the explicit VBB formulation provides a benefit over a more conventional approach, where the MDP itself is reformulated to account for the cost of information.
7.2 Deciding when to Run an Expensive Model Based Planner
Model-based planning can be computationally expensive, but beneficial in temporally extended decision making domains. In this setting, we evaluate whether the VBB can dynamically choose to invoke the planner as infrequently as possible, while still attaining good performance. While it is easy to plan using a planner (like a model based planner, which learns the dynamics model of the environment), it is not very cheap, as it involves running a planner at every step (which is expensive). So, here we try to answer whether the agent can decide based on the standard input when to access privileged input (the output of model based planner by running the planner).
Experimental Setup: We consider a maze world as shown in Figure 3(a). The agent is represented by a blue dot, and the agent has to reach the goal (represented by a green dot). The agent has access to a dynamics model of the environment (which is pretrained and represented using a parameterized neural network). In this task, the agent only gets a partial view of the surrounding i.e. the agent observes a small number of squares in front of it. The agent has to reach the goal position from the start position, and agent can use the pretrained dynamics model to sample multiple plausible trajectories, and the output of the dynamics model is fed as a conditional input to the agent’s policy (similar to (racaniere2017imagination)), thus the agent can use this dynamics model to predict possible futures, and then make an informed decision based on its current state as well as the result of the prediction from the dynamic model.
|Expensive Inference algorithm||% of times|
|Near the junction||72% 5%|
|In the Hallway||28% 4%|
In this setup, the current state of the agent (i.e. the egocentric visual observation) acts as the standard input , and the result of running the planner acts as the privileged input . In order to avoid running the model based planner, the agent needs to decide when to access the more costly planner.
Results: - Here, we analyze when the agent access the output of the planner. We find that most of the times agent access the privileged information (output of model based planner) near the junctions as shown in Table 1.
7.3 Better Generalization in Goal Driven Navigation
The goal of this experiment is to show that, by selectively choosing when to access the privileged input, the agent can generalize better with respect to this input. We consider an agent navigating through a maze comprising sequences of rooms separated by doors, as shown in Figure 12. We use a partially observed formulation of the task, where the agent only observes a small number of squares ahead of it. These tasks are difficult to solve with standard RL algorithms, not only due to the partial observability of the environment but also the sparsity of the reward, since the agent receives a reward only upon reaching the goal (chevalier2018babyai). The low probability of reaching the goal randomly further exacerbates these issues. The privileged input in this case corresponds to the agent’s relative distance to the goal . At junctions, the agent needs to know where the goal is so that it can make the right turn. While in a particular room, the agent doesn’t need much information about the goal. Hence, the agent needs to learn to access goal information when it is near a door, where it is most valuable. The current visual inputs act as a standard input , which is used to compute channel capacity .
To investigate if agents can generalize by selectively deciding when to access the goal information, we compare our method to InfoBot ((goyal2019infobot)) (a conditional variant of VIB). We use different mazes for training, validation, and testing. We evaluate generalization to an unseen distribution of tasks (i.e., more rooms than were seen during training). We experiment on both RoomNXSY ( number of rooms with atmost size , for more details, refer to the Appendix G) as well as the FindObjSY environment. For RoomNXSY, we trained on RoomN2S4 (2 rooms of at most size 6), and evaluate on RoomN6S6 (6 rooms of at most size 6) and RoomN12S10 (12 rooms, of at most size 10). We also evaluate on the FindObjSY environment, which consists of 9 connected rooms of size arranged in a grid. For FindObjSY, we train on FindObjS5, and evaluate on FindObjS7 and FindObjS10.
|Method||Percentage of times|
|InfoBot (goyal2019infobot)||60% 3%|
Tables (a)a, LABEL:tab:generl_findobj compares an agent trained with the proposed method to a goal conditioned baseline (UVFA) (schaul2015universal), a conditional variant of the VIB (goyal2019infobot), as well as to the baseline where accessing goal information is formulated as one of the actions (AIC). We also investigate how many times the agent accesses the goal information. We first train the agent on MultiRoomN2S4, and then evaluate this policy on MultiRoomN12S10. We sample 500 trajectories in MultiRoomN12S10env. Ideally, if the agent has learned when to access goal information (i.e., near the doorways), the agent should only access the goal information when it is near a door. We take sample rollouts from the pretrained policy in this new environment and check if the agent is near the junction point (or doorway) when the agent access the goal information. Table 3 quantitatively compares the proposed method with different baselines, showing that the proposed method indeed learns to generalize with respect to the privileged input (i.e., the goal).
7.4 Learning When to Communicate for Multiagent Communication
Next, we consider multiagent communication, where in order to solve a task, agents must communicate with other agents. Here we show that selectively deciding when to communicate with another agent can result in better learning.
Experimental setup: We use the setup proposed by mordatch2017emergence. The environment consists of N agents and M landmarks. Both the agents and landmarks exhibit different characteristics such as different color and shape type. Different agents can act to move in the environment. They can also be affected by the interactions with other agents. Asides from taking physical actions, agents communicate with other agents using verbal communication symbols. Each agent has a private goal that is not observed by another agent, and the goal of the agent is grounded in the real physical environment, which might include moving to a particular location. It could also involve other agents (like requiring a particular agent to move somewhere) and hence communication between agents is required. We consider the cooperative setting, in which the problem is to find a policy that maximizes expected return for all the agents. In this scenario, the current state of the agent is the standard input , and the information which might be obtained as a result of communication with other agents is the privileged input . For more details refer to the Appendix (D).
|Model||6 Agents||10 agents|
|Emergent Communication (mordatch2017emergence)||4.85 (100%) 0.1%||5.44 (100%) 0.2%|
|Randomly Accessing (RAG)||4.95 (50%) 0.2%||5.65 (50%) 0.1%|
|InfoBot (goyal2019infobot)||4.81 (100%) 0.2%||5.32 (100%) 0.1%|
|VBB (ours)||4.72 (23%) 0.1%||5.22 (34%) 0.05%|
Tasks: Here we consider two tasks: (a) 6 agents and 6 landmarks, (b) 10 agents and 10 landmarks. The goal is for the agents to coordinate with each other and reach their respective landmarks. We measure two metrics: (a) the distance of the agent from its destination landmark, and (b) the percentage of times the agent accesses the privileged input (i.e., information from the other agents). Table 4 shows the relative distance as well as the percentage of times agents access information from other agents (in brackets).
Results: Table 4 compares an agent trained with proposed method to (mordatch2017emergence) and Infobot (goyal2019infobot). We also study how many times an agent access the privileged input. As shown in Table 4 (within brackets) the VBB can achieve better results, as compared to other methods, even when accessing the privileged input only less than 40% of the times.
7.5 Information Content For VBB and VIB
|Navigation Env||4.45 (100%)||5.34 (74%)||3.92 (20%)|
|Sequential MNIST||3.56 (100%)||3.63 (65%)||3.22 (46%)|
|Model Based RL||7.12 (100%)||7.63 (65%)||6.94 (15%)|
Channel Capacity: We can quantify the average information transmission through both the VBB and the VIB in bits. The average information is similar to the conventional VIB, while the input is accessed only a fraction of the time (the VIB accesses it 100% of the time). In order to show empirically that the VBB is minimizing information transmission (Eq. 1 in main paper), we measure average channel capacity numerically and compare the proposed method with the VIB, which must access the privileged input every time (See Table 5).
We demonstrated how the proposed variational bandwidth bottleneck (VBB) helps in generalization over the standard variational information bottleneck, in the case where the input is divided into a standard and privileged component. Unlike the VIB, the VBB does not actually access the privileged input before deciding how much information about it is needed. Our experiments show that the VBB improves generalization and can achieve similar or better performance while accessing the privileged input less often. Hence, the VBB provides a framework for adaptive computation in deep network models, and further study applying it to domains where reasoning about access to data and computation is an exciting direction for future work. Current limitation of the proposed method is that it assumes independence between standard input and the privileged input but we observe in practice assuming independence does not seem to hurt the results. Future work would be to investigate how we can remove this assumption.
The authors acknowledge the important role played by their colleagues at Mila and RAIL (UC Berkeley) throughout the duration of this work. The authors are grateful to DJ Strousse and Mike Mozer for useful discussions and feedback. AG is grateful to Alexander Neitz for the code of the environment used for model based experiments Fig. (3). AG is also grateful to Rakesh R Menon for pointing out the error and giving very useful feedback. The authors would also like to thank Alex Lamb, Nan Rosemary Ke, Olexa Bilaniuk, Jordan Hoffmann, Nasim Rahaman, Samarth Sinha, Shagun Sodhani, Devansh Arpit, Riashat Islam, Coline Devin, Jonathan Binas, Suriya Singh, Hugo Larochelle, Tom Bosc, Gautham Swaminathan, Salem Lahou, Michael Chang for feedback on the draft. The authors are also grateful to the reviewers of ICML, NeurIPs and ICLR for their feedback (as the paper was accepted in the third chance). The authors are grateful to NSERC, CIFAR, Google, Samsung, Nuance, IBM, Canada Research Chairs, Canada Graduate Scholarship Program, Nvidia for funding, and Compute Canada for computing resources. We are very grateful to Google for giving Google Cloud credits used in this project.
Appendix A Conditional Bottleneck
In this section, we construct our objective function, such that minimizing this objective function minimizes . Recall that the IB objective (DBLP:journals/corr/physics-0004057) is formulated as the minimization of , where refers to the input, refers to the model output , refers to compressed representation or the bottleneck. For the proposed method, we construct our objective as follows: we minimize the mutual information between privileged input and output given the standard input, , to encode the idea that the we should avoid unnecessary access to privileged input , and maximize the . Hence, for the VBB, using the data processing inequality (Cover:2006:EIT:1146355), this implies that
To obtain an upper bound on , we must first obtain an upper bound on , and then we average over . We get the following result:
We assume that the privileged input and the standard input are independent of each other, and hence . we get the following upper bound:
where the inequality in the last line is because we replace with . We also drop the dependence of the prior on the standard input . While this loses some generality, recall that the predictive distribution is already conditioned on , so information about itself does not need to be transmitted through . Therefore, we have that . Marginalizing over the standard input therefore gives us
We approximate as a weighted mixture of and the normal prior such that . Hence, the weighted mixture can be seen as a bound on the information bottleneck objective. Whenever we access the privileged input , we pay an information cost (equal to which is bounded by . Hence, the objective is to avoid accessing the privileged input, such that on average, the information cost of using the privileged input is minimal.
Appendix B Tractable Optimization of KL objective
Here, we first show how the weighted mixture can be a bound on the information bottleneck objective.
Hence, where is expressed as a mixture of direc delta and prior, and hence it can be written as
Further expanding the RHS using eq. 9, we get
Here, we can assume the to be zero under the prior (as it is a Direc delta function). This can further be simplified to:
And hence, reducing the above term reduces t0 , our original objective.
Appendix C Another Method of Calculating Channel Capacity
In the main paper we show how can we evaluate channel capacity in a tractable way. The way we do is to learn a function which determines channel capacity. Here’s another way, which we (empirically) found that parameterizing the channel capacity network helps. In order to represent this function which satisfies these constraints, we use an encoder of the form (, where refers to the standard input, and
are learned functions (e.g., as a multi-layer perceptron) that outputsand respectively for the distribution over . Here, refers to the channel capacity of the bottleneck. In order to get a probability out of , we convert into a scalar such that the can be treated as a probability of accessing the privileged input.
We perform this transformation by normalizing such that , (in practice we perform this by clamping ) and then we pass the normalized
through a sigmoid activation function, and treating the output as a probability,, we access the privileged input with probability . Hence, the resulting distribution over z is a weighted mixture of accessing the privileged input with probability and sampling from the prior with probability . Here we assume prior to be , but it can also be learned. At test time, using , we can sample from the Bernouilli distribution to decide whether to access the privileged input or not.
Appendix D Multiagent Communication
Experimental Setup: We use the setup proposed by mordatch2017emergence. The environment consists of N agents and M landmarks. Both the agents and landmarks exhibit different characteristics such as different color and shape type. Different agents can act to move in the environment. They can also be affected by the interactions with other agents. Asides from taking physical actions, agents communicate with other agents using verbal communication symbols. Each agent has a private goal which is not observed by another agent, and the goal of the agent is grounded in the real physical environment, which might include moving to a particular location, and could also involve other agents (like requiring a particular agent to move somewhere) and hence communication between agents is required.
Each agent performs actions and communicates utterances according to a policy, which is identically instantiated for all of the agents in the environment, and also receive the same reward signal. This policy determines both the actions and communication protocols. We assume all agents have identical action and observation spaces and receive the same reward signal. We consider the cooperative setting, in which the problem is to find a policy that maximizes expected return for all the agents.
Appendix E Spatial Reasoning
In order to study generalization across a wide variety of environmental conditions and linguistic inputs, (janner2018representation)
develop an extension of the puddle world reinforcement learning benchmark. States in a 10 X 10 grid are first filled with either grass or water cells, such that the grass forms one connected component. We then populate the grass region with six unique objects which appear only once per map (triangle, star, diamond, circle, heart, and spade) and four non-unique objects (rock, tree, horse, and house) which can appear any number of times on a given map. We followed the same experimental setup and hyperparameters as in(janner2018representation).
Here, an agent is rewarded for reaching the location specified by the language instruction. Agent is allowed to take actions in the world. Here the goal is to be able to generalize the learned representation for a given instruction such that even if the environment observations are rearranged, this representation is still useful. Hence, we want to learn such representations that ties observations from the environment and the language expressions. Here we consider the Puddle World Navigation map as introduced in (janner2018representation). We followed the same experiment setup as (janner2018representation). Here, the current state of the agent acts as a standard input. Based on this, agent decides to access the privileged input.
We start by converting the instruction text into a real valued vector using an LSTM. It first convolves the map layout to a low-dimensional repesentation (as opposed to the MLP of the UVFA) and concatenates this to the LSTM’s instruction embedding (as opposed to a dot product). These concatenated representations are then input to a two layered MLP. Generalization over both environment configurations and text instructions requires a model that meets two desiderata. First, it must have a flexible representation of goals, one which can encode both the local structure and global spatial attributes inherent to natural language instructions. Second, it must be compositional, in order to learn a generalizable representation of the language even though each unique instruction will only be observed with a single map during training. Namely, the learned representation for a given instruction should still be useful even if the objects on a map are rearranged or the layout is changed entirely.
Appendix F Recurrent Visual Attention - Learning Better Features
The goal of this experiment is to study if using the proposed method enables learning a dynamic representation of an image which can be then used to accuratelyclassify an image. In order to show this, we follow the setup of the Recurrent Attention Model (RAM) (mnih2014recurrent)
. Here, the attention process is modeled as a sequential decision process of a goal-directed agent interacting with the visual image. A recurrent neural network is trained to process the input sequentially, attending to different parts within the image one at a time and hence combining information from these different parts to build up a dynamic representation of the image. The agent incrementally combines information because of attending to different parts and then chooses this integrated information to choose where next to attend to. In this case, the information due to attending at a particular part of the image acts as a standard input, and the information which is being integrated over time acts as a privileged input, which is then used to select where the model should attend next. The entire process repeats for N steps (for our experiment N = 6). FC denotes a fully connected network with two layers of rectifier units, each containing 256 hidden units.
|Model||MNIST||60 * 60 Cluttered MNIST|
|FC (2 layers)||1.69%||11.63%|
|RAM Model (6 locs)||1.55%||4.3%|
|VIB (6 locs)||1.58%||4.2%|
|VBB (6 locs) (Ours)||1.42%||3.8%|
Quantitative Results: Table 6 shows the classification error for the proposed model, as well as the baseline model, which is the standard RAM model. For both the proposed model, as well as the RAM model, we fix the number of locations to attend to equal to 6. The proposed method outperforms the standard RAM model.
Appendix G Algorithm Implementation Details
We evaluate the proposed framework using Advantage Actor-Critic (A2C) to learn a policy conditioned on the goal. To evaluate the performance of proposed method, we use a range of maze multi-room tasks from the gym-minigrid framework (gym_minigrid) and the A2C implementation from (gym_minigrid). For the maze tasks, we used agent’s relative distance to the absolute goal position as "goal".
For the maze environments, we use A2C with 48 parallel workers. Our actor network and critic networks consist of two and three fully connected layers respectively, each of which have 128 hidden units. The encoder network is also parameterized as a neural network, which consists of 1 fully connected layer. We use RMSProp with an initial learning rate ofto train the models, for both InfoBot and the baseline for a fair comparison. Due to the partially observable nature of the environment, we further use a LSTM to encode the state and summarize the past observations.
Appendix H MiniGrid Environments for OpenAI Gym
The MultiRoom environments used for this research are part of MiniGrid, which is an open source gridworld package††https://github.com/maximecb/gym-minigrid. This package includes a family of reinforcement learning environments compatible with the OpenAI Gym framework. Many of these environments are parameterizable so that the difficulty of tasks can be adjusted (e.g., the size of rooms is often adjustable).
h.1 The World
In MiniGrid, the world is a grid of size NxN. Each tile in the grid contains exactly zero or one object. The possible object types are wall, door, key, ball, box and goal. Each object has an associated discrete color, which can be one of red, green, blue, purple, yellow and grey. By default, walls are always grey and goal squares are always green.
h.2 Reward Function
Rewards are sparse for all MiniGrid environments. In the MultiRoom environment, episodes are terminated with a positive reward when the agent reaches the green goal square. Otherwise, episodes are terminated with zero reward when a time step limit is reached. In the FindObj environment, the agent receives a positive reward if it reaches the object to be found, otherwise zero reward if the time step limit is reached.
h.3 Action Space
There are seven actions in MiniGrid: turn left, turn right, move forward, pick up an object, drop an object, toggle and done. For the purpose of this paper, the pick up, drop and done actions are irrelevant. The agent can use the turn left and turn right action to rotate and face one of 4 possible directions (north, south, east, west). The move forward action makes the agent move from its current tile onto the tile in the direction it is currently facing, provided there is nothing on that tile, or that the tile contains an open door. The agent can open doors if they are right in front of it by using the toggle action.
h.4 Observation Space
Observations in MiniGrid are partial and egocentric. By default, the agent sees a square of 7x7 tiles in the direction it is facing. These include the tile the agent is standing on. The agent cannot see through walls or closed doors. The observations are provided as a tensor of shape 7x7x3. However, note that these are not RGB images. Each tile is encoded using 3 integer values: one describing the type of object contained in the cell, one describing its color, and a flag indicating whether doors are open or closed. This compact encoding was chosen for space efficiency and to enable faster training. The fully observable RGB image view of the environments shown in this paper is provided for human viewing.
h.5 Level Generation
The level generation in this task works as follows: (1) Generate the layout of the map (X number of rooms with different sizes (at most size Y) and green goal) (2) Add the agent to the map at a random location in the first room. (3) Add the goal at a random location in the last room. MultiRoomNS - In this task, the agent gets an egocentric view of its surroundings, consisting of 33 pixels. A neural network parameterized as MLP is used to process the visual observation.
Appendix I Memory Access - Deciding When to Access Memory
Here, the privileged input involves accessing information from the external memory like neural turing machines (NTM)(sukhbaatar2015end; graves2014neural). Reading from external memory is usually an expensive operation, and hence we would like to minimize access to the external memory. For our experiments, we consider external memory in the form of neural turning machines. NTM processes inputs in sequences, much like a normal LSTM but NTM can allow the network to learn by accessing information from the external memory. In this context, the state of controller (the NTM’s controller which processes the input) becomes the standard input, and based on this (the standard input), we decide the channel capacity, and based on channel capacity we decide whether to read from external memory or not. In order to test this, we evaluate our approach on copying task. This task tests whether NTMs can store and recall information from the past. We use the same problem setup as (graves2014neural). As shown in fig 13, we found that we can perform slightly better as compared to NTMs while accessing external memory only 32% of the times.
Appendix J Hyperparameters
The only hyperparameter we introduce with the variational information bottleneck is . For both the VIB baseline and the proposed method, we evaluated with 5 values of : 0.01, 0.09, 0.001, 0.005, 0.009.
j.1 Common Parameters
We use the following parameters for lower level policies throughout the experiments. Each training iteration consists of 5 environments time steps, and all the networks (value functions, policy , and observation embedding network) are trained at every time step. Every training batch has a size of 64. The value function networks and the embedding network are all neural networks comprised of two hidden layers, with 128 ReLU units at each hidden layer.
All the network parameters are updated using Adam optimizer with learning rate .
Table 7 lists the common parameters used.
|hidden layers (Q, V, embedding)||2|
|hidden units per layer (Q, V, embedding)||128|
|RNN Hidden Size||128|
Appendix K Architectural Details
For our work, we made sure to keep the architecture detail as similar to the baseline as possible.
Goal Driven Navigation: Our code is based on open source gridworld package https://github.com/maximecb/gym-minigrid.
Multiagent Communication: Our code is based on the following open source implementation. https://github.com/bkgoksel/emergent-language.
Access to External Memory: Our code is based on the following open source implementation of Neural Turing Machines. https://github.com/loudinthecloud/pytorch-ntm
Spatial Navigation: Our code is based on the following open source implementation of https://github.com/JannerM/spatial-reasoning.
The only extra parameters which our model is introduce is related to the channel capacity network, which is parameterized as a neural network consisting of 2 layers of 128 dimensions each (with ReLU non-linearity).
Appendix L Information Theoretic Analogue of Attention
The present work is about using privileged information modulated by standard input. One natural way to think about this problem is to use the idea of attention to selectively attend to privileged information. Attention has been used to dynamically modulate information in neural circuits (vaswani2017attention; ke2018sparse; goyal2019recurrent). Attention basically routes information based on relevance or argument matching. So, one way to approach this problem is to think of an information theoretic analogue of attention. So, basically what would happen if we compose an information bottleneck with an attentional mechanism. Typically attention gives you something like:
In our case, queries could be function of the standard input, and keys, values could be a function of privileged information. But with a Variational Information Bottleneck (VIB) on the query , you get instead something like
where the expectation can be estimated with a sample. If both and are sampled from the prior, we can select the prior so as to get an average over all values (maybe can be chosen from some convenient family such that the exponential of the inner product of two variables with that distribution has some convenient form), and is it deviates, we get tighter attention. This could in principle make the whole thing nicely probabilistic, such that a highly "uncertain" standard input would just have a random query which would in expectation access the other values in an uninformed way. In retrospect, I think this is how we should have framed this work.