An intelligent agent is sent to explore an unknown environment. Over the course of its mission, the agent makes observations, carries out actions, and incrementally builds up a model of the environment from this interaction. Since the way in which the agent selects actions may greatly affect the efficiency of the exploration, the following question naturally arises:
How should the agent choose the actions such that the knowledge about the environment accumulates as quickly as possible?
In this paper, this question is addressed under a classical framework, in which the agent improves its model of the environment through probabilistic inference, and learning progress is measured in terms of Shannon information gain. We show that the agent can, at least in principle, optimally choose actions based on previous experiences, such that the cumulative expected information gain is maximized. We then consider a special case, namely exploration in finite MDPs, where we demonstrate, both in theory and through experiment, that the optimal Bayesian exploration strategy can be effectively approximated by solving a sequence of dynamic programming problems.
The rest of the paper is organized as follows: Section 2 reviews the basic concepts and establishes the terminology; Section 3 elaborates the principle of optimal Bayesian exploration; Section 4 focuses on exploration in finite MDP; Section 5 presents a simple experiment; The related works are briefly reviewed in Section 6; Section 7 concludes the paper.
Suppose that the agent interacts with the environment in discrete time cycles . In each cycle, the agent performs an action , then receives a sensory input . A history is either the empty string or a string of the form for some , and and refer to the strings resulting from appending and to , respectively.
2.1 Learning from Sequential Interactions
To facilitate the subsequent discussion under a probabilistic framework, we make the following assumptions:
- Assumption I.
The models of the environment under consideration are fully described by a random element which depends solely on the environment. Moreover, the agent’s initial knowledge about is summarized by a prior density .
- Assumption II.
The agent is equipped with a conditional predictor , i.e. the agent is capable of refining its prediction in the light of information about .
Using and as building blocks, it is straightforward to formulate learning in terms of probabilistic inference. From Assumption I, given the history , the agent’s knowledge about is fully summarized by . According to Bayes rule, , with . The term represents the agent’s current knowledge about given history and an additional action . Since depends solely on the environment, and, importantly, knowing the action without subsequent observations cannot change the agent’s state of knowledge about , , hence the knowledge about can be updated using
It is worth pointing out that is chosen a priori. It is not required that they match the true dynamics of the environment, but the effectiveness of the learning certainly depends on the choices of . For example, if , and depends on only through its sign, then no knowledge other than the sign of can be learned.
2.2 Information Gain as Learning Progress
Let and be two histories such that is a prefix of . The respective posterior of are and . Using as a reference point, the amount of information gained when the history grows to , can be measured using the KL divergence between and . This information gain from to is defined as
As a special case, if , then is the cumulative information gain with respect to the prior . We also write for , which denotes the information gained from an additional action-observation pair.
From an information theoretic point of view, the KL divergence between two distributions and represents the additional number of bits required to encode elements sampled from , using optimal coding strategy designed for . This can be interpreted as the degree of ‘unexpectedness’ or ‘surprise’ caused by observing samples from when expecting samples from .
The key property information gain for the treatment below is the following decomposition: Let be a prefix of and be a prefix of , then
From updating formula Eq.1,
Using this relation recursively,
That is, the information gain is additive in expectation.
Having defined the information gain from trajectories ending with observations, one may proceed to define the expected information gain of performing action , before observing the outcome . Formally, the expected information gain of performing with respect to the current history is given by . A simple derivation gives
which means that is the mutual information between
and the random variablerepresenting the unknown observation, conditioned on the history and action .111Side note: To generalize the discussion, concepts from algorithmic information theory, such as compression distance, may also be used here. However, restricting the discussion under a probabilistic framework greatly simplifies the matter.
3 Optimal Bayesian Exploration
In this section, the general principle of optimal Bayesian exploration in dynamic environments is presented. We first give results obtained by assuming a fixed limited life span for our agent, then discuss a condition required to extend this to infinite time horizons.
3.1 Results for Finite Time Horizon
Suppose that the agent has experienced history , and is about to choose more actions in the future. Let be a policy mapping the set of histories to the set of actions, such that the agent performs
with probabilitygiven . Define the curiosity Q-value as the expected information gained from the additional actions, assuming that the agent performs in the next step and follows policy in the remaining steps. Formally, for ,
and for ,
The curiosity Q-value can be defined recursively. Applying Eq. 2 for ,
And for ,
Noting that Eq.3 bears great resemblance to the definition of state-action values (
) in reinforcement learning, one can similarly define thecuriosity value of a particular history as , analogous to state values (), which can also be iteratively defined as , and
The curiosity value is the expected information gain of performing the additional steps, assuming that the agent follows policy . The two notations can be combined to write
This equation has an interesting interpretation: since the agent is operating in a dynamic environment, it has to take into account not only the immediate expected information gain of performing the current action, i.e., , but also the expected curiosity value of the situation in which the agent ends up due to the action, i.e., . As a consequence, the agent needs to choose actions that balance the two factors in order to improve its total expected information gain.
Now we show that there is a optimal policy , which leads to the maximum cumulative expected information gain given any history . To obtain the optimal policy, one may work backwards in , taking greedy actions with respect to the curiosity Q-values at each time step. Namely, for , let
such that , and for , let
with and . We show that is indeed the optimal policy for any given and in the sense that the curiosity value, when following , is maximized. To see this, take any other strategy , first notice that
Moreover, assuming ,
Therefore holds for arbitrary , , and . The same can be shown for curiosity Q-values, namely, , for all , , , and . It may be beneficial to write in explicit forms, namely,
Now consider that the agent has a fixed life span . It can be seen that at time , the agent has to perform to maximize the expected information gain in the remaining steps. Here is the history at time . However, from Eq.2,
Note that at time , is a constant, thus maximizing the cumulative expected information gain in the remaining time steps is equivalent to maximizing the expected information gain of the whole trajectory with respect to the prior. The result is summarized in the following proposition:
Let , , and
then the policy is optimal in the sense that , for any , , and . In particular, for an agent with fixed life span , following at time is optimal in the sense that the expected cumulative information gain with respect to the prior is maximized.
3.2 Non-triviality of the Result
Intuitively, the interpretation of the recursive definition of the curiosity (Q) value is simple, and bears clear resemblance to their counterparts in reinforcement learning. It might be tempting to think that the result is nothing more than solving the finite horizon reinforcement learning problem using or as the reward signals. However, this is not the case.
First, note that the decomposition Eq.2 is a direct consequence of the formulation of the KL divergence. The decomposition does not necessarily hold if is replaced with other types of measures of information gain.
Second, it is worth pointing out that and behave differently from normal reward signals in the sense that they are additive only in expectation, while in the reinforcement learning setup, the reward signals are usually assumed to be additive, i.e., adding reward signals together is always meaningful. Consider a simple problem with only two actions. If is a plain reward function, then should be meaningful, no matter if and is known or not. But this is not the case, since the sum does not have a valid information theoretic interpretation. On the other hand, the sum is meaningful in expectation. Namely, when has not been observed, from Eq.2,
the sum can be interpreted as the expectation of the information gained from to . This result shows that or can be treated as additive reward signals only when one is planning ahead.
To emphasize the difference further, note that all immediate information gains are non-negative since they are essentially KL divergence. A natural assumption would be that the information gain , which is the sum of all in expectation, grows monotonically when the length of the history increases. However, this is not the case, see Figure 1 for example. Although is always non-negative, some of the gain may pull closer to its prior density , resulting in a decrease of KL divergence between and . This is never the case if one considers the normal reward signals in reinforcement learning, where the accumulated reward would never decrease if all rewards are non-negative.
3.3 The Algorithm
The definition of the optimal exploration policy is constructive, which means that it can be readily implemented, provided that the number of actions and possible observations is finite so that the expectation and maximization can be computed exactly.
The following two algorithms computes the maximum curiosity value and the maximum curiosity Q-value , respectively, assuming that the expected immediate gain can be computed.
The complexity of both CuriosityValue and CuriosityQValue are , where and are the number of possible observations and actions, respectively. Since the cost is exponential on
, planning with large number of look ahead steps is infeasible, and approximation heuristics must be used in practice.
3.4 Extending to Infinite Horizon
Having to restrict the maximum life span of the agent is rather inconvenient. It is tempting to define the curiosity Q-value in the infinite time horizon case as the limit of curiosity Q-values with increasing life spans, . However, this cannot be achieved without additional technical constraints. For example, consider simple coin tossing. Assuming a over the probability of seeing heads, then the expected cumulative information gain for the next flips is given by
With increasing , . A frequently used approach to simplifying the math is to introduce a discount factor , as used in reinforcement learning. Assume that the agent has a maximum actions left, but before finishing the actions it may be forced to leave the environment with probability () at each time step. In this case, the curiosity Q-value becomes , and
One may also interpret as a linear combination of curiosity Q-values without the discount,
Note that curiosity Q-values with larger look-ahead steps are weighed exponentially less.
The optimal policy in the discounted case is given by