A unified strategy for implementing curiosity and empowerment driven reinforcement learning

06/18/2018 ∙ by Ildefons Magrans de Abril, et al. ∙ 2

Although there are many approaches to implement intrinsically motivated artificial agents, the combined usage of multiple intrinsic drives remains still a relatively unexplored research area. Specifically, we hypothesize that a mechanism capable of quantifying and controlling the evolution of the information flow between the agent and the environment could be the fundamental component for implementing a higher degree of autonomy into artificial intelligent agents. This paper propose a unified strategy for implementing two semantically orthogonal intrinsic motivations: curiosity and empowerment. Curiosity reward informs the agent about the relevance of a recent agent action, whereas empowerment is implemented as the opposite information flow from the agent to the environment that quantifies the agent's potential of controlling its own future. We show that an additional homeostatic drive is derived from the curiosity reward, which generalizes and enhances the information gain of a classical curious/heterostatic reinforcement learning agent. We show how a shared internal model by curiosity and empowerment facilitates a more efficient training of the empowerment function. Finally, we discuss future directions for further leveraging the interplay between these two intrinsic rewards.

READ FULL TEXT VIEW PDF

Authors

page 1

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1

Although there are many approaches to implement intrinsically motivated artificial agents, the combined usage of multiple intrinsic drives remains still a relatively unexplored research area. Specifically, we hypothesize that a mechanism capable of quantifying and controlling the evolution of the information flow between the agent and the environment could be the fundamental component for implementing a higher degree of autonomy into artificial intelligent agents. This paper propose a unified strategy for implementing two semantically orthogonal intrinsic motivations: curiosity and empowerment. Curiosity reward informs the agent about the relevance of a recent agent action, whereas empowerment is implemented as the opposite information flow from the agent to the environment that quantifies the agent’s potential of controlling its own future. We show that an additional homeostatic drive is derived from the curiosity reward, which generalizes and enhances the information gain of a classical curious/heterostatic reinforcement learning agent. We show how a shared internal model by curiosity and empowerment facilitates a more efficient training of the empowerment function. Finally, we discuss future directions for further leveraging the interplay between these two intrinsic rewards.

2 Keywords:

intrinsic motivation, reinforcement learning, curiosity, empowerment, homeostasis

2 Keywords:

intrinsic motivation, reinforcement learning, curiosity, empowerment, homeostasis

3 Introduction

Within a reinforcement learning setting (Sutton and Barto, 1998), a reward signal indicates a particular momentary positive (or negative) event and it serves to constrain the long-term agent behavior. Extrinsic rewards are generated by an external oracle and they indicate how well the agent is interacting with the environment (e.g. videogame score, portfolio return). On the other hand, intrinsic rewards are generated by the agent itself and they indicate a particular internal event sometimes implemented as a metaphor of an animal internal drive (Chentanez et al., 2005; Barto et al., 2004; Sequeira et al., 2011; Song and Grabowski, 2006).

There are many intrinsic rewards and most of them can be characterized by how they affect the information flow between the environment and the agent. In one side of the spectrum, information is pushed from the agent to the environment, for instance, by rewarding actions that lead to predictable consecutive sensor readings (Montúfar et al., 2016) or by rewarding reaching states from where the agent actions have a large influence in determining the future state (i.e. empowerment (Jung et al., 2011; Mohamed and Rezende, 2015; Karl et al., 2017; Gregor et al., 2016)).

On the other side, information is encouraged to efficiently move from the environment to the agent. These rewards motivate the agent to explore its environment by taking actions leading to an improvement of its internal models. Schmidhuber (1991) proposed an online learning agent equipped with a curiosity unit measuring the Euclidean distance between the observed state and the model prediction. Recently, Pathak et al. (2017) extended the curiosity functionality to accommodate agents with high dimensional sensory inputs by adding a representation network to filter out information from the observed state not relevant for predicting how the agents actions affect the future state. (Houthooft et al., 2016)

presented an exploration reward bonus based on information gain maximization computed using a variational approximation of a Bayesian neural network.

Lopes et al. (2012) discussed an exploration reward bonus that encourages the learning progress over the last few experiences instead of the immediate agent surprise. Bellemare et al. (2016)

differ in the sense that the agent is not learning a forward model but a probability density function about the states visited by the agent together with a lower bound on the information gain associated with the agent exploratory behavior.

Although there are many approaches to implement intrinsically motivated artificial agents, the combined usage of multiple intrinsic drives is still a relatively unexplored research field. To establish a principled approach to combine multiple types of intrinsic motivations, we propose an approach in which an agent is designed to optimize the information flow between the agent itself and the environment. By learning to sense and act via the internal representations of the information flow, the agent would be able to behave as if it were rewarded by a particular intrinsic reward function. As we discuss below, this general formulation of intrinsic motivation can capture a large spectrum of emergent autonomous behaviors, from a curious agent aiming to acquire as much information as possible to an agent aiming to reach highly empowering states. With this architecture, an agent could discover new curiosity-driven behaviors by simply vising new internal states or sequences of internal states. We believe that the generality of this internal representation, independent of a particular task and/or agent sensing/acting capabilities, has the potential to foster multi-agent and multi-task architectures with new transfer learning capabilities.

This paper is our first step towards developing an intelligent agent capable of sensing and acting according to the information flow between the environment and itself. Our contribution in this paper is our proposal for an implementation method to compute the state of the information flow. It quantifies the information gain and empowerment obtained by an agent interacting with the environment at every step. In the following section we discuss our design requirements and present our approach. Section 3 presents the experimental results. Finally, the discussion section summarizes our main results and limitations along with possible future directions.

4 Background

This paper assumes a typical reinforcement learning setup where an agent interacts with the environment at discrete time steps, it observes a state and it acts on the environment with action according to a control policy . Within this setting, Tiomkin and Tishby (2017) presented recursive expressions to describe the information transferred from a sequence of environment states to the sequence of agent actions as well as to describe the information transferred from the agent actions to the environment states. In both cases, it is assumed that the agent interacts open-endedly with a Markovian environment (i.e. transition probability function ). Figure 1 shows two different points of views of the information flow for the same process of an agent interacting with a Markovian environment. Equations 4 and 4 present the recursive expressions of the information transferred from environment to agent and from agent to environment respectively:

(1)
(2)

where lower case is used for concrete states and actions, uppercase is used to denote random variables,

/ is the sequence of actions/states of length starting at time and is the causally conditioned directed mutual information (Kramer, 1998):

(3)

where this definition differs from that of the conditional mutual information only on that and substitutes and . The causal conditioning reflects a causal relationship on past and present only.

Figure 1: Conceptual diagram of the information gain process (left) and empowerment (right): dark thin arrows are causal dependencies and large arrows show the direction of the information flow.

It is important for our approach that both equations have a recursive structure decomposition similar to the Bellman equation. From this point of view, and would act as agent reward when we try to encourage our agent to take actions that maximize the information flows from the environment to the agent and from the agent to the environment respectively.

As we discussed in the introduction, intrinsic rewards can be characterized by how they affect the information flow between the environment and the agent. Then, we should be able to encourage a variety of behaviors, similar to those encouraged by particular intrinsic rewards, by properly balancing the two types of rewards derived from equations (4) and (4). Therefore, if we could create an agent that can sense and act in a space that quantifies the strength of the information flow between itself and the environment, then we should be able to enhance the agent capacity to thrive on different, previously unknown environments by controlling its movement in this internal space.

These rewards define our internal space. Computing these rewards requires the computations of the corresponding conditional mutual information which requires the approximation of the corresponding probability distributions

(Mohamed and Rezende, 2015; Tiomkin and Tishby, 2017). When actions and/or states are discrete, we can approximate them for instance using a neural network with a softmax output layer. However it’s much harder when states and actions are continuous, especially when the state space is very high dimensional (e.g. video stream). In the following sections, we propose a more practical method to implement both rewards.

5 Curiosity with homeostatic regulation

This section discusses a practical method to compute the reward coming from equation (4) defined as . Our method avoids the approximation of complex distributions over continuous states and actions. We validate this first reward function using a state of the art RL algorithm that works well with continuous actions. We chose the Deep Deterministic Policy Gradient algorithm (Lillicrap et al., 2015) but other options are also feasible. This algorithm finds a deterministic control policy that maximize the expected sum of discounted rewards. When , episode length is and reward function is , then our agent explores the environment by maximizing the information gain as expressed in equation (4).

We can express this reward as the reduction of entropy in the future state . Then, because we are able to know exactly the current state and due to the deterministic nature of the control policy inferred by the DDPG algorithm, we use the concrete state and actions and instead of the random variables , and respectively to compute the reward. Finally, we approximate the reduction of entropy in the future state as the reduction of the prediction error in the future state. Equation 5 formalizes this approximation:

(4)

where and are the future state predictions by the forward and extended forward models respectively. The extended forward model takes advantage of the knowledge of the action that the agent will take in the future state to improve the prediction about this future state. This approximation captures the relevant semantic with much lower computational cost. Interestingly, the internal models and can be easily implemented with deep neural networks, which can accomodate an agent with high-dimensional input streams. Figure 2 is a graphical representation of the semantic of the new curiosity reward and how it compares with respect to a state of the art curiosity reward based on the Euclidean distance between the observed state and the model prediction (e.g. (Schmidhuber, 1991; Pathak et al., 2017)).

Figure 2: Semantic of the curiosity reward with homeostatic regulation and comparisson with respect to a state of the art curiosity reward based on the Euclidean distance between the observed state and the model prediction (e.g. (Schmidhuber, 1991; Pathak et al., 2017)).

Our new curiosity reward has two components: 1) Heterostatic motivation: similarly to a state of the art work based on the Euclidean distance (Schmidhuber, 1991; Pathak et al., 2017), the first component of our reward encourages taking actions that lead to large forward model errors. This first component implements the heterostatic drive. In other words, the tendency to push away our agent from a predictable behavior; 2) Homeostatic motivation: the second component is our novel contribution. It encourages taking actions that lead to future states where the corresponding future action gives us additional information about . This situation happens when the agent is “familiar” with the state-action pair: . Therefore, our new reward encourage the agent to move towards regions of the state-action space that simultaneously deliver large forward model errors and that are “known/familiar” to the agent. In other words it implies a priority sampling strategy towards “hard-to-learn” regions of the state-action space.

We further generalize this reward by adding an hyper-parameter that controls the importance of the of the homeostatic bonus. It is interesting to note that this reward is equal to the curiosity reward proposed by (Pathak et al., 2017) when . Finally, we should note that the reward function is non-stationary due to the continuous learning of and . For that reason we

-normalize the reward using a mean and standard deviation computed at the end of each of episode using all available samples:

(5)

where and are the sample mean and sample standard deviation of the reward computed according to all samples collected so far. Algorithm 1 summarizes the overall logic of our curiosity agent. It follows an architecture similar to Pathak et al. (2017):

Result: Forward model:
: Total number of training episodes;
: Duration of each exploration episode;
Initialization of parameters including ;
Initialization of random exploration probability ;
for episode i:1..N do
       Initialize environment: initial state according to experiment strategy (see section 7);
       for step t:1..K do
             Generate (random according to );
             Sample ;
             Get reward according to equation (5);
             Add to Replay Buffer (RB);
             Sample Mini-Batch ;
             Train internal models and DDPG networks (e.g. ) using
       end for
      
end for
Algorithm 1 Curiosity-driven reinforcement learning with homeostatic regulation

6 Approximated empowerment

This section discusses the implementation of the reward coming from equation (4) defined as . We validate the implementation of this second reward by fitting a deterministic control policy with the DDPG algorithm that is able to guide an agent following the path of the maximum empowerment. We implement the definition of empowerment proposed by (Tiomkin and Tishby, 2017):

(6)

where is the source distribution. This definition of empowerment is based on the mutual information betweeen a sequence of actions and the corresponding sequence of future states which is slightly different than the original empowerment definition by Klyubin et al. (2005) which is based on the mutual information between a sequence of actions and the final state after executing all actions:

(7)

Similarly to previous section, we take advantage of the Bellman like equation of the information transferred from the agent to environment (Eq. 4) to justify the use of a reinforcement learning algorithm which finds a control policy that maximizes over a sequence of steps. Crucially, we assume that the reward function is stationary and known before we start optimizing the control policy. Therefore, we could compute it using dynamic programming. However we will continue using DDPG algorithm to stress the similarities with the curiosity-driven agent, presented in previous section, and the potential interplay between the models required to compute both rewards.

In this case, the reward function at state is defined by . A key additional cost of computing this reward, compared with the reward discussed in previous section, is that we have to optimize the source distribution that delivers the maximum conditional mutual information. To address the high computational cost of this reward, we perform a number of approximations. We express the mutual information as the reduction of entropy in the future state: . The first approximation step is to compute the first entropy term assuming a fixed uniform source distribution instead of optimizing as in the original formulation. This reward component implements a measure of future possible states according to a fixed uniform source distribution. The second entropy term, defined by , is approximated using only the action provided by the deterministic control policy. In other words, we assume that the actions are distributed according to a Dirac delta distribution optimized by the DDPG algorithm: . This second entropy term would capture the agent potential to move from the current state to the future state in a controlled way. Finally, we approximate the first and second entropy terms respectively as follows:

(8)

where is the forward model,

is the uniform distribution in the action space and

is the deterministic control policy at time . As it has been discussed in this section, we assume an stationary reward. Therefore, according to equation (6), we are assuming that the forward model is known.

Figure 3 is a graphical representation of the semantic of the new approximated empowerment reward discussed in this section. Light grey area represents the area of possible future states assuming a uniform distribution of actions. White area is the possible deviation from the future state predicted by the forward model when the agent takes the action suggested the current control policy . Therefore, this reward encourages policies that lead the agent towards states with large number of future possibilities and states from where the future state is highly predictable given the action defined by the control policy.

Figure 3: semantic of the new approximated empowerment reward. Light grey area represents the area of possible future states assuming a uniform distribution of actions. White area is the possible deviation from the future state predicted by the forward model when the agent takes the action suggested the current control policy .Therefore, this reward encourages policies that lead the agent towards states with large number of future possibilities and states from where the future state is highly predictable given the action defined by the control policy.

This reward captures the semantic of the original reward defined in the trivial term of equation (6), it avoids the maximization over the source distribution and the approximation of complex distributions over states and actions. The assumption of having a forward model is actually a feature rather than a limitation because the forward model becomes the main instrument of interplay between the curiosity and the empowerment internal functions. A first example of this interplay is discussed in section 7, where we show that we can efficiently train our forward model using the curiosity agent and then use this same model to compute the reward of a second empowerment-driven agent. More details of this interplay will be presented in a future paper. Not optimizing , could indeed be a more important limitation specially when the environment is not deterministic and the probabilistic response is not uniform across the state-action space. The final approximation step in both rewards presented in equations (5) and (6) are based on the norm which is not a valid distance metric when the state space is not Euclidean. Pathak et al. (2017) showed that this limitation can be solved by fitting a representation network

using the reconstruction performance of an agent action decoder as loss function.

7 Results

7.1 Curiosity: experiment 1

Our experimental validation presents two examples where both curiosity and homeostatic drives are superior to learn a forward model. Our validation hypothesis is that exploring an environment with several non-linearities could be optimized by regulating the agent curiosity with a homeostatic drive. More specifically, it prioritize the exploration of the state-action space according to how hard it is to learn.

To test our hypothesis, we use a 3 room continuous space environment of 40 by 40, where an agent, able to sense its exact position, learns a control policy according to the DDPG algorithm with the reward presented in equation (5) and a probability of taking a random action equal to

. The available actions are bi-dimensional action vectors such that

. The environment is deterministic and when an agent collides with a wall it returns to its previous state. The agent starts every episode in a random state and it runs for 10 steps (with max length step=10). We have implemented the forward model and the extended forward model

as feed forward neural networks with 2 hidden layers with 64 hidden units each. We store the agent traces and we train the agent and the internal models at the same time following algorithm

1. Figure 4 shows a scheme of our environment.

Figure 4: Scheme of our 3 room environment.

In our first experiment we study the accuracy of the final forward model as a function of . We check the prediction accuracy using a validation data set of randomly generated samples collected independently of the training process and never used to train and/or . We run our agent using different values of for episodes and we do each experiment 3 times. Figure 5 shows how we can improve the environment sampling efficiency by increasing the homeostatic component of the reward (i.e. ). Figure 6 shows a diagram of the policy learned after episodes with and respectively. We can clearly appreciate that, when is large, the agent tends to position itself where there are a larger number of non-linearities (i.e. the “doors”). This agent behavior enhances the learning of complex regions by leveraging a more intense random exploration where it is most required.

Figure 5: Accuracy of the forward model learned by the agent as a function of (measured according to the mean square error on the validation set).
Figure 6: Flow diagram of the control policy learned after 10K episodes with (left) and (right) respectively.

We should also mention, that for this particular experiment, a pure random sampling strategy achieves a mean square error, on the validation set, of 0.67 which is better than the best result obtained with (0.87). However this is not a fair comparison because every episode starts in a different position which enables a pure random agent to reach every spot of the environment by simply random walking its local surroundings while our curiosity agent is constrained by the a relatively low random exploration probability (). For instance, we are able to beat the random sampling agent performance using our agent with a random exploration probability equal to and . In this case, we achieve an average mean square error over three runs of which is better than the achieved by the random sampling agent.

7.2 Curiosity: experiment 2

We performed a second experiment using the same environment described in Figure 4, but in this case the agent starts every episode in a random state of the bottom room. We want to understand whether the homeostatic reward is able to enhance the acquisition of innovative environment samples by counting how many times the agent is able to traverse 2 doors and reach the top room. Figure 7 shows how we can improve the acquisition of challenging environment states by optimizing the contribution of the homeostatic reward component. In this case, a pure random sampling strategy running for episodes only reach the top room a total average of 145 times which is far below any other total average achieved with a non-random strategy with any .

Figure 7: Total number of times that the agent is able to reach the top room as a function of when it starts every episode in a random position of the bottom room.

7.3 Empowerment: experiment 1

To test our approximation of empowerment, we use again the 3 room environment where an agent, able to sense its exact position, starts every episode in a random state and it runs for 10 steps (with max length step=10). We have implemented the reward function defined in equation (6) with a forward model trained using the curiosity agent described in section 7.1 with a random exploration probability equal to and . With the forward model completely trained, we optimize our control policy using DDPG algorithm. Figure 8 shows a diagram of the reward function (left) and the final control policy (right). Our empowerment approximation rewards the agent to position itself close to the apartment doors because this position provides the larger amount of future options to the agent.

Figure 8: Agent environment and approximated empowerment reward profile (left). Control policy to maximize the acquisition of reward (right).

8 Discussion

We presented a new approach to define the internal state space of a learning agent. Our strategy is to create a minimal set of internal functions that summarize the state of the information flow between the agent and the environment. In this paper, we proposed a unified framework for implementing two types of intrinsic motivations, namely, curiosity and empowerment from the perspective of information flow between the agent and the environment. Curiosity was implemented as the drive to increase information flow from the environment to the agent whereas empowerment was formulated as the information flow from the agent to the environment. With these unified intrinsic motivations, we hypothesized that an agent should be able to generate a broad spectrum of autonomous behavior

The curiosity function quantifies interestingness of a particular state-action pair, while the empowerment function measures the agent future options and control at the current state. These are computed at every discrete time step using two functions that depend of the actual observations, agent actions and internal forward model of the environment. We derive them from information theoretical considerations and proposed methods to minimize the computational cost and share internal models across the two types of intrinsic motivations.

The curiosity function quantifies two opposing animal drives: 1) the innate drive to explore (heterostatic behavior) and 2) the desire to maintain certain critical parameters stable. We presented an exploration approach to demonstrate this first function. It generalizes a state of the art method (Pathak et al., 2017) and we present experimental results to demonstrate the superior exploration behavior of our joint homeostatic and heterostatic drive with respect to a pure curiosity/heterostatic approach.

The second derived function (i.e. empowerment) quantifies at each state the trade-off between the amount of possible future states assuming a uniform distribution of one step actions, and the precision to move to the next state according to a deterministic control policy. This function is our proposal to quantify the information that an agent can transmit to the environment when following a deterministic control policy and it is based on similar information theory principles as the curiosity function. We evaluated this function by optimizing a control policy that follows a sequence of states that are the optimal trade-off between amount of possible future states reachable from each state and control accuracy. This example captures the semantic of an empowerment-driven agent at a much lower computational cost than the original formulation.

In the future work, we will explore meta-learning strategies to dynamically adjust the contribution of the two intrinsic reward functions as well as the homeostatic drive (i.e. ), the random exploration probability and eventually also an external reward. Our meta-controller should be able to compute a probabilistic model over bi-dimensional functions on a space defined by the weights of the intrinsic rewards. Gaussian processes are good candidates as they offer a sample efficient way to approximate this distribution as well as principled approaches to implement the intrinsic weight sampling strategies (Snoek et al., 2012). It is key to address the non-stationary behavior of both intrinsic motivation functions (Snoek et al., 2014).

Conflict of Interest Statement

The authors were employed by company ARAYA, Inc.

Author Contributions

IM and RK conceived the method. IM performed experiments and analyzed data. IM and RK wrote the paper.

Funding

This work was supported by JST CREST Grant Number JPMJCR15E2, Japan.

References