## 1 Introduction

Empowerment [1, 2] is an information theoretic quantity measuring the amount of information induced by an agents actuators and the information perceived by its sensors. It therefore measures the amount of control over the environment but also how well the current state can be perceived by the sensors. [3, 1] showed that system states with high empowerment value have maximal future options. In the case of an inverted pendulum this state with maximum future possibilities consists of balancing the pendulum in an upright position as shown in experiments using the empowerment formulation [3]. This value can be used in reinforcement learning as the reward function and serves as an unsupervised type of control which moves the robot towards states with high stability and maximal influence.

Previous applications lack an efficient implementation and the ability to use continuous variables either for the state space or the action space. They do not scale well with the dimension of the action space which limits empowerment to simple simulations. The very first implementations assumed discrete distributions for both spaces [1] and later [3] used empowerment for continuous states but still needs a low dimensional discrete action space. Real-world robotics tasks, such as in-hand manipulation, would require a high dimensional continuous action space. We developed an efficient computation of empowerment able to cope with high dimensional continuous state and action spaces enabling the use of empowerment for real-world robotic tasks.

## 2 Empowerment

Empowerment is defined as the Shannon channel capacity[3]:

The distribution describes the dynamical model of the environment with being the next state, the current state and the action performed. is the same dynamical model but with the action marginalized out:

The channel capacity essentially computes the number of different next states for all possible actions. The channel capacity would be zero if the agent has no control over the environment where every action is leading to the same next state.

Currently the only algorithm used for computing the empowerment value for a single state is the Blahut-Arimoto algorithm [3]. Both the computation of this KL-divergence and the marginalisation of the system dynamics are very expensive and are done by sampling. Not only does one need to compute these values but also optimize them with respect to

. The KL-divergence inside the channel capacity is estimated by monte-carlo integration and then maximised by iteratively changing the probabilities of each discrete action. This is computationally very expensive and not suitable for online use e.g. in a robotic system. In the following we will propose an efficient implementation replacing the Blahut-Arimoto algorithm and enabling the use in real world robotics systems.

## 3 Efficient Empowerment

### 3.1 Analytic KL-divergence

Where in [3] the authors used discrete action distributions and Monte Carlo sampling for computing the empowerment objective we decided to follow [4]

for an efficient computation of the KL-divergence by using the analytical solution for the KL-divergence between two Gaussian distributions

^{1}

^{1}1Obtained with the help of the Q&A community “crossvalidated” at http://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians.

. We assume that the system dynamics can be modelled by Gaussian distributions whose parameters are defined by neural networks.

where and are modelled by Neural Networks.

### 3.2 Variance Propagation for Marginalisation

For calculating the Empowerment objective one does not only need the dynamics model but also the transition probability with the action

being marginalised out. Since this marginalisation is very costly we are using a technique called Variance Propagation

[5, 6, 7]. Variance Propagation defines a set of rules for transforming a Gaussian when propagating it through a network. By setting the input mean and variance of the action to the mean and variance of we are effectively marginalising out .### 3.3 Variational Auto-Encoders

Elements of state and action need to be statistical independent for properly applying Variance Propagation and computing the analytical KL-divergence. Since this does not hold for most real world data we need to transform state and action into latent spaces where their elements are statistical independent. We are using the Variational Auto-Encoder [4] to transform state and action into these latent spaces.

### 3.4 Action selection

Empowerment only computes a scalar measuring the quality of the current state. It does not provide suitable actions for controlling a system. The simplest way for creating actions would be to predict the next state given an action using the already available system dynamics and choosing the action producing a next state with highest empowerment. Another more sophisticated solution would be to use empowerment as the reward function for reinforcement learning. Using it as a regularizer for an already existing reward function is also possible.

## 4 Pendulum Experiments

As a first simple experiment we tried our efficient implementation on the pendulum task similar [3]

. The system dynamics of this pendulum are known and implemented in a neural network like structure such that we can apply Variance Propagation for integrating out the action. The probability distribution

was implemented using a neural network modelling sufficient statistics for a diagonal Gaussian distribution. In this simple pendulum experiment we did not use the Variational Auto-Encoder trick for making state and action statistical independent. It was not necessary since both elements of the state vector were already independent and the action is only a scalar. The result of this experiment can be seen in Fig.:

1. The value of empowerment is maximal for the angle and velocity being zero corresponding to the state of the inverted pendulum standing upright.## 5 Conclusion

We provided a solution for efficiently computing empowerment for high dimensional continuous state and action spaces by combining methods including Variance Propagation, analytical computation of the KL-divergence and the Variational Auto-Encoder. We showed in a first experiment with a simulated inverted pendulum that this method is able to identify states with high empowerment and also able to generate actions using a one-step predictor.

Future work consists of replacing the dynamical model with a learned model. We will also test our algorithm on real world data with high dimensional state and action spaces. Furthermore we plan to test action selection by using reinforcement learning with empowerment as reward function.

## References

- [1] Alexander S. Klyubin, Daniel Polani, and Chrystopher L. Nehaniv. Keep your options open: An information-based driving principle for sensorimotor systems. PLoS ONE, 3(12):e4018, December 2008.
- [2] Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment – an introduction. arXiv:1310.1863 [nlin], October 2013. arXiv: 1310.1863.
- [3] Tobias Jung, Daniel Polani, and Peter Stone. Empowerment for continuous agent-environment systems. arXiv:1201.6583 [cs], January 2012. arXiv: 1201.6583.
- [4] Diederik P. Kingma and Max Welling. Stochastic gradient VB and the variational auto-encoder. arXiv:1312.6114 [cs, stat], December 2013.
- [5] Sida Wang and Christopher Manning. Fast dropout training. pages 118–126, 2013.
- [6] Justin Bayer, Christian Osendorfer, Sebastian Urban, and Patrick van der Smagt. Training neural networks with implicit variance. In Minho Lee, Akira Hirose, Zeng-Guang Hou, and Rhee Man Kil, editors, Neural Information Processing, number 8227 in Lecture Notes in Computer Science, pages 132–139. Springer Berlin Heidelberg, January 2013.
- [7] Justin Bayer, Maximilian Karl, Daniela Korhammer, and Patrick van der Smagt. Fast adaptive weight noise. arXiv:1507.05331 [cs, stat], July 2015. arXiv: 1507.05331.

Comments

There are no comments yet.