Towards Governing Agent's Efficacy: Action-Conditional β-VAE for Deep Transparent Reinforcement Learning

11/11/2018 ∙ by John Yang, et al. ∙ Seoul National University 6

We tackle the blackbox issue of deep neural networks in the settings of reinforcement learning (RL) where neural agents learn towards maximizing reward gains in an uncontrollable way. Such learning approach is risky when the interacting environment includes an expanse of state space because it is then almost impossible to foresee all unwanted outcomes and penalize them with negative rewards beforehand. Unlike reverse analysis of learned neural features from previous works, our proposed method tackles the blackbox issue by encouraging an RL policy network to learn interpretable latent features through an implementation of a disentangled representation learning method. Toward this end, our method allows an RL agent to understand self-efficacy by distinguishing its influences from uncontrollable environmental factors, which closely resembles the way humans understand their scenes. Our experimental results show that the learned latent factors not only are interpretable, but also enable modeling the distribution of entire visited state space with a specific action condition. We have experimented that this characteristic of the proposed structure can lead to ex post facto governance for desired behaviors of RL agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Despite many recent successful achievements that deep neural networks

(DNN) have allowed in machine learning fields

[Krizhevsky, Sutskever, and Hinton2012, LeCun, Bengio, and Hinton2015, Mnih et al.2015], the legibility of their high-level representations are noticeably less studied compared to the relevant studies which rather prioritize performance enhancements or task completions. The blackbox issue of neural networks has been many times neglected and such technical opacity has been excused for their vast performance improvements [Burrell2016].

While the opaqueness of DNN comes handy when strict labels are available for every data sample, its blackbox issue is a great element of risk especially in reinforcement learning (RL) settings where machines, or agents, are allowed to have highly intertwined interactions with their environments. Since an RL agent’s policy on action selection is optimized towards maximizing the rewards, it may produce harmful and unexpected outcomes if these outcomes are not primarily penalized with negative reward signals.

Yet, too much regulation would, contrarily, result in misusing the full potential of the technology [Rahwan2018]. RL is proven of its powerfulness over humans by, for an example of AlphaGo, figuring to learn unprecedented winning moves [Silver et al.2017]. Interfering in the learning process to control the model’s resultant behaviors as done in the work of [Christiano et al.2017] may not be efficient in governing RL agents. Rather, it is desired to control the efficacy of an agent which is already optimized for the environment.

In order to rule AI agents efficiently, humans who govern first need to comprehend how AI machines perceive their world and monitor their efficacy [Stilgoe2018, Wynne1988]. Higgins et al. modeled an environment with the

-Variational Autoencoder (

-VAE) to generate disentangled latent features [Higgins et al.2017], purposefully inducing the learned features to be interpretable to human [Higgins et al.2016b]

, and have applied the features for transfer learning across multiple environments. We are motivated that

this method can be utilized to train an explainable RL agent [Higgins et al.2016a].

We believe building transparent RL agents and governing them would solve issues mentioned above. In this paper, we propose a method that allows training a deep but transparent RL policy network, encouraging their latent features to be interpretable. We intend to accomplish this by training RL agents to learn disentangled representations of their world in egocentric perspective with action-conditional -VAE (AC--VAE): the learned control-dependent latent features and uncontrollable environmental factors are disentangled while the learned factors are also able to model the environment. Our strategic design that engage the AC--VAE and an RL policy network to share a backbone structure overcomes the blackbox issue, supporting the transparency of deep RL. We also empirically show that the behavior of our agents can further be governed with human enforcements.

(a) feed flow diagram
(b) backward flow diagram
Figure 1: The structure and flow diagrams of the proposed AC--VAE for a transparent policy network. The proposed network requires training samples of MDP tuples of RL environments that consist of where , and are respectively state, action and reward at time step . The action-conditional decoder encourages the input features of the policy network to be disentangled and interpretable. Since the encoder + policy network can be seen as one big policy network that takes raw states as inputs, its inner intentions in selecting actions for a desired next state can thus be explained visually through the outputs of the decoder.

Related Work

Deep learning methods are praised of their unruled pattern extraction that yields better performance in many tasks than machines trained under human prior knowledge [Günel, Moore and Lu2011, Vanderbilt2012], but as stated earlier, the blackbox characteristic of DNNs can be precarious especially in the RL setting. One of the safety factors of AI development suggested in [Amodei et al.2016] is avoidance of negative side effects when training an agent to complete a goal task with a strict reward function.

Attempts to open the blackbox of DNN and to understand the inner system of neural networks have been made in many recent works [Lipson and Kurman2016, Zeiler and Fergus2014, Bojarski et al.2017, Greydanus et al.2017]. Its inherent learning phenomena are reversely analyzed by observing the resultant learned understructure. While the training progress is also analytically interpreted via information theory [Shwartz-Ziv and Tishby2017, Saxe et al.2018], it is still challenging to anticipate how and why high-level features in neural models are learned in a certain way before training them. Since learning a disentangled representation encourages its interpretability [Bengio, Courville, and Vincent2013, Higgins et al.2016b]

, it is previously reported that features of convolutional neural networks (CNN) can also be learned in a visually explainable way

[Zhang and Zhu2018] through disentangled representation learning.

Prospection of future states conditioned by current actions is meaningful to RL agents in many ways, and action-conditional (variational) autoencoders are learned to predict sequent states in the works of [Ha and Schmidhuber2018, Oh et al.2015, Thomas et al.2017]. DARLA [Higgins et al.2017] utilizes disentangled latent representations for cross-domain zero-shot adaptations. It aims to prove its representation power in multiple similar but different environments. Our model may also look similar to conditional generative models like Conditional Variational Autoencoders (CVAE) [Sohn, Lee, and Yan2015] and InfoGan [Chen et al.2016], but these are not directly applicable models to RL domains.

Preliminary: -Vae

Variational autoencoder (VAE) [Kingma and Welling2013] works as a generative model based on the distribution of training samples [Co-Reyes et al.2018, Babaeizadeh et al.2017]. VAE’s goal is to learn the marginal likelihood of a sample from a distribution parametrized by generative factors . In doing so, a tractable proxy distribution

is used to estimate an intractable posterior

with two different parameter vectors

and . The marginal likelihood of a data point can be defined as:

(1)

Since the KL divergence term is non-negative, sets a variational lower bound for the likelihood and the best approximation for can be obtained by maximizing this lower bound: L_vae = E_q_ϕ(z—x)[logp_θ(x—z)] - D_KL (q_ϕ(z—x)——p(z)). In practice, and are respectively encoder and decoder that are parameterized by deep neural networks, and the prior

is usually set to follow Gaussian distribution

. The gradients of the lower bound can be approximated using the reparametrization trick.

-VAE [Higgins et al.2016b]

extends the work and drives VAE to learn disentangled latent features, weighting the KL-divergence term from the VAE loss function

(negative of the lower bound) with a hyper-parameter :

(2)

When is ideally selected and does not severely interfere the reconstruction optimization, each latent factor of is learned to be not only independent of each other, but also interpretable. This means the resultant features follow physio-visual characteristics of our world and differ from conventional DNN features that are not so human-friendly.

The Proposed Model

Our proposed model is composed of two structures: a policy gradient RL method and the action-conditional -VAE (AC--VAE). As shown in Figure 1, both components are designed to strategically share first layers of the encoding network so that the latent features of AC--VAE can also become the input of the policy network. This simple shared architecture enables human-level interpretations on behaviors of deep RL methods.

Consider a reinforcement learning setting where an actor plays a role of learning policy and selects an action given a state at time , and there exists a critic that estimates value of the states to lead the actor to learn the optimal policy. Here, and respectively denote the network parameters of the actor and the critic. Training progresses towards the direction of maximizing the objective function based on cumulative rewards, where is the instantaneous reward at time and is a discount factor. The policy update objective function to maximize is defined as follows:

(3)

Here, is an advantage function, which is defined as it is in asynchronous advantage actor-critic method (A3C) [Mnih et al.2016]:

where denotes the number of steps. We have used the update method of Advantage Actor Critic (A2C) [Wu et al.2017], a synchronous and batched version of A3C, for Atari domain environments [Bellemare et al.2013]. Proximal Policy Optimization (PPO) [Schulman and Klimov2017] is also used for our experiments in continuous control environments, which reformulates the update criterion with the use of clipping objective constraint in the form of:

(4)

Here, the subscript for , and is omitted for brevity.

Action-Conditional -Vae (Ac--Vae)

As shown in Fig. 1 with a given environment, the policy network combined with the encoder produces rollouts of typical Markov tuples that consist of . A raw state feeds into the encoder model and gets encoded into a representation , where is the dimension of the the latent space. Since the policy network and AC--VAE share the parameters until this encoding process, the representation represents a DNN feature which is inputted to

the policy network while also representing a concatenated form of the mean and the standard deviation vectors

. The vectors are reparametrized into a posterior variable through the AC--VAE pipeline. The output of the encoder feed-flows into the policy network to output an optimal action where so that an RL environment responds accordingly. The action vector is then concatenated with a vector of zeros in length of to create, we call, an action-mapping vector . An element-wise sum of the latent variable and the action-mapping vector is performed in order to map action-controllable factors into the latent vector. This causes the latent variable sampled to be constrained by the probability of actions. The resultant vector is fed into the decoder network to predict the next state . The prediction is then compared with the real state given by the environment after the action taken. For an MDP tuple collected at time , the loss of AC--VAE is computed with the following loss function:

(5)
Initialize encoder and decoder
Initialize critic , actor and state .
while not stop-criterion do
     
     repeat
         Take an action with policy
         Receive new state and reward
     until  or terminal
     
     for   do
         
         Compute (for A2C or PPO)
         Sample and create
         Predict
         Compute and
         Update encoder, actor and decoder based on:
         
         Update critic by minimizing the loss:
               
Algorithm 1 AC--VAE with an actor-critic policy network

As one can see, the AC--VAE model can be trained either simultaneously with the policy network or separately, and all our experiments are performed with the former because it is more practical. At each iteration of update, the total objective function value is calculated with the weighted sum of objective function values from both models:

(6)

where is the weight balance parameter. Since exploration based on the error between generated outputs and the ground-truths have already been proven on the training enhancement in many RL related works [Oh et al.2015, Ha and Schmidhuber2018, Tang et al.2017], our model rather focuses on feasible training of a transparent neural policy network and modeling self-efficacy of agents, not on RL performance improvement. We thus choose relatively small-valued not to confuse the policy network too much. A basic pseudo-code for the training scenario of our proposed structure is provided in Algorithm 1.

Figure 2: The results of traversing the latent factor of our trained model on Atari game environment Breakout with , where are mapped with variant features of and is condensed with other environmental factors. Since the factors in the latent vector of AC--VAE are defined by the vectors of mean and standard deviation , traversing -th value of the latent vector is almost equivalent to traversing . The input DNN feature of the policy network is the concatenation of and , and thus the next state due to its output actions caused by traversed factor would be probabilistically predictable by the visual consequence estimated by the decoder with traversed .

Mapping Action-Controllable Representations

Learning visual influence was previously introduced of its importance and implicitly solved in the works of [Oh et al.2015, Greydanus et al.2017]. Distinguishing directly-controllable objects and environment-dependent objects reflects much of how a human perceives the world. Restricting in the world of Atari game domains as an example, it is intuitive for a human agent to first figure out ‘where I am in the screen’ or ‘what I am capable of with my actions’ and then work their ways towards achieving the highest score.

We show in the experiment section that AC--VAE allows RL agents not only to explicitly learn visual influences of their actions, but also learn them in a human-friendly way. By traversing each element of the latent vector, we are able to interpret which dimensions are mapped with actions and which are mapped with other environmental factors.

(a) -VAE, =1 (VAE)
without any supervised
action-mapping
(b) -VAE, =20
without any supervised
action-mapping
(c) AC--VAE, =1 (AC-VAE) with action-mappings :
, , ,
(d) AC--VAE, =20
with action-mappings :
, , ,
Figure 3: The qualitative results of traversing latent factors in -VAE with =1 (VAE) and =20 on () data tuples and those of AC--VAE with =1 (AC-VAE) and =20 on data tuples in dSprites environment. The action vectors are retrieved randomly as combinations of () that respectively represent vertical, horizontal, rotational, scaling moves. The vertical axes represent the dimensions of the learned latent vector from top to bottom while the horizontal axes represent traversing values of from left to right.

Transparent Policy Network

As mentioned earlier, the encoder and the policy network can be grouped as one bigger policy network model with an interpretable layer constrained by the AC--VAE loss. Unlike high-level features from conventional DNN models, the inner features of our policy network are consequentially interpretable.

Figure 2 illustrates how our policy network becomes transparent. If the action-dependent factors are disentangled in the latent vector and mapped into , then so they are in and because they define the sampling distribution of where denotes the dimensional location. The variational samplings from the latent space of VAE is defined as: where is an auxiliary noise variable . And, we know that . Since the value controls mainly the scale of sampled , traversing is almost equivalent to traversing 111Refer the original work of VAE for more insightful details. Thus, traversing encourages the policy network to cause actions as predictions of each traversing value of for .

Experiments

In this section, we present experimental results that demonstrate the following key aspects of our proposed method:

  • By mapping actions into the latent vector of -VAE, action-controllable factors are disentangled from other environmental factors.

  • Governance over the optimized behavior of an agent can be made based on human-level interpretation of learned latent behavioral factors.

We have experimented our method in three different environment types: dSprites, Atari and MuJoCo.

dSprites Environment is an environment we design with the dSprites dataset [Matthey et al.2017]. It originally is a synthetic dataset of 2D shapes that gradually vary in five factors: shape (square, ellipse, heart), scale, orientation, locations in vertical and horizontal axes, respectively. The environment provides a sized image that embrace two shapes, one heart and one square. At each time step, the square is randomly scaled in a randomly oriented form at random location within the image. The heart-shaped object responds to one of the following discrete action inputs: move upward, downward, left, right, enlarge, shrink, rotate left and right. All actions can be represented with a 4-dimensional action vector each of which is responsible for a unit of either vertical, horizontal, scaling or rotating movement.

VAE -VAE AC-VAE AC--VAE
(=1) (=20) (=1) (=20)
Avg. Disent. 0.120 0.133 0.233 0.390
Avg. Compl. 0.155 0.231 0.288 0.405
Table 1: The quantitative scores of disentanglement and completeness averaged over dimensions of the latent vector learned with tuples from dSprites environment.
Figure 4: The images are the estimated next states obtained by traversing the latent vector learned by AC--VAE with =10 and =0.001 on the Atari game environment Breakout. The factors at are mapped with the control factors such as movements of the paddle, and is mapped with the environmental factors such as bricks and the scoreboard.

Atari Learning Environment is a software framework for assessing RL algorithms [Bellemare et al.2013]. Each frame is considered as a state and immediate rewards are given for every state transitions. Our method is experimented in the Atari game environments of Breakout, Seaquest and Space-Invaders.

Figure 5: The images are the estimated next states obtained by traversing the latent vector learned by A2C policy and AC--VAE with =10 and =0.001 on Atari game environments Seaquest (top) and Space-Invaders (bottom) with action spaces of and , respectively. Because of a small movement per action, we have enlarged the ego at a fixed location (red box).

(a) Walker2d (, , )

(b) Hopper (, , )

(c) Half-Cheetah (, , )

(d) Swimmer (, ,
Figure 6: Traverse results in the MuJoCo environments. The numbers in the boxes represent the standard deviations of each dimensional factor of the following state, , when traversing the corresponding dimensional factor of the latent vector. Compared to the traverse for unmapped dimensions, the standard deviations of state values in the action-mapped dimensions are larger. Right arrows indicate action-mapping dimensional locations.

MuJoCo Environment provides a physics engine system for rigid body simulations [E. Todorov and Tassa.2012, G. Brockman and Zaremba.2016]. Four robotics tasks are engaged in our experiments: Walker2d, Hopper, Half-Cheetah and Swimmer. A state vector represents the current status of a provided robotic figure, each factor of which is unknown of its physical meaning.

As an encoder and a decoder, we have used a convolutional neural network (CNN) for Atari environments and fully-connected MLP networks for dSprites and MuJoCo environments. For the stochastic policy network, we have used a fully-connected MLP. PPO and A2C are applied to optimize agent’s policy for continuous control and discrete actions, respectively. Most of hyper-parameters for the policy optimization are referred from the works of [Schulman and Klimov2017, Wu et al.2017].

Disentanglement & Interpretability

To demonstrate the disentanglement performance and interpretability of the proposed algorithm, we have experimented our method with tuples from environments mentioned above.

Figure 3 and Table 1 illustrate the results for the dSprites environment. The metric framework suggested in [Eastwood and Williams2018]

with a random forest regressors are applied to present the quantitative results

of disentanglement and completeness. The tree depths are determined for the lowest prediction error of the validation set. Since the metric system is based on the disentanglement for the conventional VAE and -VAE, our metric results may not be strictly comparable to the ones reported in the original work. In Fig. 3(a) and (b), the VAE and -VAE seem to struggle from learning the pattern between input and the output without any action constraint because of the randomness of the environmental squared object, creating relatively blurred reconstructions. Such excessive generalization in reconstructions results in low scores in both disentanglement and completeness which means relatively low representational power to reproduce the ground truth variant factors. Although the action conditions and the low-weighted term allow AC-VAE (=1) reconstruct sharper images, its relatively low disentanglement pressure results in lower metric scores compared to AC--VAE (=20).

The results for the Atari environments in Figure 4 and Figure 5 show that the latent vector trained with our method models the given environment successfully. All the visited state space and learned behaviors can be projected by traversing each dimension of the latent vector. In that sense, our method can be considered as an action-conditional generative model. Because AC--VAE can model the world in an egocentric perspective, all the sequences of (state-action-next state) can be re-simulated. Such trait may advance many RL methods since similar models are used for an exploration guidance [Tang et al.2017] or as the imagery rehearsals for training [Ha and Schmidhuber2018].

Figure 6 shows the quantitative results of the traverse experiment on the MuJoCo environment. Numbers on the heat-map represent the standard deviations for each dimension’s state values when traversing dimensional factor. The higher standard deviation value in the traverse of a specific dimension means the more effects the traversing dimension have on immediate state changes. Unlike other environments, the MuJoCo environment has no environmental factors, and the current state is represented by the preceding movement of the given robotic body. As shown in Figure 6, since the standard deviation of the state values during the traverse of the dimensions that are mapped with actions is larger than the unmapped ones, we can see the proposed algorithm is able to learn the disentangled action-dependent latent features. However, it is limited from clear visual interpretation compared to the experimental cases in other environments because the actions in the MuJoCo environment is defined as a continuous control of torques for all joints and it is conjectured that the movement of one joint affects the whole status of the body.

Controlling and Governing efficacy

To verify the controllability of an agent’s optimized efficacy, we traverse the latent factors over the environment-specific range during an episode on the learned network. In order to examine , the environment output, the traversal is conducted before reparameterization ( vector). Furthermore, to get a clear view on the effect of action-mapped dimensions of the latent vector, we set all of the value of action mapped dimensions to zero except for the traversing one and those unmapped dimension of the latent vector. These experiments are conducted on the Mujoco environments, and traverse range is set as [-5, 5] for every tasks.

The learned behavior in each latent dimension is also depicted in Figure 7. The resultant traverses of action-mapped dimensions on latent factors yield in behavioral movements that are combinations of multiple joint torque values. Unlike in Atari environments with discrete action spaces, AC--VAE is constrained with various combinations of continuous action values during training simulations. When the policy network is optimized to accomplish a goal behavior such as walking, the action-mapped latent factors are learned to represent required behavioral components of spreading or gathering the legs. Therefore, vector represents variations in combinations of multiple joint movements, which allows for ease of visual comprehension on agent’s optimized efficacy. This clearly shows the possibility of governance over an RL agent’s efficacy with human-level interpretations through controlling the values of the vector in the latent space.

We have taken the advantage of our transparent policy network and derived another behavior by controlling learned behavioral components. An RL agent is able to learn with a reward function defined by human preference to perform, for example, a back-flip motion in Hopper environment [Christiano et al.2017]. Showing a promising result of human enforcements on an RL model, our method enables governance over the agent’s optimized behavior in Half-Cheetah environment. After identification of behavioral components by traversing each element of the vector, we are able to express another behavior of the agent, a back-flip in this case, as shown in Figure 8.


Figure 7: For Half-Cheetah environment with continuous control, latent behavioral factors can be interpreted by traversing latent values in time. As a result, each action-mapped latent feature is responsible for a behavioral factor.

Figure 8: Example of governing the agent movement in MuJoCo environment of Half-Cheetah. The robotic body is conducting a back-flip movement which is induced by controlling latent values at first and second dimensions of the learned vector shown in Figure 7 .

Conclusion

In this paper, we propose the action-conditional -VAE (AC--VAE) which, for a given input state at time , predicts next state conditioned on an action , sharing a backbone structure with a policy network during a deep reinforcement learning process. Our proposed model not only learns disentangled representations but distinguishes action-mapped factors and uncontrollable factors by partially mapping control-dependent variant features into the latent vector. Since the policy network combined with the preceding encoder can be considered as one bigger policy network that takes raw states as inputs, with AC--VAE, we are able to build a transparent RL agent of which latent features are interpretable to human, overcoming conventional blackbox issue of Deep RL. Such transparency allows human governance over the agent’s optimized behavior with adjustments of learned latent factors. We plan on the relevant studies for applications of the action-mapped latent vector.

References