In reinforcement learning (RL), tasks are defined as Markov decision processes (MDPs) with state and action spaces, transition models, and reward functions. The dynamics of an RL agent commence by executing an action in a state of the environment according to some policy. Based on the action choice, the environment responds by transitioning the agent to a new state and providing an instantaneous reward quantifying the quality of the executed action. This process repeats until a terminal condition is met. The goal of the agent is to learn an optimal action-selection rule that maximises total-expected returns from any initial state. Though minimally supervised, this framework has become a profound tool for decision making under uncertainty, with applications ranging from computer games mnih2015human; silver2017mastering to neural architecture search zoph2016neural, robotics Ammar:2014:OML:3044805.3045027; nagabandi2018neural; Peters:2008:NA:1352927.1352986, and multi-agent systems li2019efficient; ijcai2019-85; wen2018probabilistic; pmlr-v80-yang18d.
Common RL algorithms, however, only consider observations from one view of the state space Kaelbling:1998:PAP:1643275.1643301. Such an assumption can become too restrictive in real-life scenarios. To illustrate, imagine designing an autonomous vehicle that is equipped with multiple sensors. For such an agent to execute safe actions, data-fusion is necessary so as to account for all available information about the world. Consequently, agent policies have now to be conditioned on varying state descriptions, which in turn, lead to challenging representation and learning questions. In fact, acquiring good-enough policies in the multi-view setting is more complex when compared to standard RL due to the increase in sample complexities needed to reason about varying views. If solved, however, multi-view RL will allow for data-fusion, fault-tolerance to sensor deterioration, and policy generalisation across domains.
Numerous algorithms for multi-view learning in supervised tasks have been proposed. Interested readers are referred to the survey in li2018survey; xu2013survey; zhao2017multi
, and references therein for a detailed exposition. Though abundant in supervised learning tasks, multi-view data fusion for decision making has gained less attention. In fact, our search revealed only a few papers attempting to target this exact problem. A notable algorithm is the work inchen2017double
that proposed a double task deep Q-network for multi-view reinforcement learning. We believe the attempt made by the authors handle the multi-view decision problem indirectly by carrying innovations from computer-vision, where they augment different angle cameras in one state and feed to a standard deep Q-network. Our attempt, on the other hand, aims to resolve multi-view decision making directly by learning joint models for autonomous planning. As a by-product of our method, we arrive at a learning pipeline that allows for improvements in both policy learning and feature representations.
Closest to our work are algorithms from the domain of partially observable Markov decision processes (POMDPs) Kaelbling:1998:PAP:1643275.1643301. There, the environment’s state is also hidden, and the agent is equipped with a sensor (i.e., an observation function) for it to build a belief of the latent state in order to execute and learn its policy. Although most algorithms consider one observation function (i.e., one view), one can generalise the definition of POMDPs, to multiple types of observations by allowing for a joint observation space, and consequently a joint observation function, across varying views. Though possible in principle, we are not aware of any algorithm from the POMDP literature targeting this scenario. Our problem, in fact, can become substantially harder from both the representation and learning perspectives. To illustrate, consider a POMDP with only two views, the first being images, while the second a low-dimensional time series corresponding to, say joint angles and angular velocities. Following the idea of constructing a joint observation space, one would be looking for a map from history of observations (and potentially actions) to a new observation at the consequent time steps. Such an observation can be an image, a time series, or both. In this setting, constructing a joint observation mapping is difficult due to the varying nature of the outputs and their occurrences in history. Due to the large sample and computational complexities involved in designing larger deep learners with varying output units, and switching mechanisms to differentiate between views, we rather advocate for a more grounded framework by drawing inspiration from the well-established multi-view supervised learning literature. Thus, leading us to multi-view reinforcement learning111We note that other titles for this work are also possible, e.g., Multi-View POMDPs, or Multi-Observation POMDPs. The emphasis is on the fact that little literature has considered RL with varying sensory observations..
Our framework for multi-view reinforcement learning also shares similarities with multi-task reinforcement learning, a framework that has gained considerable attention Ammar:2014:OML:3044805.3045027; duan2016rl; finn2017model; frans2017meta; parisotto2015actor; teh2017distral; wang2016learning. Particularly, one can imagine multi-view RL as a case of multi-task RL where tasks share action spaces, transition models, and reward functions, but differ in their state-representations. Though a bridge between multi-view and multi-task can be constructed, it is worth mentioning that most works on multi-task RL consider same observation and action spaces but varying dynamics and/or reward functions NIPS2017_7036. As such, these methods fail to handle fusion and generalisation across feature representations that vary between domains. A notable exception is the work in Ammar:2015:ACK:2832581.2832715, which transfers knowledge between task groups, with varying state and/or action spaces. Though successful, this method assumes model-free with linear policy settings. As such, it fails to efficiently handle high-dimensional environments, which require deep network policies.
In this paper, we contribute by introducing a framework for multi-view reinforcement learning that generalizes partially observable Markov decision processes (POMDPs) to ones that exhibit multiple observation models. We first derive a straight-forward solution based on state augmentation that demonstrates superior performance on various benchmarks when compared to state-of-the-art proximal policy optimization (PPO) in multi-view scenarios. We then provide an algorithm for multi-view model learning, and propose a solution capable of transferring policies learned from one view to another. This, in turn, greatly reduces the amount of training samples needed by around two order of magnitudes in most tasks when compared to PPO. Finally, in another set of experiments, we show that our algorithm outperforms PILCO deisenroth2011pilco in terms of sample complexities, especially on high-dimensional and non-smooth systems, e.g., Hopper222It is worth noting that our experiments reveal that PILCO, for instance, faces challenges when dealing with high-dimensional systems, e.g., Hopper. In future, we plan to investigate latent space Gaussian process models with the aim of carrying sample efficient algorithms to high-dimensional systems.. Our contributions can be summarised as:
formalising multi-view reinforcement learning as a generalization of POMDPs;
proposing two solutions based on state augmentation and policy transfer to multi-view RL;
demonstrating improvement in policy against state-of-the-art methods on a variety of control benchmarks.
2 Multi-View Reinforcement Learning
This section introduces multi-view reinforcement learning by extending MDPs to allow for multiple state representations and observation densities. We show that POMDPs can be seen as a special case of our framework, where inference about latent state only uses one view of the state-space.
2.1 Multi-View Markov Decision Processes
To allow agents to reason about varying state representations, we generalize the notion of an MDP to a multi-view MDP, which is defined by the tuple . Here, and , and represent the standard MDP’s state and action spaces, as well as the transition model, respectively.
Contrary to MDPs, multi-view MDPs incorporate additional components responsible for formally representing multiple views belonging to different observation spaces. We use and to denote the observation space and observation-model of each sensor . At some time step , the agent executes an action according to its policy that conditions on a history of heterogeneous observations (and potentially actions) , where for and333Please note it is possible to incorporate the actions in by simply introducing additional action variables. . We define to represent the history of observations so-far, and introduce a superscript to denote the type of view encountered at the time instance. As per our definition, we allow for different types of observations, therefore, is allowed to vary from one to . Depending on the selected action, the environment then transitions to a successor state , which is not directly observable by the agent. On the contrary, the agent only receives a successor view with .
2.2 Multi-View Reinforcement Learning Objective
As in standard literature, the goal is to learn a policy that maximizes total expected return. To formalize such a goal, we introduce a multi-view trajectory , which augments standard POMDP trajectories with multi-view observations, i.e., , and consider finite horizon cumulative rewards computed on the real state (hidden to the agent) and the agent’s actions. With this, the goal of the agent to determine a policy to maximize the following optimization objective:
where is the discount factor. The last component needed for us to finalize our problem definition is to understand how to factor multi-view trajectory densities. Knowing that the trajectory density is that defined over joint observation, states, and actions, we write:
with being the initial state distribution. The generalization above arrives with additional sample and computational burdens, rendering current solutions to POMDPs impertinent. Among various challenges, multi-view policy representations capable of handling varying sensory signals can become expensive to both learn and represent. That being said, we can still reason about such structures and follow a policy-gradient technique to learn the parameters of the network. However, a multi-view policy network needs to adhere to a crucial property, which can be regarded as a special case of representation fusion networks from multi-view representation learning li2018survey; xu2013survey; zhao2017multi. We give our derivation of general gradient update laws following model-free policy gradients in Sect. 3.1.
Contrary to standard POMDPs, our trajectory density is generalized to support multiple state views by requiring different observation models through time. Such a generalization allows us to advocate a more efficient model-based solver that enables cross-view policy transfer (i.e., conditioning one view policies on another) and few-shot RL, as we shall introduce in Sect. 3.2.
3 Solution Methods
This section presents our model-free and model-based approaches to solving multi-view reinforcement learning. Given a policy network, our model-free solution derives a policy gradient theorem, which can be approximated using Monte Carlo and history-dependent baselines when updating model parameters. We then propose a model-based alternative that learns joint dynamical (and emission) models to allow for control in the latent space and cross-view policy transfer.
3.1 Model-Free Multi-View Reinforcement Learning through Observation Augmentation
The type of algorithm we employ for model free multi-view reinforcement learning falls in the class of policy gradient algorithms. Given existing advancements on Monte Carlo estimates, the variance reduction methods (e.g., observation-based baselines), and the problem definition in Eq. (1), we can proceed by giving the rule of updating the policy parameters as:
Please refer to the appendix for detailed derivation. While a policy gradient algorithm can be implemented, the above update rule is oblivious to the availability of multiple views in the state space closely resembling standard (one-view) POMDP scenarios. This only increases the number of samples required for training, as well as the variance in the gradient estimates. We thus introduce a straight-forward model-free MVRL algorithm by leveraging fusion networks from multi-view representation learning. Specifically, we assume that corresponding observations from all views, i.e., observations that sharing the same latent state, are accessible during training. Although the parameter update rule is exactly the same as defined in Eq. (6), this method manages to utilize the knowledge of shared dynamics across different views, thus being optimal than independent model-free learners, i.e., regarding each view as a single environment and learn the policy.
3.2 Model-Based Multi-View Reinforcement Learning
We now propose a model-based approach that learns approximate transition models for multi-view RL allowing for policies that are simpler to learn and represent. Our learnt model can also be used for cross-view policy transfer (i.e., conditioning one view policies on another), few-shot RL, and typical model-based RL enabling policy improvements through back-propagation in learnt joint models.
3.2.1 Multi-View Model Learning
The purpose of model-learning is to abstract a hidden joint model shared across varying views of the state space. For accurate predictions, we envision the generative model in Fig. (1
). Here, observed random variables are denoted byfor a time-step and , while represents latent variables that temporally evolve conditioned on applied actions. As multiple observations are allowed, our model generalises to multi-view by supporting varying emission models depending on the nature of observations. Crucially, we do not assume Markov transitions in the latent space as we believe that reasoning about multiple views requires more than one-step history information.
To define our optimisation objective, we follow standard variational inference. Before deriving the variational lower bound, however, we first introduce additional notation to ease exposure. Recall that
is the observation vector at time-stepfor view 444Different from the model-free solver introduced in Sect. 3.1, we don’t assume the accessibility to all views.. To account for action conditioning, we further augment with executed controls, leading to for , and for the last time-step . Analogous to standard latent-variable models, our goal is to learn latent transitions that maximize the marginal likelihood of observations as According to the graphical model in Fig. (1), observations are generated from latent variables which are temporally evolving. Hence, we regard observations as a resultant of a process in which latent states have been marginalized-out:
where collects all multi-view observations and actions across time-steps. To devise an algorithm that reasons about latent dynamics, two components need to be better understood. The first relates to factoring our joint density, while the second to approximating multi-dimensional integrals. To factorize our joint density, we follow the modeling assumptions in Fig. (1) and write:
where in the last step we introduced to emphasise modeling parameters that need to be learnt, and to concatenate state and action histories back to .
Having dealt with density factorisation, another problem to circumvent is that of computing intractable multi-dimensional integrals. This can be achieved by introducing a variational distribution over latent variables, , which transform integration into an optimization problem as
where we used the concavity of the logarithm and Jensen’s inequality in the second step of the derivation. We assume a mean-field decomposition for the variational distribution, with being the observation and action history. This leads to:
denotes the Kullback–Leibler divergence between two distribution. Assuming shared variational parameters (e.g., one variational network), model learning can be formulated as:
Intuitively, Eq. (3) fits the model by maximizing multi-view observation likelihood, while being regularized through the KL-term. Clearly, this is similar to the standard evidence lower-bound with additional components related to handling multi-view types of observations.
3.2.2 Distribution Parameterisation and Implementation Details
To finalize our problem definition, choices for the modeling and variational distributions can ultimately be problem-dependent. To encode transitions beyond Markov assumptions, we use a memory-based model
(e.g., a recurrent neural network) to serve as the history encoder and future predictor, i.e.,Introducing memory splits the model into stochastic and deterministic parts, where the deterministic part is the memory model , while the stochastic part is the conditional prior distribution on latent states , i.e.,
. We assume that this distribution is Gaussian with its mean and variance parameterised by a feed-forward neural network takingas inputs.
As for the observation model, the exact form is domain-specific depending on available observation types. In the case when our observation is a low-dimensional vector, we chose a Gaussian parameterisation with mean and variance output by a feed-forward network as above. When dealing with images, we parameterised the mean by a deconvolutional neural network goodfellow2014generative and kept an identity covariance. The variational distribution
can thus, respectively, be parameterised by a feed-forward neural network and a convolutional neural networkkrizhevsky2012imagenet for these two types of views.
With the above assumptions, we can now derive the training loss used in our experiments. First, we rewrite Eq. (3) as
where denotes the entropy, and represents . From the first two terms in Eq. (4), we realise that the optimisation problem for multi-view model learning consists of two parts: 1) observation reconstruction, 2) transition prediction. Observation reconstruction operates by: 1) inferring the latent state from the observation using the variational model, and 2) decoding to (an approximation of ). Transition predictions, on the other hand, operate by feeding the previous latent state and the previous action to predict the next latent state
via the memory model. Both parts are optimized by maximizing the log-likelihood under a Gaussian distribution with unit variance. This equates to minimising the mean squared error between model outputs and actual variable value:
where is the Euclidean norm.
Optimizing Eq. (4) also requires maximizing the entropy of the variational model . Intuitively, the variational model aims to increase the element-wise similarity of the latent state among corresponding observations gretton2007kernel. Thus, we represent the entropy term as:
where is the average value of the mean of the diagonal Gaussian representing for each training batch.
3.2.3 Policy Transfer and Few-Shot Reinforcement Learning
As introduced in Section 2.2, trajectory densities in MVRL generalise to support multiple state views by requiring different observation models through time. Such a generalization enables us to achieve cross-view policy transfer and few-shot RL, where we only require very few data from a specific view to train the multi-view model. This can then be used for action selection by: 1) inferring the corresponding latent state, and 2) feeding the latent state into the policy learned from another view with greater accessibility. Details can be found in Appendix A.
Concretely, our learned models should be able to reconstruct corresponding observations for views with shared underlying dynamics (latent state ). During model learning, we thus validate the variational and observation model by: 1) inferring the latent state from the first view’s observation , and 2) comparing the reconstructed corresponding observation from other views with the actual observation through calculating the transformation loss: Similarly, the memory model can be validated by: 1) reconstructing the predicted latent state of the first view using the observation model of other views to get , and 2) comparing with the actual observation , through calculating prediction transformation losses:
We evaluate our method on a variety of dynamical systems varying in dimensions of their state representation. We consider both high and low dimensional problems to demonstrate the effectiveness of our model-free and model-based solutions. On the model-free side, we demonstrate performance against state-of-the-art methods, such as Proximal Policy Optimisation (PPO) schulman2017proximal. On the model-based side, we are interested in knowing whether our model successfully learns shared dynamics across varying views, and if these can then be utilised to enable efficient control.
We consider dynamical systems from the Atari suite, Roboschool schulman2017proximal, PyBullet coumans2019, and the Highway environments highway-env. We generate varying views either by transforming state representations, introducing noise, or by augmenting state variables with additional dummy components. When considering the game of Pong, we allow for varying observations by introducing various transformations to the image representing the state, e.g., rotation, flipping, and horizontal swapping. We then compare our multi-view model with a state-of-the-art modelling approach titled World Models ha2018world; see Section 4.1. Given successful modeling results, we commence to demonstrate control in both mode-free and model-based scenarios in Section 4.2. Our results demonstrate that although multi-view model-free algorithms can present advantages when compared to standard RL, multi-view model-based techniques are highly more efficient in terms of sample complexities.
4.1 Modeling Results
To evaluate multi-view model learning we generated five views by varying state representations (i.e., images) in the Atari Pong environment. We kept dynamics unchanged and considered four sensory transformations of the observation frame . Namely, we considered varying views as: 1) transposed images , 2) horizontally-swapped images , 3) inverse images , and 4) mirror-symmetric images . Exact details on how these views have been generated can be found in Appendix C.1.1.
For simplicity, we built pair-wise multi-view models between and one of the above five variants. Fig. (2) illustrates convergence results of multi-view prediction and transformation for different views of Atari Pong. Fig (3) further investigates the learnt shared dynamics among different views ( and ). Fig. ((a)a) illustrates the converged
, and the log standard deviation of the latent state variable, inand . Observe that a small group of elements (indexed as 14, 18, 21, 24 and 29) have relatively low variance in both views, thus keeping stable values across different observations.
We consider these elements as the critical part in representing the shared dynamics and define them as key elements. Clearly, learning a shared group of key elements across different views is the target in the multi-view model. Results in Fig. ((b)b), illustrate the distance between and for the multi-view model demonstrating convergence. As the same group of elements are, in fact, close to key elements learnt by the multi-view model, we conclude that we can capture shared dynamics across different views. Further analysis of these key elements is also presented in the appendix.
In Fig. ((c)c), we also report the converged value of of World Models (WM) ha2018world under the multi-view setting. Although the world model can still infer the latent dynamics of both environments, the large difference between learnt dynamics demonstrates that varying views resemble a challenge to world models – our algorithm, however, is capable of capturing such hidden shared dynamics.
4.2 Policy Learning Results
Given successful modelling results, in this section we present results on controlling systems across multiple views within a RL setting. Namely, we evaluate our multi-view RL approach on several high and low dimensional tasks. Our systems consisted of: 1) Cartpole (), where the goal is to balance a pole by applying left or right forces to a pivot, 2) hopper (), where the focus is on locomotion such that the dynamical system hops forward as fast as possible, 3) RACECAR (), where the observation is the position (x,y) of a randomly placed ball in the camera frame and the reward is based on the distance to the ball,
and 4) parking (
), where an ego-vehicle must park in a given space with an appropriate heading (a goal-conditioned continuous control task). The evaluation metric we used was defined as the average testing return) across all views with respect to the amount of training samples (number of interactions). We use the same setting in all experiments to generate multiple views. Namely, the original environment observation is used as the first view, and adding dummy dimensions (two dims) and large-scale noises (after observation normalization) to the original observation generates the second view. Such a setting would allow us to understand if our model can learn shared dynamics with mis-specified state representations, and if such a model can then be used for control.
For all experiments, we trained the multi-view model with a few samples gathered from all views and used the resultant for policy transfer (MV-PT) between views during the test period. We chose state-of-the-art PPO schulman2017proximal – an algorithm based on the work in Peters:2008:NA:1352927.1352986, as the baseline by training separate models on different views and aggregated results together. The multi-view model-free (MV-MF) method is trained by augmenting PPO with concatenated observations. Relevant parameter values and implementation details are listed in the Appendix C.2.
Fig. (4) shows the result of average testing return (the average testing successful rate for the parking task) from all views. On Cartpole and parking tasks, our MV-MF algorithms can present improvements when compared to strong model-free baselines such as PPO, showing the advantage of leveraging information from multiple views than training independently within each view. On the other hand, multi-view model-based techniques give the best performance on all tasks and reduce number of samples needed by around two orders of magnitudes in most tasks. This proves that MV-PT greatly reduces the required amount of training samples to reach good performance.
We also conducted pure model-based RL experiment and compared our multi-view dynamic model against 1) PILCO deisenroth2011pilco
, which can be regarded as state-of-the-art model-based solution using a Gaussian Process dynamic model, 2) PILCO with augmented states, i.e., a single PILCO model to approximate the data distribution from all views, and 3) a multilayer perceptron (MLP). We use the same planning algorithm for all model-based methods, e.g., hill climbing or Model Predictive Control (MPC)nagabandi2018neural, depending on the task at hand. Table 1 shows the result on the Cartpole environment, where we evaluate all methods by the amount of interactions till success. Each training rollout has at most steps of interactions and we define the success as reaching an average testing return of . Although multi-view model performs slightly worse than PILCO in the Cartpole task, we found out that model-based alone cannot perform well on tasks without suitable environment rewards. For example, the reward function in hopper primarily encourages higher speed and lower energy consumption. Such high-level reward functions make it hard for model-based methods to succeed; therefore, the results of model-based algorithms on all tasks are lower than using other specifically designed reward functions. Tailoring reward functions, though interesting, is out of the scope of this paper. MV-PT, on the other hand, outperforms others significantly, see Fig. (4).
5 Related Work
Our work has extended model-based RL to the multi-view scenario. Model-based RL for POMDPs have been shown to be more effective than model-free alternatives in certain tasks watter2015embed; wahlstrom2015pixels; levine2014learning; leibfried2016deep. One of the classical combination of model-based and model-free algorithms is Dyna-Q sutton1990integrated, which learns the policy from both the model and the environment by supplementing real world on-policy experiences with simulated trajectories. However, using trajectories from a non-optimal or biased model can lead to learning a poor policy gu2016continuous
. To model the world environment of Atari Games, autoencoders have been used to predict the next observation and environment rewardsleibfried2016deep. Some previous works schmidhuber2015learning; ha2018world; leibfried2016deep
maintain a recurrent architecture to model the world using unsupervised learning and proved its efficiency in helping RL agents in complex environments. Mb-Mfnagabandi2018neural is a framework bridging the gap between model-free and model-based methods by employing MPC to pre-train a policy within the learned model before training it with standard model-free method. However, these models can only be applied to a single environment and need to be built from scratch for new environments. Although using a similar recurrent architecture, our work differs from above works by learning the shared dynamics over multiple views. Also, many of the above advancements are orthogonal to our proposed approach, which can definitely benefit from model ensemble, e.g., pre-train the model-free policy within the multi-view model when reward models are accessible.
Another related research area is multi-task learning (or meta-learning). To achieve multi-task learning, recurrent architectures duan2016rl; wang2016learning have also been used to learn to reinforcement learn by adapting to different MDPs automatically. These have been shown to be comparable to the UCB1 algorithm Auer2002 on bandit problems. Meta-learning shared hierarchies (MLSH) frans2017meta share sub-policies among different tasks to achieve the goal in the training process, where high hierarchy actions are obtained and reused in other tasks. Model-agnostic meta-learning algorithm (MAML) finn2017model minimizes the total error across multiple tasks by locally conducting few-shot learning to find the optimal parameters for both supervised learning and RL. Actor-mimic parisotto2015actor distills multiple pre-trained DQNs on different tasks into one single network to accelerate the learning process by initializing the learning model with learned parameters of the distilled network. To achieve promising results, these pre-trained DQNs have to be expert policies. Distral teh2017distral learns multiple tasks jointly and trains a shared policy as the "centroid" by distillation. Concurrently with our work, ADRL ijcai2019-277 has extended model-free RL to multi-view environments and proposed an attention-based policy aggregation method based on the Q-value of the actor-critic worker for each view. Most of above approaches consider the problems within the model-free RL paradigm and focus on finding the common structure in the policy space. However, model-free approaches require large amounts of data to explore in high-dimensional environments. In contrast, we explicitly maintain a multi-view dynamic model to capture the latent structures and dynamics of the environment, thus having more stable correlation signals.
Some algorithms from meta-learning have been adapted to the model-based setting al2017continuous; clavera2018learning. These focused on model adaptation when the model is incomplete, or the underlying MDPs are evolving. By taking the unlearnt model as a new task and continuously learning new structures, the agent can keep its model up to date. Different from these approaches, we focus on how to establish the common dynamics over compact representations of observations generated from different emission models.
In this paper, we proposed multi-view reinforcement learning as a generalisation of partially observable Markov decision processes that exhibit multiple observation densities. We derive model-free and model-based solutions to multi-view reinforcement learning, and demonstrate the effectiveness of our method on a variety of control benchmarks. Notably, we show that model-free multi-view methods through observation augmentation significantly reduce number of training samples when compared to state-of-the-art reinforcement learning techniques, e.g., PPO, and demonstrate that model-based approaches through cross-view policy transfer allow for extremely efficient learners needing significantly fewer number of training samples.
There are multiple interesting avenues for future work. First, we would like to apply our technique to real-world robotic systems such as self-driving cars, and second, use our method for transferring between varying views across domains.
Appendix A Model-based Multi-view Reinforcement Learning Algorithm
For completeness, we provide the Model-based Multi-view reinforcement learning algorithm below.
Appendix B Derivatives of Multi-view Model-free Policy Gradient
The type of algorithm we employ for model-free multi-view reinforcement learning falls in the class of policy gradient algorithms, which update the agent’s policy-defining parameters directly by estimating a gradient in the direction of higher reward. Given the problem definition in Equation 1, the gradient of the loss with respect to the network parameters can be computed as:
where we used the “likelihood-ratio” trick in the third step of the derivation. Now, one can proceed by taking a sample average of the gradient using Monte Carlo to update the policy parameters , suggesting the following update rule:
Mote Carlo estimation above is a fast approximation of the gradient for the current policy with convergence speed of to the true gradient independent of the number of parameters of the policy. It is also worth noting that although the trajectory distribution depends on the unknown initial state distribution, unknown observation models, and hidden state dynamics, the gradient only includes policy components that can be controlled by the agent.
Though fast in convergence to the true gradient, Monte Carlo estimates suffer from high variance, e.g., it is easy to show that variance grows linearly in the time horizon. Unfortunately, the naive approach of sampling big-enough batch sizes is not an option in reinforcement learning due to the high cost of collecting samples, i.e., interacting with the environment. For this reason, literature has focused on introducing baselines aiming to reduce variance peters2008reinforcement,williams1992simple. We follow a similar approach here, and introduce an observation based baseline to reduce the variance of our gradient estimate. Our baseline, , will take as inputs observations and actions555Even though this baseline is action-dependent, one can show it to be unbiased. The trick is to realize that a history at time is independent of action and rather depends on all action up to the instance., and learn to predict future returns given the current policy. Such a baseline can easily be represented as an LSTM recurrent neural network as noted in wierstra2010recurrent. Consequently, we can rewrite our update rule as:
Appendix C Experiment Details
c.1 Model learning details
c.1.1 View Settings of Atari Pong
As shown in Fig. (5), each variant corresponds to one transformation from : (a) the transposed , which is transformed from the state observation of by clockwise rotating and horizontal flipping; (b) the horizontal-swapped , which is generated by vertically splitting the observation frame of from the center and swapping the left part with the right part; (c) the inverse , which is created by exchanging the background color with the paddles/ball color of ; and (d) the mirror-symmetric , which reflects like a mirror by horizontally swapping the observation.
c.1.2 Multi-view Model Setting
Different from the original Atari Pong observations, we (1) transform each frame to a binary matrix; (2) remove the scoreboard; and (3) resize each frame to to serve as the observation of . The action space is formed by all six available discrete actions of the original Atari environment. The observation model adopts the same architecture as a typical VAE with . The memory model is a -units LSTM connected to the same output layer of the observation model. We set the batch size for each task as and the sequence length of LSTM as . We alternate the training process for the multi-view model between minimizing and by setting each training iteration with prediction iterations and reconstruction iterations, since the adjustment of the observation model to satisfy the learning of shared dynamics will affect model’s reconstruction ability. We explore training the shared dynamics on two views with corresponding inputs and non-corresponding inputs, i.e., using corresponding states from different views as training data or not, to verify the performance of multi-view models.
To collect the training data covering most dynamics of the Pong environment, we use an agent with random policy to play the game for episodes with an episode length of . At each training time step, we randomly sample trajectories of length from the dataset as the training data for , and transform these samples to corresponding observations as the training data for , thus explicitly making the training input for different views share the same transition dynamics.
c.1.3 Additional Experiment Results
To further validate the importance of key elements in extracting the underlying dynamics, we show the weights of the reconstruction network connected to of in Fig. ((a)a). As the sum of absolute weights connected to key elements are much larger than others, the change of key elements will apply higher influence to the reconstruction output , thus illustrating their significance in latent representations.
We then compute the gradients of the output with respect to the to observe which part of contributes more to the visual stimuli. As shown in Fig. ((b)b), the mean absolute gradients of key elements are significantly larger, while other elements have nearly zero gradients (no contributions to ). Consequently, the shared dynamics are mainly expressed by key elements.
|GAE parameter ()|
|Number of actors|
|Cycles to collect samples|
|Training batch size|
|Sample batch size|
|actor learning rate|
|critic learning rate|
c.2 Policy learning details
We use the same structure for the multi-view model as mentioned in Sect. C.1. For all environments, we generate 100 initial roll-out trajectories using random policies (except for the cartpole, where we only generate 20 rollouts). We use PPO as the model-free policy learning algorithm for MV-PT and MV-MF in cartpole, hopper and RACECAR tasks, and list the hyperparameters in Table 2. For the parking environment, we use DDPG lillicrap2015continuous with hindsight experience replay andrychowicz2017hindsight, and list the hyperparameters in Table 3