1 Introduction
In reinforcement learning (RL), tasks are defined as Markov decision processes (MDPs) with state and action spaces, transition models, and reward functions. The dynamics of an RL agent commence by executing an action in a state of the environment according to some policy. Based on the action choice, the environment responds by transitioning the agent to a new state and providing an instantaneous reward quantifying the quality of the executed action. This process repeats until a terminal condition is met. The goal of the agent is to learn an optimal actionselection rule that maximises totalexpected returns from any initial state. Though minimally supervised, this framework has become a profound tool for decision making under uncertainty, with applications ranging from computer games mnih2015human; silver2017mastering to neural architecture search zoph2016neural, robotics Ammar:2014:OML:3044805.3045027; nagabandi2018neural; Peters:2008:NA:1352927.1352986, and multiagent systems li2019efficient; ijcai201985; wen2018probabilistic; pmlrv80yang18d.
Common RL algorithms, however, only consider observations from one view of the state space Kaelbling:1998:PAP:1643275.1643301. Such an assumption can become too restrictive in reallife scenarios. To illustrate, imagine designing an autonomous vehicle that is equipped with multiple sensors. For such an agent to execute safe actions, datafusion is necessary so as to account for all available information about the world. Consequently, agent policies have now to be conditioned on varying state descriptions, which in turn, lead to challenging representation and learning questions. In fact, acquiring goodenough policies in the multiview setting is more complex when compared to standard RL due to the increase in sample complexities needed to reason about varying views. If solved, however, multiview RL will allow for datafusion, faulttolerance to sensor deterioration, and policy generalisation across domains.
Numerous algorithms for multiview learning in supervised tasks have been proposed. Interested readers are referred to the survey in li2018survey; xu2013survey; zhao2017multi
, and references therein for a detailed exposition. Though abundant in supervised learning tasks, multiview data fusion for decision making has gained less attention. In fact, our search revealed only a few papers attempting to target this exact problem. A notable algorithm is the work in
chen2017doublethat proposed a double task deep Qnetwork for multiview reinforcement learning. We believe the attempt made by the authors handle the multiview decision problem indirectly by carrying innovations from computervision, where they augment different angle cameras in one state and feed to a standard deep Qnetwork. Our attempt, on the other hand, aims to resolve multiview decision making directly by learning joint models for autonomous planning. As a byproduct of our method, we arrive at a learning pipeline that allows for improvements in both policy learning and feature representations.
Closest to our work are algorithms from the domain of partially observable Markov decision processes (POMDPs) Kaelbling:1998:PAP:1643275.1643301. There, the environment’s state is also hidden, and the agent is equipped with a sensor (i.e., an observation function) for it to build a belief of the latent state in order to execute and learn its policy. Although most algorithms consider one observation function (i.e., one view), one can generalise the definition of POMDPs, to multiple types of observations by allowing for a joint observation space, and consequently a joint observation function, across varying views. Though possible in principle, we are not aware of any algorithm from the POMDP literature targeting this scenario. Our problem, in fact, can become substantially harder from both the representation and learning perspectives. To illustrate, consider a POMDP with only two views, the first being images, while the second a lowdimensional time series corresponding to, say joint angles and angular velocities. Following the idea of constructing a joint observation space, one would be looking for a map from history of observations (and potentially actions) to a new observation at the consequent time steps. Such an observation can be an image, a time series, or both. In this setting, constructing a joint observation mapping is difficult due to the varying nature of the outputs and their occurrences in history. Due to the large sample and computational complexities involved in designing larger deep learners with varying output units, and switching mechanisms to differentiate between views, we rather advocate for a more grounded framework by drawing inspiration from the wellestablished multiview supervised learning literature. Thus, leading us to multiview reinforcement learning^{1}^{1}1We note that other titles for this work are also possible, e.g., MultiView POMDPs, or MultiObservation POMDPs. The emphasis is on the fact that little literature has considered RL with varying sensory observations..
Our framework for multiview reinforcement learning also shares similarities with multitask reinforcement learning, a framework that has gained considerable attention Ammar:2014:OML:3044805.3045027; duan2016rl; finn2017model; frans2017meta; parisotto2015actor; teh2017distral; wang2016learning. Particularly, one can imagine multiview RL as a case of multitask RL where tasks share action spaces, transition models, and reward functions, but differ in their staterepresentations. Though a bridge between multiview and multitask can be constructed, it is worth mentioning that most works on multitask RL consider same observation and action spaces but varying dynamics and/or reward functions NIPS2017_7036. As such, these methods fail to handle fusion and generalisation across feature representations that vary between domains. A notable exception is the work in Ammar:2015:ACK:2832581.2832715, which transfers knowledge between task groups, with varying state and/or action spaces. Though successful, this method assumes modelfree with linear policy settings. As such, it fails to efficiently handle highdimensional environments, which require deep network policies.
In this paper, we contribute by introducing a framework for multiview reinforcement learning that generalizes partially observable Markov decision processes (POMDPs) to ones that exhibit multiple observation models. We first derive a straightforward solution based on state augmentation that demonstrates superior performance on various benchmarks when compared to stateoftheart proximal policy optimization (PPO) in multiview scenarios. We then provide an algorithm for multiview model learning, and propose a solution capable of transferring policies learned from one view to another. This, in turn, greatly reduces the amount of training samples needed by around two order of magnitudes in most tasks when compared to PPO. Finally, in another set of experiments, we show that our algorithm outperforms PILCO deisenroth2011pilco in terms of sample complexities, especially on highdimensional and nonsmooth systems, e.g., Hopper^{2}^{2}2It is worth noting that our experiments reveal that PILCO, for instance, faces challenges when dealing with highdimensional systems, e.g., Hopper. In future, we plan to investigate latent space Gaussian process models with the aim of carrying sample efficient algorithms to highdimensional systems.. Our contributions can be summarised as:

[noitemsep,topsep=1pt]

formalising multiview reinforcement learning as a generalization of POMDPs;

proposing two solutions based on state augmentation and policy transfer to multiview RL;

demonstrating improvement in policy against stateoftheart methods on a variety of control benchmarks.
2 MultiView Reinforcement Learning
This section introduces multiview reinforcement learning by extending MDPs to allow for multiple state representations and observation densities. We show that POMDPs can be seen as a special case of our framework, where inference about latent state only uses one view of the statespace.
2.1 MultiView Markov Decision Processes
To allow agents to reason about varying state representations, we generalize the notion of an MDP to a multiview MDP, which is defined by the tuple . Here, and , and represent the standard MDP’s state and action spaces, as well as the transition model, respectively.
Contrary to MDPs, multiview MDPs incorporate additional components responsible for formally representing multiple views belonging to different observation spaces. We use and to denote the observation space and observationmodel of each sensor . At some time step , the agent executes an action according to its policy that conditions on a history of heterogeneous observations (and potentially actions) , where for and^{3}^{3}3Please note it is possible to incorporate the actions in by simply introducing additional action variables. . We define to represent the history of observations sofar, and introduce a superscript to denote the type of view encountered at the time instance. As per our definition, we allow for different types of observations, therefore, is allowed to vary from one to . Depending on the selected action, the environment then transitions to a successor state , which is not directly observable by the agent. On the contrary, the agent only receives a successor view with .
2.2 MultiView Reinforcement Learning Objective
As in standard literature, the goal is to learn a policy that maximizes total expected return. To formalize such a goal, we introduce a multiview trajectory , which augments standard POMDP trajectories with multiview observations, i.e., , and consider finite horizon cumulative rewards computed on the real state (hidden to the agent) and the agent’s actions. With this, the goal of the agent to determine a policy to maximize the following optimization objective:
(1) 
where is the discount factor. The last component needed for us to finalize our problem definition is to understand how to factor multiview trajectory densities. Knowing that the trajectory density is that defined over joint observation, states, and actions, we write:
with being the initial state distribution. The generalization above arrives with additional sample and computational burdens, rendering current solutions to POMDPs impertinent. Among various challenges, multiview policy representations capable of handling varying sensory signals can become expensive to both learn and represent. That being said, we can still reason about such structures and follow a policygradient technique to learn the parameters of the network. However, a multiview policy network needs to adhere to a crucial property, which can be regarded as a special case of representation fusion networks from multiview representation learning li2018survey; xu2013survey; zhao2017multi. We give our derivation of general gradient update laws following modelfree policy gradients in Sect. 3.1.
Contrary to standard POMDPs, our trajectory density is generalized to support multiple state views by requiring different observation models through time. Such a generalization allows us to advocate a more efficient modelbased solver that enables crossview policy transfer (i.e., conditioning one view policies on another) and fewshot RL, as we shall introduce in Sect. 3.2.
3 Solution Methods
This section presents our modelfree and modelbased approaches to solving multiview reinforcement learning. Given a policy network, our modelfree solution derives a policy gradient theorem, which can be approximated using Monte Carlo and historydependent baselines when updating model parameters. We then propose a modelbased alternative that learns joint dynamical (and emission) models to allow for control in the latent space and crossview policy transfer.
3.1 ModelFree MultiView Reinforcement Learning through Observation Augmentation
The type of algorithm we employ for model free multiview reinforcement learning falls in the class of policy gradient algorithms. Given existing advancements on Monte Carlo estimates, the variance reduction methods (e.g., observationbased baselines
), and the problem definition in Eq. (1), we can proceed by giving the rule of updating the policy parameters as:(2) 
Please refer to the appendix for detailed derivation. While a policy gradient algorithm can be implemented, the above update rule is oblivious to the availability of multiple views in the state space closely resembling standard (oneview) POMDP scenarios. This only increases the number of samples required for training, as well as the variance in the gradient estimates. We thus introduce a straightforward modelfree MVRL algorithm by leveraging fusion networks from multiview representation learning. Specifically, we assume that corresponding observations from all views, i.e., observations that sharing the same latent state, are accessible during training. Although the parameter update rule is exactly the same as defined in Eq. (6), this method manages to utilize the knowledge of shared dynamics across different views, thus being optimal than independent modelfree learners, i.e., regarding each view as a single environment and learn the policy.
3.2 ModelBased MultiView Reinforcement Learning
We now propose a modelbased approach that learns approximate transition models for multiview RL allowing for policies that are simpler to learn and represent. Our learnt model can also be used for crossview policy transfer (i.e., conditioning one view policies on another), fewshot RL, and typical modelbased RL enabling policy improvements through backpropagation in learnt joint models.
3.2.1 MultiView Model Learning
The purpose of modellearning is to abstract a hidden joint model shared across varying views of the state space. For accurate predictions, we envision the generative model in Fig. (1
). Here, observed random variables are denoted by
for a timestep and , while represents latent variables that temporally evolve conditioned on applied actions. As multiple observations are allowed, our model generalises to multiview by supporting varying emission models depending on the nature of observations. Crucially, we do not assume Markov transitions in the latent space as we believe that reasoning about multiple views requires more than onestep history information.To define our optimisation objective, we follow standard variational inference. Before deriving the variational lower bound, however, we first introduce additional notation to ease exposure. Recall that
is the observation vector at timestep
for view ^{4}^{4}4Different from the modelfree solver introduced in Sect. 3.1, we don’t assume the accessibility to all views.. To account for action conditioning, we further augment with executed controls, leading to for , and for the last timestep . Analogous to standard latentvariable models, our goal is to learn latent transitions that maximize the marginal likelihood of observations as According to the graphical model in Fig. (1), observations are generated from latent variables which are temporally evolving. Hence, we regard observations as a resultant of a process in which latent states have been marginalizedout:where collects all multiview observations and actions across timesteps. To devise an algorithm that reasons about latent dynamics, two components need to be better understood. The first relates to factoring our joint density, while the second to approximating multidimensional integrals. To factorize our joint density, we follow the modeling assumptions in Fig. (1) and write:
where in the last step we introduced to emphasise modeling parameters that need to be learnt, and to concatenate state and action histories back to .
Having dealt with density factorisation, another problem to circumvent is that of computing intractable multidimensional integrals. This can be achieved by introducing a variational distribution over latent variables, , which transform integration into an optimization problem as
where we used the concavity of the logarithm and Jensen’s inequality in the second step of the derivation. We assume a meanfield decomposition for the variational distribution, with being the observation and action history. This leads to:
where
denotes the Kullback–Leibler divergence between two distribution. Assuming shared variational parameters (e.g., one variational network), model learning can be formulated as:
(3) 
Intuitively, Eq. (3) fits the model by maximizing multiview observation likelihood, while being regularized through the KLterm. Clearly, this is similar to the standard evidence lowerbound with additional components related to handling multiview types of observations.
3.2.2 Distribution Parameterisation and Implementation Details
To finalize our problem definition, choices for the modeling and variational distributions can ultimately be problemdependent. To encode transitions beyond Markov assumptions, we use a memorybased model
(e.g., a recurrent neural network) to serve as the history encoder and future predictor, i.e.,
Introducing memory splits the model into stochastic and deterministic parts, where the deterministic part is the memory model , while the stochastic part is the conditional prior distribution on latent states , i.e.,. We assume that this distribution is Gaussian with its mean and variance parameterised by a feedforward neural network taking
as inputs.As for the observation model, the exact form is domainspecific depending on available observation types. In the case when our observation is a lowdimensional vector, we chose a Gaussian parameterisation with mean and variance output by a feedforward network as above. When dealing with images, we parameterised the mean by a deconvolutional neural network goodfellow2014generative and kept an identity covariance. The variational distribution
can thus, respectively, be parameterised by a feedforward neural network and a convolutional neural network
krizhevsky2012imagenet for these two types of views.With the above assumptions, we can now derive the training loss used in our experiments. First, we rewrite Eq. (3) as
(4) 
where denotes the entropy, and represents . From the first two terms in Eq. (4), we realise that the optimisation problem for multiview model learning consists of two parts: 1) observation reconstruction, 2) transition prediction. Observation reconstruction operates by: 1) inferring the latent state from the observation using the variational model, and 2) decoding to (an approximation of ). Transition predictions, on the other hand, operate by feeding the previous latent state and the previous action to predict the next latent state
via the memory model. Both parts are optimized by maximizing the loglikelihood under a Gaussian distribution with unit variance. This equates to minimising the mean squared error between model outputs and actual variable value:
where is the Euclidean norm.
Optimizing Eq. (4) also requires maximizing the entropy of the variational model . Intuitively, the variational model aims to increase the elementwise similarity of the latent state among corresponding observations gretton2007kernel. Thus, we represent the entropy term as:
(5) 
where is the average value of the mean of the diagonal Gaussian representing for each training batch.
3.2.3 Policy Transfer and FewShot Reinforcement Learning
As introduced in Section 2.2, trajectory densities in MVRL generalise to support multiple state views by requiring different observation models through time. Such a generalization enables us to achieve crossview policy transfer and fewshot RL, where we only require very few data from a specific view to train the multiview model. This can then be used for action selection by: 1) inferring the corresponding latent state, and 2) feeding the latent state into the policy learned from another view with greater accessibility. Details can be found in Appendix A.
Concretely, our learned models should be able to reconstruct corresponding observations for views with shared underlying dynamics (latent state ). During model learning, we thus validate the variational and observation model by: 1) inferring the latent state from the first view’s observation , and 2) comparing the reconstructed corresponding observation from other views with the actual observation through calculating the transformation loss: Similarly, the memory model can be validated by: 1) reconstructing the predicted latent state of the first view using the observation model of other views to get , and 2) comparing with the actual observation , through calculating prediction transformation losses:
4 Experiments
We evaluate our method on a variety of dynamical systems varying in dimensions of their state representation. We consider both high and low dimensional problems to demonstrate the effectiveness of our modelfree and modelbased solutions. On the modelfree side, we demonstrate performance against stateoftheart methods, such as Proximal Policy Optimisation (PPO) schulman2017proximal. On the modelbased side, we are interested in knowing whether our model successfully learns shared dynamics across varying views, and if these can then be utilised to enable efficient control.
We consider dynamical systems from the Atari suite, Roboschool schulman2017proximal, PyBullet coumans2019, and the Highway environments highwayenv. We generate varying views either by transforming state representations, introducing noise, or by augmenting state variables with additional dummy components. When considering the game of Pong, we allow for varying observations by introducing various transformations to the image representing the state, e.g., rotation, flipping, and horizontal swapping. We then compare our multiview model with a stateoftheart modelling approach titled World Models ha2018world; see Section 4.1. Given successful modeling results, we commence to demonstrate control in both modefree and modelbased scenarios in Section 4.2. Our results demonstrate that although multiview modelfree algorithms can present advantages when compared to standard RL, multiview modelbased techniques are highly more efficient in terms of sample complexities.
4.1 Modeling Results
To evaluate multiview model learning we generated five views by varying state representations (i.e., images) in the Atari Pong environment. We kept dynamics unchanged and considered four sensory transformations of the observation frame . Namely, we considered varying views as: 1) transposed images , 2) horizontallyswapped images , 3) inverse images , and 4) mirrorsymmetric images . Exact details on how these views have been generated can be found in Appendix C.1.1.
For simplicity, we built pairwise multiview models between and one of the above five variants. Fig. (2) illustrates convergence results of multiview prediction and transformation for different views of Atari Pong. Fig (3) further investigates the learnt shared dynamics among different views ( and ). Fig. ((a)a) illustrates the converged
, and the log standard deviation of the latent state variable, in
and . Observe that a small group of elements (indexed as 14, 18, 21, 24 and 29) have relatively low variance in both views, thus keeping stable values across different observations.We consider these elements as the critical part in representing the shared dynamics and define them as key elements. Clearly, learning a shared group of key elements across different views is the target in the multiview model. Results in Fig. ((b)b), illustrate the distance between and for the multiview model demonstrating convergence. As the same group of elements are, in fact, close to key elements learnt by the multiview model, we conclude that we can capture shared dynamics across different views. Further analysis of these key elements is also presented in the appendix.
In Fig. ((c)c), we also report the converged value of of World Models (WM) ha2018world under the multiview setting. Although the world model can still infer the latent dynamics of both environments, the large difference between learnt dynamics demonstrates that varying views resemble a challenge to world models – our algorithm, however, is capable of capturing such hidden shared dynamics.
4.2 Policy Learning Results
Given successful modelling results, in this section we present results on controlling systems across multiple views within a RL setting. Namely, we evaluate our multiview RL approach on several high and low dimensional tasks. Our systems consisted of: 1) Cartpole (), where the goal is to balance a pole by applying left or right forces to a pivot, 2) hopper (), where the focus is on locomotion such that the dynamical system hops forward as fast as possible, 3) RACECAR (), where the observation is the position (x,y) of a randomly placed ball in the camera frame and the reward is based on the distance to the ball,
and 4) parking (
), where an egovehicle must park in a given space with an appropriate heading (a goalconditioned continuous control task). The evaluation metric we used was defined as the average testing return) across all views with respect to the amount of training samples (number of interactions). We use the same setting in all experiments to generate multiple views. Namely, the original environment observation is used as the first view, and adding dummy dimensions (two dims) and largescale noises (
after observation normalization) to the original observation generates the second view. Such a setting would allow us to understand if our model can learn shared dynamics with misspecified state representations, and if such a model can then be used for control.For all experiments, we trained the multiview model with a few samples gathered from all views and used the resultant for policy transfer (MVPT) between views during the test period. We chose stateoftheart PPO schulman2017proximal – an algorithm based on the work in Peters:2008:NA:1352927.1352986, as the baseline by training separate models on different views and aggregated results together. The multiview modelfree (MVMF) method is trained by augmenting PPO with concatenated observations. Relevant parameter values and implementation details are listed in the Appendix C.2.
Fig. (4) shows the result of average testing return (the average testing successful rate for the parking task) from all views. On Cartpole and parking tasks, our MVMF algorithms can present improvements when compared to strong modelfree baselines such as PPO, showing the advantage of leveraging information from multiple views than training independently within each view. On the other hand, multiview modelbased techniques give the best performance on all tasks and reduce number of samples needed by around two orders of magnitudes in most tasks. This proves that MVPT greatly reduces the required amount of training samples to reach good performance.
PILCO  PILCOaug  MLP  MVMB 

We also conducted pure modelbased RL experiment and compared our multiview dynamic model against 1) PILCO deisenroth2011pilco
, which can be regarded as stateoftheart modelbased solution using a Gaussian Process dynamic model, 2) PILCO with augmented states, i.e., a single PILCO model to approximate the data distribution from all views, and 3) a multilayer perceptron (MLP). We use the same planning algorithm for all modelbased methods, e.g., hill climbing or Model Predictive Control (MPC)
nagabandi2018neural, depending on the task at hand. Table 1 shows the result on the Cartpole environment, where we evaluate all methods by the amount of interactions till success. Each training rollout has at most steps of interactions and we define the success as reaching an average testing return of . Although multiview model performs slightly worse than PILCO in the Cartpole task, we found out that modelbased alone cannot perform well on tasks without suitable environment rewards. For example, the reward function in hopper primarily encourages higher speed and lower energy consumption. Such highlevel reward functions make it hard for modelbased methods to succeed; therefore, the results of modelbased algorithms on all tasks are lower than using other specifically designed reward functions. Tailoring reward functions, though interesting, is out of the scope of this paper. MVPT, on the other hand, outperforms others significantly, see Fig. (4).5 Related Work
Our work has extended modelbased RL to the multiview scenario. Modelbased RL for POMDPs have been shown to be more effective than modelfree alternatives in certain tasks watter2015embed; wahlstrom2015pixels; levine2014learning; leibfried2016deep. One of the classical combination of modelbased and modelfree algorithms is DynaQ sutton1990integrated, which learns the policy from both the model and the environment by supplementing real world onpolicy experiences with simulated trajectories. However, using trajectories from a nonoptimal or biased model can lead to learning a poor policy gu2016continuous
. To model the world environment of Atari Games, autoencoders have been used to predict the next observation and environment rewards
leibfried2016deep. Some previous works schmidhuber2015learning; ha2018world; leibfried2016deepmaintain a recurrent architecture to model the world using unsupervised learning and proved its efficiency in helping RL agents in complex environments. MbMf
nagabandi2018neural is a framework bridging the gap between modelfree and modelbased methods by employing MPC to pretrain a policy within the learned model before training it with standard modelfree method. However, these models can only be applied to a single environment and need to be built from scratch for new environments. Although using a similar recurrent architecture, our work differs from above works by learning the shared dynamics over multiple views. Also, many of the above advancements are orthogonal to our proposed approach, which can definitely benefit from model ensemble, e.g., pretrain the modelfree policy within the multiview model when reward models are accessible.Another related research area is multitask learning (or metalearning). To achieve multitask learning, recurrent architectures duan2016rl; wang2016learning have also been used to learn to reinforcement learn by adapting to different MDPs automatically. These have been shown to be comparable to the UCB1 algorithm Auer2002 on bandit problems. Metalearning shared hierarchies (MLSH) frans2017meta share subpolicies among different tasks to achieve the goal in the training process, where high hierarchy actions are obtained and reused in other tasks. Modelagnostic metalearning algorithm (MAML) finn2017model minimizes the total error across multiple tasks by locally conducting fewshot learning to find the optimal parameters for both supervised learning and RL. Actormimic parisotto2015actor distills multiple pretrained DQNs on different tasks into one single network to accelerate the learning process by initializing the learning model with learned parameters of the distilled network. To achieve promising results, these pretrained DQNs have to be expert policies. Distral teh2017distral learns multiple tasks jointly and trains a shared policy as the "centroid" by distillation. Concurrently with our work, ADRL ijcai2019277 has extended modelfree RL to multiview environments and proposed an attentionbased policy aggregation method based on the Qvalue of the actorcritic worker for each view. Most of above approaches consider the problems within the modelfree RL paradigm and focus on finding the common structure in the policy space. However, modelfree approaches require large amounts of data to explore in highdimensional environments. In contrast, we explicitly maintain a multiview dynamic model to capture the latent structures and dynamics of the environment, thus having more stable correlation signals.
Some algorithms from metalearning have been adapted to the modelbased setting al2017continuous; clavera2018learning. These focused on model adaptation when the model is incomplete, or the underlying MDPs are evolving. By taking the unlearnt model as a new task and continuously learning new structures, the agent can keep its model up to date. Different from these approaches, we focus on how to establish the common dynamics over compact representations of observations generated from different emission models.
6 Conclusions
In this paper, we proposed multiview reinforcement learning as a generalisation of partially observable Markov decision processes that exhibit multiple observation densities. We derive modelfree and modelbased solutions to multiview reinforcement learning, and demonstrate the effectiveness of our method on a variety of control benchmarks. Notably, we show that modelfree multiview methods through observation augmentation significantly reduce number of training samples when compared to stateoftheart reinforcement learning techniques, e.g., PPO, and demonstrate that modelbased approaches through crossview policy transfer allow for extremely efficient learners needing significantly fewer number of training samples.
There are multiple interesting avenues for future work. First, we would like to apply our technique to realworld robotic systems such as selfdriving cars, and second, use our method for transferring between varying views across domains.
References
Appendix A Modelbased Multiview Reinforcement Learning Algorithm
For completeness, we provide the Modelbased Multiview reinforcement learning algorithm below.
Appendix B Derivatives of Multiview Modelfree Policy Gradient
The type of algorithm we employ for modelfree multiview reinforcement learning falls in the class of policy gradient algorithms, which update the agent’s policydefining parameters directly by estimating a gradient in the direction of higher reward. Given the problem definition in Equation 1, the gradient of the loss with respect to the network parameters can be computed as:
where we used the “likelihoodratio” trick in the third step of the derivation. Now, one can proceed by taking a sample average of the gradient using Monte Carlo to update the policy parameters , suggesting the following update rule:
Mote Carlo estimation above is a fast approximation of the gradient for the current policy with convergence speed of to the true gradient independent of the number of parameters of the policy. It is also worth noting that although the trajectory distribution depends on the unknown initial state distribution, unknown observation models, and hidden state dynamics, the gradient only includes policy components that can be controlled by the agent.
Though fast in convergence to the true gradient, Monte Carlo estimates suffer from high variance, e.g., it is easy to show that variance grows linearly in the time horizon. Unfortunately, the naive approach of sampling bigenough batch sizes is not an option in reinforcement learning due to the high cost of collecting samples, i.e., interacting with the environment. For this reason, literature has focused on introducing baselines aiming to reduce variance peters2008reinforcement,williams1992simple. We follow a similar approach here, and introduce an observation based baseline to reduce the variance of our gradient estimate. Our baseline, , will take as inputs observations and actions^{5}^{5}5Even though this baseline is actiondependent, one can show it to be unbiased. The trick is to realize that a history at time is independent of action and rather depends on all action up to the instance., and learn to predict future returns given the current policy. Such a baseline can easily be represented as an LSTM recurrent neural network as noted in wierstra2010recurrent. Consequently, we can rewrite our update rule as:
(6) 
Appendix C Experiment Details
c.1 Model learning details




c.1.1 View Settings of Atari Pong
As shown in Fig. (5), each variant corresponds to one transformation from : (a) the transposed , which is transformed from the state observation of by clockwise rotating and horizontal flipping; (b) the horizontalswapped , which is generated by vertically splitting the observation frame of from the center and swapping the left part with the right part; (c) the inverse , which is created by exchanging the background color with the paddles/ball color of ; and (d) the mirrorsymmetric , which reflects like a mirror by horizontally swapping the observation.
c.1.2 Multiview Model Setting
Different from the original Atari Pong observations, we (1) transform each frame to a binary matrix; (2) remove the scoreboard; and (3) resize each frame to to serve as the observation of . The action space is formed by all six available discrete actions of the original Atari environment. The observation model adopts the same architecture as a typical VAE with . The memory model is a units LSTM connected to the same output layer of the observation model. We set the batch size for each task as and the sequence length of LSTM as . We alternate the training process for the multiview model between minimizing and by setting each training iteration with prediction iterations and reconstruction iterations, since the adjustment of the observation model to satisfy the learning of shared dynamics will affect model’s reconstruction ability. We explore training the shared dynamics on two views with corresponding inputs and noncorresponding inputs, i.e., using corresponding states from different views as training data or not, to verify the performance of multiview models.
To collect the training data covering most dynamics of the Pong environment, we use an agent with random policy to play the game for episodes with an episode length of . At each training time step, we randomly sample trajectories of length from the dataset as the training data for , and transform these samples to corresponding observations as the training data for , thus explicitly making the training input for different views share the same transition dynamics.
c.1.3 Additional Experiment Results
To further validate the importance of key elements in extracting the underlying dynamics, we show the weights of the reconstruction network connected to of in Fig. ((a)a). As the sum of absolute weights connected to key elements are much larger than others, the change of key elements will apply higher influence to the reconstruction output , thus illustrating their significance in latent representations.
We then compute the gradients of the output with respect to the to observe which part of contributes more to the visual stimuli. As shown in Fig. ((b)b), the mean absolute gradients of key elements are significantly larger, while other elements have nearly zero gradients (no contributions to ). Consequently, the shared dynamics are mainly expressed by key elements.
Cartpole  Hopper  RACECAR  

Horizon (T)  
Adam stepsize  
Num. epochs 

Minibatch size  
Discount ()  
GAE parameter ()  
Number of actors  
Clipping parameter  
VF coeff.  
Entropy coeff. 
Parking  
Cycles to collect samples  
Training batch size  
Sample batch size  
HER strategy  future 
Discount ()  
clip return  
actor learning rate  
critic learning rate  
average coefficient  
clip range  
numrolloutspermpi  
noise  
random  
buffer size  
replayk  
clipratio 
c.2 Policy learning details
We use the same structure for the multiview model as mentioned in Sect. C.1. For all environments, we generate 100 initial rollout trajectories using random policies (except for the cartpole, where we only generate 20 rollouts). We use PPO as the modelfree policy learning algorithm for MVPT and MVMF in cartpole, hopper and RACECAR tasks, and list the hyperparameters in Table 2. For the parking environment, we use DDPG lillicrap2015continuous with hindsight experience replay andrychowicz2017hindsight, and list the hyperparameters in Table 3
. For modelbased baselines, we implement PILCO following the original setting from 6654139, and choose the MLP with one hidden layer of 128 units and ReLU activation functions.
abbrv appendix_ref
Comments
There are no comments yet.