1 Introduction
Deep reinforcement learning (RL) algorithms have achieved impressive success in various difficult tasks such as computer games (Mnih et al., 2013) and robotic control (Gu et al., 2017). Significant research effort in the field has led to the development of several successful RL algorithms (Schulman et al., 2017; Fujimoto et al., 2018; Haarnoja et al., 2018; Schulman et al., 2015)
. Their success is partly based on the expressive power of deep neural networks that enable the algorithms to learn complex tasks from raw sensory data. Whereas neural networks have the ability to automatically acquire taskspecific representations from raw sensory data, learning representations usually requires a large amount of data. This is one of the reasons that the application of RL algorithms typically needs millions of steps of data collection. This limits the applicability of RL algorithms to realworld problems, especially problems in continuous control and robotics. This has driven tremendous research in the RL community to develop sampleefficient algorithms
(Buckman et al., 2018; Kurutach et al., 2018; Kalweit and Boedecker, 2017; Munos et al., 2016).In general, state representation learning (SRL) (Lesort et al., 2018) focuses on representation learning where learned features are in low dimension, evolve through time, and are influenced by actions of an agent. Learning lower dimensional representation is motivated by the intuition that state of a system represents the sufficient statistic required to predict its future and in general, sufficient statistic for a lot of physical systems is fairly small dimensional. In the SRL framework, the raw sensory data provided by the environment where RL agents are deployed is called an observation, and its lowdimensional representation is called a state. Such a state variable is expected to have all task relevant information, and ideally only such information. The representation is usually learned from auxiliary tasks, that enables the state variable to contain prior knowledge of the task domain (Jonschkowski and Brock, 2014) or the dynamics of the environment. In contrast to auxiliary tasks, the task that the RL agents needs to learn ultimately is called the actual task in this paper.
Conventional wisdom suggests that the lower the dimensionality of the state vector, the faster and better RL algorithms will learn. This reasoning justifies various algorithms for learning compact state representations from highdimensional observations, for example
(Watter et al., 2015). However, while probably correct, this reasoning likely applies to the intrinsic dimensionality of the state (the
sufficient statistic). An interesting question is whether RL problems with intrinsically lowdimensional state can benefit by intentionally increasing its dimensionality using a neural network with good feature propagation. This paper explores this question empirically, using several representative RL tasks and stateoftheart RL algorithms.Additionally, we borrow motivation from the fact that larger networks generally allow better solution as they increase the search space of possible solutions. The number of units in the hidden layers of multilayer perceptrons (MLP) is often larger than the number of inputs, in order to improve the accuracy of function approximation. The importance of the size of the hidden layers has been investigated by a number of authors.
(Lu et al., 2017) theoretically shows that MLPs with a hidden layer smaller than the input dimension are very limited in their expressive power. Also, in deep RL, neural networks often have hidden layers larger than the dimension of observations (Henderson et al., 2018). Since the state representation is an intermediate variable of processing just like the hidden units, it is reasonable to expect that highdimensional representations of state might improve the expressive power of the neural networks used in RL agents in their own right.Based on this idea, we propose OFENet: an Online Feature Extractor Network that constructs and uses highdimensional representations of observations and actions, which are learned in a online fashion (i.e., along with the RL policy). We use a neural network for OFENet to produce the representations. It is desirable that the neural network can be optimized easily and produce meaningful highdimensional representations. To meet these requirements, we use MLPDenseNet as the network architecture; it is a slightly modified version of a densely connected convolutional network (Huang et al., 2017). The output of MLPDenseNet is the concatenation of all layer’s outputs. This network is trained with the incentive to preserve the sufficient statistic using an auxiliary task to predict future observations of the system. Consequently, the RL algorithm receives higher dimensional features learned by the OFENet which have good predictive power of future observations.
We believe that the representation trained with the auxiliary task allows our agent to learn effective, higherdimensional representation for as input to the RL algorithm. This in turn allows the agent to learn complex policies more efficiently. We present results that demonstrate (empirically) that the representations produced by OFENet improve the performance of RL agents in nonimage continuous control tasks. OFENet with several stateoftheart RL algorithms (Schulman et al., 2017; Fujimoto et al., 2018; Haarnoja et al., 2018) consistently achieves stateoftheart performance in various tasks, without changing the original hyperparameters of the RL algorithm.
2 Related work
Our work is broadly motivated by (Munk et al., 2016) that proposed to use the output of a neural network layer as the input for a deep actorcritic algorithm. Our method is built on this general idea, too. However, a key difference is that the goal of representation learning in (Munk et al., 2016) is to learn compact representation from noisy observation while we propose the idea of learning good higherdimensional representations of state observations. For clarity of presentation, we describe the method in (Munk et al., 2016) later in detail in Section 3.2. While the classic reinforcement learning paradigm focuses on reward maximization, streaming observation contains abundance of other possible learning targets (Jaderberg et al., 2016). These learning tasks are known as auxiliary tasks and they generally accelerate acquisition of useful state representations. There is lot of literature on using different auxiliary tasks for different tasks (Jonschkowski and Brock, 2015; Lesort et al., 2018; Munk et al., 2016; Zhang et al., 2018; Van Hoof et al., 2016; HernandezLeal et al., 2019; v. Baar et al., 2019; Chen et al., 2016; Alvernaz and Togelius, 2017). In the proposed work, we use the auxiliary task of predicting the next observation for training the OFENet with the motivation that this allows the higher dimensional outputs of the OFENet to preserve the sufficient statistic for predicting the future observations of the system. Thus these representations can prove effective in learning meaningful policies too.
The network architecture of OFENet draws on the success of recently reported research on deep learning. In the advertisement domain, Wide & Deep model
(Cheng et al., 2016) and DeepFM (Guo et al., 2017) have an information flow passing over deep networks, in order to utilize loworder feature interactions. Similarly, OFENet also has connections between shallow layers and the output.The most similar approach to our method has been proposed in (Zhang et al., 2018). Their DDR ONLINE produces higherdimensional representations than the original observations, similar to our method. However, our approach to constructing good representations is quite different. They used a recurrent architecture of the network in order to incorporate temporal dependencies, and it contained several auxiliary tasks to increase the number of training signals . In contrast, our method uses only a single auxiliary task, and we embed additional information, such as information about the action, into the representation instead of increasing the number of training signals. Consequently, our training algorithm is much simpler than that presented in (Zhang et al., 2018).
3 Background
In this section, we provide relevant background and introduce some notations that are used throughout the paper.
3.1 Reinforcement Learning
Reinforcement learning considers the setting of an RL agent interacting with an environment to learn a policy that decides the optimal action of the agent. The environment is modeled as a Markov decision process (MDP) defined by the tuple
, where is the space of possible observation and is the space of available actions. We assume that observations and actions are continuous. The unknown dynamics represents the distribution of the next observation given the current observation and the current action . The reward function represents the reward obtained for and . The goal of the RL agent is to acquire the policy that maximizes the expected sum of rewards in an episode. Note that the word state represents the notion of the essential information in the observation in SRL, so we call the information provided by the environment an observation.3.2 Model learning deep deterministic policy gradient
(Munk et al., 2016) proposed the Model Learning Deep Deterministic Policy Gradient (MLDDPG) algorithm to learn representations of observations. They introduced a model network, which is trained to construct the observation representation . The representation is used as the input in the DDPG algorithm (Lillicrap et al., 2015). The model network is a threelayer network, which produces as an internal representation, and predicts the next observation representation and the reward from the observation (Figure1).
The model network is trained by minimizing the following loss function
where represents the tradeoff between predicting the reward and the next observation representation. The minimization is done with samples collected before the agent starts learning, and then the parameters of the model network are fixed during learning.
While we construct highdimensional representations with OFENet, in the experiments of MLDDPG, the dimension of the observation representation was not greater than one third of the dimension of the observation .
4 Online feature extractor network
In this section, we describe our proposed method for learning higherdimensional state representations for training RL agents. In the standard reinforcement learning setting, an RL agent interacts with the environment over a number of discrete time steps. At any time , the agent receives an observation along with a reward and emits an action . During training, standard RL agents receive observationaction pair as input to learn the optimal policy. We propose the Online Feature Extractor Network (OFENet), which constructs highdimensional representations for observationaction pair, which is then used by the RL algorithm as input to learn the policy (instead of the raw observation and action pair). The observation used by the components is replaced by the observation representation , and the observationaction pair is also replaced by the observationaction representation .
OFENet learns the mappings and , which have parameters as depicted in Figure 2
. To learn the mappings, we use an auxiliary task whose goal is to predict the next observation from the current observation and action. The learning of the auxiliary task is done concurrently with the learning of the actual task (and thus we call our proposed network as Online Feature Extraction Network, OFENet).
In the following, we describe in detail the auxiliary task, the neural network architecture for the mappings , and how to select hyperparameters for OFENet.
4.1 Auxiliary task
In this section, we incorporate auxiliary tasks in order to learn effective higherdimensional representations for state and action to be used by the RL agent. It is common knowledge that incorporating auxiliary tasks into the reinforcement learning framework can promote faster training, more robust learning and ultimately higher performance for RL agents (Duan et al., 2016; Jaderberg et al., 2016).
We introduce the module , which receives the observationaction representation as input to predict the next observation . The module is represented as a linear combination of the representation , which has parameters .
Thus, effectively, along with learning the actual RL objective, we optimize the parameter set to minimize the auxiliary task loss defined as:
(1) 
The auxiliary task defined by Equation (1) is used by the OFENet to learn higherdimensional representations. This loss function incentivizes higherdimensional representations that preserve the dynamics information of the system (or loosely preserve the sufficient statistic to predict future observations of the system). The expectation is that the RL agent can now learn much complex policies using these higherdimensional features as this effectively increases the search space for the parameters of the policies.
The transitions required for learning are sampled as minibatches from the experience replay buffer , which stores the past transitions that the RL agent has received. Algorithm 1 outlines this procedure.
4.2 Network architecture
A neural network is used to represent the mappings in OFENet. As it is known that deeper networks have advantages with respect to optimization ability and expressiveness, we employ them in our network architecture. In addition to this, we also leverage the fact that observations often have intuitively useful information in nonimage RL tasks. For example, when position and velocity of a robot are present in an observation, it is advantageous to include them in the representation when solving reaching tasks. Moreover, because the linear combination of position and velocity can approximate the position for the next time step, outputs of shallow layers are also expected to be physically meaningful.
To combine the advantages of deep layers and shallow layers, we use MLPDenseNet, which is a slightly modified version of DenseNet (Huang et al., 2017), as the network architecture of OFENet. Each layer of MLPDenseNet has an output which is the concatenation of the input and the product of a weight matrix and defined as:
where means concatenation, is the activation function, and the biases are omitted to simplify notation. Since each layer’s output is contained in the next layer’s output, the raw input and the outputs of shallow layers are naturally contained in the final output.
The mappings are represented with an MLPDenseNet. The mapping receives the observation as input, and the mapping receives the concatenation of the observation representation and the action as its input. Figure 2 shows an example of these mappings in the proposed OFENet.
RL algorithms take the learned representations and
as input, and compute the optimal policy by optimizing the regular objective of maximizing the expected reward. It is important to note that these representations, however, are learned simultaneously with the RL algorithms. This might lead to change in distribution of the inputs to the RL algorithm. The RL algorithm, therefore, needs to adapt to the possible change of distribution in those base layers constructed by MLPDenseNet. To alleviate this potential problem, we normalize output of base layer by using Batch Normalization
(Ioffe and Szegedy, 2015) to suppress changes in input distributions.4.3 Hyperparameter selection
Effective training of agents with OFENet requires selection of size of the hidden layers and the type of activation functions. In (Henderson et al., 2018), authors show that the size of the hidden layers and the type of activation function can greatly affect the performance of RL algorithms, and the combination that achieves the best performance depends strongly on the environment and the algorithm. Thus, ideally, we would like to choose the architecture of OFENet by measuring the performance on the actual RL task, but this would be very inefficient. Therefore, we use the performance on the auxiliary task as an indicator for selecting the architecture for each task.
In order to measure performance on the auxiliary task, first we collect transitions using a random action policy for the agent. The transitions are randomly split into a training set and a test set. Then, we train each architecture on the training set, and measure the average loss on the test set over five different random seeds. On the actual task, we use the architecture which achieves the minimum on the average auxiliary loss. We call this average loss the auxiliary score. Since we can reuse the transitions for each OFENet training, and do not have to train RL agents, this procedure is sampleefficient and doesn’t incur much computational cost either.
We use an experience replay buffer to simulate the learning with the RL agent, where OFENet samples a minibatch up to the th data item of the training set at th step.
The psuedocode for the proposed method is presented in Algorithm 1
. It is noted that we do not tune the hyperparameters of the RL algorithm in order to show that agents can learn effective policies by using the representations learned by the proposed method during the learning process. This allows more flexibility in training of RL agents across a wide range of tasks.
5 Experiments
In this section, we try to answer the following questions with our experiments to describe the performance of OFENet.

What is a good architecture that learns effective state and stateaction representations for training better RL agents?

Can OFENet learn more sample efficient and better performance polices when compared to some of the stateoftheart techniques?

What leads to the performance gain obtained by OFENet?

How does the dimensionality of OFENet representation affects performance?
In the rest of this section, we present experiments designed to answer the above questions. All these experiments are done in MuJoCo simulation environment.
5.1 Architecture comparison
We compare the auxiliary score defined in section 4.3 with the performance on the actual task for various network architectures on Walker2dv2 task, where the dimensions of observations and actions are respectively and .
First, we define the actual score, which is the metric of performance on the actual task. In this paper,

return represents a cumulative reward over an episode;

step score represents the average return over 10 episodes with the RL agent at each step;

actual score represents the average of the step scores over latest 100,000 steps, where the step score is measured every 5,000 steps.
We measure the auxiliary score and the actual score for each network architecture. In this section, each architecture is characterized by a connectivity of architecture, number of layers, and an activation function. We compare three connectivity architectures: MLPDenseNet defined in section 4.2, standard MLP, and MLPResnet, which is a slightly modified version of ResNet (He et al., 2016). MLPResNet has skip connections similar to the original one, and its layers have the output defined as:
(2) 
where are weight matrices, is the input, and is the activation function. The biases are omitted to simplify notation.
Each architecture has multiple options for the combination of a layer number and a hidden layer size. In this experiment, have the same layer number for each architecture. MLP has 1, 2, 3, or 4 layers for . MLPResNet and MLPDenseNet have 2, 4, 6, or 8 layers for .
To find the most efficient architecture over same feature size, the dimensions of are respectively fixed to and . This means that the dimensionality increments of from their inputs are . While the numbers of hidden units in are respectively and in MLP and MLPResnet, the number of hidden units in MLPDenseNet depends on the number of layers. All the layers in MLPDenseNet have the same number of hidden units. For example, when has 4 layers, the number of hidden units is .
Additionally, we compare the following activation functions: ReLU, tanh, Leaky ReLU, Swish
(Ramachandran et al., 2017) and SELU (Klambauer et al., 2017). In total, we compare connectivity architectures, layersize combinations, and activation functions, resulting in a total of network architectures.To measure the auxiliary score, we collect 100K transitions as a training set and 20K transitions as a test set, using a random policy. Each architecture is trained for 100K steps. To measure the actual score, each architecture is trained with the SAC agent for 500K steps with Algorithm 1. The SAC agent is trained with the hyperparameters described in (Haarnoja et al., 2018), where the networks have two hidden layers which have 256 units. All the networks are trained with minibatches of size 256 and Adam optimizer, with a learning rate .
Figure 3 shows the actual score and the auxiliary score of each network architecture in Walker2dv2. The results show that DenseNet consistently achieves higher actual scores than other connectivity architectures. With respect to the auxiliary score, DenseNet also achieves better performance than others in many cases.
Overall, we can find a weak trend that the smaller the auxiliary scores, the better the actual scores. Therefore, in the following experiment, we select the network architecture that has the smallest value of the auxiliary score among the 20 DenseNet architectures for the actual task for each environment.
5.2 Comparative evaluation
SAC  TD3  PPO  

OFE  Original  MLSAC  MLSAC  OFE  Original  OFE  Original  
(ours)  (1/3)  (OFE like)  (ours)  (ours)  
Hopperv2  3511.6  3316.6  750.5  868.7  3488.3  3613.0  2525.6  1753.5 
Walker2dv2  5237.0  3401.5  667.4  627.4  4915.1  4515.6  3072.1  3016.7 
HalfCheetahv2  16964.1  14116.1  1956.9  11345.5  16259.5  13319.9  3981.8  2860.4 
Antv2  8086.2  5953.1  4950.9  2368.3  8472.4  6148.6  1782.3  1678.9 
Humanoidv2  9560.5  6092.6  3458.2  331.7  120.6  345.2  670.3  652.4 
To evaluate OFENet, we measure the performance of SAC, twin delayed deep deterministic policy gradient algorithm (TD3) (Fujimoto et al., 2018) for offpolicy RL algorithms, and proximal policy optimization (PPO) (Schulman et al., 2017) for onpolicy RL algorithm with OFENet representations and raw observations, on continuous control tasks in the MuJoCo environment. The dimensionality increments of from their inputs are in all experiments, and we select the best network architecture for each task as described in section 4.3. The network architectures and optimizer, hyperparameters of SAC, TD3, PPO are the same as used in their original papers (Haarnoja et al., 2018; Fujimoto et al., 2018; Schulman et al., 2017) even we combine them with OFENet. The minibatch size in (Fujimoto et al., 2018), however, is different from original paper. We use minibatches of size instead of , similarly to SAC.
Moreover, we measure the performance of SAC with the representations which are produced by a model network of MLDDPG, which we call MLSAC. The hidden layer size of MLSAC is , its activation function is ReLU, and in section 3.2 is . We train the model network with an Adam optimizer, with a learning rate . We set the dimension of the observation representation to one third of that of the observation according to (Munk et al., 2016). In addition to this, we measure the performance of MLSAC with the observation representation which has the same dimension as OFENet. Whereas in (Munk et al., 2016) the model network was trained with samples collected before the learning of the agent, we train the network with samples collected by the learning agent, such as OFENet.
In order to eliminate dependency on the initial parameters of the policy, we use a random policy to store transitions to the experiment replay buffer for the first 10K time steps for SAC, and 100K time steps for TD3 and PPO as described in (Fujimoto et al., 2018). We also pretrain OFENet to stabilize input to each RL algorithm with these randomly trained samples. Note that as described in section 4.1, OFENet predicts the future observation to learn the highdimensional representations. In Antv2 and Humanoidv2, the observation contains the external forces to bodies, which are difficult to predict because of their discontinuity and sparsity. Thus, OFENet does not predict these external forces.
Figure 4 shows the learning curves of the methods, and Table 1 shows the highest average returns over five different seeds. SAC (OFE), i.e. SAC with OFENet representations, outperforms SAC (raw), i.e. SAC with raw observations. Especially in Walker2dv2, Antv2, and Humanoidv2, the sample efficiency and final performance of SAC (OFE) outperform significantly those of the original SAC. Since TD3 (OFE) and PPO (OFE) also outperform original algorithm, it can be concluded that OFENet is an effective method for improving deep RL algorithms on various benchmark tasks.
MLSAC (1/3), i.e. MLSAC with low dimensional representation performed poorly on all tasks. Since MLDDPG is supposed to find compact representations from noisy observations, the model network probably could not find a compact representation from the nonredundant observations in the tasks. MLSAC (OFE like), i.e., MLSAC with the high dimensional representations, also performed poorly. In addition to this, extracting representation with MLP got much worse actual scores than MLPDenseNet in section 4.3. These show that constructing high dimensional representations is not a trivial task, and OFENet resolves this difficulty with MLPDenseNet.
5.3 Ablation study
Our hypothesis is that we can extract effective features from an auxiliary task in environments with a lowdimensional observation vectors. Furthermore, we would like to verify that just increasing the dimensionality of the state representation will not help the agent to learn better policies and that, in fact, generating effective higher dimensional representations using the OFENet is required to get better performance. To verify this, we conducted an ablation study to show what components does the improvement of OFENet comes from.
Figure 5 shows the ablation study over SAC with Antv2 environment. full and raw are the same plots of SAC (OFE) and SAC (original) from Figure. 4.
nobn removes Batch Normalization from OFENet. The bigger standard deviation of nobn indicates that adding Batch Normalization stabilizes the full learning process. Since the OFENet is learned online, the distribution of the input to the RL algorithms changes during training. Batch Normalization effectively works to suppress this covariate shift during training and thus the learning curve of full is more stable than nobn .
noaux removes auxiliary task and train both OFENet and RL algorithms with actual task objective of reward maximization. The much lower scores of noaux shows that learning the complex OFENet structure from just the reinforcement signal is difficult, and using the auxiliary task for learning good highdimensional features enables better learning of control policy.
sameparams increases the number of units of original SAC to , instead of as suggested in (Haarnoja et al., 2018) so that it has the same number of parameters with our algorithm. The performance does increase compared to the original unit size, but its still not as good as the full algorithm in terms of both sample efficiency and performance. This shows that just increasing the number of parameter does not help improve performance, but the auxiliary task helps with efficient exploration in the bigger parameter space.
As done in (Munk et al., 2016), freezeofe trains OFENet only before training of RL agent with randomly collected transitions as discussed in Section 5.1, and does not simultaneously train OFENet along with RL policy (i.e., skip line and in Algorithm.1). Since the accuracy of predicting future observation becomes worse when an RL agent explores unseen observation space, freezing the OFENet trained with only randomly collected data cannot produce good representations.
5.4 Effect of Dimensionality of Representation
In this section, we try to test whether increasing the dimension of the OFENet representation could lead to monotonic improvements in the performance of the RL agent. Figure 6 shows the improvement in the performance of an SAC agent on the HalfCheetahv2 environment when we increase the dimension of the OFENet representation by increasing the numbers of hidden units in an 8layer OFENet from to . The step score of the RL agent generally increases with the increase of the dimensionality of representation, until a threshold is reached. This shows that we need sufficient output dimensionality to get the benefit of increasing the dimensionality of state representations using any feature extraction network.
6 Conclusion
So, can increasing input dimensionality improve deep reinforcement learning? Recent success of deep learning has allowed us to design RL agents that can learn very sophisticated policies for solving very complex tasks. It is common belief that allowing smaller state representation helps in learning complex RL polices. In this paper, we wanted to challenge this hypothesis with the motivation that larger feature representations for state can allow bigger solution space and thus, can find better policies for RL agents. To demonstrate this, we presented an Online Feature Extractor Network (OFENet), a neural network that provides much higherdimensional representation for originally lowdimensional input to accelerate the learning of RL agents. Contrary to popular belief we provide evidence suggesting that representations that are much higherdimensional than the original observations can significantly accelerate learning of RL agents. In fact, our experimental evaluation demonstrated that OFENet representations can achieve stateoftheart performance on various benchmark tasks involving difficult continuous control problems using both onpolicy and offpolicy algorithms. Our results suggest that RL tasks, where the observation is lowdimensional, can benefit from state representation learning. Additionally, the feature learning by OFENet doesn’t require tuning the hyperparameters of the underlying RL algorithm. This allows flexible design of RL agents where the feature learning is separated from policy learning. Another important thing to observe is that the highdimensional representations should be learned so as to retain some knowledge of the task or the system. In the current paper, it was learned using the auxiliary task of predicting the next observation.
In the future, we would like to study the proposed network for highdimensional inputs to RL agents (e.g., images). We would also like to make the proposed method work with several other onpolicy techniques. Additionally, we would like to study better techniques to find the optimal architectures for the OFENet to add more flexibility in the learning process.
References
 Autoencoderaugmented neuroevolution for visual doom playing. pp. 1–8. Cited by: §2.
 Sampleefficient reinforcement learning with stochastic ensemble value expansion. pp. 8224–8234. Cited by: §1.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. pp. 2172–2180. Cited by: §2.
 Wide & Deep Learning for Recommender Systems. CoRR. External Links: 1606.07792, Link Cited by: §2.
 Benchmarking Deep Reinforcement Learning for Continuous Control. pp. 1329–1338. External Links: Link Cited by: §4.1.
 Addressing Function Approximation Error in ActorCritic Methods. CoRR. External Links: 1802.09477, Link Cited by: §A.1, §1, §1, §5.2, §5.2.
 Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. pp. 3389–3396. External Links: Document Cited by: §1.
 DeepFM: A FactorizationMachine based Neural Network for CTR Prediction. CoRR. External Links: 1703.04247, Link Cited by: §2.
 Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. CoRR. External Links: 1801.01290, Link Cited by: §A.1, §1, §1, §5.1, §5.2, §5.3.
 Deep Residual Learning for Image Recognition. pp. 770–778. External Links: Document, ISSN 10636919 Cited by: §5.1.
 Deep Reinforcement Learning That Matters. pp. 3207–3214. Cited by: §1, §4.3.
 Agent modeling as auxiliary task for deep reinforcement learning. pp. 31–37. Cited by: §2.
 Densely Connected Convolutional Networks. pp. 2261–2269. External Links: Document, ISSN 10636919 Cited by: §1, §4.2.

Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
Proceedings of the 32nd International Conference on Machine LearningProceedings of the 28th International Conference on Neural Information Processing Systems  Volume 22016 IEEE 55th Conference on Decision and Control (CDC)International Conference on Machine Learning (ICML)ICMLAAAIProceedings of The 2nd Conference on Robot Learning2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence2017 IEEE International Conference on Robotics and Automation (ICRA)2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Robotics: Science and SystemsAdvances in Neural Information Processing SystemsConference on Robot LearningAdvances in Neural Information Processing SystemsProceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment2019 International Conference on Robotics and Automation (ICRA)International conference on machine learningAdvances in neural information processing systems2017 IEEE Conference on Computational Intelligence and Games (CIG)
, F. Bach, D. Blei, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning ResearchNIPS’15Proceedings of Machine Learning Research, Vol. 378715, Lille, France, pp. 448–456. External Links: Link Cited by: §4.2.  Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §2, §4.1.
 State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction. Cited by: §1.
 Learning state representations with robotic priors. Autonomous Robots 39 (3), pp. 407–428. Note: learns low dimentional representation from image. This utilizes vairous robots ”prior”, it means phisical common sense. External Links: Document, ISSN 15737527 Cited by: §2.
 Uncertaintydriven imagination for continuous deep reinforcement learning. pp. 195–206. Cited by: §1.
 SelfNormalizing Neural Networks. In Advances in Neural Information Processing Systems 30, pp. 971–980. Cited by: §5.1.
 Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592. Cited by: §1.
 State Representation Learning for Control: An Overview. CoRR. External Links: 1802.04181, Link Cited by: §1, §2.
 Continuous control with deep reinforcement learning. CoRR abs/1509.0. External Links: 1509.02971, Link Cited by: §3.2.
 The Expressive Power of Neural Networks: A View from the Width. CoRR. External Links: 1709.02540, Link Cited by: §1.
 Playing Atari with Deep Reinforcement Learning. CoRR. External Links: 1312.5602, Link Cited by: §1.
 Learning state representation for deep actorcritic control. pp. 4667–4673. External Links: Document Cited by: §2, §3.2, §5.2, §5.3.
 Safe and efficient offpolicy reinforcement learning. pp. 1054–1062. Cited by: §1.
 Searching for Activation Functions. CoRR. External Links: 1710.05941, Link Cited by: §5.1.
 Trust region policy optimization. pp. 1889–1897. Cited by: §1.
 Proximal Policy Optimization Algorithms. CoRR abs/1707.0. External Links: 1707.06347, Link Cited by: §A.1, §1, §1, §5.2.

Simtoreal transfer learning using robustified controllers in robotic tasks involving complex dynamics
. pp. 6001–6007. External Links: Document, ISSN 10504729 Cited by: §2.  Stable reinforcement learning with autoencoders for tactile and visual data. IEEE International Conference on Intelligent Robots and Systems 2016Novem, pp. 3928–3934. External Links: Document, ISBN 9781509037629, ISSN 21530866 Cited by: §2.
 Embed to control: a locally linear latent dynamics model for control from raw images. Cambridge, MA, USA, pp. 2746–2754. Cited by: §1.
 Decoupling Dynamics and Reward for Transfer Learning. CoRR. External Links: 1804.10689, Link Cited by: §2, §2.
Appendix A Appendix
a.1 Network architecture
Table. 2 shows the network architecture of the OFENet for each environment. As described in Section. 5.1, the selected network architecture is the one that receives the smallest value of the auxiliary score among 20 DenseNet architectures: number of layers are selected from , and the activation function is selected from {ReLU, tanh, Leaky ReLU, Swish, SELU}.
Number of layers  Activation function  

Hopperv2  6  Swish 
Walker2dv2  6  Swish 
HalfCheetahv2  8  Swish 
Antv2  6  Swish 
Humanoidv2  8  Swish 
a.2 Hyperparameters
Like the network architecture, the hyperparameters of the RL algorithms are also the same with their original papers, except that the TD3 uses the batch size instead of similarly to SAC.
Table. 3 shows the number of parameters for SAC (OFE) used in the experiments. The number of parameters of sameparams in Figure 5 matches the sum of the parameters for OFENet and the number of parameters of SAC. Note that the OFENet uses MLPDenseNet architecture, and the output units of OFENet in Table. 3 ignores the units of previous layer.
Input units  Output units  Parameters  
OFENet:  st layer  
nd layer  
rd layer  
th layer  
th layer  
th layer  
Total  
SAC  st layer  
nd layer  
Output layer  
Total  
SAC (OFE)  Total 