I Introduction
Artificial intelligence (AI) is the key element to bring robots in everyday life. Robots will be asked to accomplish many different and complex tasks (e.g. navigation and exploration of unknown environments, objects manipulation and humaninteraction, etc.) and theses challenges require the ability to extract meaningful information or features from the data perceived by the sensors. Because of the high task complexity, usually, multiple sensor modalities are employed. The socalled observation space, i.e. the space containing the sensory data, has a dimensionality much higher than the socalled state space, i.e. the space containing the meaningful information for solving the task.
Traditionally, this leads to complicated manual preprocessing of the data, feature engineering, and coding of the task solution. Even though very successful, feature engineering suffers from a lack of generalizability and reusability in different contexts. For each new task, it is usually necessary a new preprocessing stage and often the coding of a new solution.
Deep Reinforcement Learning (RL) [18] has been used for decision making in many different scenarios without the need for any feature engineering. RL aims at learning the mapping from the observation space to the action space directly from the data obtained through the interaction with the environment and the reward received for each action taken. The direct endtoend mapping from observation to action has successfully solved a huge variety of tasks [16] (e.g. videogames, robot path planning, dexterous manipulation, etc.), but it usually requires a high amount of data that are not often easy to obtain (e.g. training on real robotics hardware). Furthermore, no control over the learning of the taskrelevant information is present, but the RL algorithms extract, without any supervision, the important features out of the input data.
State Representation Learning (SRL) is the name given to the process of learning and encoding the taskmeaningful information from the observation space to the socalled state space, i.e. the space containing only the taskrelevant information. Usually, the state space has dimensionality much smaller than the observation space. The mapping from observation to states can be learned using supervised learning methods using labeled data, i.e. true value of the states. However, these are difficult and expensive to obtain. In this work, we specifically focus on a method for tackling the state representation learning problem using unsupervised or selfsupervised learning, i.e. without the use of the true value of the states. However, to aid the learning of a meaningful representation we use the concept of priors introduced by
[3] and further developed by [8]. With the priors, we model the prior knowledge about the world that can be used to inject information in the state representation learning problem. For example, it is possible to phrase these priors as loss functions for neural networks. The authors believe that unsupervised and selfsupervised methods combined with general prior knowledge of the world are the keys to achieve higher degrees of intelligence and autonomy in robotics.
In this work, we aim at incorporating the reward function properties into the state representation learning process through the priors. We extend the concept of the priors to multiple sensor modalities (a very common scenario in robotics), to multitargets navigation tasks and transfer learning from simulation to real robot. The general framework used is shown in Figure
1.The rest of the paper is organized as follows: Section II presents the related work in the scope of this paper, while Section III provides the theoretical information about RL and SRL. Then, Section IV discusses the proposed methodology. Section V provides information about the experiments designed and Section VI presents and discusses the results obtained. Section VII concludes the paper.
Ii Related work
SRL aims at learning the correct encoding of the state information out of the raw sensor observations. The quality of the state representation is crucial for decisionmaking, performances of RL algorithms, and their generalization capabilities. The mapping from observations to states is commonly learned with neural networks [10] using mostly autoencoders (AEs) and variation of these (e.g. variational AEs, denoising AEs, etc.) Accordingly to [10], three main methodologies can be followed to learn meaningful state representations for RL.
The first one relies on the observation reconstruction using, for example, AEs. An AE is a neural network composed by an encoder that maps observations to latent state variables of lower dimensionality, i.e. , and a decoder that reconstructs the observations from these latent variables, i.e.
. Because of the imposed dimensionality reduction, the autoencoder tries to extract the relevant features from the observations in order to minimize the reconstruction error loss
. Variations of autoencoder learning are used in [6], [14], [11] and [2].Second, it is possible to leverage on forward models, i.e models predicting the next state given the current state and action and inverse models, i.e models predicting the action given the state and the next one . Forward and inverse models are used in [7], [20], [9] and [1].
The third methodology, the one used in this work, uses prior knowledge about the task and the environment to shape the state space. The prior knowledge is encoded in form of loss functions used to train the neural network in charge of the observationstate mapping. To this category belongs the work proposed in [8], [11] and [17].
Independently on the chosen method, the state representation should be able to efficiently compress the observation space, with minimum information loss, to a state space with Markovian properties [4], i.e from a single state
, it is possible to choose the best action without ambiguity. The aim is to transform a Partially Observable Markov Decision Process (POMDP), in the observation space, which is difficult to solve, requires memory and high amount of sample, to a simple Markov Decision Process (MDP), in the state space, that can be efficiently solved by any RL algorithm. The state representation should be also able to generalize to unseen observations with similar features.
Iii Background
The main elements of RL [18] are the agent and the environment. The agent, by interacting with the environment, learns the mapping between state and action , i.e. the policy , by receiving a reward for each action taken. The ultimate goal of the agent is to find the optimal policy, i.e the policy that maximizes the total cumulative discounted reward in Equation (1).
(1) 
Many RL algorithms estimate the state value function
or the stateaction value function and infer the optimal policy from it. These methods are called in literature valuefunctionbased approaches. Qlearning [21] is one of them. Qlearning learns the stateaction value function , which is an estimate of how good is to choose a certain action in a given state.Deep QNetwork (DQN) [15], improves the original Qlearning by approximating the stateaction value function with a neural network. However, while the algorithm is now capable of handling continuous state spaces or big discrete stateaction spaces (very common in many applications), the algorithm inherits the training instabilities of the neural network. When training neural networks, the first assumption is of independent and identically distributed data (i.i.d), however, in RL, the samples are collected from trajectories, thus strongly temporally correlated. This temporal correlation of the samples makes the training of the Qnetwork highly unstable, thus Experience Replay [12] is used to break the temporal correlation between the samples as it generates training batches composed by randomly sampled data points. The second problem is related to the loss function of the Qnetwork (see Equation 2). The loss requires a target to compute the temporal difference error that is them backpropagated to adjust the parameters of the network. However, this target is nonstationary and it is predicted using the same network that is updated. This generates, again, instability. To solve this issue, Double DQN (DDQN) [19] uses a copy of the Qnetwork to compute this target Qvalues.
(2) 
Iv Methodology
Iva Proposed approach
Learning useful representations of the environment is essential for autonomous robotics and decision making. However, the mapping from the observation space, usually highdimensional, and to the state space, usually lowerdimensional, is not straightforward. Here, we notice that with state space we intend the space of important information, necessary for learning the optimal policy for a given task using reinforcement learning. In general, the ground truth information is not always available or easy to obtain. Therefore, we aim at learning a valid state representation in an unsupervised fashion. However, we employ generic domain knowledge to shape the state representation: the robotics priors [8]
. In this work, we proposed an adaptation of the original ones. The use of priors makes the learning of a state representation sample efficient and possible after a few training epochs.
The multimodal observations are fed to the Statenet (see Figure 2
), i.e. the network in charge of encoding the important information from the data and compressing them into a lowerdimensional state vector. The
Statenet design was inspired by the architecture proposed in [5].The state vector is then passed as input of a Qnetwork (see Figure 3) in order to estimate the stateaction value function that is then used it to choose the optimal action. DDQN was chosen for its simplicity and popularity, but the method is not dependent on this choice and any other RL algorithm can be used, both with discrete and continuous action spaces. This scheme is shown in Figure 1.
IvB Rewardshaped priors
Our approach builds upon the priors introduced in [8] and aims at addressing the following research questions:

How can the reward function, through the priors, be used for shaping the learning of the state representation?

How can the concept of priors be extended to multiple sensor modalities, different environments, and multitargets navigation problems?

To what extent, can the representation learned, using the priors, in the simulation environment be transferred effectively to the real robot without further retraining?
The priors used in this work are listed below, where and
Simplicity Prior: The taskrelevant information lies in a space with dimensionality much smaller than the sensory observations.
Temporal Coherence: State changes are slow and dependent only on the most recent past. This can be interpreted as an enforcement of the Markov’s assumption.
(3) 
Reward Proportionality (new prior): Similar reward changes should induce similar state changes. These reward changes are the results of actions, but actions can be continuous or with different levels of abstraction (e.g. in the case of Hierarchical RL) and the notion of similarity is difficult to define for those cases. This new prior aims at clustering together states with similar reward variations independently on the kind of action taken.
(4) 
Causality (new prior): Dissimilar rewards are a symptom of state dissimilarity. With analogous reasoning as before, the constraint to similar actions in the Causality prior of [8] is removed.
(5) 
Reward Repeatability (new prior): Reinforces the similarity of states when presenting the same reward variation, not only in magnitude, but also in direction.
(6) 
The overall loss function, Equation (7), used for training the StateNet is equal to the weighted sum of the different priors with the addition of L2regularization term.
(7) 
The weights of the single loss function (in Equation (7)) were chosen equal to , , , and to balance the contribution of the single loss functions. This combination gave good empirical results, but no optimization procedure was run to find the best set of weight.
In RL, the reward function is defined and shaped based on taskspecific knowledge to allow the agent to learn optimal behaviors. However, in the context of SRL, a task cannot be efficiently learned if an informative representation hasn’t been learned yet. We believe that the best representation is the one that incorporates meaningful information for solving the task, therefore it shouldn’t be learned independently from the chosen reward function. The new priors (4), (5), and (6) were developed to achieve this goal: shaping the state representation using not only the environment observations, but also the rewards. In particular, the reward variation from two states is used to further impose the Markov’s assumption, during the observation compression step. Ideally, we would like to obtain a regularized state space that is Markovian, i.e. a standard RL algorithm by looking at a single state prediction can choose the optimal action without the need for any memory structure.
IvC Neural networks architecture and training regime
The StateNet, Figure 2, is an encoder network, i.e. neural network with output dimensionality much smaller than the input dimensionality. The samples from the two sensor modalities are passed through two separate network branches and they are both used to make to independent state predictions of dimension n. The two predictions are concatenated and then fed through a final fully connected (dense) layer that combines them to produce the final state prediction, again of dimension n. The considerations on choice of the state dimensionality are shown in Section VI. The state predictions and the actions are then used to estimate the Qvalues using the neural network shown in Figure 3
As shown in [5], the state representation network shouldn’t be updated with the same frequency of the reinforcement learning network due to the generation of high learning instability. Therefore, we normally train the QNet (Figure 3) at each training step, while we update only after a fixed number of training episodes the StateNet (2). The frequency of the update of the StateNet is chosen to be a tradeoff between training too often and generate instability and training too rarely and slow down the learning of the optimal policy. The optimal policy cannot be learned without an informative state representation. In the episodes right after the updates of the StateNet, the rewards achieved by the RLagent may drop due to the sudden changes of the state representation. To avoid getting stuck in local optimal policies, we hold constant the value of the of the greedy exploration policy of DDQN.
V Experiments design
Va Mobile robot navigation with camera and LIDAR
When autonomously navigating, mobile robots are usually equipped with multiple sensors (sensor modalities) in order to be able to gather the highest amount of information from the environment. Commonly used sensors for perceiving the world are cameras and laser range scanners (LiDARs). Therefore, we equipped our robot (Turtlebot 3 waffle) with a camera (FOV 60 degrees) and a 2D LiDAR (FOV 360 degrees). The approach is first tested in the ROSGazebo 3D simulation environment and later evaluated on the real robot (again Turtlebot 3 waffle).
VB Reinforcement learning algorithm settings
The algorithm chosen is DDQN with inputs the state predictions from the StateNet and with output the Qvalues, one estimate per action. The agent can choose among 3 discrete actions: respectively, go forward, turn right, and turn left. To study the effect of different reward functions on the state representation learned with the new set of priors (Equation (3)(6)), two different reward functions are tested:
(8) 
(9) 
where is the distance of the robot to the target, estimated using the odometry information, is the robot orientation with respect to the target, is the minimum distance threshold below which the navigation target is considered reached and is a scaling factor for the exponential function. and are respectively a bonus for reaching the target and a penalty for hitting an obstacle, i.e. a terminal state . These two reward functions are a common choice for solving navigation tasks.
VC Navigation tasks in different environments
We first compare the new priors with the ones from [8] in order to highlight similarities and differences in the environment in Figure 3(a). We then analyze the choice of the state dimensionality as being a crucial aspect of the RL performances. Furthermore, we study, through tSNE [13], PCA [22] and correlation analysis, if the StateNet trained with the priors (Equation (3)(6) succeeds in encoding the meaningful information for solving the navigation task. In the case of the mobile robot navigation proposed, this information corresponds to the physical properties of the world as, for example, the pose of the robot (position, position, and orientation) and its distance to the target. Eventually, we test our approach in environments with different topologies and features (e.g. different colors of the wall, textures, etc.), shown in Figure 3(b)3(e), to validate the method. We also again study the information encoded by that StateNet and to what extent these are dependent on the environment shape.
VD Multitargets state representation
We perform experiments to assess the priors in case of a more complicated task: learning a state representation for multiple navigation targets. During training at every episode, a target is sampled from a uniform distribution. We slightly adapt the observation vector to include the location, (
,) coordinates, of the target. This information is directly passed to the last dense layer of the StateNet.VE Transfer learning experiments
Transfer learning is an important element for deploying RL algorithms on real robots, but it is usually limited by the simulationreality gap, i.e. the difference that always exists between simulation and the real world. However, if informative highlevel features are extracted from the observations, the RL policies, trained on these features, gain robustness and can be transferred from simulation to real without any undesirable training on the real robots.
Vi Results and Discussion
The state predictions are analyzed using Principal Components Analysis (PCA)
[22] and tDistributed Stochastic Neighbor Embedding (tSNE) [13]. These two techniques for dimensionality reduction allow us to visualize high dimensional datasets, understand and explain the learned state representation.Via Mobile robot navigation with camera and LiDAR
Here, we analyze the influence of the different sensor modalities on the learned representation. In particular, we compare the quality of learned representation through the crash ratio and the total cumulative reward when:
ViB Navigation tasks in different environments
ViB1 Comparison with original priors
The comparison with the priors, introduced in [8], is done by comparing the effect of the different learned state representation on the performance of the RLagent in the environment depicted in Figure 3(a). For the sake of a fair comparison, the training and testing environment is similar to the one used in [8]
. Furthermore, the same neural networks and hyperparameters are used. In Figure
6, the crash ratio during training is shown when the proposed priors and the original priors are used.ViB2 Analysis of the state dimensionality
The choice of the state dimensionality is crucial for RL performances. To test it, we analyze the crash ratio, i.e. the number of times an episode ends due to a collision with an obstacle over the total number of episodes, in relation to the choice of the state dimension. This choice corresponds to the choice of the output dimension of the StateNet. The results are shown in Figure 7
. It is possible to notice that if the state dimension is chosen too small with respect to the optimal one, the encoding step loses much important information due to the exaggerated compression. This is the case for the state dimension equal to 2 and 3. In those cases, the RLagent struggles to reduce the collisions and improve the policy. On the other hand, if the dimension is chosen too big, for example, equal to 100, the performances of the RLagent are slowed down due to the lack of compression and the curse of dimensionality. The RLagent has to learn which information has to be ignored.
We compare our approach with RL using the true pose of the robot (position, position and orientation) and endtoend RL based on observations (see Figure 8) in the environment in Figure 3(a). As expected, RL based on the ground truth quickly converges to the optimal solution (blue line in Figure 8), however the knowledge of the ground truth is a limiting factor in many realworld scenarios. When the state representation is combined with RL (orange line in Figure 8) after few updates of the StateNet (occurring at episode 200 and 400 respectively), the policy converges to the optimal solution with slope very similar to the policy using the ground truth. The policy directly based on observation (green line in Figure 8) cannot converge in the time window of 1200 episodes. This result proves the effectiveness of the state representation learning using the priors.
Through PCA, we study the actual dimensionality of the encoded state space by counting the number of uncorrelated components. The method is tested for all the environments in Figure 4. The results obtained are shown in Table I.
The state representation learned with the priors is not dependent on the topology of the environment (e.g. its shape) or the choice of the features (e.g. wall’s colors or textures) as the number of uncorrelated component is consistently 4 in Env1, Env2, Env4 and Env5 (see Table I). This proves that the state representation learning method proposed generalizes well in different environments. Interestingly, in Env3, when an obstacle is present on the optimal trajectory towards the target, the state representation can encode that information. This is reflected in the number of uncorrelated components, as a fifth one emerges. This again proves that the state representation learned with the new priors can encode the taskrelevant information.
Environment  StateNet output dim  Nr. uncorr. components 

Env1  10  4 
Env2  10  4 
Env3  10  5 
Env4  10  4 
Env5  10  4 
In order to understand what kind of information the StateNet encodes in the state space, we compare samples from the different principal components with the physical important properties required in any navigation task: pose of the robot (, , ) and distance to the target. The results of the correlation analysis, for the environment in Figure 3(a), are shown in Table II. A correlation exists between the real physical properties of the world and the encoded properties by the StateNet. It is worth to mention that we are not enforcing any explicit disentanglement and uncorrelation of the state properties, as we are still in an unsupervised learning framework.
xposition  yposition  orientation  distance to target  

Principal component 1  0.86  0.24  0.18  0.14 
Principal component 2  0.28  0.68  0.7  0.8 
Principal component 3  0.32  0.17  0.37  0.22 
Principal component 4  0.1  0.19  0.09  0.13 
ViB3 Rewardshaped state representation
To test if the reward signal, combined with the new priors, can be used to effectively shape the state representation by encoding from the sensors information taskspecific knowledge, we analyzed, using tSNE, the state representations obtained when the different reward functions, in Equation (8) and (9), are employed. In particular, we analyze if the clustering of the state predictions is related to the chosen reward function. In Figure 9, the clustering of the state predictions, when the reward function in Equation (8) is used, with respect to the true distance from the target (see Figure 9a) and orientation from the target (see Figure 9b) is shown. It is possible to notice that when the reward function in Equation (8) is used, the state representation is able to encode and cluster close together the predictions that have similar rewards, i.e similar distance from the target. When the same predictions are overlapped with the true orientation (see Figure 9b), the clustering is less effective and the prediction samples with similar orientation are spread over larger areas. This is expected since the state representation is not trained to cluster the predictions with respect to the orientation. Analogously, the clustering of the state predictions when the reward function in Equation (9) is used, with respect to the true distance from the target (see Figure 9c) and orientation (see Figure 9d) is shown. When the reward function in Equation (9) is used, the predictions are correctly clustered with respect to the true orientation (see Figure 9d), but also with respect to the true distance (Figure 9c). This is due to the fact that the orientation with respect to the target is computed using the distance from the target along the x and yaxis, thus it is not completely independent on the distance. These results prove that the state representation encodes taskrelevant knowledge through the reward information.
ViC Multitargets navigation
In this Section, we present the results related to multitarget navigation. In particular, we analyze if the priors are suitable for learning a state representation that is capable of differentiating between multiple navigation targets (two in this case). The results are presented in Figure 10, where the state predictions are analysed using PCA (Figure 9(a)) and tSNE (Figure 9(b)) . The state representation learned can effectively incorporate the information of the different targets and it can cluster not only in terms of the reward in Equation (8) (this can be noticed by looking at the smoothness of the color gradient in the Figures), but also with respect to the two targets (a clear division of the state samples).
ViD Experiments in realistic simulation environment and on real robots
In this Section, the transfer learning experiments are presented. In particular, we show, for a single navigation target, the trajectories followed by the real robot after transferring the state representation and the policy learned in the simulation environment 3(d). The trajectories followed on 10 different experiments, are shown in Figure 10(a) (left).
To assess the robustness of the state representation and the policy learned in simulation to variations in the sensor reading, during the experiments we switched off the lights of the room and after few seconds we switched then back on. In Figure 10(a), the trajectories obtained are shown. When the lights are off, the agent receives images from the camera which are very different from the one it has been trained on, thus it cannot immediately find the key features to reach the target. However, the agent doesn’t perform random actions that would bring the robot to crash against an obstacle (purple dots in Figure 10(a) (right)). The agent starts a searching behavior as it rotates around in search of the correct features. Once the light is turned on again (blue dots in Figure 10(a) (right)), the agent quickly recognize the features and drives safely to the target. This can be interpreted as proof that the policy has learned robust obstacle avoidance and navigation skills.
By extracting the meaningful features from the sensor data, not only the RLagent learns the policy faster, but we can mitigate the simulation to reality gap and we can directly transfer the knowledge learned in the simulation environment to the real robot without any further training on the real hardware.
Vii Conclusions
This paper proposes a new approach for the unsupervised learning of state representations for reinforcement learning. The state representation is learned using a new set of auxiliary loss functions, i.e. the priors. These priors are shaped using the reward function as means to incorporate the taskrelevant knowledge in the state representation. From the tests on the different environments, the state representation, built using the rewardshaped priors, can encode the important physical properties for solving different navigation tasks. Furthermore, the state representation learned is not dependent on the topology of the environment or the textures in it. The number of uncorrelated components in the state is consistently 4. However, when an extra constrain is added in the environment (e.g. obstacles, see Figure 3(c)), the state representation grows an extra uncorrelated component to encode information about the obstacle. The same happens in the case of multitargets navigation tasks. Furthermore, the priors allow the fusion of different sensor modalities (camera and LiDAR in this case). Eventually, the state representation and policy learned in the simulation environment are successfully transferred to the real robot without further retraining.
Acknowledgment
The authors thank Johan Engelen for the great support in the initial phase of this work and Han Wopereis for the precious help with ROS and Gazebo simulations.
References
 [1] (2016) Learning to poke by poking: experiential learning of intuitive physics. External Links: 1606.07419 Cited by: §II.
 [2] (201708) Autoencoderaugmented neuroevolution for visual doom playing. IEEE Conference on Computational Intelligence and Games (CIG). Cited by: §II.

[3]
(2013)
Representation learning: a review and new perspectives.
IEEE Transaction on pattern analysis and machine learning
. Cited by: §I.  [4] (2015) Autonomous learning of state representations for control: an emerging field aiming to autonomously learn state representation for reinforcement learning agents from their realworld sensor observations. KI  Kunstliche Intelligenz. Cited by: §II.
 [5] (2018) Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters. Cited by: §IVA, §IVC.
 [6] (2015) Deep spatial autoencorders for visuomotor learning. CoRR. Cited by: §II.
 [7] (2015) Learning to linearize under uncertainty. External Links: 1506.03011 Cited by: §II.
 [8] (2015) Learning state representations with robotic priors. Autonomous Robots. Cited by: §I, §II, §IVA, §IVB, §IVB, §VC, §VIB1.
 [9] (2017) PVEs: positionvelocity encoders for unsupervised learning of structured state representations. External Links: 1705.09805 Cited by: §II.
 [10] (2018) State representation learning for control: an overview. Neural Networks. Cited by: §II.
 [11] (2017) Unsupervised state representation learning with robotic priors: a robustness benchmark. arXiv. Cited by: §II, §II.
 [12] (1993) Reinforcement learning for robots using neural networks. Technical report CarnegieMellon Univ Pittsburgh PA School of Computer Science. Cited by: §III.
 [13] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §VC, §VI.
 [14] (2012) Learn to swing up and balance a real pole based on raw visual input data. In Neural Information Processing, T. Huang, Z. Zeng, C. Li, and C. S. Leung (Eds.), Berlin, Heidelberg, pp. 126–133. External Links: ISBN 9783642345005 Cited by: §II.
 [15] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §III.
 [16] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §I.
 [17] (2019) State representation learning with robotic priors for partially observable environments. IROS. Cited by: §II.
 [18] (1998) Introduction to reinforcement learning. Cited by: §I, §III.
 [19] (2016) Deep reinforcement learning with double qlearning. Cited by: §III.
 [20] (201610) Stable reinforcement learning with autoencoders for tactile and visual data. 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). External Links: ISBN 9781509037629, Link, Document Cited by: §II.
 [21] (1992) Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §III.
 [22] (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (13), pp. 37–52. Cited by: §VC, §VI.