Lifelong Federated Reinforcement Learning: A Learning Architecture for Navigation in Cloud Robotic Systems

This paper was motivated by the problem of how to make robots fuse and transfer their experience so that they can effectively use prior knowledge and quickly adapt to new environments. To address the problem, we present a learning architecture for navigation in cloud robotic systems: Lifelong Federated Reinforcement Learning (LFRLA). In the work, We propose a knowledge fusion algorithm for upgrading a shared model deployed on the cloud. Then, effective transfer learning methods in LFRLA are introduced. LFRLA is consistent with human cognitive science and fits well in cloud robotic systems. Experiments show that LFRLA greatly improves the efficiency of reinforcement learning for robot navigation. The cloud robotic system deployment also shows that LFRLA is capable of fusing prior knowledge. In addition, we release a cloud robotic navigation-learning website based on LFRLA.


Federated Imitation Learning: A Privacy Considered Imitation Learning Framework for Cloud Robotic Systems with Heterogeneous Sensor Data

Humans are capable of learning a new behavior by observing others perfor...

Federated Imitation Learning: A Novel Framework for Cloud Robotic Systems with Heterogeneous Sensor Data

Humans are capable of learning a new behavior by observing others to per...

Robotic self-representation improves manipulation skills and transfer learning

Cognitive science suggests that the self-representation is critical for ...

Point Cloud Based Reinforcement Learning for Sim-to-Real and Partial Observability in Visual Navigation

Reinforcement Learning (RL), among other learning-based methods, represe...

Peer-Assisted Robotic Learning: A Data-Driven Collaborative Learning Approach for Cloud Robotic Systems

A technological revolution is occurring in the field of robotics with th...

Unsupervised state representation learning with robotic priors: a robustness benchmark

Our understanding of the world depends highly on our capacity to produce...

I Introduction

Autonomous navigation is one of the core issues in mobile robotics. It is raised among various techniques of avoiding obstacles and reaching targeted position for mobile robotic navigation. Recently, reinforcement learning (RL) algorithms are widely used to tackle the task of navigation. RL is a kind of reactive navigation method, which is an important mean to improve the real-time performance and adaptability of mobile robots in unknown environments. Nevertheless, there still exists a number of problems in the application of reinforcement learning in navigation such as reducing training time, storing data over long time, separating from computation, adapting rapidly to new environments etc [1].

In this paper, we address the problem of how to making robots learn efficiently in a new environment and extending their experience storage so that they can effectively use prior knowledge. We focus on cloud computing and cloud robotics technologies [2], which can enhance robotic systems by facilitating the process of sharing trajectories, control policies and outcomes of collective robot learning. Inspired by human congnitive science, we propose a Lifelong Evolutionary Reinforcement Learning Architecture (LFRLA) illustrated in Fig.2 to realize the goal.

Fig. 1: The person on the right is considering where should the next step go. The chess he has played and the chess he has seen are the most two influential factors on making decisions. His memory fused into his policy model. So, how can robots remember and make decisions like humans?

With the scalable architecture and knowledge fusion algorithm, LFRLA achieves exceptionally efficiency in reinforcement learning for cloud robot navigation. LFRLA makes robots able to remember what they have learned and what other robots have learned with cloud robotic systems. LFRLA contains both asynchronization and synchronization learning rather than limited to synchronous learning as A3C [3] or UNREAL [4]. To demonstrate the efficacy of LFRLA. We test LFRLA in some public and self-made training environments. Experimental data indicates that LFRLA is capable of enabling robots effectively use prior knowledge and quickly adapt to new environments. Overall, this paper makes the following contributions:

  • We present A Lifelong Federated Reinforcement Learning Architecture based on human cognitive science to make robots are able to perform lifelong learning of navigation in cloud robotic systems.

  • We propose a knowledge fusion algorithm that is able to fuse prior knowledge of robots and evolve the shared model in cloud robotic systems.

  • We introduce two effective transfer learning approaches to make robots quickly adapt to new environments.

The remainder of this paper is organized as follows. Related Theory is introduced in Section II. The overall system architecture is introduced in Part A of Section III. The paper details the knowledge fusion algorithm and transfer approaches in Part B and C of Section III. We explain LFRLA from human cognitive science in Part D of Section III. We present the evaluation results in Section IV and conclude this paper in Section V.

Ii Related Theory

Ii-a Reinforcement learning for navigation

Eliminating the requirement for location, mapping or path planning procedures, several DRL works have been presented that successful learning navigation policies are directed from raw sensor inputs: target-driven navigation [5], successor feature RL for transferring navigation policies [6], and using auxiliary tasks to boost DRL trainning [7]. Many follow-up works have also been proposed, such as embedding SLAM-like structure into DRL networks [8], or utilizing DRL for multi-robot collision avoidance [9]. Tai et al [10] successfully appplied DRL for mapless navigation by taking the sqarse 10-dimensional range findings and the target position , defining mobile robot coordinate frame as input and continuous steering commands as output. Zhu et al.[5]input both the first-person view and the image of the target object to the A3C model, formulating a target-driven navigation problem based on the universal value function approximators [11].To make the robot learn to navigate, we adopt a reinforcement learning perspective, which is built on recent success of deep RL algorithms for solving challenging control tasks [12] [13] [14] [15]. Zhang[16] presented a solution that can quickly adapt to new situations (e.g., changing navigation goals and environments). Making the robot quickly adapt to new situations is not enough, we also need to consider how to make robots capable of memory and evolution, which is similar to the main purpose of lifelong learning.

Fig. 2: Proposed Architecture.In CloudRobot, inspired by transfer learning,successor features were used to transfer the strategy to unknown environment. We input the output of the shared model as added features to the Q-network in reinforcement learning, or simply transfer all parameters to the Q-network. In RobotEnvironment, The robot learns to avoid some new types of obstacles in the new environment through reinforcement learning and obtains the private Q-network model. Not only from one robot training in different environments, private models can also be resulted from multiple robots. It is a type of federated learning. After that, the private network will be uploaded to the cloud. Iterating this step, models on the cloud become increasingly powerful.

Ii-B Lifelong learning

Lifelong Machine learning, or LML [17], considers system that can learn many tasks from one or more domains over its lifetime. The goal is to sequentially store learned knowledge and to selectively transfer that knowledge when robot learns a new task so as to develop more accurate hypotheses or policies. Robots are confronted with different obstacles in different environments, including static and dynamic ones, which are similar to the multi-task learning in lifelong learning. Although learning tasks are the same, including reaching goals and avoiding obstacles, their obstacle types are different, including static obstacles and dynamic obstacles, and different ways of movement in dynamic obstacles. Therefore, it can be regarded as a low-level multitasking learning.

A lifelong learning should be able to efficiently retain knowledge. This is typically done by sharing a representation among tasks, using distillation[18] or a latent basis[18]. The agent should also learn to selectively use its past knowledge to solve new tasks efficiently. Most works have focused on a special transfer mechanism, i.e., they suggested learning differentiable weights are from a shared representation to the new tasks [4][19]. In contrast, Brunskill and Li [20] suggested a temporal transfer mechanism, which identifies an optimal set of skills in new tasks. Finally, the agent should have a system approach that allows it to efficiently retain the knowledge of multiple tasks as well as an efficient mechanism to transfer knowledge for solving new tasks. Chen [21] proposed a lifelong learning system that has the ability to reuse and transfer knowledge from one task to another while efficiently retaining the previously learned knowledge-base in Minecraft. Although this method has achieved good results in Mincraft, there is a lack of multi-agent cooperative learning model. Learning different tasks in the same scene is similar but different from robot navigation learning.

Ii-C Federated learning

LFRLA realizes federated learning of multi robots, which is realized through knowledge fusion. Federated learning was first proposed in [22], which showed its effectiveness through experiments on various datasets. In federated learning systems, the raw data is collected and stored at multiple edge nodes, and a machine learning model is trained from the distributed data without sending the raw data from the nodes to a central place [23],[24]. Different from the traditional joint learning method where multiple edges are learning at the same time, LFRLA adopts the method of first training then mergingto reduce the dependence on the quality of communication[25][26].

Ii-D Cloud robotic system

LFRLA fits well with cloud robotic system. Cloud robotic system usually relies on either data or code from a network to support its operation. Since the concept of the cloud robot was proposed by Dr. Kuffner of Carnegie Mellon University (now working at Google company) in 2010 [27], the research on cloud robots is rising gradually. At the beginning of 2011, the cloud robotic study program of RoboEarth [28] was initiated by the Eindhoven University of Technology. Google engineers have developed robot software based on the Android platform, which can be used for remote control based on the Lego mind-storms, iRobot Create and Vex Pro, etc. [29]. KAMEI K et al. proposed a mall wheelchair robot, who shares map information through the cloud [30].However, no specific navigation method for cloud robots has been proposed up to now. We believe that this is the first navigation learning architecture for cloud robotic systems.

Generally, this paper focuses on developing a reinforcement learning architecture for robot navigation, which is capable of lifelong federated learning and multi robots federated learning, this architecture is well fit in cloud robot system.

Iii Methodology

Fig.2 presents Lifelong Federated Reinforcement Learning Architecture (LFRLA) we propose. LFRLA is capable of reducing training time without sacrificing accuracy of navigating decision in cloud robotic systems. LFRLA uses Cloud-Robot-Environment setup to learn the navigation policy. LFRLA consists of a cloud server, a set of environments, and one or more robots. We develop a federated learning algorithm to fuse private models into the shared model in the cloud. The cloud server fuses private models into shared mode, then evolves the shared model.As illustrated in Fig.2, LFRLA is an implementation of lifelong learning for navigation in cloud robotic systems.

Iii-a Procedure of LFRLA

This section displays a practical example of LFRLA: There are 3 Robots, 3 different environments and cloud servers. The first robot obtains its private strategy model Q1 through reinforcement learning in Environment 1 and upload it to the cloud server as the shared model 1G. After a while, Robot 2 and Robot 3 desire to learn navigation by reinforcement learning in Environment 2 and Environment 3. In LFRLA, Robot 2 and Robot 3 download the shared model 1G as the initial actor model in reinforcement learning. Then they can get their private networks Q2 and Q3 through reinforcement learning in Environment 2 and Environment 3. After completing the training, LFRLA uploads Q2 and Q3 to the cloud. In the cloud, strategy models Q2 and Q3 will be fused into shared model 1G, and then shared model 2G will be generated. In the future, the shared model 2G can be used by other cloud robots. Other robots will also upload their private strategy models to the cloud server to promote the evolution of the shared model.

For a cloud robotic system, the cloud generates a shared model for a time, which means an evolution in lifelong learning. The continuous evolution of the shared model in cloud is a lifelong learning pattern. In LFRLA, the cloud model achieves the ”memory storage and merging” of a robot in different environments. Thus, the shared model becomes powerful through fusing the skills to avoid multi types of obstacles.

For an individual robot, when the robot downloaded the cloud model, the initial Q-network has been defined. Therefor, the initial Q-network has the ability to reach the target and avoid some types of obstacles. It is conceivable that LFRLA can reduce the training time for robots to learn navigation. Further more, there is a surprising experiment result that the robot can get higher scores in navigation with LFRLA.

Initialize action-value Q-network with random weights ;
while cloud server is running do
       if service_request=True then
             Transfer to ;
             for  do
                   perform reinforcement learning with in environment.;
                   Send to cloud;
             end for
       end if
      if evolve time=True then
             Generate = fuse()
       end if
end while
Algorithm 1 Processing Algorithm in LFRLA

However, in actual operation, the cloud does not necessarily fuse every time when a private network is received, but fuses every fixed time.So, the processing flow of LFRLA shown in Algorithm 1.

Fig. 3: Knowledge Fusion Algorithm in LFRLA:We generate a large amount of training data based on sensor data, target data, and human-defined features. Each training sample is added into the private network and the kth generation sharing network, while different actors are scored for different actions. Then, we store the scores and calculate the confidence values of all actors in this training sample data. The ”confidence value” is used as a weight, while the scores are weighted and summed to obtain the label of the current sample data. By analogy, all sample data labels are generated. Finally, a network is generated and fits the sample data as much as possible. The generated network is the (k+1)th generation. This step of evolve is finished.

Key algorithms in LFRLA include knowledge function algorithm and transferring approaches, as introduced in the following.

Iii-B Knowledge fusion algorithm in cloud

Inspired by images style transfer algorithm, we develop a knowledge fusion algorithm to evolve the shared model. This algorithm is based on generative networks and it is efficient to fuse parameters of networks trained from different robots or a robot in different environments. The algorithm deployed in the cloud server receives the privately transmitted network and upgrades the sharing network parameters. To address knowledge fusion, the algorithm generates a new shared model from private models and the shared model in cloud. This new shared model is the evolved model.

Fig.3 illustrates the process of generating a policy network. The training data of the network is randomly generated based on sensor attributes. The label on each piece of data is dynamically weighted, which is based on the ”confidence value” of each robot in each piece of data. We define robots as actors in reinforcement learning. Different robots or the same robot in different environments are different actors.

Initialize the shared network with random Parameters ;
Input: : The number of data samples generated ; : The number of private networks; : Action sizes of the robot;
for i=1,i N,i++ do
       Calculate indirct features from ;
       [] ;
       for n=1,n K,n++ do
             score append
       end for
      for n=1,n K,n++ do
             for m=1,m M,m++ do
                   Calculate the confidence value of the n-th private network in the i-th data based on formula (1)
             end for
       end for
      if condition then
       end if
       Calculate the based on formula (2) and (3);
       label append ;
end for
training the shared network from (x,lable);
Algorithm 2 Private Networks Fusion Algorithm

The ”confidence value” motioned above of the actor is the degree of confirmation on which action the robot chooses to perform. For example, in a piece of sample from training data, the private Network 1 evaluates Q-values of different actions to (85, 85, 84, 83, 86), but the evaluation of the k-G sharing network is (20, 20, 100, 10, 10). In this case, we are more confident on actor of k-G sharing network, because it has significant differentiation in the scoring process. On the contrary, the scores from actor of private Network 1 are confusing. Therefore, when generating the labels, the algorithm calculates the confidence value according to the score of different actors. Then the scores are weighted with confidence value and summed up. Finally, we obtain labels of training data by executing the above steps for each piece of data. There are several approaches to define confidence, such as variance, standard deviation, and information entropy. In this work, we use information entropy because we believe it better reflects the degree of data variation. Below is quantitative function of robotic confidence (information entropy):

Fig. 4: A transfer learning method of LFRLA

Robot j ”confidence”:

Memory weight of robot j:

Knowledge fusion function:

It should be noted that Fig.3 only shows the process of a sample generating a label. Actually, we need to generate a large number of samples. For each data sample, the confidence values of the actors are different, so the weight of each actor is not the same. For example, when we generate 50,000 different pieces of data, there are nearly 50,000 kinds of different combinations of confidence. These changing weights can be incorporated into the data labels, are enable the generated network to dynamically adjust the weights on different sensor data.

In conclusion, knowledge fusion algorithm in cloud can be defined as:

We discribe our approach in details in Algorithm 2. For a single robot, private network is obtained in different environments. Therefore, it can be regarded as asynchronous learning of the robot in LFRLA. When there are multiple robots, we just need to treat them as the same robot in different environments. At this time, the evolution process is asynchronous, and multiple robots are synchronized.

Iii-C Transfer the shared model

Various approaches of transfer reinforcement learning have been proposed. In the specific task that a robot learns to navigate, we found that there are two applied approaches. The first one is taking the shared model as initial actor network. While the other one is using the shared model as a feature extractor. If we adopt the first approach that takes the shared model as an initial actor network, abilities of avoiding obstacles and reaching targets can remain the same. In this approach, the robot deserves a good score at the beginning. And the experimental data shows that the final score of the robot has been greatly improved at the same time. However, every coin has two sides, this approach is unstable. The training time depends on the adjustment of parameters in some extent. This work presents parameters modification suggestions of this approach in navigation learning:

  • Accelerate updating speed of value network to actor network.

  • Slightly increase the punishment when robots hit obstacles and adjust reward according to the average score of the shared model in private models.

  • Minimize or substantially reduce the probability of random action.

  • Reduce the descending speed on probability of random action.

  • Adjust the maximum single learning time moderately according to environmental complexity.

The shared model is widely used as a feature extractor in transfer learning. As illustrated in Fig.4, this method increases the dimension of the features. So, it can improve the effect stably. One problem that needs to be solved in experiment is that there is a structural difference between input layer of the shared network and private network. The approach in LFRLA is that the number of nodes in the input layer is consistent with the number of elements in the original feature vector, as shown in Fig.4. The features from transfer learning are not used as inputs to the shared network. They are just inputs for training private networks. This approach has high applicability, even though the shared model and private models have different network structures.

It is also worth noting that if the robot uses image sensors to acquire images as feature data. It is recommended to use the traditional transfer learning method that taking the output of some convolutional layers as features because the Q-network is a convolutional neural network. If a non-image sensor such as a laser radar is used, the Q-network is not a convolutional neural network, then we will use the output of the entire network as additional features, as Fig.4 shows.

Iii-D Explanation from human cognitive science

The design of the LFRLA is analogous to the human decision-making process in cognitive science. For example, when playing chess, the chess player will make decisions based on the rules and his own experiences. The chess experiences include his own experience in chess and the experience of other chess players he has seen. We can regard the chess player as a decision model, and the quality of the decision model represents the performance level of the chess player. In general, this policy model will become increasingly excellent through experience accumulation, and the chess player’s skill will be improved. This is the iterative evolutionary process in LFRLA. After each chess player finishing playing chess, his chess level or policy model evolve, which is analogous to the process of knowledge fusion in LFRLA. And these experiences will also be used in later chess player, which is analogous to the process of transfer learning in LFRLA.

(a) Env-1
(b) Env-2
(c) Env-3
(d) Env-4
(Generic-a) Approach score in Env-1
(Generic-b) Approach score in Env-2
(Generic-c) Approach score in Env-3
(Generic-d) Approach score in Env-4
(LFRLA-a) LFRLA score in Env-1
(LFRLA-b) LFRLA score in Env-2
(LFRLA-c) LFRLA score in Env-3
(LFRLA-d) LFRLA score in Env-4
Fig. 5: We present both quantitative and compared results: Env-1 to Env-4 are the training environments. Generic-a to Generic-d present scores in training process of generic approaches. LFRLA-a to LFRLA-d present scores in training process of LFRLA. Improvement-a to Improvement-d present the improvement of LFRLA compared with generic methods. In the training procedure of Env-1, LFRLA has the same result with generic approaches. Because there is no antecedent shared models for the robot. In the training procedure of Env-2, LFRLA obtained the shared model 1G, which made LFRLA get higher reward in less time compared with the generic approach. In Env-3 and Env-4, LFRLA evolve the shared model to 2G and 3G and obtained excellent result. From this figure, we demonstrate that LFRLA can get higher reward in less time compared with the generic approach.

Fig. 1 demonstrates a concrete example. The person on the right is considering where should the next step goes. The chess he has played and the chess he has seen are the most two influential factors on making decision. But his chess experiences may influence the next step differently. At this time, according to human cognitive science, the man will be more influenced by experiences with clear judgments. An experience with a clear judgment will have a higher weight in decision making. This procedure of humans makes decisions is analogous to knowledge fusion algorithm in LFRLA. The influence of different chess experience is always dynamic in the decision of each step. The knowledge fusion algorithm in LFRLA achieves this cognitive phenomenon by adaptively weighting the labels of training data. The chess player is a decision model that incorporates his own experiences. Corresponding to this opinion, LFRLA integrates experience into one decision model by generating a network. This process is also analogous to operation of human cognitive science.

Iv Experiments

In this section, we intend to answer two questions: 1) Can LFRLA help reduce training time without sacrifice accuracy of navigation in cloud robotic systems? 2) Does the knowledge fusion algorithm is effective to increase the shared model? To answer the former question, we conduct experiments to compare the performance of the generic approach and LFRLA. To answer the second question, we conduct experiments to compare the performance of generic models and the shared model in transfer reinforcement learning.

(a) Env-1
(b) Env-2
(c) Env-3
(Generic-a) Env-1 stacked scores
(Generic-b) Env-2 stacked scores
(Generic-c) Env-3 stacked scores (sampling)
Fig. 6: We present both quantitative and compared results: Env-1 to Env-4 are the testing environments. Generic-a to Generic-d display the stacked scores in training process of generic models and the shared models.

Iv-a Experimental setup

The training procedure of the LFRLA was implemented in virtual environment simulated by gazebo. Four training environments was constructed to show the difference consequence between the generic approach training from scratch and LFRLA, as shown in Fig.5. There is no obstacle in Env-1 except the walls. There are four static cylindrical obstacles in Env-2, four moving cylindrical obstacles in Env-3. More complex static obstacles are in Env-4. In every environment, a Turtlebot3 burger equipped with a laser range sensor is used as the robot platform. The scanning range is from 0.13m to 4m. The target is represented by a red square object. In every episode, the target position was initialized randomly in the whole area and guaranteed to be collision-free with other obstacles. At the beginning of each episode, the starting pose of the agent and the target position are randomly chosen such that a collision-free path is guaranteed to exist between them. An episode is terminated after the agent either reaches the goal, collides with an obstacle, or after a maximum of 6000 steps during training and 1000 for testing. We calculate the average reward of the robot every two minutes.

The hyper-parameters regarding the reward function in generic approaches and the first step training of LFRLA as shown in Table 1. Moreover, the experiments result also show that the effects of LFRLA are not depending on the tuning of hyperparameters. We trained the model from scratch on a single Nvidia GeForce GTX 1070 GPU. The actor-critic network used two fully connected layers with 64 units. The output is used to produce the discrete action probabilities by a linear layer followed by a softmax, and the value function by a linear layer.

Iv-B Evaluation for the architecture

According to the hyper-parameters and complexity of environments, the target average score (5 consecutive times above a certain score) in Env-1 is 4500, Env-2 is 4000, Env-3 is 3500, Env-4 is 3500. To show the performance of LFRLA, we tested it and compared with generic methods in the four environments. Then we start the training procedure of LFRLA. As mentioned before, we initialize the shared model and evolve it as Algorithm 2 after training in Env-1. In the cloud robotic system, the robot download the shared model 1G. Then, the robot performed reinforcement learning base on the shared model. The robot got a private model after training and it would be uploaded to the cloud server. The cloud server fused the private model and the shared model 1G to obstain the shared model 2G. With the same mode, follow-up evolutions will be performed. We constructed four environments, so the shared model upgraded to 4G. Performance of LFRLA shown in Fig.5 where also shows generic methods performance.

Parameters Values Transfered values
Maximum number of steps per round 6000 6000
Target network parameter update frequency 2000 1000
Discount factor 0.99 0.95
Learning rate 0.00025 0.0002
1 0.8
Decay of epsilon 0.05 0.05
batch size 64 64
TABLE I: Hyper-parameter in reinforcement learning

In Env2-Env4, LFRLA increased accuracy of navigating decision and reduced training time in the cloud robotic system. From the last row of Fig.5, we can observe that with the federated of the shared model, the improvement are more efficient. LFRLA highly effective for learning a policy over all considered obstacles and greatly improves the generalization capability of our trained model across the most commonly encountered environments. Experiments demonstrate that LFRLA is capable of reducing training time without sacrificing accuracy of navigating decision in cloud robotic systems.

Time to meet requirement Averange scores Averange of the last five scores
Test-Env-1 Test-Env-2 Test-Env-3 Test-Env-1 Test-Env-2 Test-Env-3 Test-Env-1 Test-Env-2 Test-Env-3
model 1 1h 32min 3h 6 h 1353.36 46.5 -953.56 1421.54 314.948 -914.16
model 2 33min 3h 6 h 2631.57 71.91 288.61 3516.34 794.02 -132.07
model 3 37min 3h 5h 50min 2925.53 166.5 318.04 3097.64 -244.17 1919.12
model 4 4h 41min 2h 30min 5h 38min 1989.51 1557.24 1477.5 2483.18 2471.07 3087.83
Shared model 55min 2h 18min 4h 48min 2725.16 1327.76 1625.61 3497.66 2617.02 3670.92
TABLE II: Results of the contrast experiment

Iv-C Evaluation for the knowledge fusion algorithm

In order to verify the effectiveness of the knowledge fusion algorithm, we conducted a comparative experiment. We created three new environments that were not present in the previous experiments. These environments are more similar to real situations: Static obstacles such as cardboard boxes, dustbin, cans are in the Test-Env-1. Moving obstacles such as stakes are in Test-Env-2. Test-Env-3 includes more complex static obstacles and moving cylindrical obstacles. We still use the Turtlebot3 burger created by gazebo as the testing platform. The parameters are the same with Changed values in Table1. In order to vertify the advancement of the shared model, we trained the navigation policy based on the generic model1, model2, model3, model4 in Test-Env-1, Test-Env-2 and Test-Env-3 respectively. The generic models are from the previous generic approaches experiments. These policy models trained from one environment without knowledge fusion.According to the hyper-parameters and complexity of environments, the average score goal (5 consecutive times above a certain score) in Test-Env-1 is 4000, Test-Env-2 is 3000, Test-Env-2 is 2600.

In the following, we present compared results in Fig.6 and quantitative results in Table 2. which clearly show the effectiveness our approach. In table 2, the darker the cells, the more advanced model on the row. The shared model steadily reduces training time. In particular, we can observe that the generic method models are only able to make excellent decisions in individual environments; while the shared model is able to make excellent decisions in plenty of different environments. So, the proposed knowledge fusion algorithm in this paper is effective.

Compared with A3C or UNREAL approaches which update parameters of the policy network at the same time, the proposed knowledge fusion approach is more suitable for the federated architecture of LFRLA. The proposed approach is capable of fuse models with asynchronous evolve the shared model. The approach of updating parameters at the same time has certain requirements for environments, while the proposed knowledge fusion algorithm has no requirements for environments. Using generative network and dynamic weight labels are able to realize the integration of memory instead of A3C or UNREAL method, which only generates a decision model during learning and has no memory.

V conclusion

We presented a learning architecture LFRLA for navigation in cloud robotic systems. The architecture is able to make navigation-learning robots effectively use prior knowledge and quickly adapt to new environment. Additionally, we presented a knowleged fusion algorithm in LFRLA and introduced transfer methods. Our approach is able to fuse models and asynchronous evolve the shared model. We validated our architecture and algorithmes in policy-learning experiments; moreover, we released a cloud robotic navigation-learning website based on LFRLA.

The architecture has fixed requirements for the dimension of input sensor signal and the dimension of action. We leave it as future work to make LFRLA flexible to deal with different input and output dimensions. The more flexible LFRLA will offer a wider range of services in cloud robotic systems.