Recent Reinforcement Learning (RL) researches in autonomous robots have achieved significant performance improvement by employing distributed architecture for decentralized agents [2, 4], which is termed as Distributed Reinforcement Learning (DRL). However, most existing DRL frameworks consider only synchronous learning with a constant environment. In addition, with the fast development of autonomous driving simulators, it is now common to perform pre-training on simulators, and then transfer the pre-trained model to real-life autonomous cars for fine-tuning. One of the main drawbacks of this path is that the model transfer process is conducted offline, which may be very time-consuming, and there is lack of feedback and collaborations from the fine-tuned model trained with different real-life scenarios.
To overcome these challenges, we propose an end-to-end training process which leverages federated learning (FL, 
) and transfer learning to enable asynchronous learning of agents from different environments simultaneously. Specifically, we bridge the pre-training on simulators and real-life fine tuning processes by various agents with asynchronous updating strategies. Our proposed framework alleviates the time-consuming offline model transfer process in autonomous driving simulations while allows heavy load of training data stays local in the autonomous edge vehicles. Therefore the framework can be potentially applied to real-life scenarios where multiple self-driving technology companies collaborate to train more powerful RL tasks by pooling their robotic car resources without revealing raw data information. We perform extensive real-life experiments on a well-known RL application, i.e, steering control RL task for collision avoidance of autonomous driving cars to evaluate the feasibility of the proposed framework and demonstrates that the proposed framework has superior performance compared to the non-federated local training process.
I-a Related Work
One of the most important tasks for transfer reinforcement learning is to generalize the already-learned knowledge to new tasks [16, 1, 11]. With the fast advance of robotics simulators, lots of researches start to investigate the feasibility and effectiveness of transferring the knowledge of simulators to real-life agents [2, 10, 20, 15, 3].
 proposed a decentralized end-to-end sensor-level collision avoidance policy for multi-robot systems, with the pre-trained process conducted on stage mobile robot simulator111http://rtv.github.io/Stage/.  studied the problem of reducing the computationally prohibitive process of anticipating interaction with neighboring agents in a decentralized multi-agent collision avoidance scenario. The pre-trained model of the RL model used is based on the trained data generated by the simulator. 
investigated the problem end-to-end nonprehensile rearrangement that maps raw pixels as visual input to control actions without any form of engineered feature extraction. The authors firstly trained a suitable rearrangement policy in Gazebo, and then adapt the learned rearrangement policy to real-world input data based on the transfer framework proposed.
It can be easily concluded that for transfer reinforcement learning in robotics area, most RL researches employed the following research path: pre-training RL model on simulators, transferring the model to robots and fine-tuning the model parameters. Usually, the above processes are executed sequentially, i.e., after the RL models have been pre-trained and transferred to the robots, no meaningful experience or knowledge from the simulators can be provided for the final models fine-tuned on the real-life robots. Then, one may ask: can we make the transfer and fine-tune processes executed in parallel?
The framework proposed in this work utilizes RL tasks in the architecture of federated learning. Note that some recent works also investigate federated reinforcement learning (FRL) architecture.  presents two real-life FRL examples for privacy-preserving issues both in manufactory industry and medical treatment systems. The authors further investigated the problem of multi-agent RL system in a cooperative way, when considering the privacy-preserving requirements of agent data, gradients and models.  studied the FRL settings in the autonomous navigation where the main task is to make the robots fuse and transfer their experience so that they can effectively use prior knowledge and quickly adapt to new environments. The authors presented the Lifelong Federated Reinforcement Learning (LFRL), in which the robots can learn efficiently in a new environment and extend their experience so that they can use their prior knowledge.  employed the techniques in FRL for personalization of a non-player character, and developed player grouping policy, communication policy and federation policy respectively.
I-B Our Proposal
Different from existing FRL researches, our research motivation originates from the feasibility of conducting online transfer on the knowledge learned from one RL task to another task, with the aim of both federated learning and online transfer model.
In this paper, we present Federated Transfer Reinforcement Learning (FTRL) framework, which is capable of transferring RL agents knowledge in real-time on the foundation of federated learning. To the best of our knowledge, it is the first literature dealing with FRL techniques with online transfer model. Compared to the above existing works, our proposed framework has the advantages of
Online transfer. The proposed framework is capable of executing the source and the target RL tasks in simulator or real-life environments with non-identical robots, obstacles, sensors and control systems;
Knowledge aggregation. Based on the functionality of federated learning, the proposed framework can conduct knowledge aggregation process in nearly real-time.
We validate the effectiveness of FTRL framework on the real-life collision avoidance systems on JetsonTX2 remote controlled (RC) cars and the Airsim simulators. The experiment results show that FTRL can transfer the knowledge online, with better training speed and evaluation performance.
Ii Hardware Platform and Tasks
In order to better illustrate and validate the framework proposed, we construct real-life autonomous systems based on three JetsonTX2 RC cars, Microsoft Airsim autonomous driving simulator and a PC server. Fig. 1 presents the basic hardware and software platforms used in the validation process.
The real-life RL agents run on three RC cars, which house a battery, a JetsonTX2 single-board computer, a USB hub, a LIDAR sensor and an on-board Wi-Fi module. Fig. (a)a presents an image of the experiment RC car.
In the collision avoidance experiment, we use a PC as the model pre-training platform and as the FL server, which is armed with an 8-core 32G Intel i9-9820X CPU, and 4 NVIDIA 2080 Ti GPU.
Developed by Microsoft, Airsim is a simulator for drones and cars, which serves as a platform for AI research to experiment with ideas on deep reinforcement learning, autonomous driving etc. The version used in this experiment is v1.2.2.-Windows 222https://github.com/Microsoft/AirSim/releases. In the pre-train and federation processes, we “coastline” build-in map in the Airsim platform, which can be seen in Fig. (b)b.
As can be seen in Fig. (c)c, we construct a fence-like experimental race for the collision avoidance tasks in indoor environment. We regularly change the overall shape of the race and sometimes set some obstacles in the race in order to construct different RL environments. However, for a single run of a specific RL task, the race shape and obstacle positions remain unchanged.
Iii Proposed Framework
It is worth noting that FTRL framework is not designed for any specific RL method. However, in order to thoroughly describe the framework and validate its effectiveness, Deep Deterministic Policy Gradient (DDPG, ) is chosen to be the RL implementation.
Iii-a DDPG Algorithm
We consider the following standard RL setting: A RL agent has to interact with the stochastic environment in discrete time. At each time step the agent makes observations , takes actions , and receives rewards . We assume that the environments considered in this work have real-valued observations and actions . For deterministic action case, the agent’s behavior is controlled by a deterministic policy : , which maps each observation to an action. The state-action value function, which describes the expected return conditioned on first taking action from state and subsequently acting according to , is defined as
where is the discount factor.
is an off-policy actor-critic algorithm, which primarily uses two neural networks, one for the actor and one for the critic. The critic network is updated from the gradients obtained from the temporal difference (TD) error. The actor network is updated by the deterministic policy gradient by Silver et al..
Iii-B Steering Control RL Settings
Similar to some existing works on single collision avoidance system [9, 18], we seek to develop steering control to avoid collisions for autonomous agents. The observations are the LIDAR distance data, which are collected by the sensor equipped on the autonomous cars. To accomplish this task, we introduce a specific reward function conditioned on observations , i.e., , which is defined as follows:
where is the number of the dimensions of the LIDAR distance data, is a fraction of the distance data (), denotes the maximal integer no larger than and represents the ascending sequence of , if event happens else 0. is a positive base reward value, is a positive penalty value for collision events and is a positive value for casting exponential penalty on . It can be concluded from if an action policy is targeted to make good performance, it should obtain: 1) no collision events and 2) to make the smallest fraction of distance data as great as possible.
Note that we set the reward function to be conditioned on rather than on and based on the following considerations:
The collision event caused by can be detected by : when the minimal value of LIDAR data is lower than the predefined safe distance i.e., , then a collision event is detected, and thus a penalty value is activated in the reward function.
Given the current observation , a good steering action policy is capable of making the autonomous agent to stay away from any obstacle in the next state as far as possible. Specifically, the autonomous agent is expected to maximize its minimal distance with all obstacles in the next time step, i.e., . Moreover, for the sake of the existence of stochastic factors, we choose to make exponential penalty on the average value of the smallest fraction of the ascending sequence of , i.e., .
Iii-C FTRL Framework
For the collision avoidance task conducted herein, we present the FTRL framework. The basic components of a FTRL framework are presented in Fig. 2. There are different autonomous car agents conducting collision avoidance RL tasks in different environments, including the real-life and the simulator environments. All agents share identical model structure, so that their models can be aggregated by FedAvg process [12, 19]. The basic training process is as follows:
Online transfer process. Since distributed RL agents are acting in various environments, a knowledge transfer process is needed when each RL agent interacts with its specific environment;
Single RL agent training and inference. This process serves as a standard RL agent training and inference process.
FedAvg process. All the useful knowledge of distributed RL agents is aggregated by FedAvg process of the RL models, which can be expressed as:
where , represent the network parameters of the federation model and the model of the -th RL agent respectively, and is the number of all RL agents. is updated element-wisely as the arithmetic mean of all RL models.
Online Transfer Process. Since the RL tasks to be accomplished are highly-relative and all observation data are propositional-correlated and pre-aligned, one possible transfer strategy is to make numeric alignments on the observations and actions. According to the reward function Eq. 2, is solely dependent on . Therefore, we only have to make transfer process on and . For the LIDAR observation data, we set one environment as standard environment, and all observations of non-identical scales can be transformed into the standard observation based on the following propositional way:
where is a super-parameter controlling the scale-ratio of the -th and the standardized environments. We then standardize the action of DDPG into range and when making steering action, the -th agent acts as:
where represents the maximal range of the steering control for a specified car in the -th environment. The detailed processes for the RL agent and the FL server are presented in the Algorithm 1 and 2 ( presents the DDPG model of the -th agent and represents the federation model).
The training procedure for FTRL works in an asynchronous way:
The -th agent procedure. As can be seen in Algorithm 1, for the -th agent, firstly, according to Eq. 4, an agent-specified transfer process is employed if the current agent is not acting in the standard environment. Then it asynchronously updates the RL model from the FL server if needed. Lastly, it trains the RL model from the experience buffer with DDPG algorithm. A super-parameter is introduced in order to control the time interval of updating the federation model from the FL server.
FL server procedure. As can be seen in Algorithm 2, the FL server regularly collects all the RL models from all agents, which is controlled by the super-parameter federation cycle . Then the FL server generates the federation model by FedAvg process.
The inference for FTRL is rather simple: the -th agent receives the observation and then, if needed, performs transfer process according to Eq. 4. Then the standardization action can be computed by ( denotes the -th time step result of the random process in DDPG), and lastly the steering action can be made by Eq. 5.
Note that since Algorithm 1 and 2 work asynchronously, some weights update process of of local RL agents may not be used. For example, we assume that two model synchronization processes of the -th agent happen at time and respectively, a federation process of the FL server happens between the two synchronization processes at time , i.e., . Since at time , this agent updates it model to the federation model generated at time , the local training processes between time and makes no impact to the FL system. It is trivial to extend the current framework to conduct asynchronous model updates, similar to .
In this section, we conduct real-life experiments on RC cars and Airsim in order to validate the followings: 1) FTRL is capable of transferring online knowledge from simulators to real-life environments; 2) Compared with a single run of DDPG, FTRL framework can achieve a better training speed and performance.
|FTRL-DDPG||0.42 (7.7%)||9 (50%)||0.37 (27.6%)||27 (12.9%)||0.51 (34.2%)||17 (29.2%)|
|FTRL-DDPG-SIM||0.45 (15.4%)||12 (33.3%)||0.39 (34.5%)||16 (48.4%)||0.50 (31.6%)||13 (45.8%)|
Iv-a Application Details
In this subsection, for the sake of reproductions, we are going to present the basic application settings for FTRL, Airsim platform and the RC cars.
(The codes for all the implementations are uploaded to http…… ). The following presents the basic DDPG settings employed: the actor network is equipped with three 128-unit fully-connected layers with a continuous output layer, while the critic network also has three 128-unit fully-connected layers with a state-action output layer; We set and , and learning rates for both actor and critic networks 1e-4. We set the experience buffer size to be 2500 and batchsize 32.
The basic settings for Airsim is in the uploaded setting file : in order to maintain a good transferrability to the RC cars, the LIDAR sensor is set to be only able to collect the distance data of the front view (with ‘HorizontalFOV’ range [-90,90]), which are divided into 60 dimensions from left to right. We use the public build-in map “coastline” of Airsim to conduct the pre-training and the federation processes.
In the experiments conducted, we set for all RC cars. The LIDAR data are collected at a frequency of 40Hz. The interactions among the DDPG agents, the RC car control system and the Airsim are divided into discrete decision making problem with time interval of 0.25 seconds. The federation cycle of the FL server is set to be 2 minutes and the synchronization cycle of local agents 3 minutes.
For the reward function presented in Eq. 2, we set the base reward value , the collision penalty value , the minimum safe distance and the exponential distance penalty value .
Iv-B Comparison Results
Since training DDPG algorithm from scratch on real-life autonomous cars may take unacceptable time, we have pre-trained a common DDPG model on Airsim platform for all participant DDPG agents. With the pre-trained model, each car can make reasonable action corresponding to the LIDAR data, which however still has room for improvement.
In this fine-tune processes of any real-life agent, we divide the training time of each DDPG agent into three stages, with each containing 2500 discrete time steps. Since only the inference of the pre-trained model happens when the number of the experience buffer is smaller than 2500, we ignore the results of the first stages and name the following two stages as stage and stage . As mentioned, each time step takes 0.25 seconds, and stage and stage have range and seconds, respectively.
Since all cars may be running in different environments, the rewards may be of non-identical scales. In order to make the results comparable, the following presents the metric relative performance employed. For a corresponding index () in stages and , let , represent the respective rewards, and the relative performance is defined as:
where and denote the maximal and the minimal reward values for a single run of each car, respectively. It can be concluded from Eq. 6 that and indicates that for the corresponding , the -th time step in stage performs better than that in stage .
We keep track of the cumulative summation of relative performance for different stages, and present the results in Fig. 3 of different application settings, including
DDPG results on single RC cars;
FTRL-DDPG results with the federation of three RC cars(FTRL-DDPG);
FTRL-DDPG results with the federation of three RC cars and Airsim platform(FTRL-DDPG-SIM);
As can be seen from Fig. (a)a, for each car, we can see that most of the values of the cumulative summations of relative performance are lower than 0. Moreover, it can be confidently concluded that the performance decays from stage to stage . The result indicates that for each run of DDPG, with only 2500 time steps for training, we can make no guarantee on the performance improvements of local RL agents.
However, referring to Fig. (b)b, for FTRL-DDPG, most of the cumulative summation values of relative performance on car1 and car3 are above 0. For car 2, for the first 1500 time steps, an opposite conclusion can be drawn that the performance decays from stage to stage , and however, for time steps 1500-2500, a significant improvement of the relative performance can be viewed. The above results indicate that with FRL framework can accelerate the training speed and improve the performance of the federation of three cars.
Referring to Fig. (c)c, for FTRL-DDPG-SIM, most of the cumulative summation values of relative performance on all cars in the experiments. By comparing the results of FTRL-DDPG-SIM and FTRL-DDPG, we can easily see that FTRL-DDPG-SIM can achieve greater relative performance than FTRL-DDPG on most time steps recorded. The above results indicate that the transfer model employed in FTRL-DDPG-SIM is effective in accelerating the training speed of autonomous cars by online transferring the knowledge learned from the Airsim simulator, which can take charge of more workload on the training processes of RL agents.
In order to better compare all the results of different RL tasks, we further made comparisons on the trained models. The experimental race is shown in Fig. 4. It is worth noting that the test race is specifically set to be much more complicated than the training environments (as shown in Fig. (c)c), which is with more obstacles and tighter distances.
For the test experiments with trained models, each run of different cars is executed for 50 cycles in the experimental race. We recorded the average LIDAR distances and the collision numbers for each run of DDPG, FTRL-DDPG and FTRL-DDPG-SIM on each car. It can be easily drawn that a better policy is capable of fulfilling collision avoidance tasks with greater average distance and less collision number. Table I presents the corresponding results.
As can be seen from Table I, the results in bold denote the better result for each car. It can be easily seen that for each car, the average distance and collision numbers of FTRL-DDPG and FTRL-DDPG-SIM are much less than the corresponding results of DDPG, which demonstrate the effectiveness of FTRL-DDPG-SIM. The following presents an averaging result: for the test experimental race tasks, compared with DDPG, FTRL-DDPG can make performance improvements with averaging 20.3% increase in the average distance with obstacles and averaging 30.7% decrease in the collision number, while for FTRL-DDPG-SIM, the corresponding results are 27.2% and 42.5%, respectively.
As a conclusion, for the autonomous driving areas, with the capabilities of transferring online knowledge from simulators to real-life cars, FTRL-DDPG-SIM performs better than both single execution of single RL agents and federation model with identical RL agents with better training speed and performance.
V Conclusions and Future Work
In this work, we present the FTRL framework, which is capable of conducting online transfer to the knowledge of different RL tasks executed in non-identical environments. However, the transfer model employed in FTRL presented in this work is rather simple, which is based on human knowledge. Autonomously transferring the experience or knowledge from the already learned tasks to new ones online constitutes another research frontier.
-  (2017) Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065. Cited by: §I-A.
-  (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 285–292. Cited by: §I-A, §I-A, §I.
-  (2016) Autonomous drifting using simulation-aided reinforcement learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5442–5448. Cited by: §I-A.
-  (2009) Learning agents for collaborative driving. In Multi-Agent Systems for Traffic and Transportation Engineering, pp. 240–260. Cited by: §I.
-  (2019) Parallel reinforcement learning. In The 6th World Conference on Systemics, Cybernetics, and Informatics, Cited by: §I-A.
Design and use paradigms for gazebo, an open-source multi-robot simulator. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), Vol. 3, pp. 2149–2154. Cited by: §I-A.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §III-A, §III, 15, 6.
-  (2019) Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems. arXiv preprint arXiv:1901.06455. Cited by: §I-A, §III-C.
-  (2011) Autonomous pedestrian collision avoidance using a fuzzy steering controller. IEEE Transactions on Intelligent Transportation Systems 12 (2), pp. 390–401. Cited by: §III-B.
-  (2018) Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6252–6259. Cited by: §I-A, §I-A.
-  (2018) Universal successor representations for transfer reinforcement learning. arXiv preprint arXiv:1804.03758. Cited by: §I-A.
-  (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §III-C.
Federated reinforcement learning for fast personalization.
2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 123–127. Cited by: §I-A.
-  (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I.
-  (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §I-A.
-  (2015) Actor-mimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. Cited by: §I-A.
-  (2014) Deterministic policy gradient algorithms. Cited by: §III-A.
-  (2015) Steering control collision avoidance system and verification through subject study. IET intelligent transport systems 9 (10), pp. 907–915. Cited by: §III-B.
Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 12. Cited by: §I, §III-C.
-  (2019) End-to-end nonprehensile rearrangement with deep reinforcement learning and simulation-to-reality transfer. Robotics and Autonomous Systems 119, pp. 119–134. Cited by: §I-A, §I-A.