Multi-robot navigation is one of the significant problems for multi-robot systems and practical to cooperative tasks such as search and rescue, formation control, and area coverage. As each robot has limited communication and sensing ranges, cooperative behaviors among robots rely on the connectivity of the communication network. Therefore, maintaining connectivity over time is important for multi-robot navigation.
have been proposed. Most of these works consider global connectivity because of its flexibility as it allows robots to break communication links as long as the communication graph is connected over time. This is guaranteed by preserving the second-smallest eigenvalue of the underlying graph Laplacian of robot networks to be positive, which is well-known as algebraic connectivity. As these works attempt to complete various tasks, some of them validate their strategies in obstacle-free environments for formation control , visiting multiple waypoints  and area coverage . Potential field methods are widely studied to combine connectivity maintenance with system objectives and obstacle avoidance. Williams et al.  select a dispersive potential with collision avoidance to validate the designed connectivity controller at the presence of disc-obstacles. Li et al.  have proposed a potential-based method towards circle obstacles avoidance with connectivity maintenance. Sabattini et al.  evaluate their gradient-based controller in rendezvous applications. They use a collision avoidance strategy which is not mentioned in experiments as the controller is designed for general purposes.
One of the major challenges of multi-robot navigation is the imperfect environment information due to limited sensing capabilities. Connectivity controllers may fail since they can be trapped at the presence of obstacles without the explicit design of navigation strategies. Moreover, simple assumptions on obstacle shapes have limitations in complex environments.
To navigate in complex environments, RL-based approaches are alternatives that have shown promising results in map-less navigation. Many of these works have developed end-to-end navigation policies which directly map raw sensor measurements to control commands without prior knowledge of the environment [8, 9, 10]. While few of them focus on connectivity of the robot team, Lin et al.  propose a RL framework that combines a constrained optimization module to enforce local connectivity constraints. However, local connectivity may lead to restrictive behaviors since each robot has to maintain communication links with all other robots in the team over time. Moreover, the effects of action correction from the optimization module are nontrivial in RL as it may break the Markov assumption [12, 13].
In this work, we focus on multi-robot navigation with connectivity maintenance at the presence of unknown obstacles. We consider global connectivity to allow flexible changes of communication graphs and incorporate it as constraints into RL methods. Although the connectivity-constrained multi-robot navigation problem can be formulated as a Constrained Partial Observable Markov Decision Process (Constrained POMDP), existing constrained RL methods are infeasible due to sample inefficiency and the inherent difficulty of this multi-objective problem (reaching target points and avoiding collisions while bounding connectivity constraints). To address this issue, we present a model-free constrained RL approach that combines behavior cloning (BC) to make policy training tractable. By learning from demonstrations, the actions of robots are guided towards expert navigation behaviors and therefore improve efficiency in policy search. Following the centralized training decentralized execution scheme[14, 15], a shared policy is optimized by experiences of all agents during training. As a result, a decentralized policy is developed that takes information of other agents into account to maintain global connectivity and leverages range sensor data for collision avoidance.
The contributions of this work are:
A constrained RL-based approach is proposed to navigate multiple robots in unknown environments while maintaining global connectivity in a decentralized manner.
An effective way to facilitate policy training by combing BC and RL without breaking the connectivity constraint.
The performance of the developed policy is empirically evaluated in simulation and real-world robotic platforms.
Ii Related Work
RL-based approaches have been intensively studied in the last few years and have shown great potentials in navigation with imperfect sensing in unknown complex environments. For single-robot navigation, Pfeiffer et al. 
propose a sample efficient approach that uses expert demonstrations for policy pre-training and further improve the policy performance by RL. The experiments show that the policy outperforms their previous approach, where an end-to-end policy is trained by supervised learning only. In particular, they treat collision avoidance as constraint to ensure safety and use pre-training to improve sample efficiency. In our approach, a connectivity-maintained expert is infeasible, and thus we impose connectivity constraints and provide navigation demonstrations in one-stage RL training without pre-training. Zhang et al.  aim to develop a flexible policy and propose a transfer method that can handle various range sensor dimensional configurations. Later, they propose a support-point approach to distinguish critical points from LiDAR range measurements and handle different LiDAR configurations without re-training.
In multi-robot settings, Chen et al.  propose an approach called CADRL that considers observable states of other robots and pedestrians instead of direct mapping from perception to motion commands. With a trained value function, the robots perform one-step lookahead to generate collision-free actions. Later, they extend CADRL and design a socially-aware reward function for learning a policy that incorporates social norms towards pedestrian-rich navigation. Everett et al. 
augment this framework with Long short-term memory to improve performance on complex interactions with pedestrians. The parameter sharing scheme[20, 21] is popular for efficient multi-agent policy training and applicable to multi-robot navigation. Long et al.  train a multi-robot collision avoidance policy with a multi-agent extended variant of Proximal Policy Optimization (PPO) and adopt curriculum learning to facilitate training. This policy is shared among robots and navigates each robot to its desired goal from 2D laser measurements without communications in a decentralized manner. Lin et al.  aim to navigate the geometric center of a robot team to reach waypoints and develop an end-to-end policy shared among the robots that consider raw laser data inputs and position information of other robots. Han et al.  also consider observable states of other robots and dynamic obstacles with Gaussian noises. They propose a greedy target allocation method to efficiently navigate the robots to multiple targets.
Iii-a Problem formulation
The multi-robot navigation problem can be formulated as a Partial Observable Markov Decision Process (POMDP). Formally, a POMDP is defined by tuple where is the state space, is the action space, is the observation space,
is the transition probabilities of taking actionat state , is the observation distribution at the new state , is the reward function and is the discount factor.
Specifically, we consider a team of robots navigate to a target zone with radius . The robots are considered homogeneous and modeled as discs with the same radius and communication range . At each timestep , each robot obtains 2D laser scans and its velocity . The relative goal position and the positions of other robots in the team are informed to the robot . Given the observation , the robot aims to obtain a parameterized policy that directly maps to control command . The policy is shared among robots, and each robot is allowed to take different actions since they receive different observations.
The objective of RL is to find the optimal parameter , where the corresponding policy maximizes the expected cumulative discounted rewards:
where is the horizon of an episode.
Iii-A1 Observation space and action space
As mentioned above, the observation for each robot consists of four parts . The 2D laser scans data consists of 90 measurements from a 360 LiDAR, the velocity is in Cartesian coordinate, and all of the positions (i.e., and ) are transformed to the local robot coordinate frame from global coordinates. In our case, the robots are omnidirectional and the control command is the velocity in Cartesian coordinate, i.e., . For practical purpose, we shrink the output of the policy to fit the maximum value of velocity, i.e., .
Iii-A2 Reward design
The robot team is supposed to get into the target zone without collisions. To achieve this, the reward function for each robot is designed to encourage approaching the target zone and avoiding collisions with other robots and obstacles :
where is the reward when the robot approaches the target, or all of the robots enter the target zone:
where denotes the distance to goal for robot at timestep and . is to penalize collisions among robots or obstacles:
We empirically set , and .
Iii-B Global connectivity constraint
The approach for global connectivity maintenance of communication networks is based on the second smallest eigenvalue of the associated graph Laplacian, which is briefly described below and used to formulate a connectivity constraint later on.
Consider a multi-agent systems with positions denoted as . Each pair of agents and can communicate with each other within an identical limited distance , i.e., . The communication topology can be represented as an undirected graph where each vertex denotes the -indexed agent and each edge denotes the communication availability between agents and . The Laplacian matrix of graph is defined as where is the adjacency matrix and each element if and otherwise. is the degree matrix, which is diagonal, i.e., and the -th element of the diagonal is . The second smallest eigenvalue of the Laplacian matrix, denoted as , describes algebraic connectivity of the graph as always holds if is connected .
With the RL formulation, the robot team learns to navigate to the target zone. To impose global connectivity to the team, we introduce a cost function that use the algebraic connectivity measurement::
where is the indicator function and is the algebraic connectivity of the graph which is constructed by at timestep . Note that the robots can infer from positions of the teammates . As global connectivity serves as a constraint, the multi-robot navigation problem is extended to a constrained RL formulation. Specifically, constrained RL aims to optimize the policy while bounding expected discounted returns with respect to the cost function:
The Eq. (6) is formed in terms of the expected value of trajectories. Similar to , by setting the discount factor close to 1, we are able to constrain the total number of constraint violations per trajectory to be less than the desired threshold . In this work, we use Constrained Policy Optimization (CPO)  to solve this constrained RL problem:
where is the upper bound. CPO adopts trust region methods for local policy search, which is introduced in Trust Region Policy Optimization (TRPO) . TRPO uses a constraint on the Kullback-Leiber (KL) divergence of two stochastic policies between updates to stabilize policy training with monotonic improvement guarantees. Similarly, CPO replaces the original objective and constraint in Eq. (7) with surrogate functions in a trust region:
where is the average KL divergence between two consecutive policies and is the step size. Advantages functions are introduced in Eq. (8
) which can be estimated by function approximatorswith one-step TD error . CPO approximately solves the surrogate optimization in Eq. (8). Due to approximation errors, CPO asymptotically bounds the constraint during training. One may refer to  for details.
Iii-C Behavior cloning
In fact, the optimization problem in Eq. (7) regarding two objectives (i.e., Eq. (1) and Eq. (6)) is usually infeasible as it leads to conservative policy explorations in stochastic domains even without a trust region constraint. In our case, connectivity constraint leads to in-place oscillations, and the robots can not learn to navigate to the target.
To make this policy optimization tractable, we introduce the behavior cloning (BC) technique during RL training. In many robotic tasks such as manipulation, limited demonstration data can be collected, but an expert policy for all states is usually infeasible [25, 26]. However, for map-less navigation, recent successful practices have shown that an expert can be obtained by RL methods without great effort. In particular, given observation and the same reward structure () in SecIII-A2, RL (we use Soft Actor-Critic  here) can train a navigation expert policy for a single robot efficiently. Though BC is widely used for pre-training to provide initialization for RL. As a result, pre-training from a navigation expert can not provide a good bootstrap for this problem since it performs undesired behaviors that break the global connectivity. Instead, we incorporate BC into RL training. With a navigation expert
, a supervised mean square error (MSE) loss function is constructed to minimize the difference with:
Then the original RL objective Eq. (1) is modified with this additional BC loss term:
where is a hyper-parameter.
Iii-D Neural network structure
We follow the common practices [9, 10, 17] in the parameterization of policy and adopt a similar network structure as shown in Fig. 3. The raw laser data is processed by two 1D convolution (Conv1D) layers and a fully connected (FCand are fed into two hidden FC layers. Then the velocity output is obtained from the last FC layer. As suggested by a large scale experimental results , we use hyperbolic tangent (tanh
) as activation functions and orthogonal initialization scheme to initialize network weights. The outputis supposed to be the mean of the action distribution. A separate vector with the same dimension as
is defined as the log standard deviation that will be updated alone. In this way, the stochastic policy
is formed as a Gaussian distribution. At each timestep , the robots take random actions for on-policy exploration.
The value networks and are introduced for advantage estimation and both constructed with two hidden and one output FC layers to map the observation to value estimates.
Network structure. Conv1D (filters, kernel, stride) denotes the 1D convolutional layer, FC denotes the fully-connected layer.
Iii-E Policy training
Iii-E1 Simulation setup
Since RL is considerably sample inefficient, the expert policy and the shared policy are trained in the simulation where sufficient samples are available. The multi-robot navigation is simulated in a custom OpenAI Gym environment . We use the domain randomization technique  in order to avoid overfitting and bridge the sim-to-real gap. Concretely, we build 100 2D maps in total where several elements are randomly initialized, including target positions, starting points of robots, and obstacle configurations (numbers, shapes, positions, and sizes) as shown in Fig. 4.
Iii-E2 CPO policy update
The policy training algorithm is shown in Algorithm 1. The robots collect transition data simultaneously in several maps that are randomly selected. At each timestep , each robot takes actions generated from the shared policy given its own observation and then receives its immediate reward and a shared connectivity cost signal . When the number of collected transition data is sufficient (i.e., greater than ), we use the batch of episodes collected by all robots to update policy . To build up CPO formulation, the objective and the constraint are approximated with advantage estimates (line 12). We then use the well-behaved expert , which is trained in advance to compute the supervised BC loss with all transition data (line 13). The policy network is optimized by CPO with regard to the combined objective in Eq. (10). The value networks is optimized with the following loss function:
Eq. (11) is computed with all sampled trajectories and minimized with an L-BFGS optimizer. The cost value network is updated in the same fashion in terms of cost signals.
We implement our neural network models in PyTorch. All training procedures and simulation experiments are performed on a computer with an Nvidia RTX 2080 GPU and an Intel Xeon Gold 6254 CPU. Each timestep takes about 2in various simulations. The primal hyper-parameters for Algorithm 1 are shown in Tab. I.
|number of timesteps for each update||2048|
|discount factor for reward||0.99|
|discount factor for cost||0.999|
|coefficient for BC loss||0.1|
|,||L-BFGS step size for||0.1|
|KL constraint threshold||0.01|
|connectivity constraint threshold||0.1|
Iv Experiments and results
To demonstrate the effectiveness of our approach, we first investigate the influences of behavior cloning and the connectivity constraint on policy performance. Then we evaluate the performance of our approach in real-world experiments.
|Metrics||Approach||3 robots||4 robots||5 robots|
Iv-a Comparisons on various approaches in simulation
To study the influences of behavior cloning and connectivity constraint in our approach, we compare three approaches with different combinations:
CPO: policy optimization under the connectivity constraint without supervised BC loss term, i.e., the BC loss coefficient
TRPO+BC: unconstrained policy optimization with supervised BC loss term where the connectivity constraint is removed.
CPO+BC (our approach): policy optimization under the connectivity constraint and the supervised BC loss term
The three approaches are compared in terms of average rewards and constraint value during training. To quantitatively measure the final performance of the approaches, we use the following metrics:
Success Rate: the ratio of the robot team to reach targets without collisions
Connectivity Rate: the ratio of the robot team to maintain connectivity during navigation
Travel Time: the average travel time for the robot team successfully navigates to targets
We randomly re-generate 100 maps for final performance evaluation. We conduct experiments with the various number of robots in simulation and evaluate the performance on average with 5 random seeds for each approach. The evolution of learning curves is consistent across different simulation settings, and we present one of the results (3 robots) in Fig. 4(a) and Fig. 4(b).
Influence of behavior cloning: Comparing CPO and CPO+BC shows that behavior cloning help to improve the policy convergence speed, as shown in Fig. 4(a). In fact, CPO fails to learn to navigate in most of the training maps because of influences of the connectivity constraint. This is supported by the constraint value evolution over time in Fig. 4(b). As shown in Tab. II, the CPO policy leads to terrible failures in terms of navigation success rate and dedicates to preserve connectivity in testing scenarios, which are undesired behaviors for navigation tasks. CPO+BC stabilizes the constraint value to be close to the threshold and succeeds in the majority of unseen scenarios. We conclude that BC has significant effects on policy improvement.
Influence of connectivity constraint: Comparing CPO+BC and TRPO+BC shows that CPO+BC is inferior in terms of policy convergence speed. This is due to the fact that CPO+BC learns to satisfy the constraint. However, TRPO+BC leads to high constraint values without bounding the constraint on training convergence. The final performance presented in Tab. II also proves that TRPO+BC fails to maintain connectivity in most cases even it has good navigation performance. Surprisingly, CPO+BC outperforms TRPO+BC at travel time. We believe that it is because CPO+BC learns implicit cooperative schemes as presented in many cases.
Compared to TRPO+BC, CPO+BC leads to sub-optimal navigation success rate but we find that most of the failures are caused where the team struggles in complex scenarios which require great exploration abilities. This result indicates that enforcing connectivity constraints degrades the ability of navigation exploration. Compared with CPO, TRPO allows to take ”risky” actions without concerns of constraints satisfaction. Both CPO+BC and TRPO+BC show consistent performance across the various number of robots in the experiment results listed in Tab. II.
Iv-B Real-world experiments
As CPO+BC and TRPO+BC have good navigation performance in simulation, we deploy the policies trained by both methods to evaluate their sim-to-real capabilities. The deployed policies take around 2M training timesteps.
Real robot configuration: Our holonomic robot model is shown in Fig. 6. The robot is equipped with DJI Mecanum wheel chassis for omnidirectional mobility and an RPLIDAR A2 laser range scanner. Each LiDAR on a robot is deployed at a different height to detect each other. We evenly pick 90 measurements from every 360 laser scan. The robot is controlled by a Jetson TX2 module that publishes the policy output at 10Hz. For both the simulations and the real-world experiments we set the radius of the robot and the maximum moving speed . We set to emulate limited communication range. To distribute position information to the robots, we use an OptiTrack global tracking system to capture and inform the states of robots and the target position. The robots receive laser data and position information by subscribing to ROS nodes of LiDAR and OptiTrack.
The policies are tested in various scenarios (shown in Fig. 7) and show consistent performance as in simulation on the three metrics in Tab. II. The experiment results are summarized in Fig. 4(c), which shows the value of algebraic connectivity of the robots, as the multi-robot system evolves. Our approach (i.e., CPO+BC) takes less travel time and maintains connectivity at all time while TRPO+BC fails in connectivity perseverance (the value goes to 0).
In this work, we present an RL-based approach to develop a decentralized policy for multi-robot navigation with connectivity maintenance. The policy is shared among robots and jointly optimized by transition data of all robots in simulation during training. By enforcing the connectivity constraint and providing expert demonstrations, the policy learns to generate navigation actions with connectivity concerns. We compare different combinations of our key designs and validate the effectiveness of our approach. We also evaluate the generalization performance of our approach in both various simulated scenarios and real-world experiments. As a case study of RL-based approach, this work presents an effective way to reduce exploration complexity with concerns of constraints satisfaction.
-  X. Li, D. Sun, and J. Yang, “A bounded controller for multirobot navigation while maintaining network connectivity in the presence of obstacles,” Automatica, vol. 49, no. 1, pp. 285–292, 2013.
-  L. Sabattini, N. Chopra, and C. Secchi, “Decentralized connectivity maintenance for cooperative control of mobile robotic systems,” The International Journal of Robotics Research (IJRR), vol. 32, no. 12, pp. 1411–1423, 2013.
-  R. K. Williams, A. Gasparri, G. S. Sukhatme, and G. Ulivi, “Global connectivity control for spatially interacting multi-robot systems with unicycle kinematics,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 1255–1261.
-  A. Gasparri, L. Sabattini, and G. Ulivi, “Bounded control law for global connectivity maintenance in cooperative multirobot systems,” IEEE Transactions on Robotics (T-RO), vol. 33, no. 3, pp. 700–717, 2017.
-  M. Fiedler, “Algebraic connectivity of graphs,” Czechoslovak Mathematical Journal, vol. 23, no. 2, pp. 298–305, 1973.
-  L. Wang, A. D. Ames, and M. Egerstedt, “Multi-objective compositions for collision-free connectivity maintenance in teams of mobile robots,” in 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016, pp. 2659–2664.
-  B. Capelli and L. Sabattini, “Connectivity maintenance: Global and optimized approach through control barrier functions,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 5590–5596.
-  M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart, and J. Nieto, “Reinforced imitation: Sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations,” IEEE Robotics and Automation Letters (RA-L), vol. 3, no. 4, pp. 4423–4430, 2018.
-  P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6252–6259.
-  J. Lin, X. Yang, P. Zheng, and H. Cheng, “End-to-end decentralized multi-robot navigation in unknown complex environments via deep reinforcement learning,” in 2019 IEEE International Conference on Mechatronics and Automation (ICMA). IEEE, 2019, pp. 2493–2500.
-  J. Lin, X. Yang, P. Zheng, and H. Cheng, “Connectivity guaranteed multi-robot navigation via deep reinforcement learning,” in Conference on Robot Learning (CoRL). PMLR, 2020, pp. 661–670.
R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe
reinforcement learning through barrier functions for safety-critical
continuous control tasks,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 3387–3395.
-  S. Mysore, B. Mabsout, R. Mancuso, and K. Saenko, “Regularizing action policies for smooth control with reinforcement learning,” in 2021 IEEE International Confernece on Robotics and Automation (ICRA). IEEE, 2021.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Neural Information Processing Systems (NIPS), 2017.
-  J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in Neural Information Processing Systems (NIPS), 2016.
-  M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, “From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1527–1533.
-  W. Zhang, Y. Zhang, and N. Liu, “Map-less navigation: A single drl-based controller for robots with varied dimensions,” arXiv preprint arXiv:2002.06320, 2020.
-  Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 285–292.
-  M. Everett, Y. F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3052–3059.
-  J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Springer, 2017, pp. 66–83.
-  T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE Transactions on Cybernetics, vol. 50, no. 9, pp. 3826–3839, 2020.
-  R. Han, S. Chen, and Q. Hao, “Cooperative multi-robot navigation in dynamic environment with deep reinforcement learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 448–454.
J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy
International Conference on Machine Learning (ICML). PMLR, 2017, pp. 22–31.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning (ICML). PMLR, 2015, pp. 1889–1897.
-  M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” arXiv preprint arXiv:1707.08817, 2017.
-  A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” in Proceedings of Robotics: Science and Systems (RSS), 2018.
-  T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
-  L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry, “Implementation matters in deep rl: A case study on ppo and trpo,” in International Conference on Learning Representations (ICLR), 2019.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30.
-  D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical programming, vol. 45, no. 1, pp. 503–528, 1989.