Motion planning in complex scenarios, like multi-robot coordination (Figures (a)a and (e)e), dynamic obstacle avoidance (Figure (b)b), aerial cargo delivery (Figure (c)c), or balancing a flying inverted pendulum (Figure (d)d), require finding trajectories for systems with unknown non-linear dynamics. To complete these tasks, it is not sufficient that the planned trajectories are optimized only for speed, but they need to balance different, often opposing, qualities (preferences) along the way. In some cases the preferences can be formulated as constraints on the trajectory, but often the precise constraints are unknown or difficult to calculate. For example, consider a simple manipulation task where a robot is required to set a glass on a table without breaking it. We do not know precisely the amount of force that causes the glass to shatter, yet we can describe our preferences: low force and fast task completion. To find feasible planning solutions, these complex problems can frequently be described in terms of points of interest, or intents, that the agent either needs to progress towards (attractors) or stay away from (repellers). The planning must balance between them. We call these tasks Preference Balancing Tasks (PBTs). There are several challenges associated with solving PBTs: finding a policy that guides the agent to the goal, handling non-linear dynamics, and adaption to external disturbances in real-time.
An example of a solution to a simple PBT is an Artificial Potential Field (APF) , which balances a force that attracts the robot to the goal with a force that repulses the robot from obstacles. These two forces can be thought of as preferences that drive the robot motion through the environment. The challenge of planning with methods such as APFs, is primarily in determining the placement, shape, and relative scaling between the potentials. The properly constructed force field greatly impacts the method’s success.
Reinforcement learning (RL) has been recently successful in planning trajectories for systems in unknown dynamics . RL often learns a state-value function, which is used much like APF to plan a trajectory. In robotics domains, RL requires function approximation, either with Deep Neural Nets or carefully selected features 
. Deep reinforcement learning (DRL), RL with deep neural networks as function approximators, gained lots of attention recently in applications such as Atari games, self-driving , and learning motion planning policies from raw sensor data  . Deep neural nets are good choice of approximators when we do not have good intuition about the problem feature space, and have abundant data for training. At the same time, the DRL methods have been rather challenging to apply to robotics motion planing, due to the speed of decision-making, the training data needed, and the general instability of training .
In contrast, PEARL, as well as other feature-based RL methods, is more interpretable, requires less data, and executes faster. PEARL is appropriate for problems where we have an intuition of the feature set, lack the demonstration data to use for training, and require fast training. While DRL takes hours and days to train, PEARL completes training within minutes. Additionally, PEARL’s features make the behavior easy to interpret and analyze. Furthermore, for tasks with only attractors, we know the conditions under which the task’s goal is an asymptotically stable point , which means that we can guarantee the agent’s behavior when the conditions are met. In Section VI, we show that the proposed features satisfy the stability criterion for attractor-only multi-agent tasks. And for tasks with mixed intents (both attractors and repellers), we analyze the local minima conditions.
External disturbances that influence a robot’s motion at runtime (e.g., atmospheric changes or wind) pose another challenge. They are often stochastic and can externally excite the system with normally distributed intensity and direction, with variation between consecutive observations. Stochastic disturbances, along with complex nonlinear system dynamics, make traditional solutions, e.g., adaptive and robust control modeling, which solve this problem by explicitly solving the optimal control problem, difficult or intractable . We are interested in a trajectory generation method that solves PBT, rejects stochastic disturbances, and is computationally efficient so as to be used on a high-dimensional system that requires frequent input selection (i.e., 50 Hz).
We propose Preference Appraisal Reinforcement Learning (PEARL), a simple and easy method to solve many PBTs for multi-agent systems with unknown non-linear dynamics. PEARL constructs a RL state-value function (APF equivalent) from user-given intents as points of interest (preferences), computes features from the intents, and learns the scaling between the features (appraisal). We learn the state-value function on simplified problems to speed up the training, and use it to plan more difficult problems. To make the trajectory planning robust to external disturbances, we present a policy approximation, Least Squares Axial Policy Approximation (LSAPA) , that rejects stochastic disturbances. PEARL trains in minutes, and although there are no guarantees that it will work in all cases, we show that it works on a range of problems and offer an analysis of the method’s limitations.
We evaluate PEARL with and without stochastic disturbances, on dynamic obstacle avoidance , aerial cargo delivery (Figure (c)c) , rendezvous (Figure (e)e) , and flying inverted pendulum (Figure (d)d) tasks. Further, PEARL was used for a multi-agent pursuit problem to plan in 100-dimensional state space in real-time at 50 Hz. The feasibility of the generated trajectory is verified experimentally for the aerial cargo delivery task with stochastic disturbances .
This paper extends our prior presentation of a single-agent PEARL  and stochastic disturbance rejection  to the multi-agent platform, along with comprehensive analysis and evaluation that shows PEARL flexibility and range. New to this paper are 1) multi-agent formulation of PEARL, 2) training domain formalization and analysis, 3) stability analysis, 4) multi-agent pursuit solution, and 5) flying inverted pendulum under stochastic disturbances.
The rest of the paper is structured as follows. Section II positions PEARL within the related work. Section III introduces necessary preliminaries. In Section IV we present PEARL’s components: preference balancing tasks, the MDP and training domain, construction of features, fittest policy selection, and policy adaptation for rejection of stochastic disturbances. Section V presents comprehensive case studies demonstrating the method. In Section VI we analyze the method, its progression to the goal, computational cost, feature properties, and construction of the learning domain. Lastly, we conclude in Section VII.
Ii Related work
Reinforcement learning: Function approximation RL methods typically assume user-provided features , which map subspaces to points. RL is sensitive to feature selection, because of the implied dimensionality reduction . Classically, two feature types are used in RL: discretization and basis functions 
. Discretization partitions the domain, scaling exponentially with the space dimensionality. Basis functions, such as kernels and radial basis function networks, offer more learning flexibility. These functions, however, can require manual parameter tuning, and the feature number for multi-robot systems and dynamic obstacle avoidance tasks increases exponentially with the state space dimensionality. Further, these method are generic and do not capture prior knowledge of a task that an engineer might have. PEARL proposes a feature selection method that solves a particular class of motion tasks for acceleration-controlled robots, exploiting task knowledge. The number of features is invariant to the problem dimensionality, and the computation time scales polynomially with the state space dimension. Similar to Voronoi decomposition, which solves high-dimensional manipulation problems by projecting the robot’s configuration space onto a low dimensional task space , the features we propose define a basis in the preference-task space as well. In contrast to Voronoi decomposition, PEARL learns the relationship between the features. Another RL approach solves manipulation tasks with hard  and soft  constraints. Our tasks, however, do not have known constraints and bounds; they are set up as preferences to guide dynamically feasible trajectory generation.
Preference shaping: Devlin et al.  focuses on reward shaping to improve the equity of the joint agent policy in multi-agent tasks. Rather, we put emphasis on the feature selection and task formulation. Another approach dynamically calculates preference ordering where preferences are expressed as logical formulas . Our preferences adapt easily to spaces of different dimensions.
Stochastic disturbances: Robot motion control under stochastic disturbances has been studied on a number of problems, including quadrotor trajectory tracking under wind-gust disturbances , a blimp path planning in the presence of stochastic wind , with different approaches use low-level controllers for stabilization of trajectories within reach tubes  or trajectory libraries . Other approaches to UAV control in environments with a drift field explicitly solve the system dynamics [24, 25], or use iterative learning to estimate repetitive disturbance [26, 27]. While these solutions are aimed at particular systems or repetitive disturbances, our method works for a class of problems and systems influenced by the stochastic disturbance. Finally, Nonlinear Model Predictive Control (NMPC)  tracks a deterministic trajectory minimizing a cumulative error for a non-linear dynamical system in presence of the stochastic disturbances. The main difference is that, while NMPC computes the cost on-the-fly, PEARL constructs a state-value function, a discounted, infinite-horizon, cumulative cost (reward) prior to task execution. PEARL’s control policy, LSAPA, uses the state-value function during planning and only needs to solve a one-step optimization problem with respect to the state-value function.
Continuous action policy approximation: Several sampling-based methods that approximate the policy efficiently exist, such as HOOT , which uses hierarchical discretization to progressively narrow the search on the most promising areas of the input space, thus ensuring an arbitrarily small error . Deterministic Axial Sum (DAS) 
solves input selection in linear time using divide and conquer. It finds optimal input for each of the input axes independently with Lagrangian interpolation, and then combines the single-axis selections. DAS works well for high-dimensional multi-agent planning with value functions without local maxima. HOOT is slower, and, in practice, HOOT works well for single-agent planning with value functions that have many small-scale maxima. Although DAS can compensate for some levels of zero-mean noise[9, 30], the method stops working in the presence of stochastic disturbances. This is because the external disturbance induces unpredictable drift onto the system. The method presented here, LSAPA, uses polynomial least squares regression instead of interpolation to compensate for the stochastic disturbances.
We model robots as discrete-time nonlinear control-affine systems with accelerations as control inputs . Consider a robot with degrees of freedom (DoF). If an acceleration is applied to the robot’s center of mass at time-step
, the new position-velocity vector (state)is,
for some functions . A Markov decision process (MDP), a tuple with states and actions , that assigns immediate scalar rewards to states in , formulates a task for the system (1) . A solution to a MDP is a control policy that maximizes cumulative discounted reward over the agent’s lifetime, value, , where is the discount factor. Approximate value iteration (AVI) , finds a solution to a continuous state MDP by approximating state-value function with a linear map
RL often works with action-value function, , a measure of the discounted accumulated reward collected when an action is taken at state . In relation to the state-value function, V (2), action-value can be expressed as
AVI takes a feature vector and learns weights between them by sampling the state-space and observing the rewards. It iteratively updates
in an expectation-maximization manner.
After parameter learning is completed, batch RL enters a planning phase. The planner takes the value function approximation (2) and an initial condition, and it generates a trajectory using the closed-loop control with a greedy policy with respect to the state-value approximation,
where state is the result of applying action to state . Action selection in continuous spaces, which calculates the greedy policy (4), is a multivariate optimization over an unknown function.
To start the learning phase, a user provides PEARL with the task definition, which contains the basic information about the problem: the number of robots, the robot’s DoFs, maximum accelerations, and a set of intents (attractors or repellers). The basic system information is encoded into a training MDP (Section IV-A1). The intents are encoded into features (Section IV-A2). To make the learning more tractable, we select a learning domain that is smaller than the full problem space, (Section IV-A3). The features and training MDP are passed to the approximate value iteration algorithm (AVI  or CAFVI ), which determines the weights between features. The training output is a feature weight vector where is number of preferences. We repeat the training several times to select the best policy (Section IV-A4).
Once the fittest policy is selected, PEARL is ready to execute tasks on the full MDP domain. The features, weights, and initial conditions are passed to the planner (Section IV-B), which uses a greedy policy with respect to the learned state-value function to complete the task. Section IV-B1 presents a greedy policy modification that allows the system to compensate for stochastic disturbances in the special case when all task preferences are attractors.
Our aim is to solve tasks that can be described with a set of attractors (goals) and repellers (obstacles) for acceleration-controlled multi-robot systems with unknown dynamics.
Iv-A1 Multi-agent MDP Setup
The multi-robot system consists of robots, where robot has degrees of freedom (DoF). We assume the robots work in continuous state and action spaces, are controlled through acceleration applied to their center of mass, and have differentially flat dynamics that are not explicitly known. Let be the robot’s position, velocity, and acceleration, respectively. The MDP state space is where
is the dimensionality of the joint multi-robot system. The state is joint vector
and action is the joint acceleration vector,
The state transition function, which we assume is unknown, is a joint dynamical system of individual robots
each being control-affine (1). Note that since all robots are control-affine, the joint system is as well.
For training purposes, we assume the presence of a black-box simulator, or dynamics samples, for each robot. The reward is set to one when the joint multi-robot system achieves the goal, and zero otherwise. The tuple
defines the joint MDP for the multi-robot problem.
Iv-A2 Feature Selection
We define a PBT with preferences, . The preferences, points in position or velocity space, , either attract or repel the one or more agents. Preferences that attract an agent are goals, , whereas the preferences that repel it are obstacles, . We assume that the task is well-defined, i.e., the task’s attractors forms a non-empty region, not fully occluded by repellers, .
To learn a PBT with preferences, , we form a feature for each preference. We construct the features, associated with both attractors and repellers, and reduce their measure to the intended point of interest. For example, multi-agent pursuit has two attractors, the prey’s position and speed, and a repeller, other nearby agents’ positions. The three preferences form three features that intuitively correspond to the distance from the prey, speed difference from the prey, and the inverse of the distance from other agents.
Formally, assuming the low-dimensional task space and high-dimensional MDP space , we consider task-preference features,
Parametrized with the state space dimensionality, , the features map the state space to a lower dimensional feature space, and, depending on the preference type, measure either the squared intensity or distance to the attractor. Let be a subset of robots that a preference applies to, and be a projection of the robot’s state onto the minimal subspace that contains . For instance, when a preference is a point in a position space, is the robot’s position. Similarly, when is a point in a velocity space, is the robot’s velocity. Then, attractor features are defined with
and repeller features are defined with
Iv-A3 Learning Efficiency
to discover the relative weights between the features (preference appraisal). AVI algorithms learn by sampling from the learning domain. In high-dimensional and high-volume learning domains, learning algorithms need prohibitively many iterations, or might not even converge due to the curse of dimensionality. To avoid these problems and make learning computationally efficient, we propose a training domain (9) that is a subset of the full problem domain in (5). Let the training MDP be
where and The training MDP, domain, is a subset of the full training domain. Specifically, out of all possible subsets, we select and such that they contain all of the task’s point of interest and its immediate vicinity, but not much more. For instance, in the case of dynamic obstacle avoidance (Figure (b)b), the problem domain is a circular area with radius 50 meters in 2-dimensional workspace, but if we place three obstacles at 2 meters from the goal, then the training domain can be a circle with radius 3 meters from the goal, since it contains the attractor and all the repellers, and all are reachable. To formalize, the training domain is a bounded, closed set that contains the smallest open set that covers all the objectives,
The training set, is . This set is used in the RL training algorithm to sample the training tuples.
Iv-A4 Monte Carlo Policy Selection
We repeat the training times because of the sampling nature of RL algorithms. Each training trial produces a different set of preference weights, resulting in for To select the fittest policy, we evaluate each on a fixed evaluation set of initial conditions. The evaluation measures the percent success rate for the policy and the average trajectory duration. We select a policy with the fastest average trajectory from the policies with the highest success rates on the training domain.
Iv-B Trajectory Planning
After a PEARL agent is trained, we use the learned value function to plan. To plan and generate trajectories, the planner executes a closed-loop feedback control. At each time step the agent observes the state of the world, , and uses an approximation of a greedy policy (4) such as Deterministic Policy Approximation (DAS)  or HOOT  to compute an action, to apply to the system.
Iv-B1 Stochastic Disturbance Rejection for Attractor Tasks
We look at the special case of the attractor-only tasks, e.g., rendezvous (Figure (e)e), flying inverted pendulum (Figure (d)d), and aerial cargo delivery (Figure (c)c). For these problems, we present a policy approximation that rejects stochastic disturbances that influence the system at run-time. The policy approximation, Least Squares Axial Policy Approximation (LSAPA), adapts DAS . In , we showed that for tasks with only attractors a Q-value function is a quadratic function of the state , therefore DAS can be used for efficient, high-precision planning if certain conditions are met. Specifically, DAS takes advantage of the facts that action-value function is a quadratic function of the input for any fixed arbitrary state in a control-affine system (1) with state-value approximation (2) . DAS finds an approximation for the maximum local function for a fixed state . It works in two steps, first finding maxima on each axis independently and then combining them together. To find a maximum on an axis, the method uses Lagrangian interpolation to find the coefficients of the quadratic polynomial representing the function. Then, an action that maximizes the function on each axis is found by zeroing the derivative. The final policy is a piecewise maximum of a convex and simple vector sums of the action maxima found on the axes. Deterministic axial policies do not adapt to changing conditions or external forces, because their results do not depend on the selected samples. We extend DAS to work in the presence of disturbances via LSAPA. LSAPA uses least squares regression, rather than Lagrangian interpolation, to select the maximum on a single axis. This change allows the LSAPA method to compensate for the error induced by non-zero mean disturbances.
We consider a system that is externally influenced by disturbance
is sampled from a Gaussian distribution,, and show that a) the Q-function remains quadratic, and b) propose a modification to DAS that compensates for the external disturbance.
During the planning, we assume that we have a black-box simulator of the system, which receives mean and variance of the current probability distribution of the disturbance. The probability distribution can be obtained by estimating a moving average and variance of the error between observed and desired acceleration, obtained with an accelerometer placed on a robot.
At every time step, , LSAPA, observes a state, . By sampling the simulator, LSAPA finds a near-optimal input, , to apply to the system. The Lagrangian interpolation, used in DAS, interpolates the underlying quadratic function with only three points, and this compounds the error from the disturbances. In contrast, our new method, LSAPA, uses least squares regression with many sample points to compensate for the induced error.
|External force excreted onto the|
|n axis unit vector|
|Coefficients of Q’s axial restriction|
|Univariate input variable|
|Set of vectors in direction of n axis|
|Estimate in direction of the n axis|
|Estimate over first n axes|
|’s maximum estimate with a policy|
|Univariate function in the direction|
|of axis , passing through point|
|Number of axis samples|
We first show that the function remains quadratic with a maximum even when the system is influenced with a stochastic term.
The proof is Appendix A.
Next, we present finding the maximum on i
axis using least squares linear regression with polynomial features.
-axial restriction on axis is a univariate function , s.t. is a unit vector on axis.
-axial restriction on axis is a quadratic function,
for some vector based on Proposition IV.1. Our goal is to find by sampling the input space at a fixed state.
Suppose, we collect input samples in the axis, The simulator returns state outcomes when the input samples are applied to the fixed state , where Next, -estimates are calculated with (3),
Using the supervised learning terminology the Q estimates,, are the labels that match the training samples . Matrix,
contains the training data projected onto the quadratic polynomial space. The solution to the supervised machine learning problem,
fits into the training data and labels . The solution to (14),
is the coefficient estimate of the -axial restriction (11). A solution to is a critical point, and because is quadratic the critical point is
Lastly, we ensure that action selection falls within the allowed action limits,
where and are lower and upper acceleration bounds on the axis, respectively.
Repeating the process of estimating the maxima on all axes and obtaining , we calculate the final policy with
The policy approximation (18) combines the vector sum of the non-convex policies (17) with the convex sum policy. The convex sum guarantees the system’s monotonic progression towards the goal for a deterministic system , but the simple vector sum (non-convex policy) does not . If, however, the vector sum performs better than the convex sum policy, then (18) allows us to use the better result. The disturbance changes the function, and the regression fits Q to the observed data.
V Case studies
Table II summarizes the case studies used for evaluation. PEARL was implemented in MATLAB 2013a and 2014a and all training and simulations are executed on either a single core of Intel i3-2120 at 3.3GHz with 4GB RAM or Intel Xeon E5-1620 at 3.6GHz with 8GB RAM. Experiments were performed on an AscTec Hummingbird Quadrotor equipped with a suspended load weighing 45 grams. The quadrotor and load positions were captured with a motion capture system at 100 Hz. The testbed’s internal tracking system tracked the quadrotor position using a LSAPA generated trajectory as a reference. The video111https://youtu.be/ZkFt1uY6vlw contains the experiments and visualization of the simulations.
|Task||Location||PEARL||State space||State space||Task variation|
|Multi agent pursuit||Section V-A||Training||12||Static prey; 3 pursuit agents|
|Planning||50||Moving prey, 25-agent agents in real-time, 1000 offline|
|Dynamical obstacle||Section V-B||Training||4||4 static obstacles|
|avoidance||Planning||4||Up to 900 stochastically moving obstacles|
|Aerial cargo delivery||Section V-C||Training||10||Without disturbance|
|Rendezvous||Section V-D||Training||16||Without disturbance|
|Flying inverted||Section V-E||Training||10||Without disturbance|
V-a Multi-agent pursuit
Variants of multi-agent pursuit tasks have been studied in the RL and multi-agent literature [34, 35, 14, 36]. A typical predator-pursuit task works in a grid world with multiple-agents pursing a prey . In the classical variant, both pursuers and the prey are autonomous agents . The pursuers work together as a team of independent agents . Ogren et al.  approaches multi-agent coordination through the construction of a control Lyapunov function that defines a desired formation, similar to our proposed solution. But, their approach uses a given control law to move the agents, while we use a RL-derived planner. The swarming community addresses a similar problem of a large group of agents working together with a limited information exchange . Typically, the resulting complex system exhibits an emergent behavior as a result of only few given rules. For example, Boids are intended to model the behavior of bird flocks, and accomplish that using a sum of separation, coherence, and alignment proportional controllers for each agent . In other work, Kolling et al. compute synchronized trajectories for UAVs that guarantee the detection of all targets .
Our variant of the multi-agent pursuit task controls only the pursuers chasing the prey, while the prey follows a predetermined trajectory unknown to the pursuers. The problem formulation we use differs from the typical formulation in three ways. First, it works in an arbitrarily large continuous state and action space rather than the discretized grid world. Second, it learns and plans in joint agent-position-velocity space, and handles variable team size without additional learning. Lastly, we require the pursuers to keep distance among themselves, thus we learn and plan in the joint pursuers’ space. We formulate the task as follows:
(Multi-agent pursuit task.) A set of agents must follow the prey agent in close proximity, while maintaining separation to each other. The agents know the current position and velocity of the prey (leader) but do not know the future headings.
V-A1 PEARL setup
To solve this task, we set up a MDP and preferences as outlined in Section IV-A2. Assuming agents, with DoFs each, the state space is the joint agent-position-velocity vector space . State ’s coordinates, , denote ’s agent position and velocity in the direction of axis . The action space consists of acceleration vectors, , where and is agent’s acceleration in the direction. To design the preferences for the pursuit task, we look at the task definition. There are three preferences: close proximity to the prey (attractor), following the prey (attractor), and maintaining distance between the agents (repeller). Thus, the feature vector has three components,
Following the method from Section IV-A2, we express the distance to the prey as
and following the prey, as reducing the difference in velocities between the agents and the prey,
where and are prey’s position and velocity in the direction . The last feature is increasing the distance between the agents,
To learn the pursuit task, we use 3 planar agents with double integrator dynamics (, ). The pursuit team of three robots is the smallest team for which all features are non-trivial. The maximum absolute accelerations less than . We run CAFVI  as the learning agent in PEARL with DAS  for 300 iterations to learn the feature vector weights. The sampling space is inside a six-dimensional hypercube . The prey is stationary at the origin during learning. The resulting weights are . Simulations run at 50 Hz. The time to learn is .
To plan a task, we assign a trajectory to the prey, and increase the number of agents. The prey starts at the origin, and the pursuers start at random locations within from the origin. We plan a trajectory.
Figure 3 depicts the planning computational time for a trajectory as the number of pursuers increases. State and action spaces grow with the team size, and the planning time stays polynomial. This is because the feature vector scales polynomially with the state size. The method plans the problems with continuous state spaces up to and continuous actions up to in real-time (below the green line in Figure 3), as trajectory computation takes less time than its execution.
The evaluation looks at 25-agent pursuit of a prey that tracks either a spiral (Figures (a)a - (c)c) or lemniscate de Gerono curve (Figures (d)d- (f)f). The agents start at random locations uniformly drawn within from the prey’s initial position (origin). Figures (a)a and (d)d show the movement of the agents in the xy-plane. Although the pursuers do not know the future prey’s positions, they start tracking in increasingly close formation. Figures (b)b and (e)e show the x-axis trajectory over time, and Figures (c)c and (f)f
show the y-axis trajectory over time. Note that the initial positions are uniformly distributed, and afterall agents are close to the prey and remain in phase with it. The only exception is the y-coordinate of the lemniscate (Figure (f)f). Here, the agents never catch up with the prey. The higher frequency oscillations in the y-direction make the prey hard to follow, but the agents maintain a constant distance from it. Figures (g)g - (i)i depict a trajectory of a 1000-pursuer task. The state space is and the action space is . DAS successfully plans a trajectory in this large-scale space. The prey’s trajectory has Brownian velocity (its acceleration is randomly drawn from a normal distribution). In this example, the prey travels further from the origin than the previous two examples.
|Method||Prey||#||# State||# Action||Comp.||Prey Dist. (m)||Dist. Agents|
We compare the PEARL multi-agent pursuit task with Boids . Boids uses the three standard goals; separation, alignment, and cohesion (independent of the prey), and two additional prey-direct rules; prey cohesion, and prey alignment. Since our agents have a double integrator dynamics, the alignment and cohesion rules are realized by minimizing the distance and velocity differential. We empirically tuned the rule weights until Boids started pursuing the prey. The resulting weights are for flock separation, alignment, and cohesion and for prey alignment and cohesion, respectively. Figures (g)g - (i)i show the resulting agents’ trajectories. Compared to PEARL trajectories in Figure 5, Boids agents lag behind more in the Spiral and Lemniscate trajectories. They also maintain a larger distance among agents, which is expected given the explicit separation rule that comes to effect when two agents are less than apart.
Table III shows the summary of ending trajectory characteristics averaged over 100 trials with the prey following the given trajectories. For each trajectory we plan the pursuit task with a varying number of pursuers. For all tasks the pursuit takes less than to compute even in the 100-dimensional state spaces (Table III), confirming the timing benchmark in Figure 3. The next column, prey distance, shows that the average pursuer-prey distance remains below
. The minimal standard deviation suggests consistent results, in agreement with the trajectory plots (Figure4). The last column shows the average distance between the agents that was to be maximized. The distance ranges between and . It decreases as the pursuer team grows, reflecting higher agent density around the prey and a tighter formation. Motion animation shows that while approaching the prey, the agents maintain the same formation and relative positioning, thus avoiding collisions. Compared to PEARL, Boids are an order of magnitude faster generating trajectories, but the average distance to the prey is an order of magnitude larger.
V-B Dynamic obstacle avoidance
Planning motion in dynamic environments is challenging because plans must be frequently adjusted due to moving obstacles. To address this challenge, proposed planners vary greatly in methodology, information about the position and dynamics of the obstacles, and the ability to account for stochastic obstacle motion . For example, APF methods provide fast solutions by using only the positional information of obstacles near to the robot [1, 40]. The challenge of APF methods lies in tuning the repulsive and attractive potentials. On the other hand, state of the art sampling-based methods for dynamic environments plan in state-time space in order avoid guiding the robot into inevitable collision states, states which lead to collision regardless of control policies [41, 42]. These methods often require knowledge of the dynamics of obstacles (deterministic or stochastic), which can be difficult to obtain . A method that approximates an obstacle’s dynamics is Velocity Obstacle (VO) that computes control actions for collision avoidance in the robot’s velocity space using the geometry . In VO the obstacles are approximated as circles or spheres and are assumed to move at a constant velocity, which can be inaccurate in crowded environments with stochastically moving obstacles. In this study, we define the dynamic obstacle avoidance task as follows:
(Dynamic obstacle avoidance task.) The agent must navigate from start to goal without collision with obstacles with obstacle dynamics unknown to the agent. The agent has access to the current position and velocity information of obstacles.
V-B1 PEARL setup
We setup a MDP where the state space is the joint vector of robot position and velocity, . The action space is the acceleration on each axis with dimension . This task has two natural preferences: 1) minimize the distance to the goal and 2) maximize the distance from obstacles. Therefore, PEARL’s feature vector is formulated as the combination of two preferences: . The first feature is an attractor towards the goal, , where is the position of the agent and is position of the goal. The second feature is a repeller from the closest obstacle, , where is a constant empirically selected to be and is the distance to the closest obstacle.
We use 4 stationary obstacles placed at to learn the weights between the two features. The goal is at the origin. The sampling space is inside a two-dimensional hypercube . The robot has a maximum speed of , and a maximum acceleration of . We run AVI  with HOOT policy approximation to learn the feature vector weights. The resulting weights are . All simulations are done at 10 Hz. The training duration is .
The planning environment is illustrated in Figure (b)b. The robot must travel from the start position to the goal at under the same speed and acceleration constraints as used for learning. The environment has
randomly placed moving obstacles with hybrid stochastic dynamics, three stochastic modes of linear, arc, and swerve. An obstacle is in one of the three modes at any given moment, and the initial mode is randomly determined. Obstacles in linear dynamics mode has a fixed heading but the speed of travel is sampled stochastically from the setwith probability . Obstacles in arc dynamics mode move counter clock-wise with radius at a stochastically sampled angular speed from the set with probability . Obstacles in swerve dynamics mode are changing heading linearly at the rate of between . is sampled uniformly between . All obstacles, regardless of dynamics modes, re-sample stochastically with frequency The obstacles are circles with radius . The average speed of obstacles ( regardless of dynamics modes) is identical to the maximum speed of the robot. We maintain the constant density of moving obstacles by restricting the robot and moving obstacles to a circle of radius . When an obstacle hits the boundary of the circle, it is transported to the antipodal position on the circle and continues from this new position.
Planning trajectory characteristics (averaged over 200 trials) for environments with varied complexity (number of obstacles). (a) Task success rate. The error bars are computed using the 99% confidence interval. (b) Amount of time for the agent to reach the goal without collision, the black dotted line is the minimum finish time without obstacles. (c) Computation time per planning step.
Since the agent does not have knowledge of obstacle dynamics in this task, we compared our method with two obstacle avoidance methods that also do not require this knowledge, Gaussian APF and VO. Gaussian APF considers only the position of obstacles. It combines a quadratic attractive potential toward the goal and a repulsive potential from obstacles . The obstacle potentials are Gaussians with around obstacles, tuned empirically for this problem. The relative strength between the attractive and repulsive potential, , has a significant impact on the success rate and also needs to be manually tuned. Larger represents a more goal-greedy robot behavior. We compared various values of to our method. VO considers the position and velocity of obstacles  in order to compute velocity obstacles posed by obstacles within of the agent.
Figure (a)a and (b)b show that planning with PEARL has a higher probability of successfully avoiding obstacles, and reaches the goal in less time compared to Gaussian APF. The success rate and task finish time of Gaussian APF depends greatly on the parameter . This parameter has to be tweaked manually or by optimization algorithms  after many planning trials. PEARL balances the features (similar to finding the optimal ) in the learning phase with a simplified scenario, and is able to transfer the weights to the online plan with comparable or better performance.
Figure (a)a also shows that PEARL has a higher success rate than VO. This is primarily due to VO’s velocity obstacle formulation that assumes the obstacle has a fixed velocity. This results in trajectories that lead the agent into collision with stochastically moving obstacles. PEARL on the other hand, by balancing the features, was able to generate trajectories with sufficient clearance to account for the stochastic obstacle motion. In addition, Figure (b)b shows the computation time per planning step for PEARL is lower than VO, even though PEARL is implemented in MATLAB and VO is in C++. (Our C++ implementation of VO is downloaded from  and modified to support a single robot and multiple moving obstacles.) PEARL scales linearly with the number of obstacles and is capable of generating high success rate trajectories in real-time (Figure (c)c). This suggests that PEARL is a viable alternative method for dynamic obstacle avoidance, even when the obstacles are moving with highly unpredictable hybrid stochastic dynamics.
V-C Aerial cargo delivery
We use a quadrotor with a suspended load as our benchmarking platform because it is a popular research platform leading to solutions for multiple robots , hybrid systems , differentially-flat approaches , and load trajectory tracking  among others. The aerial cargo delivery requires a quadrotor, carrying a load on a suspended rigid cable, to deliver the cargo to a given location with the minimal residual load oscillations . The task has applications in delivery supply and aerial transportation in urban environments. The task is easily described, yet, it is difficult for human demonstration as it requires a careful approach to avoid destabilizing the load. We use the following definition of the aerial cargo delivery:
(Aerial cargo delivery task.) The quadrotor equipped with a suspended load must navigate from start to goal arriving at the goal with the minimum residual oscillation of the suspended load. The quadrotor center of the mass is influenced with a stochastic disturbance with Gaussian distribution. The agent has access to the quadrotor’s position and velocity, and load’s position and angular velocity, as well as the location of the goal.
V-C1 PEARL setup
The state space is a 10-dimensional joint system, , of the quadrotor’s and the load’s position and velocity, where is the quadrotor’s position, and is the load’s displacement in polar coordinates. The input is a 3-dimensional acceleration vector applied to the quadrotor’s center of mass, with a maximum acceleration of . The features are squared distances of quadrotor position, , load’s displacement, , quadrotor’s velocity, , and the load’s angular velocity, .
The training domain workspace is volume from the origin, with speed sampled from hypercube. During training the system is deterministic, i.e. there is no stochastic disturbance applied to it. The goal is in the origin. We train with CAVI algorithm. The resulting training weights are
The initial condition for aerial cargo delivery are within from the goal. The goal is not in the origin anymore, and the system is influenced by a stochastic disturbance. Input disturbance distributions are evaluated with a mean of 0, 1, and , and a standard deviation of 0, 0.5 and .
We evaluate LSAPA’s 1) suitability for real-time planning of high-dimensional control-affine discrete time systems that require frequent input, 2) ability to execute PBTs in the presence of different stochastic input disturbances, and 3) trajectories for feasibility on physical systems.
To evaluate the feasibility of the LSAPA trajectory on a physical system and to assess the simulation fidelity, we compare experimentally LSAPA and DAS planned trajectories. We chose DAS for this experiment because it is the fastest input selection method, and it performed better than HOOT in previous experiments . Figure 7 shows the results of the experiment when a disturbance with distribution is imposed on the system (Figure (c)c). The quadrotor starts at coordinates (-1, -1, 1.2) and the goal is at (0.5, 0.5, 1.2) meters. We notice (Figure (a)a) LSAPA experiences an overshoot of the goal after 2 seconds, but compensates and returns to the goal position. The DAS trajectory, however, does not compensate and continues with the slow drift past the goal. The load swing is not very different between the two trajectories.
We compare LSAPA to DAS, HOOT, and NMPC for the aerial cargo delivery task. HOOT uses three-level hierarchical search, with each level providing ten times finer discretization. NMPC tracks a trajectory generated assuming no disturbances. It is implemented using the MATLAB routine provided in  with a 5 time-step long horizon and cost function , where is position of the state that results when input is applied to . is a position at the reference trajectory. The expectation is calculated as an average of 100 samples. NMPC uses the same disturbance-aware simulator used for LSAPA, DAS, and HOOT. The simulator calculates closed-loop optimization problem and simulates the plan.
Figure 9 summarizes the planning results. Results in Figure (a)a show that aerial cargo delivery task (left) the time needed to calculate the next input with LSAPA is an order of magnitude smaller than the time step (green line), allowing ample time to plan the trajectory in a real-time closed feedback loop. DAS calculates the next input faster than LSAPA. This is expected because the deterministic policy uses 3 samples per input dimension, while the stochastic policy in this case uses 300 samples. NMPC is two orders of magnitude slower for the lower-dimensional aerial cargo delivery task, averaging about to select an input, over ten times that the available window for the real-time control. Both LSAPA and DAS are computationally cheap, linear in the input dimensionality, while HOOT, and NMPC scale exponentially. The timing results show that, assuming LSAPA provides good quality trajectories, LSAPA can be a viable method for input selection in real-time on high-dimensional systems that require high-frequency control.
Next, in Figures (b)b, we examine if the trajectories complete the task, reaching the goal region of . Due to the constant presence of the disturbance, we consider the average position of the quadrotor rather than simply expecting to reach the goal region. Note that the accumulated squared error, typically used to measure quality of tracking methods, is not appropriate for LSAPA, HOOT, and DAS because they generate trajectories on the fly and have no reference trajectory. Thus, we measure if the system arrives and stays near the goal. As a control case, we first run simulations for repeatable disturbance ( and ). LSAPA, NMPC, and HOOT methods complete the task (Figure (b)b).
For a small standard deviation of 0.5 LSAPA performs similarly to HOOT, producing trajectories that complete the task, unlike DAS and NMPC. The quality of NMPC solution also degrades with the disturbance (Figure (b)b). It is more pronounced than with DAS, because NMPC makes input selection based on solving a fixed horizon optimization problem. The optimization problem accumulates the estimation error, thus invalidating the solution, and smaller horizon lengths are not sufficient to capture good tracking. For the larger standard deviation (1 ), both LSAPA and HOOT create trajectories that still perform the tasks.
Figure 10 shows trajectories planned with LSAPA and HOOT for aerial cargo delivery task under exertion of disturbance with distribution (Figure (c)c). Although both the quadrotor’s and the load’s speeds are noisy (Figures (a)a and (b)b), the position changes are smooth, and the quadrotor arrives near the goal position where it remains.
This task is an extension of the aerial cargo delivery, and involves two heterogeneous robots, an aerial vehicle with suspended load, and a ground robot. The two robots work together for the cargo to be delivered on top of the ground robot (Figure (e)e), thus the load must have minimum oscillations when they meet. We use the following definition:
(Rendezvous task.) A two-robot system, consisting of a ground robot and quadrotor equipped with a suspended load, must navigate from their initial positions towards each other, and come to rest the minimum residual oscillation of the suspended load. Both robots’ centers of mass are influenced with a stochastic disturbance with Gaussian distribution. The agent has access to the both robots’ positions and velocities and the load’s position and angular velocity. The goal location is not explicitly known.
V-D1 PEARL setup
The state space is a 16-dimensional vector of the joint UAV-load-ground robot position-velocity space, , and the action set is a 5-dimensional acceleration vector on the UAV (3 dimensions) and the ground robot (2 dimensions), . The maximum acceleration of the UAV is , while the maximum acceleration of the ground robot is . Feature vector contains: , the distance between the ground and aerial robot’s load and coordinates, , the difference in high equal to the suspension cable length, , their relative speeds, and and the load’s position and velocity relative to the UAV.
The training samples are sampled from within from the origin. During the training the system is deterministic, i.e. there is no stochastic disturbance applied to it. We learn the task using deterministic CAFVI, which results in the weights .
After a deterministic learning phase, we generate trajectories for 25 different initial conditions using the learned weights, , and varying disturbance distributions. Rendezvous has the two robots start from within from each other. Input disturbance distributions are evaluated with a mean of 0 and and a standard deviation of 0, 0.5 and , because the maximum acceleration of the ground robot is and the system cannot compensate for the larger disturbance.
We compare LSAPA to DAS and HOOT for the rendezvous task. The evaluation set up is the same as in the aerial cargo delivery task.
Figure (a)a shows that although HOOT performs under for the aerial cargo delivery task (left), for the rendezvous task (right), it scales exponentially with the input. As a result, HOOT takes to calculate input for the rendezvous task, twice the length of the minimal time step required for real time planning. NMPC for the rendezvous task took 10 times longer than for the aerial cargo delivery task, requiring about 3 hours to calculate a single 15-second trajectory. Thus, we decided against running systematic NMPC tests with rendezvous task because of the impractically long computational time.
Figure (c)c shows that LSAPA and HOOT trajectories complete the task, unlike DAS and NMPC. While DAS is able to compensate for zero-mean noise for the aerial cargo delivery task (Figure (b)b), its performance degrades in the higher-dimensional rendezvous (Figure (c)c). For rendezvous, LSAPA produces the comparable results in an order of magnitude less time (Figure (a)a). Overall, LSAPA is the only presented method that performs decision-making in real time, and compensates for the disturbances on the higher-dimensional task.
Figure 11 show trajectories planned with LSAPA and HOOT for rendezvous task with disturbance with distribution . The two robots meet in 4 seconds, after the initial failed coordinated slow down at 1 second. Note, that the targeted position for the quadrotor’s altitude is 0.6 meters in order for the load to be positioned above the ground robot. The results in Figure 9 confirm that both methods produce very similar trajectories, but recall that HOOT does not scale well for larger problems and did not produce rendezvous trajectories in real-time. In contrast, LSAPA produced both trajectories in real-time.
V-E Flying Inverted Pendulum
The flying inverted pendulum task consists of a quadrotor-inverted pendulum system in a plane. The goal is to stabilize the pendulum and keep it balanced as the quadrotor hovers . In this paper, we modify the flying inverted pendulum from  by additionally applying external stochastic disturbance and extending the planning into 3 dimensional workspace.
(Flying inverted pendulum task) An inverted pendulum is attached to a quadrotor via a massless rigid rod. The quadrotor must balance the inverted pendulum and reduce its own velocity while being influenced by stochastic disturbances. The quadrotor has access to the velocity and position of itself as well as the pendulum.
V-E1 PEARL setup
The state space is an ten-dimensional vector of Cartesian position-velocity coordinates, . is the quadrotor’s position in Cartesian coordinates, while is its linear velocity. is the position of the pendulum relative to projected on the plane orthogonal to gravity. The action space is a two dimensional acceleration vector The maximum acceleration is . The reward is one when the target zone is reached, and zero otherwise. The simulator used is a linearized model of a full dynamics of a planar flying inverted pendulum and can be found in .
The features for the first task are squares of the pendulum’s position and velocity relative to the goal upright position. The second task has an additional feature of a square of the quadrotor’s velocity, The subscripts and denote that just the pole stabilization and quadrotor slowdown tasks are under consideration, respectively.
With the exception of the maximum acceleration, the training set up is the same as in , and we use DAS policy in training. The resulting parametrization vectors are = and = .
We use a disturbance probability density functionand a pole initial displacement of 23. While the deterministic sum solves this problem and balances the inverted pendulum in the absence of disturbances and small zero-mean disturbances (), it fails to balance the inverted pendulum for non-zero mean disturbances. In contrast, LSAPA policy solves the task (Figure 12).
Figure (a)a shows the quadrotor’s trajectory, and Figure (b)b displays pendulum position in Cartesian coordinates relative to the target position above the quadrotor. The first subtask brings the pole upright (0 to 5 seconds). Then the second subtask slows down the quadrotor (after 5 seconds). The pole is slightly disturbed during the initial moments of the second subtask but returns to an upright position.
Figure 13 depicts the results of the trajectory characteristics for increasing number of samples in LSAPA. The smallest number of samples is three. The accumulated reward (Figure (a)a) increases exponentially below 10 samples. The gain decreases between 10 and 20 samples. Thus, the peak performance is reached after 20 samples. Sampling beyond that point brings no gain. We see the same trend with the pole displacement (Figure (b)b) and speed magnitude (Figure (c)c).
In this section we analyze and discuss PEARL’s policy progression to the attractor, computational complexity, preference properties, and training domain selection. First, we find sufficient conditions for the policy to progress to the attractor, and analyze the regions where the agent can get stuck when the conditions are violated. Second, we prove that the LSAPA computational complexity is linear with the dimensionality of the input space. Third, we consider the preferences’ properties with respect to the workspace volume and dimensionality. Last, we construct the training domain which captures both the interesting parts of the problem, and is small enough for the efficient learning.
Vi-a Task completion
In this section we discuss the policy’s progression to the attractor. First, we look at convergence conditions for attractor-only tasks, and then discuss the existence of local maxima when the task contains repellers.
Vi-A1 Attractor-only tasks
The system is controllable and the attractor is reachable. In particular, we use,
and that is regular outside of the origin,
Action is defined on a closed interval around the attractor,
The drift is bounded,
The claim is the direct result from Artstein’s theorem , and our previous work [53, 9]. Artstein’s theorem  states that a greedy policy over a control Lyapunov function takes the system to the equilibrium. And in [53, 9] we showed that under the conditions above, a control Lyapunov function can be constructed from the state-value function V.
When the learning state-value function V is in the form (23), the greedy policy (4) will progress to the equilibrium, located in the goal. The goal preferences result in a state-value function of the form (23) when the objectives span the entire state space and each element of the learned vector is negative. To show why, we write the goal preferences using the projection notation.
Let be a diagonal projection matrix corresponding to the attractor , The element on s diagonal is equal to 1, only if s coordinate is non-zero. All other elements are 0. Clearly, for Then
because since is an orthogonal projection matrix. Let is also a diagonal, orthogonal, and positive semi-definite. Further, when the objectives span the whole state space, i.e. when Similarly, for matrix iff and each coordinate of is negative. Since (2) can be written as
after translation of the coordinate system by where
To conclude, the policies learned with PEARL for tasks with only attractor that jointly span the entire state-space are guaranteed to progress the agent to the goal when the learning algorithm results in parametrization with all negative coordinates.
Vi-A2 State value function local minima analysis
For tasks with mixed preferences, such as dynamic obstacle avoidance, the agents follow preferences, but there are no formal completion guarantees. In fact, the value function (2) has potentially two maxima, one on each side of the obstacle.
For the purpose of this analysis, we assume that the problem is well formulated, and contains one goal, i.e., the intersection of subspaces defined by distance reducing objectives is non-empty, and forms a connected set. Note that in order for both attractor and repeller as expected the resulting weights must be negative . Since a straight line is the shortest path between an agent and its attractor, we analyze the value function restricted to that line with varying obstacle distances. To simplify the analysis, we transform the value function (2). Without loss of generality, we rotate and scale the coordinate system, such that the goal is in the origin, and the agent is on the axis. The two nearest obstacles lay on (1, ), and (1, ). In addition, we multiply the entire function (2) by minus one, to give a rise to function . Now, we are interested in finding necessary conditions minima, which correspond to the maxima.
First to construct , let be the ratio between learned weights for the repeller and the attractor feature. The value function after the affine transformation is
We examine necessary conditions for ’s minima based on the obstacle distance, and the coefficient Point is local minima if
and the second derivative is positive,
For 24 to hold, the following must be the case,
We conclude that the value function (2) for problems with obstacles has two local maxima, one to the left of the goal, and the other one to the right of the obstacles. If the agent is located between the goal and the obstacles, it will settle at the equilibrium point to the left of the obstacle. Otherwise, the agent will settle to the right of the obstacle, unless the obstacles move. Further, the equilibrium point tends to the goal, as the distance between the obstacles increase.
We perform an empirical study, depicted in Figure 14, to verify the analysis. The value function has either a single maxima near an attractor (pink, and blue lines in Figure 14), has two maxima (blue and green line), or has an inflection point near the obstacles (red line). Inspection of the partial derivative at the minima points in Figure 14 reveals that these points are saddle points.
In summary, when the obstacles are far enough apart there is only a global maximum. As the obstacles come closer together, a new region of attraction forms on the other side of the obstacle. If the agent gets into the local maximum region of attraction, gradient-descend methods trap it. Sampling-based greedy methods such as HOOT, however, might get the agent out of the local maxima if it is close enough to the boundary.
Vi-B Computational cost
The feature vector computation time given in (6) is linear in state dimensionality and number of preferences. This is because
because projection is linear operation.