I Introduction
Recent advances in onboard computing software and hardware make it appear that the time when robots will populate and operate in the same environments with humans is not too far away. As a result, there has recently been a strong interest in developing planning methods for robots that operate in close proximity to humans [17]. To this date, most of these methods focus on safe planning in dense human crowds, where humans move frequently, and they often rely on standard collision avoidance techniques developed in the robotics literature for this purpose. While safety is indeed a critical specification for any humanrobot system, the deployment of robots in environments populated by humans also requires planning methods that can incorporate different human utilities that are not always straightforward to express mathematically. In fact, it is often the case that humans will only provide highlevel human feedback on whether they like an event or not, e.g., a robot behavior, without providing relevant contextual information, e.g., their locations in the case of safety. Developing planning methods that can incorporate such unknown human utilities and highlevel feedback is necessary to have robots that can efficiently coexist with humans.
In this paper, we consider the problem of planning sociallyaware trajectories for robots operating in environments populated by humans. The task assigned to the robot is a basic reachavoid task. We define a robot trajectory to be socialaware if it does not interfere with humans in any way that causes discomfort. In this paper, discomfort is defined broadly and, depending on specific individuals, it can result from the robot being too close to a human or from interfering with human sight or tasks. Moreover, we assume that human feedback is a bandit feedback indicating a complaint or no complaint on the part of the robot trajectory that interferes with the humans, and it does not reveal any contextual information on the locations of the humans or the reason for a complaint. Finally, we assume that humans can move in the obstaclefree space. As a result, human utilities can change and different robot paths need to be computed for different human configurations. We formulate this planning problem as an online optimization problem that minimizes the social value of the timevarying robot trajectory, defined by the total number of incurred human complaints. Since the human utility function that specifies human feedback is unknown, we can not apply standard gradientbased optimization methods to solve this online optimization problem. Instead, we propose to use derivativefree methods, specifically the zeroth order method, to estimate the gradient by querying humans for feedback on randomly generated trajectories. To obtain a plausible solution, we restrict the number of requests for feedback that can be sent to humans; humans will not provide feedback hundreds of times for the same task. Furthermore, to ensure that the resulting robot trajectories are collisionfree and dynamically feasible, we integrate the zeroth order method with an offtheshelf motion planner that corrects the zerothorder trajectory to satisfy these additional constraints.
Next, we discuss related work on sociallyaware robot navigation, trajectory optimization, and human preference.
Ia SociallyAware Robot Navigation
To this date, modelbased and learningbased methods are the two main approaches to sociallyaware robot navigation [4]. Modelbased methods typically combine multiagent collision avoidance techniques with predictive models of human motion, such as social force models [7] or gaussianprocessesbased statistical models [33]. On the other hand, learningbased methods either learn an interaction model which can predict human motion from a human trajectory dataset by means of handcrafted features [18, 16], gaussian processes [34]
, or neural networks
[1], or directly enforce multiagent cooperative collision avoidance using deep reinforcement learning
[4]. As discussed before, common in these methods is that they all focus on humanrobot collision avoidance that can be modeled mathematically. Instead, here we consider sociallyaware robot planning for human utilities that are possibly unknown and can not be easily described mathematically. Instead, we only assume that highlevel human feedback is available that does not include contextual information about the reasons for a complaint. To the best of our knowledge, this is the first robot planning framework that can incorporate such types of human utilities and feedback.IB Trajectory Optimization
Numerical optimization methods for motion planning can be broadly classified into two categories, namely,
derivativebased and derivativefree. Derivativebased methods require a smooth and differentiable objective function and typically solve nonlinear motion planning problems by leveraging the first and secondorder derivatives. These methods include, but are not limited to, the wellknown motion planners CHOMP [27], TrajOpt [30], KOMO [32], and RieMo [26]. However, derivativebased methods can not handle nondifferentiable utility functions, such as the human utilities considered in this work. Such nondifferentiable utilities can instead be handled by derivativefree planning methods. A type of derivativefree planning methods that have been widely used in the robotics literature are the samplingbased methods, such as kinodynamic RRT [35], which builds graphs by randomly exploring the configuration space and optimizing an objective along the way. Alternatively, populationbased optimization methods can be used for motion planning with nondifferentiable objectives, such as the partical swarm optimization method [15], that moves a swarm of particles in the search space in search of better robot trajectories. Another popular derivativefree method is STOMP [13] that improves on the current trajectory by applying Monte Carlo simulation to sample a series of noisy trajectories and explore the trajectory space. Although derivativefree planning methods, such as those proposed in [35, 15, 13], can handle nondifferentiable objectives, they still require these objectives to be known and mathematically expressed. This is not the case in the sociallyaware planning problem considered here, where human utilities are unknown and only highlevel contextunaware human feedback is available. What makes it possible to utilize such human feedback for robot planning is the proposed zerothorder optimization framework that is combined with offtheshelf planners such as the above.IC Human Preference Trajectory Optimization
Perhaps most relevant to the method proposed here is recent work on trajectory optimization using human preferences. For example, in [20] taskspecific human preferences are incorporated into a motion planner, assuming that these human preferences are formulated as simple convex functions. Similarly, in [12], human preferences are used to iteratively make small improvements to trajectories returned by robots. Instead, here we assume that human feedback is a bandit feedback and the human utility function that governs this feedback is unknown. Following a different approach, [21] adopts a datadriven method to account for human preferences. A set of features regarding the trajectory is designed and a classifier is learned to predict human rating on a new trajectory. Similarly, [6] learns an action policy to tackle grasping tasks by actively querying a human expert for ratings. More broadly, preferencebased reinforcement learning [37] focuses on human preference for pairwise comparison between trajectories. Either a policy that satisfies human preference is learned directly [36, 3], or a qualitative preference model capturing a human preference relation or a quantitative utility function estimating a human evaluation criterion are learned and then used to optimize the policy [8, 5, 29, 25]. While preferencebased learning methods can theoretically be used to solve the sociallyaware planning problem with bandit human feedback proposed in this paper, their application is not straightforward. First, they can only be applied to stationary problems where the utilities of humans do not change with time. This is because a learned policy for one preference model can not be easily adapted to another preference model corresponding to different human utilities. However, even in the stationary case, the amount of data required by learningbased methods to learn an optimal policy is typically very large. This would make the application of such methods prohibitive in practice, since humans would need to be queried for feedback a prohibitively large number of times. Our proposed framework does not suffer such limitations. Extensive numerical simulations show that it can easily handle changing human utilities and the number of queries for human feedback is low.
The rest of the paper is organized as follows. In Section II, we provide a brief overview of the zeroth order optimization method and in Section III we present the problem formulation. In Section IV we describe our proposed algorithm for the social aware robot planning problem. Numerical experiments are presented in Section V.
Ii Preliminaries
In this section, we provide a short background on zerothorder optimization since it constitutes the foundation upon which our proposed method is developed. The zeroth order method [23, 10] is a derivativefree optimization framework that optimizes an objective function based only on the function evaluations, which is a preferred method when the objective function is unknown or noisy, or the gradient is difficult to calculate. Consider an unknown objective function defined on a dimensional domain . To minimize the objective , the zeroth order method estimates the gradient at the current iterate by randomly perturbing and then querying for function evaluations one or multiple times. For example, the following gradient estimate queries for function evaluations twice,
(1) 
where is the size of the decision variable ,
is a random vector sampled from a uniform distribution in a unit sphere or a Gaussian distribution in
and is the exploration parameter. As discussed in [23, 10], the expression in (1) is an unbiased stochastic gradient estimate of the smoothed function . Therefore, when the zeroth order method updates the current iterate using (1), it in fact minimizes the smoothed objective function , whose distance to is bounded by an amount proportional to the exploration parameter , see [23, 10]. After obtaining the estimated gradient, the next iterate is computed in a usual way as(2) 
where is the step size.
Iii Problem Definition
Consider an environment , let denote the set of obstacles in , and assume that the obstaclefree space is populated by humans that are distributed possibly in a spatially uneven way. Assume also that the model of human dynamics is unknown and human movement is random, making it unlikely to predict the moving pattern. Given a group of humans in the environment, let be the set that collects their locations at discrete time instants . We make the following assumption on the motion pattern of the humans.
Assumption III.1
There exists a constant such that for all . Moreover, during every time instant , humans move instantaneously between consecutive locations and , and they remain stationary during the time interval .
The existence of the bound in Assumption III.1 ensures that the times when humans move in the environment are not arbitrarily close to each other. Since humans also move instantaneously, Assumption III.1 captures situations where human motion is infrequent and sporadic. For example, it models typical work environments, where humans remain around certain locations for a while and then quickly move to other locations to perform subsequent tasks.
Consider also a robot navigating in the obstaclefree environment from a starting location to a target location . Let denote the state of the robot that includes the robot’s location, orientation, etc, where is the feasible state space. Let also denote the control input of the robot, where is the set of admissible control inputs. Moreover, let for , denote a sequence of time instants defined over the time intervals that humans remain stationary, and define the discretetime dynamics of the robots during this time interval as
(3) 
Finally, define a discrete robot trajectory during the time interval by the sequence of robot positions , where is the position component of and for all . Particularly, is the starting location and is the target location. In what follows, with slight abuse of notation, we write .
Our goal is to design a collisionfree, dynamically feasible, and sociallyaware trajectory for the robot that does not interfere with humans in any way that causes discomfort. In this paper, discomfort is defined broadly and, depending on specific individuals, it can result from the robot being too close to a human, or from interfering with human sight or tasks. To do so, in what follows, we propose a novel way to incorporate human feedback in the robot planning process. Specifically, we make the following assumptions on the feedback provided by humans.
Assumption III.2
Human feedback is a bandit feedback indicating a complaint or no complaint on the that part of the robot trajectory that falls in human vicinity. Moreover, human feedback does not reveal any contextual information about the locations of the humans or the reason for a complaint. Finally, the human utility function does not change during time intervals , but it can change at time instants when humans move.
We model human feedback using an implicit humanandtimedependent utility function which indicates the preference of individual regarding the trajectory at time instant , where we drop the subscript from when the notation is clear. Particularly, the function returns 1 if individual complains about the trajectory at time instant , otherwise it returns 0. Dependence of on time captures the fact that an individual’s feedback on a trajectory may vary over time. For instance, humans may become more or less tolerant to a robot moving in their vicinity, depending on the type of their ongoing work. Essentially, we treat each individual as a black box and respect their preferential heterogeneity by means of distinct functions . We capture the social value of the trajectory by the total number of complaints incurred by that trajectory, defined as . Then, a trajectory is said to be sociallyaware if it incurs zero complaints, i.e, . In what follows we avoid relying on models of human feedback to achieve sociallyaware planning [20, 21, 29], since such models can differ across humans and vary over time.
On the other hand, we define a collision between the trajectory and an obstacle if a line segment between consecutive waypoints in intersects the obstacle . Then the trajectory is collisionfree if it never collides with any obstacle.^{1}^{1}1Note that, in this paper, collision avoidance is defined in terms of obstacles. Collisions with humans are captured by human complaints. Let be a set that collects all collisionfree discrete trajectories in the environment . Finally, we define a dynamicallyfeasible trajectory to be one that respects the dynamical robot constraints (3), and collect all dynamicallyfeasible discrete trajectories in the environment in the set . Then, the sociallyaware robot planning problem considered in this paper can be formulated as
(4)  
s.t. 
Problem (4) is an offline optimization problem and it can be solved after time instant assuming that all utility functions are known in hindsight. However, if the human utilities are unknown and a new sociallyaware trajectory has to be designed during every time interval , problem (4) needs to be solved in an online fashion. We can do this by solving the following optimization problem during every time interval ,
(5) 
Note that the objectives in problem (5) are timevarying due to the varying human locations and feedback. Based on problem (5), the sociallyaware robot planning problem considered in this paper can be formulated as follows.
Problem 1
The solution of an online optimization problem should track the solution of its offline counterpart [31]. As such, we can measure the quality of the sequence of trajectories returned by the solution of Problem 1 using the cumulative regret over the whole time period defined as
(6) 
where is the computed solution sequence of problem (5) and is the minimizer sequence of problem (4) that incurs the least number of complaints if the functions are known in hindsight. The cumulative regret is a dynamic regret since the optimal trajectory varies due to human motion and the variations in the human feedback function over time. As shown in [22], the best known bound for the dynamic regret (6) grows at the same rate at which the optimal trajectory varies over time. Therefore, the cumulative utility (6) shall at best grow at the same rate at which the human positions and preferences change. Note that since the human utility is unknown, we can not employ standard gradientbased optimization methods to solve Problem 1. Instead, we employ the zeroth order optimization, discussed in Section II, to solve this online problem.
Iv SociallyAware Robot Planning
In this section, we first reformulate Problem 1 by manipulating the constraint set in (5) and then develop a novel method that combines zeroth order optimization with Model Predictive Control (MPC) [28, 11, 2] to solve the sociallyaware robot planning Problem 1. Specifically, let be the set of collisionfree and dynamicallyfeasible trajectories that are in the neighborhood of a reference trajectory and define the problem
(7) 
Note that problems (5) and (7) are equivalent. To see this, observe that if is the optimal solution of (5) then it is collisionfree and dynamically feasible. Therefore, it belongs to the set . Since (5) and (7) share the same cost function, this means that is also the optimal solution of (7). On the other hand, if is the optimal solution of (7) then it is collisionfree and dynamically feasible. Therefore, it satisfies the constraints of problem (5), which means that is also the optimal solution of (5).
Computing the set is generally difficult. Instead, we can compute a single trajectory in using offtheshelf motion planners, such as MPC. Let denote such a trajectory, which satisfies . Then, a sufficient condition for feasibility of problem (7) is that . To enforce this constraint, we consider as a target trajectory that needs to track. Moreover, we query humans for feedback on the trajectory rather than the trajectory . This is because may not be collisionfree or dynamically feasible, while is. Therefore, it is not meaningful to request human feedback on an infeasible trajectory. With the above modifications, we obtain the following penaltybased reformulation of problem (7) as
(8) 
where is a positive penalty parameter and measures the tracking error between and the motion planner. Note that (8) is not necessarily equivalent to (7). This is because there is no guarantee that a motion planner can return a trajectory that satisfies , where is the optimal solution of (7). Note also that the solution of (8) returns a trajectory but the robot executes the trajectory that is guaranteed to be collisionfree and dynamically feasible. Finally, observe that in (8), the goal is to minimize human complaints, but other objectives such as control effort can also be incorporated in this formulation. For example, the MPC trajectory in (8) can be easily constructed in a way that minimizes such additional objectives; see, e.g., Section IVB..
As discussed in Section III, since the human utility function is unknown, we need to utilize the zeroth order method to solve problem (8), which we combine with an MPC motion planner to obtain the trajectories in (8) as follows. At every iteration of the proposed zeroth order method, the current trajectory iterate is perturbed and then rectified by the MPC motion planner so that it is collisionfree and dynamically feasible. Then, the humans are queried for feedback on the perturbed trajectories and this feedback along with the tracking error is used by the zeroth order method to estimate the gradient and update the current trajectory iterate; see Section II. The proposed algorithm is illustrated in Figure 1. In what follows, we discuss the proposed zeroth order method, the MPC motion planning problem, and the integrated system in detail.
Iva Zeroth Order Optimization Method
The advantage of the proposed zerothorder method is that it can be used to incorporate bandit human feedback in motion planning. Since only the value of the human utility function is accessible, the proposed zeroth order method uses these values to estimate the gradient of human feedback as
(9) 
where , is the iterate of , is the trajectory returned by is the MPC motion planner that will be introduced in the next section, is the size of vector , and is a random vector of unit length. Given the vector , the trajectories and are deterministic.
A main limitation of zerothorder methods in general is that their convergence is slow, especially when the problem dimension is high [23, 10]. This is the case here as the number of waypoints in the trajectory can be very large. To address this limitation, we propose local updates of the trajectory , restricted to those waypoints that are directly affected by human complaints. This is a reasonable assumption, since humans are more likely to complain on that part of the trajectory that causes them discomfort rather than on the whole robot trajectory, which may not even be fully observable. To develop the proposed local update scheme, we replace the utility functions by functions , where the index set captures the local awareness of human , that is, those waypoints in the whole trajectory that directly affect human at time instant . In this case, to minimize the utility function , it is sufficient to only update the waypoints in the set according to the modified zeroth order update
(10) 
where , is the dimensionality of a waypoint, is the cardinality of the set and the vector is computed by setting the entries of that are not in the set to be and then nornalizing, i.e., . Since , the convergence speed of the algorithm can be greatly improved. Similarly, the gradient of the tracking error between the zerothorder iterate and the MPC motion plan can be obtained as
(11) 
where denotes the tracking error. Note that can also be computed using finite differences since this objective is known. In what follows, we discuss how to compute the set .
Since each trajectory needs to connect the starting and target locations of the robot, we assume that the first and last waypoints of the trajectory do not affect any human and, therefore, they are not included in the set . All other locations in the free space, if they are traversed by the trajectory , can possibly cause discomfort to humans. In what follows, we discuss two forms of human feedback, thus, resulting in two ways to decide the set . (i) Humans can only send complaints without revealing which waypoints affect them. In this case, the set is the full trajectory and Alg. 1 perturbs the whole trajectory . We refer to this as the full perturbation scheme. (ii) Humans send complaints and also report the part of the trajectory that affects them. In this case, the set contains those waypoints indicated by the human as well as a random number of waypoints before and after the reported sequence of waypoints. This provides more room to the zerothorder method to compute trajectories that can be easier modified by the motion planner. Moreover, these additional waypoints can account for human errors in reporting the waypoints that cause them discomfort. We refer to this as the local perturbation scheme.
IvB Model Predictive Control (MPC)
MPC has long been used to plan collisionfree robot trajectories that track predefined reference paths [19, 38, 9]. We adopt the approach proposed in [38]. Specifically, given a reference trajectory returned by the zeroth order method that can be dynamically infeasible or can collide with obstacles, our goal is to compute the optimal control sequence , at each time instant , by solving the following optimal control problem over a horizon of length :
(12a)  
(12b)  
(12c) 
where is the deviation from the reference trajectory, and are constant weighting matrices. The first term in (12) penalizes collisions with obstacles, the second term penalizes the terminal deviation, and the last term penalizes the deviation from the reference trajectory and the control input. Note that in the proposed MPC problem the penalty on the control input can help reduce the trajectory length. Moreover, the penalty term on the collisions is a pointwise potential function, defined as
(13) 
where is the shortest distance of the line segment that connects waypoints and , denoted by , to any obstacle, is a small positive constant to avoid the singularity.
IvC The Integrated SociallyAware Planning Algorithm
Alg. 1 shows the proposed online solution to Problem 1 during the whole time period , that integrates zerothorder optimization for social awareness of the desired trajectory and MPC for collision avoidance and dynamic feasibility. As shown in [39], if the objective function is nondegenerate, which is a weaker condition than the strong convexity, the dynamic regret in (6) can be improved by estimating the gradient multiple times per time instant. Although, compared to [39], here the form of is unknown, we adopt the same idea and run gradient updates at every time instant. Note that the inner iteration [line 1] of Alg. 1, reduces to a common online algorithm when . Within each inner iteration, the humans are queried for feedback twice: one for the two perturbed trajectories and one for the evaluation of the updated trajectory. Therefore, the number of queries during each time interval is twice as much as the number of inner iterations. Note that the number of queries can be further reduced by running the inner iterations without checking whether the updated trajectory has complaints or not. In this case, humans are only queried once per iteration. In what follows, we discuss Alg. 1 in details.
A good initial trajectory , starting from the starting location and ending at the target location , can be a collisionfree trajectory that is provided by motion planning algorithms, such as RRT [14]. But it can also be any other trajectory without considering collision since MPC will address this. For each time interval , Alg. 1 runs multiple iterations to obtain a trajectory with high social awareness. Specifically, at every iteration , Alg. 1 first collects the human feedback on the trajectory . If has zero complaints or a predefined maximum number of iterations is reached, then the inner iteration terminates and the trajectory returned by MPC is the solution for the time interval [lines 11]. Otherwise, Alg. 1 updates the waypoints in , perturbs the current iterate by adding and subtracting , and queries the humans for feedback twice [lines 11]. Then, the algorithm computes the gradient for the objective function and updates the iterate [lines 11]. We add an extra weight parameter before the gradient of the utility function. At the next time instant , the humans in the environment may move. The trajectory from the previous instant is used as a warm start to initialize at instant .
V Numerical Experiments
In this section, we present multiple case studies, implemented using Python 3.6.3 on a computer with a 2.3 GHz Intel Core i5 processor and 8G RAM, that illustrate the efficiency of the proposed algorithm. We consider a square continuous environment of size where two square obstacles of size are located near the center. The starting location of the robot is at and the target location is at ; see also Fig. 2. The start and target locations are considered fixed throughout the simulations. We generate a trajectory that consists of 15 waypoints that connects the initial and target locations; see red arrow path in Fig. 2. This trajectory constitutes the initial trajectory for the proposed zerothorder method.
Furthermore, we consider a group of humans in the environment . At time , we assume that the human population is randomly located in the environment. Then, at each time instant , each individual moves by uniformly sampling a location within a circle of radius of centered at their current location. The size of the circle captures the range of an individual’s movement. To quantify human feedback, we assume that each individual possesses a restricted zone and makes a complaint if there exists a trajectory segment passing through the restricted zone. The restricted zone is represented as a circle of radius , where is humandependent and it reflects the sensitivity of an individual to violation of their restricted zone; see also Fig. 2. For every human , we let randomly take a value in the set . The radius of the restricted zone is randomly updated at subsequent time instants , which captures the temporal variation of a human’s feedback. We emphasize that the restricted zone is only a construct used to generate complaints and it is not known by the zeroth order planner; neither are the human locations. Finally, we consider a time period containing the sequence of time instants . The parameters of the zeroth order method in Alg. 1 are set as and for the full perturbation scheme and for the local perturbation scheme.
2.352.37  3.254.44  31.882.42  30.340.99  
5.585.19  5.208.33  34.504.70  31.462.84  
12.207.67  7.754.63  39.366.14  35.576.51  
12.3513.10  7.608.11  39.568.52  35.606.18  
18.3313.17  8.688.52  43.928.06  38.018.23 
and represent the number of iterations taken by the full and local perturbation schemes, respectively, and and represent the trajectory length returned by full and local perturbation schemes, respectively. The numbers in the parentheses next to certain results report the number of trials out of 20 where no sociallyaware trajectory is generated within 50 iterations. During these trials, paths with high social value were still generated, but the number of complaints was not zero.
Va Robot Model
We consider a unicycle robot model [24]. Let and , respectively, represent the location and orientation of the robot, and denote the control input, where and are the linear and angular velocity, respectively. Then the discrete kinematic equation governing the state is
(14) 
where . In the simulation experiments that follow, the MPC parameters are set as .
VB SociallyAware Planning in Stationary Environments
We first validate Alg. 1 and compare the full perturbation and local perturbation methods, discussed in Section IVA, in situations where humans do not move. We set (consider only) and in Alg. 1 and, as discussed in Section IVC, we perform up to gradient updates until the inner iteration terminates, with querying humans for feedback on the updated trajectory, so that the total number of queries is twice as much as the number of iterations that Alg. 1 takes. Moreover, we use the trajectory as the initial trajectory for both full perturbation and local perturbation schemes. The number of humans varies from 20 to 60. For a specific value of
, we report the mean and standard deviation of the number of iterations taken to find a collisionfree, dynamicallyfeasible and sociallyaware trajectory, averaged over 20 trials; see also Table
I. We also present trajectory length as another metric to compare the full and local perturbation methods. Observe that, for each scheme, both the average number of iterations required to obtain a sociallyaware trajectory and the trajectory length grow as increases. In comparison, the local perturbation scheme requires fewer iterations and generates shorterlength trajectories than the full perturbation method. Note that the average number of queries for human feedback, of Alg. 1 is low; between and times for the local and full perturbation methods, respectively. Therefore, the proposed method can be used in practice without burdening humans with prohibitively many feedback requests. In contrast, the reinforcement learningbased methods discussed in Section IC will require orders of magnitude more samples to obtain an optimal policy. Two sample trajectories returned by the local perturbation scheme, when , are shown in Fig. 2. We see that the trajectory returned by the MPC motion planner successfully avoids human restricted zone and obstacles, whereas its reference trajectory returned by the zeroth order method violates the restricted zone and collides with obstacles.VC SociallyAware Planning in Dynamic Environments
In this section, we consider situations where humans move at time instants and their corresponding utilities change. Therefore, we solve the online problem (8). At each time instant, the full and local perturbation schemes are used for the same human distribution and human utility function. We assume that there always exists a dynamicallyfeasible, collisionfree and sociallyaware trajectory during each time interval. That is, each trajectory in the minimizer sequence of problem (4) incurs zero complaints. In this case, the cumulative regret in (6) is reduced to
(15) 
In the following simulation, we fix the number of humans at 50 and vary the movement radius and the maximum number of gradient updates in Alg. 1. Different from VB, we perform the gradient updates without querying humans for feedback so that the total number of queries is equal to the number of iterations of Alg. 1. The value or the radius affects the relative change between human distributions and, therefore, the relative change between optimization problems at consecutive time intervals. For each combination of and , we run 5 trials and report the mean cumulative regret as a function of the time instants , for the full and local perturbation schemes, respectively. As a baseline, we also report the mean cumulative regret incurred by the initial trajectory returned by MPC, at every time interval assuming that this is also the trajectory returned by the algorithm at that time interval. This baseline is equivalent to the case where . Fig. 3 depicts the growth of the regret with respect to the maximum number of iterations for different human motion radii . Observe that, compared to the baseline trajectory , the full and local perturbation schemes can improve the social value of the trajectory with few queries. Particularly, using only 3 iterations/queries at each time instant, both methods can provide a good tradeoff between decreasing the cumulative regret and limiting the human query frequency. Observe that in Fig. 3(a) for , the local perturbation scheme performs better than the full perturbation scheme for all values of , but this advantage becomes weak as the human motion radius increases. This is because as grows, the relative change in the human distribution becomes large, and so does the the relative change between consecutive optimization problems. In the local perturbation scheme, for each optimization problem, more waypoints of the trajectory returned from the previous time instant have to be modified, due to the larger change in the human distribution when increases.
VD Limitations and Future Research Directions
We have conducted a large number of numerical simulations and have tested the proposed framework in challenging planning problems. One limitation of this framework, in its current form, is that it does not perform well in situations where the target location of the robot is very close to obstacles. This is due to the potential functions used for obstacle avoidance in the MPC motion planner that repel the robot trajectory away from the obstacles. For example, our method has not been able to solve planning problems where the target is located in narrow passages formed by obstacles. Additionally, the proposed framework in its present form can not easily handle complaints, e.g., about robots being too far from humans; in our simulations complaints are about robots being too close to humans. There are a couple of reasons for this. First, the current local perturbation scheme is not applicable since humans are not able to specify a trajectory segment that affects them if the trajectory is far from them. Second, the full perturbation scheme is also challenging to apply because it is possible that trajectories at consecutive iterations are first close and then far from humans so that it is difficult to estimate zeroth order gradients. The reason for this behavior is the stochastic nature of the full perturbation method. We are currently exploring extensions of our proposed sociallyaware planning framework to more complex environments and diverse reasons for human complaints.
Vi Conclusion
In this paper, we considered the problem of designing collisionfree, dynamically feasible, and sociallyaware trajectories for robots operating in environments populated by humans. We assumed that humans can provide bandit feedback indicating a complaint or no complaint on the part of the robot trajectory that caused them discomfort, and do not reveal any contextual information about their locations or the reasons for their complaints. Assuming that humans can move in the obstaclefree space and, as a result, human utility can change, we formulated this planning problem as an online optimization problem that minimizes the social value of the timevarying robot trajectory, defined by the total number of incurred human complaints. As the human utility is unknown, we employed zeroth order optimization methods to solve this problem, which we combined with offtheshelf motion planners to satisfy the dynamic feasibility and collisionfree properties of the resulting trajectories. To the best of our knowledge, this is the first approach to sociallyaware robot planning that is not restricted to avoiding collisions with humans but, instead, focuses on increasing the social value of the robot trajectories. Compared to datahungry preference learning methods that can be used to solve similar but stationary sociallyaware planning problems, our framework is more practical since it only requires small number of iterations and, therefore, queries for human feedback, to return sociallyaware robot paths.
References

[1]
(2016)
Social lstm: human trajectory prediction in crowded spaces.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 961–971. Cited by: §IA.  [2] (2012) Nonlinear model predictive control. Vol. 26, Birkhäuser. Cited by: §IV.
 [3] (2014) Preferencebased reinforcement learning: evolutionary direct policy search using a preferencebased racing algorithm. Machine Learning 97 (3), pp. 327–351. Cited by: §IC.
 [4] (2017) Socially aware motion planning with deep reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1343–1350. Cited by: §IA.
 [5] (2017) Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307. Cited by: §IC.
 [6] (2015) Active reward learning with a novel acquisition function. Autonomous Robots 39 (3), pp. 389–405. Cited by: §IC.
 [7] (2013) Robot companion: a socialforce based approach with human awarenessnavigation in crowded environments. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1688–1694. Cited by: §IA.
 [8] (2012) Preferencebased reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning 89 (12), pp. 123–156. Cited by: §IC.
 [9] (2011) Predictive control of autonomous ground vehicles with obstacle avoidance on slippery roads. In ASME 2010 dynamic systems and control conference, pp. 265–272. Cited by: §IVB.
 [10] (2017) Stochastic online optimization. singlepoint and multipoint nonlinear multiarmed bandits. convex and stronglyconvex case. Automation and remote control 78 (2), pp. 224–234. Cited by: §II, §IVA.
 [11] (2017) Nonlinear model predictive control. In Nonlinear Model Predictive Control, pp. 45–69. Cited by: §IV.
 [12] (2015) Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research 34 (10), pp. 1296–1313. Cited by: §IC.
 [13] (2011) STOMP: stochastic trajectory optimization for motion planning. In 2011 IEEE international conference on robotics and automation, pp. 4569–4574. Cited by: §IB.
 [14] (2011) Samplingbased algorithms for optimal motion planning. The international journal of robotics research 30 (7), pp. 846–894. Cited by: §IVC.

[15]
(2015)
Trajectory optimization with particle swarm optimization for manipulator motion planning
. IEEE Transactions on Industrial Informatics 11 (3), pp. 620–631. Cited by: §IB.  [16] (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research 35 (11), pp. 1289–1307. Cited by: §IA.
 [17] (2013) Humanaware robot navigation: a survey. Robotics and Autonomous Systems 61 (12), pp. 1726–1743. Cited by: §I.
 [18] (2012) Featurebased prediction of trajectories for socially compliant navigation.. In Robotics: science and systems, Cited by: §IA.
 [19] (2007) Combined pathfollowing and obstacle avoidance control of a wheeled robot. The International Journal of Robotics Research 26 (4), pp. 361–375. Cited by: §IVB.
 [20] (2016) Low dimensional human preference tracking for motion optimization. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2867–2872. Cited by: §IC, §III.
 [21] (2015) Towards a datadriven approach to human preferences in motion planning. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 920–927. Cited by: §IC, §III.
 [22] (2016) Online optimization in dynamic environments: improved regret rates for strongly convex problems. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 7195–7201. Cited by: §III.
 [23] (2017) Random gradientfree minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. Cited by: §II, §IVA.
 [24] (2016) Robust constrained learningbased nmpc enabling reliable mobile robot path tracking. The International Journal of Robotics Research 35 (13), pp. 1547–1563. Cited by: §VA.
 [25] (2018) Sample and feedback efficient hierarchical reinforcement learning from human preferences. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 596–601. Cited by: §IC.
 [26] (2015) Understanding the geometry of workspace obstacles in motion optimization. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 4202–4209. Cited by: §IB.
 [27] (2009) CHOMP: gradient optimization techniques for efficient motion planning. Cited by: §IB.
 [28] (2000) Tutorial overview of model predictive control. IEEE control systems magazine 20 (3), pp. 38–52. Cited by: §IV.
 [29] (2017) Active preferencebased learning of reward functions.. In Robotics: Science and Systems, Cited by: §IC, §III.
 [30] (2013) Finding locally optimal, collisionfree trajectories with sequential convex optimization.. In Robotics: science and systems, Vol. 9, pp. 1–10. Cited by: §IB.
 [31] (2017) Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control 63 (3), pp. 714–725. Cited by: §III.
 [32] (2017) A tutorial on newton methods for constrained trajectory optimization and relations to slam, gaussian process smoothing, optimal control, and probabilistic inference. In Geometric and numerical foundations of movements, pp. 361–392. Cited by: §IB.
 [33] (2015) Robot navigation in dense human crowds: statistical models and experimental studies of human–robot cooperation. The International Journal of Robotics Research 34 (3), pp. 335–356. Cited by: §IA.
 [34] (2017) Modeling cooperative navigation in dense human crowds. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1685–1692. Cited by: §IA.
 [35] (2013) Kinodynamic rrt*: asymptotically optimal motion planning for robots with linear dynamics. In 2013 IEEE International Conference on Robotics and Automation, pp. 5054–5061. Cited by: §IB.
 [36] (2012) A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, pp. 1133–1141. Cited by: §IC.
 [37] (2017) A survey of preferencebased reinforcement learning methods. The Journal of Machine Learning Research 18 (1), pp. 4945–4990. Cited by: §IC.
 [38] (2009) Modelpredictive active steering and obstacle avoidance for autonomous ground vehicles. Control Engineering Practice 17 (7), pp. 741–750. Cited by: §IVB.
 [39] (2017) Improved dynamic regret for nondegenerate functions. In Advances in Neural Information Processing Systems, pp. 732–741. Cited by: §IVC.