APPLD: Adaptive Planner Parameter Learning from Demonstration

03/31/2020 ∙ by Xuesu Xiao, et al. ∙ The University of Texas at Austin 0

Existing autonomous robot navigation systems allow robots to move from one point to another in a collision-free manner. However, when facing new environments, these systems generally require re-tuning by expert roboticists with a good understanding of the inner workings of the navigation system. In contrast, even users who are unversed in the details of robot navigation algorithms can generate desirable navigation behavior in new environments via teleoperation. In this paper, we introduce APPLD, Adaptive Planner Parameter Learning from Demonstration, that allows existing navigation systems to be successfully applied to new complex environments, given only a human teleoperated demonstration of desirable navigation. APPLD is verified on two robots running different navigation systems in different environments. Experimental results show that APPLD can outperform navigation systems with the default and expert-tuned parameters, and even the human demonstrator themselves.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Designing autonomous robot navigation systems has been a topic of interest to the research community for decades. Indeed, several widely-used systems have been developed and deployed that allow a robot to move from one point to another [1, 2], often with verifiable guarantees that the robot will not collide with obstacles while moving.

However, while current navigation systems indeed allow robots to autonomously navigate in known environments, they often still require a great deal of tuning before they can be successfully deployed in new environments. For example, wide open spaces and densely populated areas may require completely different sets of parameters such as inflation radius, sampling rate, planner optimization coefficients, etc. Re-tuning these parameters requires an expert who has a good understanding of the inner workings of the navigation system. Even Zheng’s widely-used full-stack navigation tuning guide [3] asserts that fine-tuning such systems is not as simple as it looks for users who are “sophomoric” about the concepts and reasoning of the system. Moreover, tuning a single set of parameters assumes the same set will work well on average in different regions of a complex environment, which is often not the case.

In contrast, it is relatively easy for humans—even those with little to no knowledge of navigation systems—to generate desirable navigation behavior in new environments via teleoperation, e.g., by using a steering wheel or joystick. It is also intuitive for them to adapt their specific navigation strategy to different environmental characteristics, e.g., going fast in straight lines while slowing down for turns.

In this paper, we investigate methods for achieving autonomous robot navigation that are adaptive to complex environments without the need for a human with expert-level knowledge in robotics. In particular, we hypothesize that existing autonomous navigation systems can be successfully applied to complex environments given (1) access to a human teleoperated demonstration of competent navigation, and (2) an appropriate high-level control strategy that dynamically adjusts the existing system’s parameters.

Fig. 1: Overview of appld: human demonstration is segmented into different contexts, for each of which, a set of parameters is learned via Behavior Cloning. During deployment, proper parameters are selected by an online context predictor.

To this end, we introduce a novel technique called Adaptive Planner Parameter Learning from Demonstration (appld) and hypothesize that it can outperform default or even expert-tuned navigation systems on multiple robots across a range of environments. Specifically, we evaluate it on two different robots, each in a different environment, and each using a different underlying navigation system. Provided with as little as a single teleoperated demonstration of the robot navigating competently in its environment, appld

segments the demonstration into contexts based on sensor data and demonstrator behavior and uses machine learning both to find appropriate system parameters for each context and to recognize particular contexts from sensor data alone (Fig.

1). During deployment, appld provides a simple control scheme for autonomously recognizing context and dynamically switching the underlying navigation system’s parameters accordingly. Experimental results confirm our hypothesis: appld can outperform the underlying system using default parameters and parameters tuned by human experts, and even the performance of the demonstrator.

Ii Related Work

This section summarizes related work on parameter tuning, machine learning for robot navigation, and task demonstration segmentation, also known as changepoint detection.

Ii-a Parameter Turning

Broadly speaking, appld seeks to tune the high-level parameters of existing robot navigation systems. For this task, Zheng’s guide [3] describes the current common practice of manual parameter tuning, which involves robotics experts using intuition, experience, or trial-and-error to arrive at a reasonable set of parameters. As a result, some researchers have considered the problem of automated parameter tuning for navigation systems, e.g., dynamically finding trajectory optimization weights [4] for a dwa planner [1], or designing novel systems that can leverage gradient descent to match expert demonstrations [5]. While such approaches do successfully perform automatic navigation tuning, they are thus far tightly coupled to the specific system for which they are designed and typically require hand-engineered features. In contrast, the proposed automatic parameter tuning work is more broadly applicable: appld treats the navigation system as a black box, and it does not require hand-engineering of features.

Ii-B Machine Learning for Navigation

Researchers have also considered using machine learning more generally in robot navigation, i.e., beyond tuning the parameters of existing systems. One such approach is that of using inverse reinforcement learning to estimate costs over semantic terrain labels from human demonstrations

[6], which can then be used to drive classical planning systems. Other work has taken a more end-to-end approach, performing navigation by learning functions that map directly from sensory inputs to robot actions [7, 8]. In particular, recent work in this space from Kahn et al. [9]

used a neural network to directly assign costs to sampled action sequences using camera images. Because these types of approaches seek to replace more classical approaches to navigation, they also forgo the robustness, reliability, and generality of those systems. For example, Kahn

et al. reported the possibility of catastrophic failure (e.g., flipping over) during training. In contrast, the work we present here builds upon traditional robot navigation approaches and uses machine learning to improve them only through parameter tuning, which preserves critical system properties such as safety.

Ii-C Temporal Segmentation of Demonstrations

appld leverages potentially lengthy human demonstrations of robotic navigation. In order to effectively process such demonstrations, it is necessary to first segment these demonstrations into smaller, cohesive components. This problem is referred to as changepoint detection [10], and several researchers concerned with processing task demonstrations have proposed their own solutions [11, 12]. Our work leverages these solutions in the context of learning from human demonstrations of navigation behavior. Moreover, unlike [12], we use the discovered segments to then train a robot for—and deploy it in—a target environment.

Iii Approach

To improve upon existing navigation systems, the problem considered here is that of determining a parameter-selection strategy that allows a robot to move fast and smoothly to its goal.

We approach this problem as one of learning from human demonstration. Namely, we assume that a human can provide a teleoperated demonstration of desirable navigation behavior in the deployment environment, and we seek to find a set of planner parameters that can provide a good approximation of this behavior. As we will show in Section IV, when faced with a complex environment, a human demonstrator naturally drives differently in different regions of the environment such that no single set of planner parameters can closely approximate the demonstration in all states. To overcome this problem, the human demonstration is divided into pieces that include consistent sensory observations and navigation commands. By segmenting the demonstration in this way, each piece—which we call a context—corresponds to a relatively cohesive navigation behavior. Therefore, it becomes more feasible to find a single set of planner parameters that imitates the demonstration well for each context.

Iii-a Problem Definition

We assume we are given a robot with an existing navigation planner . Here, is the state space of the planner (e.g., current robot position, sensory inputs, navigation goal, etc.), is the space of free parameters for (e.g., sampling density, maximum velocity, etc.), and is the planner’s action space (e.g., linear and angular velocity). Using and a particular set of parameters , the robot performs navigation by repeatedly estimating its state and applying action . Importantly, we treat as a black-box, e.g., we do not assume that it is differentiable, and we need not even understand what each component of does. In addition, a human demonstration of successful navigation is recorded as time series data , where is the length of the series, and and represent the robot state and demonstrated action at time . Given and , the particular problem we consider is that of finding two functions: (1) a mapping that determines planner parameters for a given context , and (2) a parameterized context predictor that predicts the context given the current state. Given and , our system then performs navigation by selecting actions according to . Note that since the formulation presented here involves only changing the parameters of , the learned navigation strategy will still possess the benefits that many existing navigation systems can provide, such as assurance of safety.

Iii-B Demonstration Segmentation

Provided with a demonstration, the first step of the proposed approach is to segment the demonstration into pieces—each of which corresponds to a single context only—so that further learning can be applied for each specific context. This segmentation problem can be solved by changepoint detection methods [10]. Given , a changepoint detection algorithm is applied to automatically detect how many changepoints exist and where those changepoints are within the demonstration. Denote the number of changepoints found by as and the changepoints as with and , the demonstration is then segmented into pieces .

Iii-C Parameter Learning

Following demonstration segmentation, we then seek to learn a suitable set of parameters for each segment . To find this , we employ behavioral cloning (BC) [13], i.e., we seek to minimize the difference between the demonstrated actions and the actions that would produce on . More specifically,

(1)

where is the induced norm by a diagonal matrix with positive real entries, which is used for weighting each dimension of the action. A black-box optimization method is then applied to solve Equation 1. Having found each , the mapping is simply .

Iii-D Online Context Prediction

At this point, we have a library of learned parameters and the mapping that is responsible for mapping a specific context to its corresponding parameters. The only thing left is a scheme to dynamically infer which context the robot is in during execution. To do so, we form a supervised dataset , where if . Then, a parameterized function

is learned via supervised learning to classify which segment

comes from, i.e.,

(2)

Given , we define our context predictor according to

(3)

In other words, acts as a mode filter on the context predicted by over a sliding window of length .

1:  // Training
2:  Input: the demonstration , space of possible parameters , and the navigation stack .
3:  Call on to detect changepoints with and .
4:  Segment into .
5:  Train a classifier on , where if .
6:  for  do
7:     Call with objective defined in Equation 1 on to find parameters for context .
8:  end for
9:  Form the map , .
10:  
11:  // Deployment
12:  Input: the navigation stack , the mapping from context to parameters, and the context predictor .
13:  for  do
14:     Identify the context according to Equation 3.
15:     Navigate with .
16:  end for
Algorithm 1 appld

Taken together, the above steps constitute our proposed appld approach. During training, the above three stages are applied sequentially to learn a library of parameters (hence the mapping ) and a context predictor . During execution, Equation 3 is applied online to pick the right set of parameters for navigation. Algorithm 1 summarizes the entire pipeline from offline training to online execution.

Iv Experiments

Fig. 2: Jackal Trajectory in Environment Shown in Fig. 1: heatmap visualization of the LiDAR inputs over time is displayed at the top and used for segmentation by champ. For each region divided by champ changepoints, CMA-ES finds a set of parameters that best imitates the human demonstration. Velocity and angular velocity profiles from default (red), appld (no context) (orange), and appld (green) parameters, along with the human demonstration (black), are displayed with respect to time. Plots are scaled to best demonstrate performance differences between different parameters.

In this section, appld is implemented to experimentally validate our hypothesis that existing autonomous navigation systems can be successfully applied to complex environments given (1) access to a human demonstration from teleoperation, and (2) an appropriate high-level control strategy that dynamically adjusts the existing system’s parameters based on context. To perform this validation, appld is applied on two different robots—a Jackal and a BWIBot—that each operate in a different environment with different underlying navigation methods. The results of appld are compared with those obtained by the underlying navigation system using (a) its default parameters (default), and (b) parameters we found using behavior cloning but without context (appld (no context)). We also compare to the navigation system as tuned by robotics experts in the second experiment. In all cases, we find that appld outperforms the alternatives.

Iv-a Jackal Maze Navigation

In the first experiment, a four-wheeled, differential-drive, unmanned ground vehicle—specifically a Clearpath Jackal—is tasked to move through a custom-built maze (Fig. 1). The Jackal is a small and agile platform with a top speed of 2.0m/s. To leverage this agility, the complex maze consists of four qualitatively different areas: (i) a pathway with curvy walls (curve), (ii) an obstacle field (obstacle), (iii) a narrow corridor (corridor), and (iv) an open space (open) (Fig. 1). A Velodyne LiDAR provides 3D point cloud data, which is transformed into 2D laser scan for 2D navigation. The Jackal runs Robot Operating System (ros) onboard, and appld is applied to the local planner, dwa [1], in the commonly-used move_base navigation stack.

Teleoperation commands are captured via an Xbox controller from one of the authors, who is unfamiliar with the inner workings of the dwa planner and attempts to operate the platform to quickly traverse the maze in a safe manner. The 52s demonstration is recorded using rosbag configured to record all joystick commands and all inputs to the move_base node.

For changepoint detection (Algorithm 1, line 3), we use champ as , a state-of-the-art Bayesian segmentation algorithm [11]. As input to champ

, the recorded LiDAR range data statistics (mean and standard deviation) from

and the recorded demonstrated actions are provided. champ outputs a sequence of changepoints that segment the demonstration into segments, each with uniform context (line 4). As expected, champ determines segments in the demonstration, each corresponding to a different context (line 5).

trained for online context prediction (line 14) is modeled as a two-layer neural network with ReLU activation functions.

For the purpose of finding for each context, the recorded input is played to a ros move_base node using dwa as the local planner with query parameters and the resulting output navigation commands are compared to the demonstrator’s actions. Ideally, the dwa output and the demonstrator commands would be aligned in time, but for practical reasons (e.g., computational delay), this is generally not the case—the output frequency of move_base is much lower than the frequency of recorded joystick commands. To address this discrepancy, we match each with the most recent queried output of within the past seconds (default execution time per command, 0.25s for Jackal), and use it as the augmented navigation at time . If no such output exists, augmented navigation is set to zero since the default behavior of Jackal is to halt if no command has been received in the past seconds (Fig. 3). For the metric in Equation 1, we use mean-squared error, i.e.

is the identity matrix.

Fig. 3: Action-Matching and Loss Metric

Following the action-matching procedure, we find each using cma-es [14] as our black-box optimizer (Algorithm 1, line 7). The optimization runs on a single Dell XPS laptop (Intel Core i9-9980HK) using 16 parallel threads. The elements of in our experiments are: dwa’s max_vel_x (v), max_vel_theta (w), vx_samples (s), vtheta_samples (t), occdist_scale (o), pdist_scale (p), gdist_scale (g) and costmap’s inflation_radius (i). The fully parallelizable optimization takes approximately eight hours, but this time could be significantly reduced with more computational resources.

The action profiles of using the parameters discovered by default, appld (no context), and appld are plotted in Fig. 2, along with the single-shot demonstration segmented into four chunks by champ. Being trained separately based on the segments discovered by champ, the appld parameters (green) perform most closely to the human demonstration (black), whereas the performance of both default (red) and appld (no context) (orange) significantly differs from the demonstration in most cases (Fig. 2).

v w s t o p g i
def. 0.50 1.57 6 20 0.10 0.75 1.00 0.30
no ctx. 1.55 0.98 10 3 0.01 0.87 0.99 0.46
Curve 0.80 0.73 6 42 0.04 0.98 0.94 0.19
Obstacle 0.71 0.91 16 53 0.55 0.54 0.91 0.39
Corridor 0.25 1.34 8 59 0.43 0.65 0.98 0.40
Open 1.59 0.89 18 18 0.40 0.46 0.27 0.42
TABLE I: Parameters of Jackal Experiments (dwa)

The specific parameter values learned by each technique are given in Tab. I, where we show in the bottom rows the individual parameters learned by appld for each context. At run time, appld’s trained context classifier selects in which mode the navigation stack is to operate and adjusts the parameters accordingly (Fig. 1).

Tab. II shows the results of evaluating the overall navigation system using the different parameter-selection schemes along with the demonstrator’s performance as a point of reference. We report both the time it takes for each system navigate a pre-specified route and also the BC loss (Equation 1) compared to the demonstrator.

Context BC Loss Real-world Time (s)
(a) Curve
Demonstration N/A 12.10
default 0.17550.0212 30.203.87
app. (no ctx.) 0.18560.0030 *55.1413.84
appld 0.07800.0002 9.390.73
(b) Obstacle Field
Demonstration N/A 9.00
default 0.20610.0540 12.320.59
app. (no ctx.) 0.25370.0083 *60.000.00
appld 0.15860.0216 7.690.35
(c) Narrow Corridor
Demonstration N/A 24.06
default 0.09530.0916 *49.5219.88
app. (no ctx.) 0.05660.0419 *60.000.00
appld 0.01980.0010 19.110.82
(d) Open Space
Demonstration N/A 7.28
default 0.85970.0013 15.070.61
app. (no ctx.) 0.20940.0013 15.087.42
appld 0.20710.0021 7.190.42
TABLE II: Loss and Time Comparison of Jackal Experiments (dwa)

For each metric, lower is better, and we compute mean and standard deviation over 10 independent trials. For trials that end in failure (e.g., the robot gets stuck), we add an asterisk (*) to the reported results and use penalty time value of 60s. The results show that, for every context, appld achieves the lowest BC loss and fastest real-world traverse time, compared to default and appld (no context). In fact, while appld is able to successfully navigate in every trial, default fails in 8/10 trials in the narrow corridor due to collisions in recovery_behaviors after getting stuck, and appld (no context) fails in 9/10, 10/10, and 10/10 trials in curve, obstacle field, and narrow corridor, respectively. In open space, appld (no context) is able to navigate quickly at first, but is not able to precisely and quickly reach the goal due to low angular sample density (vtheta_samples). Surprisingly, in all contexts, the navigation stack with appld parameters even outperforms the human demonstration in terms of time, and leads to qualitatively smoother motion than the demonstration. Average overall traversal time from start to goal, 43s, is also faster than the demonstrated 52s. The superior performance achieved by appld compared to default and even the demonstrator validates our hypothesis that given access to teleoperated demonstration tuning dwa navigation parameters is possible without a roboticist. The fact that appld outperforms appld (no context) indicates the necessity of the high-level context switch strategy.

Iv-B BWIBot Hallway Navigation

Fig. 4: BWIBot Navigates in GDC Hallway

Whereas we designed the Jackal experiments to specifically test all aspects of appld, in this section, we evaluate appld’s generality to another robot in another environment running another underlying navigation system. Specifically, we evaluate our approach using a BWIBot (Fig. 4 left)—a custom-built robot that navigates the GDC building at The University of Texas at Austin every day as part of the Building Wide Intelligence (BWI) project [15]. The BWIBot is a nonholonomic platform built on top of a Segway RMP mobile base, and is equipped with a Hokuyo LiDAR. A Dell Inspiron computer performs all computation onboard. Similar to the Jackal, the BWIBot uses the ros architecture and the move_base navigation framework. However, unlike the Jackal, the BWIBot uses a local planner based on the elastic bands technique (e-band) [2] instead of dwa.

As in the Jackal experiments, teleoperation is performed using an Xbox controller by the same author who is unfamiliar with the inner workings of the e-band planner. The demonstration lasts 17s and consists of navigating the robot through a hallway, where the demonstrator seeks to move the robot in smooth, straight lines at a speed appropriate for an office environment. Unlike the Jackal demonstration, quick traversal is not the goal of the demonstration.

In this setting, the appld training procedure is identical to that described for the Jackal experiments. In this case, however, champ did not detect any changepoints based on the LiDAR inputs and demonstration (Fig. 4 right), indicating the hallway environment is relatively uniform and hence one set of parameters is sufficient.

The BC phase takes about two hours with 16 threads on the same laptop used for the Jackal experiments. The parameters learned for the e-band planner are max_vel_lin (v), max_vel_th (w), eband_internal_force_gain (i), eband_external_force_gain (e), and costmap_weight (c). The results are shown in Tab. III.

v w i e c loss
def. 0.75 1.0 1 2 10 0.17300.0025
exp. 0.5 0.5 3 2 15 0.09400.0095
app. 0.65 0.35 0.52 0.04 15.36 0.06690.0071
TABLE III: Parameters and Results of BWIBot Experiments (e-band)

The first row of Tab. III shows the parameters of the BWIBot planner used in the default system. Because champ does not discover more than a single context, appld and appld (no context) are equivalent for this experiment. Therefore, we instead compare to a set of expert-tuned (expert) parameters that is used on the robot during everyday deployment, shown in the second row of the table. These parameters took a group of roboticists several days to tune by trial-and-error to make the robot navigate in relatively straight lines. Finally, the parameters discovered by appld are shown in the third row. The last column of the table shows the BC loss induced by default, expert, and appld parameters (again averaged over 10 runs). Real-world time is not reported since a quick traversal is not the purpose of the demonstration in the indoor office space. The action profiles from these three sets of parameters (queried on the demonstration trajectory ) are compared with the demonstration and plotted in Fig. 4 lower right, where the learned trajectories are the closest to the demonstration. When tested on the real robot, the appld parameters achieve qualitatively superior performance, despite the fact that the experts were also trying to make the robot navigate in a straight line (Fig. 4 left).

The BWIBot experiments further validate our hypothesis that parameter tuning for existing navigation systems is possible based on a teleoperated demonstration instead of expert roboticist effort. More importantly, the success on the e-band planner without any modifications from the methodology developed for dwa supports appld’s generality.

V Summary and Future Work

This paper presents appld, a novel learning from demonstration framework that can autonomously learn suitable planner parameters and adaptively switch them during execution in complex environments. The first contribution of this work is to grant non-roboticists the ability to tune navigation parameters in new environments by simply providing a single teleoperated demonstration. Secondly, this work allows mobile robots to utilize existing navigation systems, but adapt them to different contexts in complex environments by adjusting their navigation parameters on the fly. appld is validated on two robots in different environments with different navigation algorithms. We observe superior performance of appld’s parameters compared with all tested alternatives, both on the Jackal and the BWIBot. An interesting direction for future work is to investigate methods for speeding up learning by clustering similar contexts together. It may also be possible to perform parameter learning and changepoint detection jointly for better performance.

Acknowledgement

This work has taken place in the Learning Agents Research Group (LARG) at UT Austin. LARG research is supported in part by NSF (CPS-1739964, IIS-1724157, NRI-1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARL, DARPA, Lockheed Martin, GM, and Bosch. Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.

References

  • [1] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,” IEEE Robotics & Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997.
  • [2] S. Quinlan and O. Khatib, “Elastic bands: Connecting path planning and control,” in [1993] Proceedings IEEE International Conference on Robotics and Automation.   IEEE, 1993, pp. 802–807.
  • [3] K. Zheng, “Ros navigation tuning guide,” arXiv preprint arXiv:1706.09068, 2017.
  • [4] D. Teso-Fz-Betoño, E. Zulueta, U. Fernandez-Gamiz, A. Saenz-Aguirre, and R. Martinez, “Predictive dynamic window approach development with artificial neural fuzzy inference improvement,” Electronics, vol. 8, no. 9, p. 935, 2019.
  • [5] M. Bhardwaj, B. Boots, and M. Mukadam, “Differentiable gaussian process motion planning,” arXiv preprint arXiv:1907.09591, 2019.
  • [6] M. Wigness, J. G. Rogers, and L. E. Navarro-Serment, “Robot navigation from human demonstration: Learning control behaviors,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1150–1157.
  • [7] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, “From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots,” in IEEE International Conference on Robotics and Automation.   IEEE, 2017.
  • [8] S. Siva, M. Wigness, J. Rogers, and H. Zhang, “Robot adaptation to unstructured terrains by joint representation and apprenticeship learning,” in Robotics: Science and Systems (RSS), 2019.
  • [9] G. Kahn, P. Abbeel, and S. Levine, “Badgr: An autonomous self-supervised learning-based navigation system,” arXiv preprint arXiv:2002.05700, 2020.
  • [10] S. Aminikhanghahi and D. J. Cook, “A survey of methods for time series change point detection,” Knowledge and information systems, vol. 51, no. 2, pp. 339–367, 2017.
  • [11] S. Niekum, S. Osentoski, C. G. Atkeson, and A. G. Barto, “Online bayesian changepoint detection for articulated motion models,” in 2015 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2015, pp. 1468–1475.
  • [12]

    F. Meier, E. Theodorou, and S. Schaal, “Movement segmentation and recognition for imitation learning,” in

    Artificial Intelligence and Statistics, 2012, pp. 761–769.
  • [13] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in neural information processing systems, 1989, pp. 305–313.
  • [14] N. Hansen, S. D. Müller, and P. Koumoutsakos, “Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es),” Evolutionary computation, vol. 11, no. 1, pp. 1–18, 2003.
  • [15] P. Khandelwal, S. Zhang, J. Sinapov, M. Leonetti, J. Thomason, F. Yang, I. Gori, M. Svetlik, P. Khante, V. Lifschitz et al., “Bwibots: A platform for bridging the gap between ai and human–robot interaction research,” The International Journal of Robotics Research, vol. 36, no. 5-7, pp. 635–659, 2017.