I Introduction
Safe motion planning for legged systems should be of essential consideration to prevent falling or colliding with obstacles. The main challenge in safe motion planning is to design safety verification tools that accurately evaluate whether a system will satisfy safety constraints while it is stabilized along desired trajectories by using a given feedback controller and without being too conservative.
In this letter, we propose a framework that learns a safety assessment function that can provide probabilistic verification for motion planning. Our framework trains this function using trajectory data. We rollout a number of trajectories using a nominal model and embed them with their safety properties into a lowdimensional space in which we define their safety probabilities. During the execution phase, upcoming desired trajectories are mapped to this lowdimensional space, and the safety probability is estimated before execution. Note that since the safety probability is computed based on the nominal model, there is a reality gap. In order to reduce this gap, we perform an online adaptation process as we collect trajectories during execution.
Related Work: Recent work on robust motion planning has considered safety verification methods that characterizes funnels around planned trajectories. The authors in [1] employed a linear feedback controller and estimated regions of attraction of the closedloop system by searching Lyapunov functions, and [2, 3] showed robust motion planning on aerial robots. A similar new approach, based on HamiltonJacobi reachability analysis [4] and contraction theory [5], proposed an offline characterization of tracking error bounds around trajectories. However, these techniques are computationally intensive and limited to a small class of systems, which make it difficult to be deployed for legged robots which are generally modeled as highdimensional and hybrid system with sophisticated feedback controllers.
Model predictive control (MPC) has shown to be a promising tool to perform dynamic constrained trajectory optimization. In particular, tubebased MPC considers a simple ancillary feedback controller to bind output trajectories around a nominal path and verifies safety satisfactions for all realizations of uncertainties [6, 7]. The authors in [8] applied this technique to bipedal walking assuming a linear pendulum model and a simple controller. However, computing invariant tubes for highly nonlinear and hybrid systems with sophisticated feedback controllers is challenging. The work in [9] proposed to learn distributions of output trajectories in a datadriven manner, which can then be used for safety verification, but the dataefficiency and simtoreal gap issues have not been addressed for robot deployment.
The studies in [10, 11] considered a Bayesian optimization technique which evaluates planned trajectories executed with a closedloop controller and use them to find planner parameters. The authors in [12, 13] trained policies using closedloop systems to generate swing foot trajectories for walking motion. These frameworks make it possible to optimize planner parameters and to design trajectories such that the resulting closedloop behaviors satisfy safety constraints. However, these verification methods evaluate trajectory safety only at the planning phase, making it difficult to detect unsafe states arising during execution, for instance, due to unexpected disturbances.
The idea of embedding system safety information into a lowdimensional space is not new and has been previously presented in [14]. In this work, the authors proposed a framework that learns a lowdimensional representation of regions of attraction of a closedloop autonomous system. In our work, we extend this idea and learn a safety assessment function for a closedloop trajectory tracking system. For closedloop autonomous systems, the initial states on their own determine the evolution of the systems and therefore, their safety characteristics. On the contrary, closedloop trajectory tracking systems have external inputs (e.g., desired trajectories), which affect the evolution of the system and, thus, require a special safety treatment. For instance, we have to properly measure which specific pieces of a desired trajectory could result in future failure. To this end, we reevaluate the computation methods described in [14] and extend them for safety verification for executing planned trajectories, while preserving algorithmic benefits.
Contributions: Our key contributions are the following:

[label=()]

We propose a framework that learns a safety assessment function that evaluates whether desired trajectories are safe before and during execution. In particular, we investigate a data structure, data generation pipeline, and safetyrelated properties needed for training.

Our framework incorporates numerous algorithmic advantages, in particular:

[label=()]

It does not require an analytic expression of the closedloop system to train a safety assessment function, which allows us to reason about safety for complicated systems.

It is dataefficient and is able to address the simtoreal gap, which is crucial for real system implementation.

Our safety assessment function can provide safety predictions for the trajectories both when generating robust plans and executing to detect unexpected external disturbances.


We deploy our framework in a quadruped balancing task and a humanoid reaching task and show that our framework can open up a number of interesting possibilities for algorithm development. In the quadruped balancing task, we integrate a backup recovery step planner that is triggered based on safety predictions, and in the humanoid reaching task, we provide a robot selfassessment capability to estimate the likelihood of safe task completion for humanrobot interaction.
Ii Problem Statement
Consider a discretized system given by
(1) 
where , , are the system state, input, and disturbances.
is the output vector that can be measured from system state (e.g., endeffector positions in task space). We further assume to have a planner that computes a desired trajectory
, where represents a planning horizon, and denotes a desired output. Given a tracking controller , the closedloop system dynamics is denoted as(2) 
Then, the solution trajectory of the closedloop system can be recursively computed from the starting state and the upcoming desired trajectory with the expression
(3) 
As illustrated in Fig. 1, our goal is to make a receding horizon prediction about the safety of the closedloop system with the current state measurement and upcoming desired trajectory. To be more specific, at current time index , we want to predict the probability of all future states being safe,
(4) 
using the information of and . is the userspecified safe set that could be defined with a tracking error or conservative capture region to avoid falling. Note that is the safety assessment horizon during which we look ahead and can be different from the planning horizon . is a taskdependent parameter and is chosen to contain primarily safety information. For a cyclic walking task, for example, does not need to be the trajectory duration for multiple steps, but rather just for one stepping cycle. For convenience, we concatenate the state measurement and upcoming desired trajectory and define a safety assessment input:
(5) 
Using this nomenclature, our goal can be summarized to define a safety assessment function that predicts the safety probability (4) of a closedloop system.
We consider a scenario where the real dynamical system is not perfectly known, but we assume the nominal system is available and can be simulated over time. Since the dynamics of legged systems are nonlinear, highdimensional, and hybrid and the controller are often formulated based on a numerical optimization problem, we do not have access to the analytic expressions of the closedloop solution trajectories of either the nominal or real systems. Therefore, we propose to learn the safety assessment function in a datadriven manner. Throughout the paper, we use a tilde, , and an overline, , to represent variables related to the nominal system and the real system, respectively.
Iii Framework Overview
Our framework aims to find a lowdimensional embedding of safety assessment inputs where the lowdimensional space can be discretized into a finite number of grid cells. Then, we assign each cell a belief mass using belief function theory [15] to evaluate the safety probability of the inputs. The assignment of belief masses is denoted as basic belief assignment (BBA) and the BBA for the grid index is expressed as . Here, is the belief mass of the probability of the closedloop system being safe when it evolves with safety assessment inputs that are mapped to and belong to the grid index . is the belief mass of the complementary event and is the uncertainty on the safety estimation. Note that it holds , and , , and are in the interval . After the BBAs for the grid cells are computed, we define a safety assessment function , where the safety assessment input is embedded in the grid cell .
To compute BBAs for grid cells, we first simulate a sufficient amount of trajectories using a nominal model. We collect safety assessment inputs from the trajectories and label them whether they yield safe behaviors or not. For each safety assessment input pair, we evaluate a distance metric to measure their similarity in terms of safety. For instance, the distance between a pair is small if they share a similar safety property (e.g., if they are both safe or unsafe) but large otherwise. Using the computed distances, we embed the safety assessment inputs into a lowdimensional space using the the tDistributed Stochastic Neighbor Embedding (tSNE) technique [16]. As a result, we obtain two clusters separated in a lowdimensional space: one is the collection of safety assessment inputs that result in safe behaviors and the other one is the collection of safety assessment inputs that yield unsafe behavior. Then, we discretize the lowdimensional space into grid cells and make a prior estimate of BBA for each cell with the expression .
Simulating the nominal system is usually a cheap and efficient way to initialize the lowdimensional representation of the trajectories and the safety assessment function, but is inaccurate. Therefore, an online adaptation process is followed to reduce the gap between the real and the nominal system and update the safety assessment function. As we collect trajectory data from the real system, we compare it with the behavior from the nominal closedloop system and train a discrepancy function that reveals how reliable the training data from the nominal system was. Using the discrepancy function, we update the prior estimates of BBAs in the grid cells. At the same time, we compute a feedback estimates of the BBA for each cell using the real system’s trajectory data, which is defined as . Finally, we combine the prior and the feedback estimates of BBAs and update the safety assessment function. The overall framework including offline initialization and online adaptation is illustrated in Fig. 2.
Iv Offline Initialization of Safety Assessment Function
Iva Data Generation and Lowdimensional Embedding
As illustrated in Fig. 3, a planner designs a desired trajectory () using a randomly sampled planner parameter. Employing a feedback tracking controller, we simulate a nominal closedloop system and rollout a trajectory (). We determine the trajectory to be safe if all of its states are contained in the safe region. We terminate the episode when the system reaches unsafe regions and determine the trajectory to be unsafe. We split the simulated trajectories into segments spanning a duration of , the safety assessment horizon, and create a training data set with each segment’s initial state, desired trajectory, and unsafety score. The collection of training data is denoted as , where
(6) 
and is the number of training data, corresponding to the number of trajectory segments. and represent the starting state and the desired trajectory of the th trajectory segment – note that we zero the beginning time index for each segment – forming the th safety assessment input. is the unsafety score and is computed by the following rule:
(7) 
where is a discount factor and is a function that takes a segment index and returns the remaining time steps from the beginning of the segment to the termination of the episode where the segment belongs to. Note that the tilde conveys that the unsafety score is evaluated using the simulated trajectory from the nominal closedloop system. The unsafety score represents how much the segment contributes to the system’s unsafe behavior. Associating it with the discount factor, the segments that are near the episode termination are scored with higher values.
For each pair of training data, we measure their similarity based on their error and safety properties. First, we measure the dynamic time warping for the error signals between the th and th training data using the formula , where is the trajectory error and is the dynamic time warping operator. While a dynamic time warping measurement might reflect similarity of the safety property in general, it is still possible that safe and unsafe segments share similar trajectories. To obtain more accurate similarity measures in terms of safety, we propose a distance metric considering the dynamic time warping measurements and unsafety scores at the same time as
(8) 
where denotes the maximum value among the dynamic time warping measurements and is a weighting constant multiplying the unsafety score difference. As a result, the trajectory segments which show similar error sequences and are alike in terms of safety are considered to be close.
Using this computed distance, we apply tSNE on the training data to obtain a realization of the lowdimensional space . Based upon this embedding, we train a mapping function
, using a deep neural network by minimizing the cost function
, where is the lowdimensional embedding of the th training data, . The neural network is trained to reproduce the lowdimensional embedding constructed by tSNE.IvB Prior Estimate of BBAs on Grid Cells
We discretize the lowdimensional space into grid cells and compute a prior estimate of BBA for each cell as illustrated in Fig. 3. For convenience, we define a locating function which takes a safety assessment input and returns an index of a grid cell in which the input is embedded in the lowdimensional space. First, we define the belief assignment for each embedded training data point, , based on its unsafety score by introducing the expression , where
(9) 
Here, is the belief mass of the probability of the closedloop system’s behavior being safe when it starts at the state with the upcoming desired trajectory and is the belief mass of its complementary event. represents the confidence level on the nominal system model and is set to userspecified parameter, .
We take the belief assignments on the training data into account and further designate a belief assignment for each grid cell. Let us define, for each index , a set of BBAs , which contains the BBAs for grid cell . Then, the prior estimate of the BBA for the grid cell can be computed as
(10) 
where is the number of BBAs in , is the minimum number of data for the estimate. When there is not sufficient training data in the grid cell (i.e., ), we estimate by an empty BBA , which indicates that no safety estimate can be made. is a fusion operator among the set , which is borrowed from [14] as
(11) 
Finally, the safety assessment function is initialized with the prior estimate of the BBAs for grid cells.
V Online Adaptation of Safety Assessment Function
Va Discrepancy Function
Although the prior estimate of the BBA provides a rough safety prediction, we update the safety assessment function online as we collect trajectory data from the real system as depicted in Fig. 4. When we rollout a trajectory using the real system, we simulate a trajectory using the nominal closedloop system with the same initial state and the same desired trajectory. With the trajectories from the real and nominal systems, we construct a collection of feedback data with sets, where
(12) 
Similar to the training data, and represent the starting state and the desired trajectory of the th trajectory segment with the reordered time index. and are the unsafety scores of the th segment of the trajectories of the real and the nominal system, respectively, computed by Eq. (7). If there is a discrepancy in terms of safety between the nominal and the real system due to the reality gap, can be different from .
Now, we define a discrepancy function that quantifies the level of reality gap. We approximate this function with a Gaussian process regression (GPR) model, which is trained with the input set and the output set .
With the trained GPR model, we predict the reliability of the training data and update the prior estimate of BBA
. Let us denote the predicted mean and standard deviation of
by and . Based on the level of reality gap predicted by the trained GPR model, we update the belief assignment on the training data with the new uncertainty(13) 
where is a userspecified parameter set to be smaller than . As more feedback data is collected and the standard deviation on the prediction goes below a certain threshold (i.e., ), we update the uncertainty of the belief assignment using the mean prediction . With the new , we update the belief mass, and , by following Eq. (9). Finally, we improve the prior estimate of BBAs for grid cells with Eq. (10) to take the reality gap into account.
VB Feedback Estimate of BBAs on Grid Cells
We update the feedback estimate of BBAs on grid cells using . We, again, first compute the belief assignment for each embedded feedback data with the expression , where , , and . Note that is set to have zero uncertainty since it comes from the real system. With this, we compute the feedback estimate of BBA for the grid index as
(14) 
where contains the BBAs in grid , and is the number of BBAs in the set . If no feedback data is collected yet for the index (i.e., ), we set the estimate to an empty BBA. is another fusion operator among the set and is defined as
(15) 
Here, parameters and are the initial value and the decay rate of the uncertainty , respectively, and the uncertainty converges to zero as the number of data goes to infinity (i.e., ). and are computed with the average operator.
Finally, we combine and and compute the BBA for each index vector as
(16) 
If the feedback estimate for the grid index is available, we fuse the prior and feedback estimates of BBAs through the fusion operator in Eq. (11), otherwise, we just use the prior estimate. It has been shown that the approaches as the number of feedback data, , approaches infinity [14]. This means that the prior estimate has an effect when there is no sufficient data from the real system, but has less of an effect in making safety estimates. We finally update the safety assessment function as . For computational efficiency, the online adaptation process is performed once every sets of feedback data are obtained, where the value of is a task dependant parameter.
Vi Experimental Results
In this study, we consider two different scenarios: a quadruped balancing task and a humanoid reaching task. We then address the following questions: Does the offline initialization phase find a proper lowdimensional representation of trajectory data and compute ? Does the online adaptation phase incorporate feedback data and properly address the simtoreal gap? Can the safety assessment function make a receding horizon prediction so that it can evaluate trajectories’ safety both at planning phase and at the execution phase? How is our safety assessment function compared to other baseline verification tools and how much are the predictions accurate? How can our framework be incorporated to a backup planner or controller to prevent unsafe behaviors?
Via Laikago Balancing
We consider a balancing task using the Laikago quadruped from UnitreeRobotics. The robot’s state consists of its floating base and joints configurations, and the output vector
is the base position. At every episode, the robot is initialized with randomly sampled state and our planner generates an interpolated trajectory between the initial and desired base position. Then, our feedback controller computes joint position commands by solving inverse kinematics to follow the trajectory. For this task, we define the safe set
to be the supporting polygon and a specified height range. Thus, we check that the projection of the base onto the ground remains inside this safe region and that the base height remains within its corresponding bounds. We consider random disturbances while balancing and aim to make a receding horizon safety prediction on the motions using the safety assessment function. If a strong disturbance causing the closedloop system to become unsafe is properly detected by the safety prediction module, we initiate a recovery step plan [17] to avoid falling. Table I summarizes parameters used in the safety assessment function training.We simulate episodes with the nominal closedloop system and segment the data to construct the training data .^{1}^{1}1We intentionally make a reality gap by reducing the link’s mass by and removing the joint frictions and observation noises to simulate the nominal system. We also add a random offset to the initial state to simulate the disturbances. We measure the distance between the training data and use it to embed the data into a two dimensional space (i.e., ) that is discretized into a by square grid with a cell length of . The lowdimensional embedding of the training data and the prior estimate of BBAs for grid cells are illustrated in Fig. 5.
The online adaptation process is performed once every feedback data are collected from the real system (i.e., ). We train the discrepancy function with the GPR model and update for each grid cell. For instance, the grid cell highlighted with the pink circle in Fig. 5 was originally assigned of safety probability in the offline initialization phase but is updated to after the first update iteration due to the feedback data that shows a large simtoreal gap. This makes the discrepancy prediction around the pink circle regions to be high, which results in an increase in the uncertainty and a decrease in the safety probability . At the same time, we update and fuse it with to adapt the safety assessment function.
After the safety assessment module converges, we show that our framework can make a receding horizon safety prediction on the balancing trajectories and trigger the recovery step when it is needed to avoid falling. Fig. 6 shows snapshots of Laikago balancing and taking a recovery step. The robot is perturbed with balls in simulation: one which generates a small disturbance (Fig. 6(b)) and another one which generates a large disturbance (Fig. 6(d)). The robot stabilizes and tracks the desired trajectory until the safety assessment function predicts future unsafety. When it predicts a safety probability below the threshold , set to , it triggers the recovery step planner to avoid falling.
ViB Atlas Reaching
We consider an object reaching task using the Boston Dynamic’s humanoid Atlas. The robot’s state consists of its floating base and joints configurations, and the output vector consists of the reaching hand position. At every episode, the robot is initialized with randomly sampled state and the planner generates an interpolated trajectory between the initial and the target hand position. Our feedback controller computes joint torque commands by using an optimizationbased wholebody controller [18]. We define the safety set such that if the projected base position is inside the supporting polygon, the endeffectors do not collide with the obstacles, and the joint positions remain within their limits. We train the safety assessment function for the hand reaching trajectories and use it to predict whether the robot can reach the commanded target safely.^{2}^{2}2When we rollout trajectories using the nominal system, we do not sample an offset and do not add it to the initial state since we do not consider disturbances here. This training is done only for one arm since the same mapping function can be used for both left and right arms. The parameters used in the training are identical to the ones used in Laikago balancing task except for the prediction horizon, which is .
0.99  10  0.01  0.3  5  0.3  0.1  0.4  0.3 
When a human commands a humanoid what to do as an enduser, it is not trivial to evaluate whether the command is safe to execute or not. We demonstrate that our safety assessment function enables a robot to estimate the likelihood it will accomplish the given task safely. Fig. 7(a) illustrates a scenario where Atlas is told to reach the blue box on the bookshelf. After ensuring this task can be accomplished safely, the robot executes the command. Fig. 7(b) illustrates the scenario where the robot is initially told to reach the red can. Based on the safety prediction, the robot rejects the task so that the human instructor can provide a different description to accomplish the task.
In Fig. 8(a), we compare the reachable regions on the bookshelf computed by our safety assessment function against those obtained by a simple inverse kinematics based reachability method. Our safety assessment function considers joint limits violation, collision, and falling down while manipulating to be unsafe, and it results in more conservative reachable regions than those considering only kinematic constraints. Fig. 8(b) summarizes the evaluation on the prediction accuracy of our safety assessment function. Among episodes with randomly sampled target positions, the safety assessment function predicts of safe targets to be safe and of unsafe targets to be unsafe.
Vii Conclusions
In this letter, we propose a probabilistic safety verification tool for legged systems when desired motions are given. We leverage a lowdimensional embedding of the current state measurement and upcoming desired trajectories based on the proposed distance metric for safety prediction. For dataefficiency, we initialize our safety assessment function by simulating trajectories with a nominal system and perform online adaptation using trajectories from the real system to account for the reality gap. We have demonstrated our framework’s efficiency and accuracy with a quadruped balancing task and a humanoid reaching task.
As future work, we would like to integrate our safety verification tool in hierarchical reinforcement learning frameworks such as
[19] and train a highlevel motion policy with a safety consideration. We would also like to deploy our safety verification tool in a humanrobot interaction scenario such as [20] and provide selfassessment capabilities to our new Draco humanoid, a successor of the Draco biped [21].Acknowledgment
The authors would like to thank the members of the Human Centered Robotics Laboratory at The University of Texas at Austin for their great help and support.
References
 [1] R. Tedrake, I. R. Manchester, M. Tobenkin, and J. W. Roberts, “Lqrtrees: Feedback motion planning via sumsofsquares verification,” The International Journal of Robotics Research, vol. 29, no. 8, pp. 1038–1052, 2010. [Online]. Available: https://doi.org/10.1177/0278364910369189
 [2] A. Majumdar and R. Tedrake, “Funnel libraries for realtime robust feedback motion planning,” The International Journal of Robotics Research, vol. 36, no. 8, pp. 947–982, 2017. [Online]. Available: https://doi.org/10.1177/0278364917712421
 [3] Z. Manchester and S. Kuindersma, “Robust direct trajectory optimization using approximate invariant funnels,” Autonomous Robots, vol. 43, no. 2, pp. 375–387, 2019. [Online]. Available: https://doi.org/10.1007/s1051401897795
 [4] M. Chen, S. L. Herbert, H. Hu, Y. Pu, J. F. Fisac, S. Bansal, S. Han, and C. J. Tomlin, “Fastrack:a modular framework for realtime motion planning and guaranteed safe tracking,” IEEE Transactions on Automatic Control, vol. 66, no. 12, pp. 5861–5876, 2021.
 [5] S. Singh, H. Tsukamoto, B. T. Lopez, S.J. Chung, and J.J. Slotine, “Safe motion planning with tubes and contraction metrics,” in 2021 60th IEEE Conference on Decision and Control (CDC), Dec 2021, pp. 2943–2948.
 [6] W. Langson, I. Chryssochoos, S. Raković, and D. Mayne, “Robust model predictive control using tubes,” Automatica, vol. 40, no. 1, pp. 125–133, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109803002838
 [7] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learningbased model predictive control for safe exploration,” in 2018 IEEE Conference on Decision and Control (CDC), Dec 2018, pp. 6059–6066.
 [8] A. Gazar, M. Khadiv, A. D. Prete, and L. Righetti, “Stochastic and robust mpc for bipedal locomotion: A comparative study on robustness and performance,” in 2020 IEEERAS 20th International Conference on Humanoid Robots (Humanoids), July 2021, pp. 61–68.

[9]
D. Fan, A. Agha, and E. Theodorou, “Deep Learning Tubes for Tube MPC,” in
Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020.  [10] A. Rai, R. Antonova, S. Song, W. Martin, H. Geyer, and C. Atkeson, “Bayesian optimization using domain knowledge on the atrias biped,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 1771–1778.
 [11] M. H. Yeganegi, M. Khadiv, A. D. Prete, S. A. A. Moosavian, and L. Righetti, “Robust walking based on mpc with viability guarantees,” IEEE Transactions on Robotics, pp. 1–16, 2021.

[12]
A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, and
V. Vanhoucke, “Policies modulating trajectory generators,” in
Proceedings of The 2nd Conference on Robot Learning
, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 916–926. [Online]. Available:
https://proceedings.mlr.press/v87/iscen18a.html  [13] J. Ahn, J. Lee, and L. Sentis, “Dataefficient and safe learning for humanoid locomotion aided by a dynamic balancing model,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4376–4383, July 2020.
 [14] Z. Zhou, O. S. Oguz, M. Leibold, and M. Buss, “Learning a lowdimensional representation of a safe region for safe reinforcement learning on dynamical systems,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2021.
 [15] G. Shafer, A Mathematical Theory of Evidence. Princeton: Princeton University Press, 1976.
 [16] L. van der Maaten and G. Hinton, “Visualizing data using tsne,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html
 [17] M. H. Raibert, Legged Robots That Balance. USA: Massachusetts Institute of Technology, 1986.
 [18] J. Ahn, S. J. Jorgensen, S. H. Bang, and L. Sentis, “Versatile locomotion planning and control for humanoid robots,” Frontiers in Robotics and AI, vol. 8, 2021. [Online]. Available: https://www.frontiersin.org/article/10.3389/frobt.2021.712239
 [19] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artif. Intell., vol. 112, no. 1–2, p. 181–211, aug 1999. [Online]. Available: https://doi.org/10.1016/S00043702(99)000521
 [20] T. Frasca, E. Krause, R. Thielstrom, and M. Scheutz, ““can you do this?” selfassessment dialogues with autonomous robots before, during, and after a mission,” 2020.
 [21] J. Ahn, D. Kim, S. Bang, N. Paine, and L. Sentis, “Control of a high performance bipedal robot using viscoelastic liquid cooled actuators,” in 2019 IEEERAS 19th International Conference on Humanoid Robots (Humanoids), 2019, pp. 146–153.