Log In Sign Up

Concurrent Policy Blending and System Identification for Generalized Assistive Control

by   Luke Bhan, et al.
Vanderbilt University

In this work, we address the problem of solving complex collaborative robotic tasks subject to multiple varying parameters. Our approach combines simultaneous policy blending with system identification to create generalized policies that are robust to changes in system parameters. We employ a blending network whose state space relies solely on parameter estimates from a system identification technique. As a result, this blending network learns how to handle parameter changes instead of trying to learn how to solve the task for a generalized parameter set simultaneously. We demonstrate our scheme's ability on a collaborative robot and human itching task in which the human has motor impairments. We then showcase our approach's efficiency with a variety of system identification techniques when compared to standard domain randomization.


Policy Transfer with Strategy Optimization

Computer simulation provides an automatic and safe way for training robo...

Active Domain Randomization

Domain randomization is a popular technique for improving domain transfe...

Multi-task Learning with Gradient Guided Policy Specialization

We present a method for efficient learning of control policies for multi...

Analysis of Randomization Effects on Sim2Real Transfer in Reinforcement Learning for Robotic Manipulation Tasks

Randomization is currently a widely used approach in Sim2Real transfer f...

Parameter Identification of Induction Motor Using Modified Particle Swarm Optimization Algorithm

This paper presents a new technique for induction motor parameter identi...

Not Only Domain Randomization: Universal Policy with Embedding System Identification

Domain randomization (DR) cannot provide optimal policies for adapting t...

I Introduction

Over the last few years, there has been significant interest in developing models that are trained in simulation and then transferred to the real world [5], [17], [16]

. Despite progress in learning from simulated policies, these methods still suffer from long simulation times as they require large amounts of experience to handle the unknown environments present in the real world. Additionally, these policies struggle to generalize for complex situations as they may become unpredictable when faced with new challenges. As such, researchers have approached training these policies in two distinct ways. The first approach utilizes techniques, such as Kalman Filters to identify system parameters that can inform policies on how to respond in different environmental conditions

[6]. However, these policies struggle to generalize and often require re-tuning [6]. The second approach involves randomizing sets of system parameters during training so that the policy learns to robustly handle a wide variety of situations. This approach - domain randomization [13] - requires the actual parameters to be in the set of random values and as such, the robustness of the policy is directly correlated to the range of the generated randomized parameters. Moreover, the resulting policies are overly conservative in action selection, especially when the parameter set is large. Finally, for complex tasks with many parameters, creating large ranges across multiple parameter values requires significant training time before a robust policy can be generated [8].

Both of the above-mentioned approaches require a single policy that aims to 1) solve the task at hand for a single set of parameters and 2) can generalize the solution to a wide set of parameters. These policies struggle to generalize and fail for parameter spaces in complex tasks. As an alternative, we attempt to decouple the process of generating generalized system parameters to support the learning of policies that solve the task efficiently. We do this by utilizing blending techniques informed by system identification methods. With our approach, we train single policies that are efficient for a distinct sets of the parameter space and utilize a blending network to identify how best to combine the actions of these individual policies based on the estimated parameters of the system. We verify the validity of our approach for a collaborative robot and human itching task in which the human has different combinations of motor impairments. Our proposed approach works with the assumption that the space of each parameter is bounded, thus any combination of parameter values results in a convex space. This guarantees convergence of the blending network given a set of individual policies learned to control the system under single parameter variability.

Our paper makes the following primary contributions:

  • We decouple the process of learning a single task and generalizing to a large set of system parameter combinations using a blending network technique;

  • We design an architecture, which integrates a blending network that accurately handles the generalization of its sub-models to system parameters; and

  • We implement our scheme on a simulated collaborative human and robotic locomotion task with various environment parameters to demonstrate its effectiveness across different system identification methods.

Ii Related Work

There have been many approaches to policy learning based on estimation of simulation parameters; however, to the author’s knowledge, none combined system identification with a blending approach. For example, [7] demonstrates the use of simultaneous system identification and policy training where they explore a series of predictive error methods to minimize the difference between the observed parameters and the estimated parameters for model predictive control (MPC). Additionally, domain randomization has been used for a motorized robotic control tasks [12]. However, these tasks do not consider a collaborative environment nor do they handle multiple faults introduced by the interacting agents. Furthermore, [11] solves a challenging Rubik’s cube control task by automatic domain randomization that slowly increases the difficulty of the task, but can take significant time as it does not consider the integration of any real-world sampling. Lastly, [9] considers an adaptive domain randomization strategy where the framework attempts to identify domains that can create challenging environments for the policy. This approach is similar to ours, except that their approach is purely data driven based on the result of their policy. This requires significant training time due to potential sample inefficiency. In contrast, we utilize prior domain knowledge to identify environments that have a high potential of being challenging for the policy.

In addition to the large amount of research designed for efficient sim2real transfer, there have been a series of recent work that demonstrate the effectiveness of policy blending. [10] demonstrates the use of policy blending for simple tasks such as opening a cap, flipping a breaker, and turning a dial. However, their policy learns directly from sensor measurements and does not consider impairments in the agents. Furthermore, [2]

has shown a policy blending technique between a human and robot policy for robot-assisted control to accurately assist the human with various tasks such as fetching a water bottle. However, this work does not consider training models using modern deep reinforcement learning (DRL) techniques. Given these approaches, it is worthwhile to combine robust policy blending with modern system identification as a new approach to generalized modelling.

Iii Background

In this section, we first formalize the reinforcement learning problem for tasks with multiple varying parameters. Then, we outline three approaches presented in the literature to solve such tasks.

Iii-a Robust Markov Decision Process

We model tasks with multiple varying parameters as a robust Markov decision process (R-MDP) defined using a tuple

, where is the state space, the action space, represents a reward function, is called a discount factor, and

is an uncertainty probability set where

represents a family of probability measures over next states . The next state of the system is contingent on the conditional measure where is the current state and is the action selected by an agent.

In this work, we adopt the assumption that is structured as a Cartesian product , also known as the state-action rectangularity assumption [19]. This implies that nature can choose the worst-transition independently for each state and action. Moreover, we assume that the uncertainty probability set is defined by the parameter space that characterizes the task such that ().

In a standard MDP framework, a policy maps states to distributions over actions with the goal of maximizing the sum of the discounted rewards over the future. The optimization criterion is the following


where the value function, , defines the value of being in any given state


Traditional RL algorithms require that the system dynamics and reward function do not change over time to be able to find an optimal deterministic Markovian policy satisfying 1. This property is clearly not satisfied in the R-MDP case. Therefore, we propose an approach to decouple learning how the system dynamics change depending on the system parameters and learning how to optimally solve the task.

Iii-B Domain Randomization

To effectively identify in simulation, a set of parameters must be defined to model the environment. Domain randomization attempts to sample a set of some N parameters which we will denote for which a reasonable range of potential values is constructed - usually from domain specific knowledge [17]. In this paper, we will consider domain randomization of the uniform type such that the parameters are uniformly sampled within a feasible range. For example, the weakness of a certain human joint can be sampled uniformly between 0 and 1 where 0 invokes no mobility while 1 is a joint that is at full strength.

Iii-C System Identification Via Parameter Estimation

System identification through parameter estimation is a well studied subject in which a estimator can consistently receive samples from a real world environment and generalize these samples into an estimated true value. In this work, we utilize the Unscented Kalman Filter (UKF) for our estimator [18] and make the assumption that our real-world parameters can be measured with some confidence, but may be cluttered with noise. Although this is a strong assumption, it can be softened for systems with non-measurable parameters by using approximate models of the system plus general estimation methods like Particle Filtering. See for example [15].

Iii-D Autotuned Search Parameter Model (SPM)

When the environment’s parameters cannot be measured neither estimated through physics-based estimation, we propose to utilize a new technique that can estimate the parameters by interacting with the environment as an agent. Recently, Du et al. formulated a new approach to system identification where they define a data-driven model that learns a map from observation-action-parameter estimate sequences to a probability distribution, i.e.

such that the parameter estimates are greater than, less than, or equal to the true parameters [3]

. The mapping works like a binary classifier that is continuously trained concurrently to the policy such that it slowly converges to the real world parameters by learning from its own policy interaction trajectories. Following this iterative search, we can then perform a level of system identification that does not completely rely on domain knowledge for our experiments.

Iv Approach

In this section, we present the details of our solution scheme for solving collaborative tasks with multiple varying parameters. Figure 1 show the different components of the proposed approach instantiated for the case study presented in the next section. A set of individual policies learned for individual parameter changes is required. Each individual policy can be trained for a single parameter distribution and the total number of policies scales linearly with respect to the total number of parameters. Training each policy can be accomplished through Domain Randomization techniques. The core components of the proposed scheme - the blending network and system identification technique - are described next.

Fig. 1: Policy blending with system identification for a human-robot collaborative task

Iv-a Blending Network Learning

To create a decoupled policy in which we can solve individual tasks while maintaining robustness to a variety of system parameters, we introduce a blending network in which we consider solely the N system parameters as its state space. This policy then only needs to output the weights of its sub-policies at each time step to generate the action for the environment. The sub-policies of this model are trained on a single set of constant system parameters in which a unique environment is identified through previous domain knowledge. We then define the action of the blending network as where , N is the number of sub policies and represents the action taken by the sub policy given a state at time t.

Iv-B Concurrent System Identification

We then combine the blending network with a concurrent system identification scheme to obtain a generalized policy that is robust to different environmental changes, i.e. different combinations of parameter values. To do this, we let the state space of the blending network consist of only the estimated parameters and as such must learn to associate certain parameters with the appropriate sub-policies. In practice, every certain number of training steps, we utilize our system identification method to update the state space of the blending network with a more accurate set of estimate system parameters and continue training. We emphasize that our approach is independent of the system identification method chosen and thus can be tailored based on the available domain knowledge of the environment.

V Experiments

V-a Assistive Gym

For our experiments, we utilize a framework introduced by [4] which is an Open AI gym environment for collaborative human and robot interaction. Assistive gym is a realistic physics environment powered by PyBullet that enforces realistic human joints as well as it provides a series of robots for collective tasks. We utilize the Jaco robot as it performed best in the individual itching policies explored in [4] for our control task and define the success value of a task as the amount of force applied to the target itch position throughout the entire episode. Each episodes consists of 200 time steps equating to 20 seconds of real-world time.

V-B Training of Sub-Policies

For demonstrating our model, we attempt to solve a collaborative itching task using assistive gym [4] where a robot is assisting an impaired human in itching. We consider 3 impairments similar to [1] for the human:

  1. [label=()]

  2. Involuntary Movement: The first impairment is involuntary movement which is handled by adding noise normally distributed to the joint actions of the human. For this policy, we sample the noise according to a normal distribution where each joint in the arm has a mean of 0 noise and a standard deviation of 5 degrees of noise.

  3. Weakened Strength: The second impairment involves weakness in the ability for the human to move their arms which is introduced by lowering the strength factor in the PID controller of the joints. This value is also sampled normally with a mean of 0.66 and standard deviation of 0.2 with 1 representing full strength and 0 representing immobility.

  4. Limited Range: Lastly, we consider a limitation in the range of movement for each joint in the arm of the human. Like above, full joint movement is represented by 1 and immobile joints are represented by 0. As such, we sample the limited movement from a normal distribution with mean 0.75 and standard deviation of 0.1.

Initially, we begin by training a single policy for each individual impairment on 2 million time steps (5000 episodes) using Proximal Policy Optimization (PPO) [14]

. We designed a grid-search experiment to obtain the neural network architecture. The best configuration for each network obtained consists of 2 layers of 64 nodes. For all single impairment policies, we define a state space of 64 joints between the robot and human along with an action space of 17 joint targets. All policies use the same reward function as defined in

[4]. This reward function considers a weighted combination of the distance of the robot arm to the target itch position, a penalty for large actions, and the contact induced with the itch target: where represents a weight for that term , is the Euclidean distance from the arm to the target,

is the Euclidean norm of the action vector taken by the robot and

is the current force applied to the target.

Policy Observation Space Action Space
34 human joint values
30 robot joint values
10 human joint values
7 robot joint values
34 human joint values
30 robot joint values
10 human joint values
7 robot joint values
Limit Range
of Motion
34 human joint values
30 robot joint values
10 human joint values
7 robot joint values
blending network
Only System Parameters:
1 for Estimate Weakness
1 for Estimated Range Limit
10 for Estimated
Movement Joints
3 weighted values for
blending the policies
TABLE I: Trained policies and their respective observation and action spaces

V-C Training of Blending Network

Similar to the sub-policies, we consider the same reward and utilize a PPO model with 2 layers of 64 nodes each for training the blending network. However, for the blending network we train for 400k time steps and the state space only consists of the system parameters. Unlike the sub-policies, our blending network is trained on a human with all three impairments and as such must consider many more cases of how the robot needs to act. By training the policy on a general all three impairments, we allow our blending network to become more robust to parameter identification and improve on the notion that training a single policy to handle all three impairments is complex and time-consuming due to sample inefficiency.

V-D Training of Domain Randomization

To train the domain randomization model, we train on a human invoking exhibiting all three impairments. Similar to above, we use PPO with layers of 64 nodes each. However, these impairments are now sampled uniformly as such:

  1. [label=()]

  2. Involuntary Movement: The noise for each joints angle is between degrees.

  3. Weakened Strength: We consider a weakness coefficient between .

  4. Limited Range: We consider range limitations between times the original motion.

Method Policy Blending State Space
Trained from Human
and Robot Observation
12 Parameters Estimated
By UKF Sampling Real World
Autotuned SPM Yes
12 Parameters Estimated by
Mapping Function of Interaction
Between Policy and Real World
Parameters are Passed as the
State Space at the Start of
Each Epsiode
TABLE II: Trained policies and their respective observation and action spaces

Vi Discussion and Results

Fig. 2: Example of our robot completing the itching task even when the human is dis-functionally moving its arm upward
Fig. 3: Training reward average over 50 episodes for single impairment policies. The rewards here are averaged over the training of 3 seeds.

For our initial sub-policies, we can see that the weakness and limit based policies can achieve a higher reward consistently over the involuntary movement policy in Fig. 3. For our blending network, we consider the best performing sub-policies and only train on humans with a combination of all three impairments. As such, in Fig. 5 can see that the rewards are much lower than those of the individual policies. Additionally, we notice that there is a significant advantage to using a blending-based policy with system identification over general domain randomization. Furthermore, we can see that the ability to estimate the real-world parameters enhances the policies overall convergence as the auto tuned policies struggle to achieve the same success as the UKF based or the baseline (system parameters are perfectly known to the blending network at each timestep). To further evaluate our policies, we define a testing experiment in which we undergo 100 episodes of our human exhibiting all three impairments in which the impairment values are sampled as above but in conjunction. We still utilize the given system identification method for estimating the state space of the blending network. Fig. 4 a) shows a box plot of the performance for the joint parameter variations. We consider experiments in which the human only enacts a single impairment and the results are shown in b), c), and d) of Fig. 4 and Table III, respectively.

From the box plots, we can see that we outperform domain randomization for 100 separate episodes. Furthermore, there is a difference between the system identification methods as the UKF and the system fed with the correct parameters outperform the autotuned search approach. As such, we can determine two important things about our approach. First, the policy blending has a significant improvement over general domain randomization in terms of both sample efficiency and performance. Second, our design can successfully employ various types of system identification; however, those identification methods may significantly affect the overall performance of the policy and should be based on the maximum amount of domain knowledge available.

Given this, we must note limitation of our scheme is that we need to develop the sub-policies; however, these theoretically provide us stability and robustness when faced with unknown environments. Additionally, given that these sub-policies can be reused as they are now decoupled from the main blending network, different approaches can quickly be tested and tuned - a problem limiting current domain randomization methods. Furthermore, it is not guaranteed that a linear combination of weights from the blending network and the action spaces of the blending policy is a good approximation other than from an empirical standpoint. Different methods could certainly be used for the policy blending and this is a future direction of exploration.

Fig. 4: Application of the trained policy to a real environment for 100 separate episodes. We use the highest reward policy for each situation. a) Shows the itch force applied when the human has a combination of all three impairments. b), c), and d) show the force applied when the human has a single impairment in the form of limited range, weakness, or involuntary motion respectively.
Fig. 5: Training reward averaged over 50 episodes for our policy exploration methods. The rewards shown here are averaged over the training of 3 seeds
of Motion
in Joints
Method Mean STDEV Mean STDEV Mean STDEV Mean STDEV
1.33 2.18 0.96 1.94 1.56 2.77 1.07 1.90
UKF 8.68 10.58 18.14 14.62 8.05 10.09 18.13 14.49
Autotuned SPM 5.26 7.35 6.34 8.67 5.71 7.35 10.42 10.32
9.03 10.23 15.02 14.40 11.2 11.71 19.0 13.81
TABLE III: Mean and STDEV of Each Method Given a Specific Impairment

Vii Conclusions and Future Work

In this work, we present a concurrent policy blending and system identification scheme for learning a generalized policy with respect to varying system parameters. With this scheme, we demonstrate the ability to solve a collaborative human and robot task in which the human is impaired with multiple separate, but impactful conditions. Additionally, we demonstrate that our policy outperforms the sample inefficient domain randomization as we can utilize diverse system identification methods to significantly improve over a single general policy. As such, in this work, we provide a framework for efficiently training generalized policies that are robust to an ever changing system.


  • [1] A. Clegg, Z. Erickson, P. Grady, G. Turk, C. C. Kemp, and C. K. Liu (2020) Learning to collaborate from simulation for robot-assisted dressing. IEEE Robotics and Automation Letters 5 (2), pp. 2746–2753. External Links: Document Cited by: §V-B.
  • [2] A. Dragan and S. Srinivasa (2013-06) A policy blending formalism for shared control. International Journal of Robotics Research 32 (7), pp. 790 – 805. Cited by: §II.
  • [3] Y. Du, O. Watkins, T. Darrell, P. Abbeel, and D. Pathak (2021) Auto-tuned sim-to-real transfer. External Links: 2104.07662 Cited by: §III-D.
  • [4] Z. Erickson, V. Gangaram, A. Kapusta, C. K. Liu, and C. C. Kemp (2019) Assistive gym: a physics simulation framework for assistive robotics. External Links: 1910.04700 Cited by: §V-A, §V-B.
  • [5] M. Kaspar, J. D. M. Osorio, and J. Bock (2020) Sim2Real transfer for reinforcement learning without dynamics randomization. CoRR abs/2002.11635. External Links: Link, 2002.11635 Cited by: §I.
  • [6] L. Ljung (1986) System identification: theory for the user. Prentice-Hall, Inc., USA. External Links: ISBN 0138816409 Cited by: §I.
  • [7] A. B. Martinsen, A. M. Lekkas, and S. Gros (2020) Combining system identification with reinforcement learning-based mpc. IFAC-PapersOnLine 53 (2), pp. 8130–8135. Note: 21st IFAC World Congress External Links: ISSN 2405-8963, Document, Link Cited by: §II.
  • [8] J. Matas, S. James, and A. J. Davison (2018) Sim-to-real reinforcement learning for deformable object manipulation. CoRR abs/1806.07851. External Links: Link, 1806.07851 Cited by: §I.
  • [9] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull (2019) Active domain randomization. External Links: 1904.04762 Cited by: §II.
  • [10] T. Narita and O. Kroemer (2021) Policy blending and recombination for multimodal contact-rich tasks. IEEE Robotics and Automation Letters 6 (2), pp. 2721–2728. External Links: Document Cited by: §II.
  • [11] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang (2019) Solving rubik’s cube with a robot hand. External Links: 1910.07113 Cited by: §II.
  • [12] X. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018-05) Sim-to-real transfer of robotic control with dynamics randomization. pp. 1–8. External Links: Document Cited by: §II.
  • [13] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018-05) Sim-to-real transfer of robotic control with dynamics randomization. 2018 IEEE International Conference on Robotics and Automation (ICRA). External Links: ISBN 9781538630815, Link, Document Cited by: §I.
  • [14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §V-B.
  • [15] Shamrao, C. Padmanabhan, S. Gupta, and A. Mylswamy (2018) Estimation of terramechanics parameters of wheel-soil interaction model using particle filtering. Journal of Terramechanics 79, pp. 79–95. External Links: ISSN 0022-4898, Document, Link Cited by: §III-C.
  • [16] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. External Links: 1804.10332 Cited by: §I.
  • [17] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. External Links: 1703.06907 Cited by: §I, §III-B.
  • [18] E.A. Wan and R. Van Der Merwe (2000) The unscented kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373), Vol. , pp. 153–158. External Links: Document Cited by: §III-C.
  • [19] W. Wiesemann, D. Kuhn, and B. Rustem (2013) Robust markov decision processes. Mathematics of Operations Research 38 (1), pp. 153–183. Cited by: §III-A.