I Introduction
Over the last few years, there has been significant interest in developing models that are trained in simulation and then transferred to the real world [5], [17], [16]
. Despite progress in learning from simulated policies, these methods still suffer from long simulation times as they require large amounts of experience to handle the unknown environments present in the real world. Additionally, these policies struggle to generalize for complex situations as they may become unpredictable when faced with new challenges. As such, researchers have approached training these policies in two distinct ways. The first approach utilizes techniques, such as Kalman Filters to identify system parameters that can inform policies on how to respond in different environmental conditions
[6]. However, these policies struggle to generalize and often require retuning [6]. The second approach involves randomizing sets of system parameters during training so that the policy learns to robustly handle a wide variety of situations. This approach  domain randomization [13]  requires the actual parameters to be in the set of random values and as such, the robustness of the policy is directly correlated to the range of the generated randomized parameters. Moreover, the resulting policies are overly conservative in action selection, especially when the parameter set is large. Finally, for complex tasks with many parameters, creating large ranges across multiple parameter values requires significant training time before a robust policy can be generated [8].Both of the abovementioned approaches require a single policy that aims to 1) solve the task at hand for a single set of parameters and 2) can generalize the solution to a wide set of parameters. These policies struggle to generalize and fail for parameter spaces in complex tasks. As an alternative, we attempt to decouple the process of generating generalized system parameters to support the learning of policies that solve the task efficiently. We do this by utilizing blending techniques informed by system identification methods. With our approach, we train single policies that are efficient for a distinct sets of the parameter space and utilize a blending network to identify how best to combine the actions of these individual policies based on the estimated parameters of the system. We verify the validity of our approach for a collaborative robot and human itching task in which the human has different combinations of motor impairments. Our proposed approach works with the assumption that the space of each parameter is bounded, thus any combination of parameter values results in a convex space. This guarantees convergence of the blending network given a set of individual policies learned to control the system under single parameter variability.
Our paper makes the following primary contributions:

We decouple the process of learning a single task and generalizing to a large set of system parameter combinations using a blending network technique;

We design an architecture, which integrates a blending network that accurately handles the generalization of its submodels to system parameters; and

We implement our scheme on a simulated collaborative human and robotic locomotion task with various environment parameters to demonstrate its effectiveness across different system identification methods.
Ii Related Work
There have been many approaches to policy learning based on estimation of simulation parameters; however, to the author’s knowledge, none combined system identification with a blending approach. For example, [7] demonstrates the use of simultaneous system identification and policy training where they explore a series of predictive error methods to minimize the difference between the observed parameters and the estimated parameters for model predictive control (MPC). Additionally, domain randomization has been used for a motorized robotic control tasks [12]. However, these tasks do not consider a collaborative environment nor do they handle multiple faults introduced by the interacting agents. Furthermore, [11] solves a challenging Rubik’s cube control task by automatic domain randomization that slowly increases the difficulty of the task, but can take significant time as it does not consider the integration of any realworld sampling. Lastly, [9] considers an adaptive domain randomization strategy where the framework attempts to identify domains that can create challenging environments for the policy. This approach is similar to ours, except that their approach is purely data driven based on the result of their policy. This requires significant training time due to potential sample inefficiency. In contrast, we utilize prior domain knowledge to identify environments that have a high potential of being challenging for the policy.
In addition to the large amount of research designed for efficient sim2real transfer, there have been a series of recent work that demonstrate the effectiveness of policy blending. [10] demonstrates the use of policy blending for simple tasks such as opening a cap, flipping a breaker, and turning a dial. However, their policy learns directly from sensor measurements and does not consider impairments in the agents. Furthermore, [2]
has shown a policy blending technique between a human and robot policy for robotassisted control to accurately assist the human with various tasks such as fetching a water bottle. However, this work does not consider training models using modern deep reinforcement learning (DRL) techniques. Given these approaches, it is worthwhile to combine robust policy blending with modern system identification as a new approach to generalized modelling.
Iii Background
In this section, we first formalize the reinforcement learning problem for tasks with multiple varying parameters. Then, we outline three approaches presented in the literature to solve such tasks.
Iiia Robust Markov Decision Process
We model tasks with multiple varying parameters as a robust Markov decision process (RMDP) defined using a tuple
, where is the state space, the action space, represents a reward function, is called a discount factor, andis an uncertainty probability set where
represents a family of probability measures over next states . The next state of the system is contingent on the conditional measure where is the current state and is the action selected by an agent.In this work, we adopt the assumption that is structured as a Cartesian product , also known as the stateaction rectangularity assumption [19]. This implies that nature can choose the worsttransition independently for each state and action. Moreover, we assume that the uncertainty probability set is defined by the parameter space that characterizes the task such that ().
In a standard MDP framework, a policy maps states to distributions over actions with the goal of maximizing the sum of the discounted rewards over the future. The optimization criterion is the following
(1) 
where the value function, , defines the value of being in any given state
(2) 
Traditional RL algorithms require that the system dynamics and reward function do not change over time to be able to find an optimal deterministic Markovian policy satisfying 1. This property is clearly not satisfied in the RMDP case. Therefore, we propose an approach to decouple learning how the system dynamics change depending on the system parameters and learning how to optimally solve the task.
IiiB Domain Randomization
To effectively identify in simulation, a set of parameters must be defined to model the environment. Domain randomization attempts to sample a set of some N parameters which we will denote for which a reasonable range of potential values is constructed  usually from domain specific knowledge [17]. In this paper, we will consider domain randomization of the uniform type such that the parameters are uniformly sampled within a feasible range. For example, the weakness of a certain human joint can be sampled uniformly between 0 and 1 where 0 invokes no mobility while 1 is a joint that is at full strength.
IiiC System Identification Via Parameter Estimation
System identification through parameter estimation is a well studied subject in which a estimator can consistently receive samples from a real world environment and generalize these samples into an estimated true value. In this work, we utilize the Unscented Kalman Filter (UKF) for our estimator [18] and make the assumption that our realworld parameters can be measured with some confidence, but may be cluttered with noise. Although this is a strong assumption, it can be softened for systems with nonmeasurable parameters by using approximate models of the system plus general estimation methods like Particle Filtering. See for example [15].
IiiD Autotuned Search Parameter Model (SPM)
When the environment’s parameters cannot be measured neither estimated through physicsbased estimation, we propose to utilize a new technique that can estimate the parameters by interacting with the environment as an agent. Recently, Du et al. formulated a new approach to system identification where they define a datadriven model that learns a map from observationactionparameter estimate sequences to a probability distribution, i.e.
such that the parameter estimates are greater than, less than, or equal to the true parameters [3]. The mapping works like a binary classifier that is continuously trained concurrently to the policy such that it slowly converges to the real world parameters by learning from its own policy interaction trajectories. Following this iterative search, we can then perform a level of system identification that does not completely rely on domain knowledge for our experiments.
Iv Approach
In this section, we present the details of our solution scheme for solving collaborative tasks with multiple varying parameters. Figure 1 show the different components of the proposed approach instantiated for the case study presented in the next section. A set of individual policies learned for individual parameter changes is required. Each individual policy can be trained for a single parameter distribution and the total number of policies scales linearly with respect to the total number of parameters. Training each policy can be accomplished through Domain Randomization techniques. The core components of the proposed scheme  the blending network and system identification technique  are described next.
Iva Blending Network Learning
To create a decoupled policy in which we can solve individual tasks while maintaining robustness to a variety of system parameters, we introduce a blending network in which we consider solely the N system parameters as its state space. This policy then only needs to output the weights of its subpolicies at each time step to generate the action for the environment. The subpolicies of this model are trained on a single set of constant system parameters in which a unique environment is identified through previous domain knowledge. We then define the action of the blending network as where , N is the number of sub policies and represents the action taken by the sub policy given a state at time t.
IvB Concurrent System Identification
We then combine the blending network with a concurrent system identification scheme to obtain a generalized policy that is robust to different environmental changes, i.e. different combinations of parameter values. To do this, we let the state space of the blending network consist of only the estimated parameters and as such must learn to associate certain parameters with the appropriate subpolicies. In practice, every certain number of training steps, we utilize our system identification method to update the state space of the blending network with a more accurate set of estimate system parameters and continue training. We emphasize that our approach is independent of the system identification method chosen and thus can be tailored based on the available domain knowledge of the environment.
V Experiments
Va Assistive Gym
For our experiments, we utilize a framework introduced by [4] which is an Open AI gym environment for collaborative human and robot interaction. Assistive gym is a realistic physics environment powered by PyBullet that enforces realistic human joints as well as it provides a series of robots for collective tasks. We utilize the Jaco robot as it performed best in the individual itching policies explored in [4] for our control task and define the success value of a task as the amount of force applied to the target itch position throughout the entire episode. Each episodes consists of 200 time steps equating to 20 seconds of realworld time.
VB Training of SubPolicies
For demonstrating our model, we attempt to solve a collaborative itching task using assistive gym [4] where a robot is assisting an impaired human in itching. We consider 3 impairments similar to [1] for the human:

[label=()]

Involuntary Movement: The first impairment is involuntary movement which is handled by adding noise normally distributed to the joint actions of the human. For this policy, we sample the noise according to a normal distribution where each joint in the arm has a mean of 0 noise and a standard deviation of 5 degrees of noise.

Weakened Strength: The second impairment involves weakness in the ability for the human to move their arms which is introduced by lowering the strength factor in the PID controller of the joints. This value is also sampled normally with a mean of 0.66 and standard deviation of 0.2 with 1 representing full strength and 0 representing immobility.

Limited Range: Lastly, we consider a limitation in the range of movement for each joint in the arm of the human. Like above, full joint movement is represented by 1 and immobile joints are represented by 0. As such, we sample the limited movement from a normal distribution with mean 0.75 and standard deviation of 0.1.
Initially, we begin by training a single policy for each individual impairment on 2 million time steps (5000 episodes) using Proximal Policy Optimization (PPO) [14]
. We designed a gridsearch experiment to obtain the neural network architecture. The best configuration for each network obtained consists of 2 layers of 64 nodes. For all single impairment policies, we define a state space of 64 joints between the robot and human along with an action space of 17 joint targets. All policies use the same reward function as defined in
[4]. This reward function considers a weighted combination of the distance of the robot arm to the target itch position, a penalty for large actions, and the contact induced with the itch target: where represents a weight for that term , is the Euclidean distance from the arm to the target,is the Euclidean norm of the action vector taken by the robot and
is the current force applied to the target.Policy  Observation Space  Action Space  





Weakness 







blending network 


VC Training of Blending Network
Similar to the subpolicies, we consider the same reward and utilize a PPO model with 2 layers of 64 nodes each for training the blending network. However, for the blending network we train for 400k time steps and the state space only consists of the system parameters. Unlike the subpolicies, our blending network is trained on a human with all three impairments and as such must consider many more cases of how the robot needs to act. By training the policy on a general all three impairments, we allow our blending network to become more robust to parameter identification and improve on the notion that training a single policy to handle all three impairments is complex and timeconsuming due to sample inefficiency.
VD Training of Domain Randomization
To train the domain randomization model, we train on a human invoking exhibiting all three impairments. Similar to above, we use PPO with layers of 64 nodes each. However, these impairments are now sampled uniformly as such:

[label=()]

Involuntary Movement: The noise for each joints angle is between degrees.

Weakened Strength: We consider a weakness coefficient between .

Limited Range: We consider range limitations between times the original motion.
Method  Policy Blending  State Space  


No 


UKF  Yes 


Autotuned SPM  Yes 



Yes 

Vi Discussion and Results
For our initial subpolicies, we can see that the weakness and limit based policies can achieve a higher reward consistently over the involuntary movement policy in Fig. 3. For our blending network, we consider the best performing subpolicies and only train on humans with a combination of all three impairments. As such, in Fig. 5 can see that the rewards are much lower than those of the individual policies. Additionally, we notice that there is a significant advantage to using a blendingbased policy with system identification over general domain randomization. Furthermore, we can see that the ability to estimate the realworld parameters enhances the policies overall convergence as the auto tuned policies struggle to achieve the same success as the UKF based or the baseline (system parameters are perfectly known to the blending network at each timestep). To further evaluate our policies, we define a testing experiment in which we undergo 100 episodes of our human exhibiting all three impairments in which the impairment values are sampled as above but in conjunction. We still utilize the given system identification method for estimating the state space of the blending network. Fig. 4 a) shows a box plot of the performance for the joint parameter variations. We consider experiments in which the human only enacts a single impairment and the results are shown in b), c), and d) of Fig. 4 and Table III, respectively.
From the box plots, we can see that we outperform domain randomization for 100 separate episodes. Furthermore, there is a difference between the system identification methods as the UKF and the system fed with the correct parameters outperform the autotuned search approach. As such, we can determine two important things about our approach. First, the policy blending has a significant improvement over general domain randomization in terms of both sample efficiency and performance. Second, our design can successfully employ various types of system identification; however, those identification methods may significantly affect the overall performance of the policy and should be based on the maximum amount of domain knowledge available.
Given this, we must note limitation of our scheme is that we need to develop the subpolicies; however, these theoretically provide us stability and robustness when faced with unknown environments. Additionally, given that these subpolicies can be reused as they are now decoupled from the main blending network, different approaches can quickly be tested and tuned  a problem limiting current domain randomization methods. Furthermore, it is not guaranteed that a linear combination of weights from the blending network and the action spaces of the blending policy is a good approximation other than from an empirical standpoint. Different methods could certainly be used for the policy blending and this is a future direction of exploration.






Method  Mean  STDEV  Mean  STDEV  Mean  STDEV  Mean  STDEV  

1.33  2.18  0.96  1.94  1.56  2.77  1.07  1.90  
UKF  8.68  10.58  18.14  14.62  8.05  10.09  18.13  14.49  
Autotuned SPM  5.26  7.35  6.34  8.67  5.71  7.35  10.42  10.32  

9.03  10.23  15.02  14.40  11.2  11.71  19.0  13.81 
Vii Conclusions and Future Work
In this work, we present a concurrent policy blending and system identification scheme for learning a generalized policy with respect to varying system parameters. With this scheme, we demonstrate the ability to solve a collaborative human and robot task in which the human is impaired with multiple separate, but impactful conditions. Additionally, we demonstrate that our policy outperforms the sample inefficient domain randomization as we can utilize diverse system identification methods to significantly improve over a single general policy. As such, in this work, we provide a framework for efficiently training generalized policies that are robust to an ever changing system.
References
 [1] (2020) Learning to collaborate from simulation for robotassisted dressing. IEEE Robotics and Automation Letters 5 (2), pp. 2746–2753. External Links: Document Cited by: §VB.
 [2] (201306) A policy blending formalism for shared control. International Journal of Robotics Research 32 (7), pp. 790 – 805. Cited by: §II.
 [3] (2021) Autotuned simtoreal transfer. External Links: 2104.07662 Cited by: §IIID.
 [4] (2019) Assistive gym: a physics simulation framework for assistive robotics. External Links: 1910.04700 Cited by: §VA, §VB.
 [5] (2020) Sim2Real transfer for reinforcement learning without dynamics randomization. CoRR abs/2002.11635. External Links: Link, 2002.11635 Cited by: §I.
 [6] (1986) System identification: theory for the user. PrenticeHall, Inc., USA. External Links: ISBN 0138816409 Cited by: §I.
 [7] (2020) Combining system identification with reinforcement learningbased mpc. IFACPapersOnLine 53 (2), pp. 8130–8135. Note: 21st IFAC World Congress External Links: ISSN 24058963, Document, Link Cited by: §II.
 [8] (2018) Simtoreal reinforcement learning for deformable object manipulation. CoRR abs/1806.07851. External Links: Link, 1806.07851 Cited by: §I.
 [9] (2019) Active domain randomization. External Links: 1904.04762 Cited by: §II.
 [10] (2021) Policy blending and recombination for multimodal contactrich tasks. IEEE Robotics and Automation Letters 6 (2), pp. 2721–2728. External Links: Document Cited by: §II.
 [11] (2019) Solving rubik’s cube with a robot hand. External Links: 1910.07113 Cited by: §II.
 [12] (201805) Simtoreal transfer of robotic control with dynamics randomization. pp. 1–8. External Links: Document Cited by: §II.
 [13] (201805) Simtoreal transfer of robotic control with dynamics randomization. 2018 IEEE International Conference on Robotics and Automation (ICRA). External Links: ISBN 9781538630815, Link, Document Cited by: §I.
 [14] (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §VB.
 [15] (2018) Estimation of terramechanics parameters of wheelsoil interaction model using particle filtering. Journal of Terramechanics 79, pp. 79–95. External Links: ISSN 00224898, Document, Link Cited by: §IIIC.
 [16] (2018) Simtoreal: learning agile locomotion for quadruped robots. External Links: 1804.10332 Cited by: §I.
 [17] (2017) Domain randomization for transferring deep neural networks from simulation to the real world. External Links: 1703.06907 Cited by: §I, §IIIB.
 [18] (2000) The unscented kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373), Vol. , pp. 153–158. External Links: Document Cited by: §IIIC.
 [19] (2013) Robust markov decision processes. Mathematics of Operations Research 38 (1), pp. 153–183. Cited by: §IIIA.