Multiple-Pilot Collaboration for Advanced Remote Intervention using Reinforcement Learning

09/27/2021 ∙ by Ziwei Wang, et al. ∙ Tsinghua University Imperial College London 0

The traditional master-slave teleoperation relies on human expertise without correction mechanisms, resulting in excessive physical and mental workloads. To address these issues, a co-pilot-in-the-loop control framework is investigated for cooperative teleoperation. A deep deterministic policy gradient(DDPG) based agent is realised to effectively restore the master operators' intents without prior knowledge on time delay. The proposed framework allows for introducing an operator (i.e., co-pilot) to generate commands at the slave side, whose weights are optimally assigned online through DDPG-based arbitration, thereby enhancing the command robustness in the case of possible human operational errors. With the help of interval type-2(IT2) Takagi-Sugeno (T-S) fuzzy identification, force feedback can be reconstructed at the master side without a sense of delay, thus ensuring the telepresence performance in the force-sensor-free scenarios. Two experimental applications validate the effectiveness of the proposed framework.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotic assisted remote intervention is widely adopted in extreme manipulating scenarios like outer space, under water, and minimally invasive surgery (MIS), as it extends the operational ability and improve the ergonomics and intelligence through teleoperation schemes [12, 9, 4, 17, 20]. The human intention and command on the master side can be transferred through communication channels to be reproduced on the slave robots. Meanwhile, it can offer cross-continent intervention or assistance even in the absence of local experienced operators.

However, time delay induced by long-distance communication will deteriorate the stability for robotic teleoperation systems[1]. For instance, in the application for robotic assisted MIS[14, 3, 2] and assistive exoskeleton [23, 22], it has been shown that the time delay should be ideally less than 180ms[7] to ensure intuitive haptic communication for human-robot interaction. However, prior knowledge on time delay and its derivative are difficult to obtain in advance. For the communication in cross-regional teleoperation, the time delay can go easily over 180ms [21, 19]. Thus, overcoming large time delay without prior information is still an open challenge for the control design in teleoperation.

To address the resulted control issues, wave-variable-based, passivity-based, predictive-based and adaptive-based control schemes have been reported in the literature[16, 13, 18, 5]. But in general, teleoperation relies on the master operator’s full control, which is prone to cause fatigue to the human operator and thus may deteriorate the control performance. Moreover, operational errors may randomly happen to the human, which will be directly reproduced at the slave side. In addition, the human operator can only access limited environmental information from the slave side, which can not guarantee the correct intervention all the time.

To overcome these problems, introducing multiple operators with separate interfaces could facilitate the robust arbitration for the master commands. It avoids subjective operation errors effectively and therefore reduce the physical and mental workload for each. Following the multi-operator based concept, the authority weights for different human operators in dual-user or multiple-operators haptic systems can be assigned [15, 8]. However, the authority factors are normally pre-determined as constants making the incorrect commands under large weights cannot be compensated for, weakening the coordination performances of the multiple operators.

In this paper, in order to further improve the overall performance of the multi-operator based approach, we introduce a new deep reinforcement learning (RL) based multiple-pilot collaboration framework for teleoperation. Besides the normally installed interfaces on the remote master side, we add a co-pilot at the slave side to offer additional intervention for teleoperation. In the proposed control framework, dual Kalman filters (KF) are used to distribute multiple surgeons’ commands online. Based on the fused master signal, the deep deterministic policy gradient (DDPG) [10]

based RL algorithm is employed to optimize the policy which corrects the phase difference and data loss caused by unknown time delay. Fuzzy identification based force estimation for the robot-environment interaction is developed to predict the force on the salve side. The developed control framework can guarantee the synchronisation tracking performance and the transparency while improving the robustness of decisions.

Fig. 1: Multiple-Pilot cooperation framework diagram.

Ii Methods

The overall control framework for multiple-pilot collaboration is illustrated in Fig. 1, which is composed of dual master operators with visual feedback, one co-pilot, the slave robot and the environment. In what follows, the argument will be omitted for simplicity when not specifically highlighted.

Ii-a Multi-Operator Command Fusion

Human commands from the dual master interfaces, and , are fused through a KF module, whose output is denoted by . Consider the process system for KF module used at the master side:


where is the time-step index with being the set of positive integers. is the system matrix, the system state described in Cartesian space, the process noise with covariance , the observed state, the projection matrix, and the observation noise with covariance . At each time step, the future value of each state is estimated as in (3) and then corrected based on the received measurements (4),




where is the state estimation, is the corrected state and is the Kalman gain matrix expressed in (5)-(6),


Remark 1. In addition to the above KF incorporating multiple operator commands, two Kalman filters are employed in the proposed teleoperation framework, satisfying the parameter update process described above. The one near the co-pilot guarantees the smoothness and availability of and . The slave state predicted by the Kalman filter () enhances the transparency as it provides a delay-free force feedback through the force estimator.

Ii-B DDPG-based Prediction and Arbitration

Two DDPG agents are developed to tackle the time delay and decision making issues in the proposed multiple-operator teleoperation. The first agent (DDPG1) is trained to compensate for a specific instantaneous value to the delayed master signal, including the velocity and acceleration (, ) for each Cartesian dimension. Given the instantaneous signal transmission error , in which is the sum of and the DDPG1 output (also one of the inputs of DDPG2), the optimization of DDPG1 is guided through the properly designed reward function , which encourages the networks to minimize caused by forward time delay .

DDPG1 is composed of two neural networks (NNs), namely a deep

network (critic NN) that estimates the state-action values (7), and a policy network (actor NN) that generates the optimal action under each state (8),


where is the state in the NNs, including the velocity and acceleration received at the slave side, namely . is the estimated state-action value for state and action at instant . is the estimated optimal action for , and are the number of basis functions in the output NN layer, and are the corresponding network weights for the output layer, and and

are the basis functions for the critic and actor networks respectively. A dense model is used as the basis function while the ReLu is chosen as the activation function for both networks. The critic NN minimizes the following cost function


where is the number of transitions in the minibatch sampled from the critic NN replay buffer, is the critic network output, is the state-action value for the critic NN, is the immediate reward, is the discount factor, is the state-action value obtained from the target network, is the target network policy, and is the actor NN policy. On the other hand, the actor NN updates the network’s weights by minimizing the the cost function in (11),


where is the number of transitions in the minibatch sampled from the replay buffer. The reward function for DDPG1 is defined as:


where the superscript is the -th element index.

The second DDPG agent (DDPG2) aims to allocate weights online for co-pilot and master operators, and outputs the optimised weight matrix () to minimize . is the error of the final command () and the task target, which is generated through the arbitration module as following blending form


where is a diagonal matrix and each entry is between 0 and 1. is the identify matrix with appropriate dimension. It also applies to the velocity command in the same manner. Similar with the DDPG1, the reward function with respect to DDPG2 is designed as


Finally, a state feedback controller such as a PID controller can be used to determine the control command for the slave robot, which updates the end-effector state () of the slave robot online.

Remark 2. The signals used are all represented in Cartesian space. In this way, the slave robot does not have to comply with the master interfaces in terms of configurations. DDPG1 can ensure the restoration of the master intention without the prior information of time delay, overcoming packet loss and phase delay problems due to the communication channel. In addition, the output of DDPG2 provides the dynamically adjustable weights for arbitration, integrating the intentions of operators at different sides.

Remark 3. The proposed framework allows the boundary of weights () to be changed by adjusting the tanh layer parameters of the action NN according to the task requirements. Each entry of should be in the range of for master-oriented teleoperation tasks. As an example, the driver and the surgeon should play a dominant role in shared driving and telesurgery in terms of human assistance and safety. Regarding the scenarios where the master operator has similar expertise to the co-pilot, the output range of Actor NN can be set as . The initial weights can be assigned to the co-pilot based on a prior knowledge or confidence level, which are then optimally adjusted by DDPG2 based on his/her performance.

Ii-C IT2 Fuzzy Identification based Force Estimator

The installation of force sensors may not be impractical in certain scenarios, such as space robot grasping and surgical robot cutting, due to conflicting layout of end-effector tools and force sensors. Therefore, data-based force feedback to the master operator in force-sensor-free scenarios facilitates to improve the transparency of teleoperation. To this end, we develop an IT2 T-S fuzzy identification to reconfigure interactive force at the master side.

Since the input data may contain different features, each feature needs to be normalized so that its value domain is in a similar range to facilitate the clustering of interaction environments. In this regard, z-score normalization and linear discriminant analysis (LDA) are exploited for the pre-processing procedure, where LDA is employed to maximise the distance between data in different categories and minimise that in the same category. Let the data preprocessed by the above two steps be

. In what follows, we adopt the Fuzzy C-Means (FCM) algorithm to identify fuzzy sets as follows


where the subscript stands for the -th sampling, is the number of fuzzy rules extracted by subtractive clustering [6] and is the number of data points. is fuzzy partition matrix exponent and . The membership function with respect to Rule is denoted by . Taking the Lagrange multiplier into the problem (15), one can obtain


where the first-order partial derivatives with respect to and are as follows


Solving the above equation group yields




Considering the measurement noise in and the system uncertainty, the firing strength of the Rule is within the following interval set [11]


in which and denote the lower and upper membership function such that . Hence, the data-driven IT2 T-S fuzzy system is described by



is the augmented vector of

. is the parameter matrix updated by weighted recursive least squares (WRLS) algorithm, namely


where , ,




with , such that .

Data: Slave states after Kalman filtering on the master side ()
Result: IT2 fuzzy systems and identified clusters
1 Z-score normalize and LDA for the collected ;
2 Calculate the cluster centers as (18) and membership functions as (19);
3 while  do
4       Conduct state augmentation as ;
5       repeat
6             Update the parameter matrix as (22)-(23);
8      until ;
9      Calculate the IT2 T-S fuzzy output as (21);
11 end while
Algorithm 1 Data-Driven IT2 T-S Fuzzy Identification Algorithm

Iii Experiment Results

We have experimentally validated the individual technologies of the proposed framework and performed post-integration systemic testing, where the setup is composed of two 7-DoF Omega. 7 interfaces (Force Dimension Inc., Switzerland) and an LBR iiwa 14 robot (KUKA Inc., Germany).

Fig. 2: The position tracking curves of the master operator and co-pilot.

Iii-a Dual-Pilot Collaboration

One master operator drew a circle trajectory based on visual feedback while the co-pilot with similar expertise carried out the same task. Two operators could communicate via local area network (LAN), leading to a negligible time delay. The same NN structure is utilised in two DDPG-based agents, where the critic NN is composed of three hidden layers with five neurons each, and one other layer with two neurons, while the actor NN is composed of one hidden layer with three neurons. The discount factor was set as 0.99, minibatch size as 100, experience buffer length as

, noise variance as 0.01 and variance decay rate as

. The RL agents are trained using the generated trajectories for 20 episodes and each episode consisting of 2000 steps, where each step size is 0.01s. The boundaries of and are set for the outputs of DDPG1 and DDPG2, respectively. The root-mean-square error (RMSE) is chosen as the evaluation criteria for operational performance.

Due to page limitations, we only show the -direction position profiles of master operator (), co-pilot () and final command () blended in the arbitration module, as shown in Fig. 2. Three subjective errors occurred during operation, resulting in the deviation of the trajectory (RMSE = 0.057569). The co-pilot was aware of the error at the master side during the corresponding three periods and compensated by improving its own trajectory. The co-pilot also made mistakes at other times, for example in the first five seconds and 10-15 seconds (RMSE = 0.058192). The final command sent to the robot was generated through blending the master and co-pilot command in the DDPG-based arbitration (RMSE = 0.022571). It shows human subjective errors can be effectively overcome and the necessity of introducing co-pilot.

Iii-B Triple-Pilot Collaboration with Time Delay

In this scenario, two operators performed a circle drawing operation at the master side with visual feedback and reconstructed force feedback. A co-pilot with similar experience carried out the same task at the slave side, where the master and slave communicated via TCP/IP protocol with a 0.5 second delay. The system control frequency is 100Hz. The parameters of KF were set as follows: , , , , . The rest parameters remain consistent with those in the above experiment. Due to the page limitation, we only show the -direction data in the remaining figures. The position profiles of master and slave restore commands are depicted in Fig. 3, which clearly shows that the DDPG1 effectively compensates for phase delays due to communication channel. The reliability of the master command is improved by combining the two operators’ intents with the help of KF (RMSE between and is 0.04747). stands for the task-dependent reference signal. Fig. 4 illustrates the decision making process of the master operators and co-pilot. At the beginning of the collaboration with dual master operators, the co-pilot deviated from the desired trajectory due to operational errors, therefore the weight assigned to him was zero, as shown in Fig. 5. In the middle of collaboration, the co-pilot compensated for subjective errors at the master side so that the corresponding weight was increased. The co-pilot and arbitration module allow dynamic changes in work efficiency at the master side, which guarantees the robustness of commands in the event of operational errors (RMSE between and is 0.0051387).

Fig. 3: The position commands for master and slave restore.
Fig. 4: Position commands of master operators and co-pilot.

To validate the effectiveness of the proposed force estimator, we tested the performance of force estimation in three interaction environments (hard silicon, soft silicon and sponge) provided force feedback for local reconstruction to the master operators. We configured an ATI 6-axis force/torque sensor for the end-effector of the slave robot during the training phase. Fifty interaction experiments were conducted in each of the three material environments, where three trials were randomly selected as the training set and the rest as the test set. Fig. 6 illustrates the clustering results used in the test. The average RMSEs for the interaction forces are 0.097904 (hard silicon), 0.088005 (soft silicon) and 0.18961 (sponge) respectively. The curves from top to bottom in Fig. 7 show the estimated interaction forces versus the measured forces in hard silicon, soft silicon and sponge materials respectively. This shows that the designed force estimator effectively establishes a mapping between the input state and the force, i.e. it provides force feedback to the master in force-sensor-free scenarios. The operators can feel the force interaction between the end-effector and the environment without a sense of delay. The stiffness of the three materials can be also felt to be different according to the force feedback applied.

Fig. 5: Curve of arbitration adjustment.
Fig. 6: Fuzzy clustering results.
Fig. 7: Force estimation results.

Iv Conclusion

In this paper, we propose a new control framework for multi-pilot collaborative teleopertion. DDPG-based RL agents are employed to enable the reduction of the master intents and multi-source intent decisions. In order to enhance the master perception of the interaction environment, we develop a new interactive force estimator based on IT2 T-S fuzzy identification, which facilities the force feedback reconstructed at the master side. Two key metrics, tracking accuracy and telepresence, are thus guaranteed. Experiments in scenarios of dual-pilot collaboration and triple-pilot collaboration with time delay verify the merits of the control. It should be noted that the framework proposed is based on the premise that the task target is known, which is widely used in systems for training operators. Scenarios with unknown or time-varying task targets will be investigated in the future work.


The authors appreciate Dr. Xiaoxiao Cheng, Pakorn Uttayopas and Sarah Chams Bacha from Imperial College for collecting the data of the experiments.


  • [1] P. Arcara and C. Melchiorri (2002) Control schemes for teleoperation with time delay: a comparative study. Robotics and Autonomous systems 38 (1), pp. 49–64. Cited by: §I.
  • [2] W. Bai, Q. Cao, C. Leng, Y. Cao, M. G. Fujie, and T. Pan (2017) A novel optimal coordinated control strategy for the updated robot system for single port surgery. The International Journal of Medical Robotics and Computer Assisted Surgery 13 (3), pp. e1844. Cited by: §I.
  • [3] W. Bai, Q. Cao, P. Wang, P. Chen, C. Leng, and T. Pan (2017) Modular design of a teleoperated robotic control system for laparoscopic minimally invasive surgery based on ros and rt-middleware. Industrial Robot: An International Journal. Cited by: §I.
  • [4] W. Bai, N. Zhang, B. Huang, Z. Wang, F. Cursi, Y. Tsai, B. Xiao, and E. M. Yeatman (2021) Dual-arm coordinated manipulation for object twisting with human intelligence. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Cited by: §I.
  • [5] L. Chan, F. Naghdy, and D. Stirling (2014) Application of adaptive controllers in teleoperation systems: a survey. IEEE Transactions on Human-Machine Systems 44 (3), pp. 337–352. External Links: Document Cited by: §I.
  • [6] S. L. Chiu (1994) Fuzzy model identification based on cluster estimation. Journal of Intelligent & Fuzzy Systems 2 (3), pp. 267–278. Cited by: §II-C.
  • [7] E. Ivanova, J. Eden, S. Zhu, G. Carboni, A. Yurkewich, and E. Burdet (2021) Short time delay does not hinder haptic communication benefits. IEEE Transactions on Haptics 14 (2), pp. 322–327. External Links: Document Cited by: §I.
  • [8] B. Khademian and K. Hashtrudi-Zaad (2012) Dual-user teleoperation systems: new multilateral shared control architecture and kinesthetic performance measures. IEEE/ASME Transactions on Mechatronics 17 (5), pp. 895–906. External Links: Document Cited by: §I.
  • [9] D. A. Lawrence (1993) Stability and transparency in bilateral teleoperation. IEEE Transactions on Robotics and Automation 9 (5), pp. 624–637. Cited by: §I.
  • [10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
  • [11] J. M. Mendel, R. I. John, and F. Liu (2006) Interval type-2 fuzzy logic systems made simple. IEEE Trans. Fuzzy Syst. 14 (6), pp. 808–821. Cited by: §II-C.
  • [12] G. Niemeyer and J. Slotine (1991) Stable adaptive teleoperation. IEEE Journal of Oceanic Engineering 16 (1), pp. 152–162. Cited by: §I.
  • [13] E. Nuño, L. Basañez, and R. Ortega (2011) Passivity-based control for bilateral teleoperation: a tutorial. Automatica 47 (3), pp. 485–495. Cited by: §I.
  • [14] S. Rezazadeh, W. Bai, M. Sun, S. Chen, Y. Lin, and Q. Cao (2019) Robotic spinal surgery system with force feedback for teleoperated drilling. The Journal of Engineering 2019 (14), pp. 500–505. Cited by: §I.
  • [15] M. Shahbazi, S. F. Atashzar, and R. V. Patel (2018) A systematic review of multilateral teleoperation systems. IEEE transactions on haptics 11 (3), pp. 338–356. Cited by: §I.
  • [16] D. Sun, F. Naghdy, and H. Du (2014) Application of wave-variable control to bilateral teleoperation systems: a survey. Annual Reviews in Control 38 (1), pp. 12–31. Cited by: §I.
  • [17] J. Troccaz, G. Dagnino, and G. Yang (2019) Frontiers of medical robotics: from concept to systems to clinical translation. Annual Review of Biomedical Engineering 21, pp. 193–218. Cited by: §I.
  • [18] R. Uddin and J. Ryu (2016) Predictive control approaches for bilateral teleoperation. Annual Reviews in Control 42, pp. 82–99. Cited by: §I.
  • [19] Z. Wang, H. Lam, B. Xiao, Z. Chen, B. Liang, and T. Zhang (2021) Event-triggered prescribed-time fuzzy control for space teleoperation systems subject to multiple constraints and uncertainties. IEEE Transactions on Fuzzy Systems 29 (9), pp. 2785–2797. External Links: Document Cited by: §I.
  • [20] Z. Wang, Z. Chen, Y. Zhang, X. Yu, X. Wang, and B. Liang (2019) Adaptive finite-time control for bilateral teleoperation systems with jittering time delays. International Journal of Robust and Nonlinear Control 29 (4), pp. 1007–1030. Cited by: §I.
  • [21] Z. Wang, B. Liang, Y. Sun, and T. Zhang (2020) Adaptive fault-tolerant prescribed-time control for teleoperation systems with position error constraints. IEEE Transactions on Industrial Informatics 16 (7), pp. 4889–4899. Cited by: §I.
  • [22] T. Xue, Z. Wang, T. Zhang, and M. Zhang (2019) Adaptive oscillator-based robust control for flexible hip assistive exoskeleton. IEEE Robotics and Automation Letters 4 (4), pp. 3318–3323. Cited by: §I.
  • [23] T. Xue, Z. Wang, T. Zhang, O. Bai, M. Zhang, and B. Han (2021) Continuous finite-time torque control for flexible assistance exoskeleton with delay variation input. Robotica 39 (2), pp. 291–316. Cited by: §I.