Efficient reinforcement learning control for continuum robots based on Inexplicit Prior Knowledge

02/26/2020 ∙ by Junjia Liu, et al. ∙ Shanghai Jiao Tong University 0

Compared to rigid robots that are often studied in reinforcement learning, the physical characteristics of some sophisticated robots such as software or continuum are more complicated. Moreover, recent reinforcement learning methods are data-inefficient and can not be directly deployed to the robot without simulation. In this paper, we propose an efficient reinforcement learning method based on inexplicit prior knowledge in response to such problems. The method is firstly corroborated by simulation and employed directly in the real world. By using our method, we can achieve visual active tracking and distance maintenance of a tendon-driven robot which will be critical in minimally-invasive procedures.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For decades, massive efforts have been made to make machines intelligent, in expectation of relieving human labors from repetitive, dangerous, and heavy work. In traditional robotics, control of robots is realized by establishing kinematic and dynamic models in the form of a transformation matrix. This method has achieved excellent results in conventional robots with discrete rigid links but becomes difficult to implement when dealing with soft robots such as continuum robots. In the traditional method, several subjective assumptions have to be made to get control of continuum manipulators, leading to a deviation with actual circumstances and inaccurate in results[1]

. Even though, the kinematic and dynamic models for continuum robots are often described in the form of nonlinear partial differential equations, which makes the control more complex.

Ever since reinforcement learning (RL) theory was proposed, developers have been trying to apply it to robotics. With the introduction of RL methods, the traditional method in rigid robotics is enhanced with the idea of trial-and-error[2][3]. But the application of RL theory in continuum robots could still meet some resistance. As far as we were concerned, recently only a few studies have applied RL to control continuum robots[4]. In Thuruthe et al.’s research, an accurate Vicon tracking system is provided for realizing closed-loop control from the third-person perspective. However, devices used in their research are not available for most application scenarios of continuum robots. Furthermore, data-inefficiency is the major drawback of RL algorithms, especially in a non-stationary continuum robot, which can make the learning on the real-world robot more impractical.

In this paper, we focalize automatic kinematics learning of complex robotic systems and end-to-end predicting control by using a visual servo from a first-person perspective. The main problem we tackled is the data-efficiency of complex and non-stationary real-world robotics. We use the inexplicit prior knowledge to accelerate the convergence of the learning process. Meanwhile, the ability of exploration is still guaranteed by an auto-adjusted exploitation coefficient.

To evaluate our proposed method empirically, we build a simulator by MuJoCo[5] first and then try on a real-world continuum robot directly. Our primary contributions are as follows:

  • A new model-based RL framework that integrates inexplicit prior knowledge (IPK) is proposed.

  • A Kalman filter based selector is designed to afford an evaluation of hybrid controller accuracy.

  • To balance the prior knowledge and RL, we set an exploitation coefficient that can be adjusted automatically according to the KL divergence.

  • Simulation and experiment results in real-world demonstrate the data-efficient of our method and require fewer interactions than the state-of-the-art model-based methods.

Ii Related Works

Ii-a Model-based Reinforcement Learning

The word model-based is easily ambiguous, which can both represent a given model in MPC and a learned model mainly used in RL. In this paper, the word model-based means a model learned from the explored data when either the system model or the environment model is unknown.

Model-based reinforcement learning (MBRL) began with Dyna [6] architecture. Compared to model-free reinforcement learning (MFRL), it is undoubtedly more suitable for robotic systems due to the data-efficiency of taking full advantage of experience data. Since MBRL uses a learned dynamic model to promote the learning process, its uncertainty will bring incorrect transition and impair value function approximation[7]. Gu et al.[8]

consider the weakness of the neural network in mini-batch data and use a linear time series model to model the environment. MVE

[9] controls the uncertainty of the model by limiting the imagination of the model to a fixed depth. STEVE[10]

improving the thought of MVE by dynamically interpolating between model rollouts of different horizon length of each example, and ensures that models are used without redundant errors. Moreover, the model ensemble technique in STEVE is inspiring, which can also be found in ME-TRPO

[11]. Most recently, Michael et al.[12] propose a monotonic model-based policy optimization (MBPO) to provide a performance guarantee.

The above methods are all the continuation of Dyna and learned an observation-prediction model. In contrast, VPN[13] directly predicts the future value and reward instead of an environment model.

Ii-B Reinforcement Learning with prior knowledge

Although MBRL algorithms achieve infusive success, they still take too many timesteps (e.g. the state-of-the-art MBRL method MBPO still needs 5k steps even for a simple Pendulum task) which still impractical in real-world robot application. Except for merely learning from scratch, some prior knowledge of the robot system can be brought in for both stable and efficient.

Moreno er al.[14] add a set of prior knowledge sources as a basic controller and use a credit assignment block to judge when to explore by RL. However, the evaluation function is designed by hand and acts as a if-else way. In addition, Bougie et al.[15] use another Q-learning model to select the best action. This module is trained using a Boltzmann distribution as explorer on the whole experience replay buffer. They empirically demonstrate that it can boost A3C by injecting prior features for important exploration area.

Ii-C Continuum Robot Control

Researches on control of continuum robots have been widely explored in traditional methods[16]. Researchers tend to establish the manipulator kinematic and dynamic models derived from several geometric assumptions. The most commonly used model simplifies the control issues based on the constant curvature (CC) approximation and linearized feedback[1][17]. This CC model performs worse when external loads are non-negligible[18][19]. As an alternative, mechanics-modified models were used in continuum robotics. Walker, Hannan and Gravagne have introduced the hyper-redundant robotics[20][21] and large-deflection dynamic model was used in their researches[18]. Considering the backbone of continuum robots as an elastic rod, Webster et al.[22] and Mahvash et al.[23] have respectively applied Cosserat rod model in their researches. Although an increase in accuracy is found, solutions of those models, described in the form of nonlinear partial differential equations, are sensitive to parameters and time-consuming[19][24], which inevitably increases the complexity of the control issues in continuum robotics.

Iii RL based on Inexplicit Prior Knowledge

Humans always have some intuitive inexplicit prior knowledge (IPK) about control of robots, which might be inaccurate, but is generally on the right path. To avoid useless exploration in a complex manipulator system, the general trend of movement can be pointed out and taught to the ignorant robot. All it needs to do is continuing amending the movement trend mapping from data and finally get reliable and explicit mapping. According to this idea, the main framework of our method is shown in Fig. 1.

Fig. 1: The main framework of our method which contains two parts: IPK subsystem and MBPO subsystem. Two orange policies represent the basic controller and the fusion controller of IPK subsystem and two white policies stand for MBPO. The pink and green areas show the mechanism of IPK guidance. The blue area is the terminal fusion controller.

Iii-a Exploration guided by the prior knowledge

Fig. 2:

Left: Reparameter trick in SAC paper; Right: Before reparameter trick, the output Gaussian distribution from MBRL controller is fusing with distribution from basic controller by Kalman filter.

The so-called inexplicit in this article represents the approximate direction of each tendon motor. Certainly, this kind of information is much easier to obtain than a kinematic model. They can be tested by powering up each motor and recording their specialty. We use their motion directions as a coordinate system to measure the location of the target. This prior knowledge can be regarded as a basic controller. It provides the simplest way to control a robot. For each time step, the target direction is first confirmed. Then calculate horizontal and vertical coordinate components. Finally, randomly select a motor in that direction to perform the motion.

The intention of this portion is to prolong the length of the task horizon and try to make it possible to sample more successional action-state pairs. In this paper, we adopt soft actor critic (SAC) [25] as our policy gradient algorithm and MBPO as our MBRL algorithm. The primary procedure of MBPO is to employ a uniform policy that generates random actions to guarantee the exploration scope. However, this will lead to a major risk of a robot crash and may cost a tremendous amount of time to reset. Both of them are insufferable in a real-world application.

We tackle this by setting two sets of action outputs, one from the IPK basic controller and another from MBPO. The replay buffer is augmented from to , where the subscript stands for information from IPK subsystem and for MBPO. By this, one experience can be divided into two parts and have different uses. Actions from IPK subsystem is used for practically interacting with the environment and get the real reward and the real next observation . In contrast, MBPO information is merely used in policy updates. According to , the approximation of reward and next observation

can be estimated.


Intuitively, IPK actions guarantee that robots will eventually reach their target with a high probability, and the MBPO part can still improve its policy with a certain degree of precision. Therefore, the initial exploration procedure implements once and gain two times experience, it is obviously more efficient than the original MBPO does.

Iii-B Fusion Controller

After the initial exploration procedure, MBPO trains a Gaussian process policy as the main policy. Correspondingly, the IPK subsystem also turns into a new link: fusion controller.

Although MBPO disentangles the task horizon and model horizon by querying the model only for short rollouts, it is still limited by the probability of reaching the target, especially in sparse reward problem. Since the IPK basic controller is rule-based, it is convenient to assess its performance. From the initial replay buffer and the log of their task length, we can revert the data to the full form. At the each time step, we can get the vector of target both before the action and after the action, then the deviation of each action from anticipative direction can be easily estimated. These deviations can be depicted as a Gaussian distribution. Moreover, the raw actions of SAC are also Gaussian distribution. How can we use both of this useful information?

A very naive thought is fusing the basic output Gaussian with the SAC action distribution. Kalman filter is a common method to fuse the measurement information of multiple sensors and tend to be more accurate than each of them. As Fig. 2 shown, we use a Kalman filter to integrate outputs from both two controllers and acquire a new fusion distribution. This procedure is before the reparameter trick of SAC.


Our motivation for introducing IPK subsystem is to demonstrate and guide the MBRL algorithm in order to reduce wasting time on useless exploration at the beginning, but not limit it. Because some motion, like axial distance maintenance and real-time tracking, cannot gain enough information from IPK basic controller, they still need relay on the exploration. So the MBPO reward estimation here is more complicated. We set an exploitation coefficient to balance exploration and exploitation which is inspired by the temperature coefficient in MBPO.

Theorem III.1 (Exploitation Coefficient Auto-Adjustment)

Let be the Gaussian distribution from the Gaussian policy and let be the fusional distribution. Then the exploitation coefficient is related to the KL-divergence between these two distributions.



is a hyperparameter for KL-divergence limiting.

Proof.  See Appendix A.1.

And use this coefficient to trade off exploration and exploitation.


Meanwhile, this technique also should be used in Equation 2 as a weight parameter.


The policy evaluation step is similar to Soft Policy Evaluation[25], it ensures that we can obtain soft value function for any policy . However, we need to prove that the new policy will achieve higher value than the old one by limiting the KL-divergence.

Theorem III.2 (Fusional Policy Improvement)

According to Theorem III.1 and Equation 5, let , the new policy of time step is . Then for all with .

Proof.  See Appendix A.2.

Implement Soft Policy Evaluation and Fusion Policy Improvement repeatedly, the policy will eventually converge to the optimal as proved in SAC Theorem 1[25].

Iv Experiment

In this paper, we proposed to train the continuum robot to aim at a target object by controlling the shift of multiple tendon drivers without the kinematic model, to track the movement of the target, and to maintain a certain axial distance. In minimally invasive surgery, the specialty of target tracking and axial distance keeping is critical for surgeons to concentrate on the practice since lesions will vibrate as the patient’s breathing and other organ movements. To verify our idea, we first carried out experiments in a designed simulator and analyze ablations of it. Then we deploy it directly to a real-world continuum robot we designed.

Iv-a Simulation

Mujoco is used to build a continuum robot model, with the physical manipulator to be referred. It can be divided into two motion sections, each of which is composed of 10 serial connected joints. Both of the sections are actuated by two sets of tendon-driven system at the end-point and have two degrees of freedom (DOF) separately. The panorama of the simulator is illustrated in the left of Fig


Fig. 3: Left: The panorama of tendon-driven continuum robot simulator based on Mujoco; Right: The continuum robot continuing tracking the target.

Following existing studies, we use the epoch return to evaluate the performance of different algorithms. It calculates the transformation of 3-dimension Euclidean distance after each step, reward when the target reaches the visual center and punish when out of the field of view. To maintain the axial distance, it is also treated as a penalty. During the training process, each epoch has 1000 time steps with a 20 steps model rollout. We compare our method with the state-of-the-art MBRL algorithm MBPO and MFRL algorithm SAC. To reveal the effect of IPK guided exploration, the fusion controller is unpacked into a basic rule-based controller and an MBPO controller guided by IPK. The performance comparison is shown in Fig.


Fig. 4: The performance comparison among MBPO, SAC and our method. * IPK-MBPO means the performance of MBPO subsystem in our method which is guided by IPK.

From Fig. 4, we discover that by introducing inexplicit prior knowledge, the IPK-MBPO improves faster than the original one. After about 10 epochs, the IPK-MBPO reaches and surpasses the basic controller and converges to a better performance than either the basic controller or the other two SOTA algorithms. Benefit from the thought of fusion control, the performance of the terminal fusion controller can be kept in a perfect range throughout the whole training process.

Iv-B Ablation Study

The most critical part of the IPK framework is action fusion.

Fig. 5: The blue line represents the KL-divergence between IPK-MBPO Gaussian policy and the fusional policy. The orange line represents the value change of the exploitation coefficient .

By recording the mean KL-divergence between IPK-MBPO and the fusion controller in each epoch, the exploitation coefficient can be calculated. In Fig. 5, we can discover that both KL-divergence and the exploitation coefficient descend through the training process. It demonstrates that the perfect performance of the fusion controller is not just relying on the basic controller but more focus on the data-driven IPK-MBPO controller. Moreover, it also confirms the exploitation coefficient auto-adjustment theory in Theorem III.1.

Iv-C Experiment on real-world continuum robot

To validate the effectiveness of the IPK algorithm, a real-world continuum robot is designed, the same as the mechanical structure described in simulation. Plastic joints are evenly arranged and fixed on an elastic rod with large deflection, which provides necessary resilience as the backbone of the robot. Tendons are threaded through joints. Every two symmetrically arranged tendons attached to the same end-point can provide one DOF by producing strains in opposite directions. Transmission structures in such tendon-servo system sets are optimized by using screw rods with normal and reverse thread on both ends respectively. Then the two tendons linked with the same DOF can be driven by one servo motor, which avoids accuracy-loss caused by motor synchronization and structural redundancy. As a result, a one-to-one correspondence is formed between DOFs and motors. The physical structure of this part is illustrated in Figure 6(a). In this way, the structure of the continuum manipulator is greatly simplified with lighter weight and higher accuracy to fit the simulation within the error range.

Same as simulation, the continuum robot needs some extra devices to perceive the experiment environment. A pinhole camera is fixed on the end-point to gather information for tracking tasks. Encoders on servo motors are used to ensure the IPK actions to be executed precisely, and protect the manipulator from being damaged in over range conditions. An extra camera is set up towards the robot, only for result evaluation. The gathered information is shown in Fig. 6(b).

Fig. 6: Devices and final results in real-world experiments. (a) Structure of the continuum robot. (b) Training processes in every epoch started from the zero position. (c) Mode I: The continuum manipulator was trained to tracking the target by using the visual observation, moving from state i to state ii for example. (d) Mode II: Height information of the end-point was gathered by image processing. With height information added into rewards, the continuum manipulator was trained to keep the distance during tracking tasks. Notice that state ii has a similar height with state i. Details as video.

The real-world experiment process was similar to that in simulation. Different from sim-to-real studies, the real-world model is not transformed from the simulation but directly learn from the real environment. In this case, the model can learn the uncertainty in the real environment and take these errors into account. The experiment was carried out mainly in two steps. Firstly, the continuum manipulator was trained in tracking objects by using visual observation (Fig. 6(c)). In this case, in order to shorten the training process, the real object was replaced by a screen that kept playing the video of simulated objects in a loop. Once the tracking task was failed or mechanical limits were reached, the manipulator would come back to the zero position with the help of encoders and prepare for the next training epoch. With the prior-experience of basic actions, after only half an hour, 10,000 steps, the robot gained an acceptable performance. Secondly, based on the already learned model in tracking tasks, height information of the end-point was added into rewards and made the manipulator learned to keep axial distance with the object (Fig. 6(d)). Then the robot would try to track the object with the least distance loss. After one half and an hour, 20,000 steps, the robot achieved convergence. Finally, the network weights were saved to reproduce the two tasks. The video of simulation and real-world experiments is available at https://youtu.be/MhqBSI-SXQc.

V Conclusion

The method of this paper takes full advantage of inexplicit prior knowledge and accelerates the learning process by guiding towards the approximate right direction. Furthermore, the exploration of MBRL is also ensured by some learned coefficients. An empirical result is given by visualizing the KL-divergence between action distributions and proved our theory. By achieving the experiment we conducted, the designed continuum robot can apply to the minimally invasive surgery.

Despite the delicate framework designing, the success is still merely proved in simple action space. More effort need to be taken to corroborate this idea on rigid robots and mobile robots.


This work was supported by the National Natural Science Foundation of China (Grant No. 61973210), Shanghai Science and Technology Commission (Grant No. 17441901000), the Medical-engineering Cross Projects of SJTU (Grant Nos. YG2019ZDA17, ZH2018QNB23), and the Scientific Research Project of Huangpu District of Shanghai (Grant No. HKQ201810).


  • [1] B. A. Jones and I. D. Walker, “Kinematics for multisection continuum robots,” IEEE Transactions on Robotics, vol. 22, no. 1, pp. 43–55, 2006.
  • [2] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
  • [3] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8, IEEE, 2018.
  • [4] T. G. Thuruthel, E. Falotico, F. Renda, and C. Laschi, “Model-based reinforcement learning for closed-loop dynamic control of soft robotic manipulators,” IEEE Transactions on Robotics, vol. 35, no. 1, pp. 124–134, 2018.
  • [5] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, IEEE, 2012.
  • [6] R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,” ACM Sigart Bulletin, vol. 2, no. 4, pp. 160–163, 1991.
  • [7] G. Kalweit and J. Boedecker, “Uncertainty-driven imagination for continuous deep reinforcement learning,” in Conference on Robot Learning, pp. 195–206, 2017.
  • [8] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” in

    International Conference on Machine Learning

    , pp. 2829–2838, 2016.
  • [9] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Model-based value estimation for efficient model-free reinforcement learning,” arXiv preprint arXiv:1803.00101, 2018.
  • [10] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efficient reinforcement learning with stochastic ensemble value expansion,” in Advances in Neural Information Processing Systems, pp. 8224–8234, 2018.
  • [11] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,” arXiv preprint arXiv:1802.10592, 2018.
  • [12] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” arXiv preprint arXiv:1906.08253, 2019.
  • [13] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in Advances in Neural Information Processing Systems, pp. 6118–6128, 2017.
  • [14] D. L. Moreno, C. V. Regueiro, R. Iglesias, and S. Barro, “Using prior knowledge to improve reinforcement learning in mobile robotics,” Proc. Towards Autonomous Robotics Systems. Univ. of Essex, UK, 2004.
  • [15] N. Bougie and R. Ichise, “Deep reinforcement learning boosted by external knowledge,” in Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 331–338, ACM, 2018.
  • [16] T. George Thuruthel, Y. Ansari, E. Falotico, and C. Laschi, “Control strategies for soft robotic manipulators: A survey,” Soft Robotics, vol. 5, no. 2, pp. 149–163, 2018. PMID: 29297756.
  • [17] M. W. Hannan and I. D. Walker, “Kinematics and the implementation of an elephant’s trunk manipulator and other continuum style robots,” Journal of robotic systems, vol. 20, no. 2, pp. 45–63, 2003.
  • [18] I. A. Gravagne, C. D. Rahn, and I. D. Walker, “Large deflection dynamics and control for planar continuum robots,” IEEE/ASME Transactions on Mechatronics, vol. 8, no. 2, pp. 299–307, 2003.
  • [19] D. Trivedi, A. Lotfi, and C. D. Rahn, “Geometrically exact dynamic models for soft robotic manipulators,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1497–1502, IEEE, 2007.
  • [20] I. A. Gravagne and I. D. Walker, “Kinematic transformations for remotely-actuated planar continuum robots,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), vol. 1, pp. 19–26 vol.1, 2000.
  • [21] M. Hannan and I. Walker, “Novel kinematics for continuum robots,” in Advances in Robot Kinematics, pp. 227–238, Springer, 2000.
  • [22] D. C. Rucker and R. J. Webster III, “Statics and dynamics of continuum robots with general tendon routing and external loading,” IEEE Transactions on Robotics, vol. 27, no. 6, pp. 1033–1044, 2011.
  • [23] M. Mahvash and P. E. Dupont, “Stiffness control of surgical continuum manipulators,” IEEE Transactions on Robotics, vol. 27, no. 2, pp. 334–345, 2011.
  • [24] M. T. Chikhaoui, S. Lilge, S. Kleinschmidt, and J. Burgner-Kahrs, “Comparison of modeling approaches for a tendon actuated continuum robot with three extensible segments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 989–996, 2019.
  • [25] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.