I Introduction
For decades, massive efforts have been made to make machines intelligent, in expectation of relieving human labors from repetitive, dangerous, and heavy work. In traditional robotics, control of robots is realized by establishing kinematic and dynamic models in the form of a transformation matrix. This method has achieved excellent results in conventional robots with discrete rigid links but becomes difficult to implement when dealing with soft robots such as continuum robots. In the traditional method, several subjective assumptions have to be made to get control of continuum manipulators, leading to a deviation with actual circumstances and inaccurate in results[1]
. Even though, the kinematic and dynamic models for continuum robots are often described in the form of nonlinear partial differential equations, which makes the control more complex.
Ever since reinforcement learning (RL) theory was proposed, developers have been trying to apply it to robotics. With the introduction of RL methods, the traditional method in rigid robotics is enhanced with the idea of trialanderror[2][3]. But the application of RL theory in continuum robots could still meet some resistance. As far as we were concerned, recently only a few studies have applied RL to control continuum robots[4]. In Thuruthe et al.’s research, an accurate Vicon tracking system is provided for realizing closedloop control from the thirdperson perspective. However, devices used in their research are not available for most application scenarios of continuum robots. Furthermore, datainefficiency is the major drawback of RL algorithms, especially in a nonstationary continuum robot, which can make the learning on the realworld robot more impractical.
In this paper, we focalize automatic kinematics learning of complex robotic systems and endtoend predicting control by using a visual servo from a firstperson perspective. The main problem we tackled is the dataefficiency of complex and nonstationary realworld robotics. We use the inexplicit prior knowledge to accelerate the convergence of the learning process. Meanwhile, the ability of exploration is still guaranteed by an autoadjusted exploitation coefficient.
To evaluate our proposed method empirically, we build a simulator by MuJoCo[5] first and then try on a realworld continuum robot directly. Our primary contributions are as follows:

A new modelbased RL framework that integrates inexplicit prior knowledge (IPK) is proposed.

A Kalman filter based selector is designed to afford an evaluation of hybrid controller accuracy.

To balance the prior knowledge and RL, we set an exploitation coefficient that can be adjusted automatically according to the KL divergence.

Simulation and experiment results in realworld demonstrate the dataefficient of our method and require fewer interactions than the stateoftheart modelbased methods.
Ii Related Works
Iia Modelbased Reinforcement Learning
The word modelbased is easily ambiguous, which can both represent a given model in MPC and a learned model mainly used in RL. In this paper, the word modelbased means a model learned from the explored data when either the system model or the environment model is unknown.
Modelbased reinforcement learning (MBRL) began with Dyna [6] architecture. Compared to modelfree reinforcement learning (MFRL), it is undoubtedly more suitable for robotic systems due to the dataefficiency of taking full advantage of experience data. Since MBRL uses a learned dynamic model to promote the learning process, its uncertainty will bring incorrect transition and impair value function approximation[7]. Gu et al.[8]
consider the weakness of the neural network in minibatch data and use a linear time series model to model the environment. MVE
[9] controls the uncertainty of the model by limiting the imagination of the model to a fixed depth. STEVE[10]improving the thought of MVE by dynamically interpolating between model rollouts of different horizon length of each example, and ensures that models are used without redundant errors. Moreover, the model ensemble technique in STEVE is inspiring, which can also be found in METRPO
[11]. Most recently, Michael et al.[12] propose a monotonic modelbased policy optimization (MBPO) to provide a performance guarantee.The above methods are all the continuation of Dyna and learned an observationprediction model. In contrast, VPN[13] directly predicts the future value and reward instead of an environment model.
IiB Reinforcement Learning with prior knowledge
Although MBRL algorithms achieve infusive success, they still take too many timesteps (e.g. the stateoftheart MBRL method MBPO still needs 5k steps even for a simple Pendulum task) which still impractical in realworld robot application. Except for merely learning from scratch, some prior knowledge of the robot system can be brought in for both stable and efficient.
Moreno er al.[14] add a set of prior knowledge sources as a basic controller and use a credit assignment block to judge when to explore by RL. However, the evaluation function is designed by hand and acts as a ifelse way. In addition, Bougie et al.[15] use another Qlearning model to select the best action. This module is trained using a Boltzmann distribution as explorer on the whole experience replay buffer. They empirically demonstrate that it can boost A3C by injecting prior features for important exploration area.
IiC Continuum Robot Control
Researches on control of continuum robots have been widely explored in traditional methods[16]. Researchers tend to establish the manipulator kinematic and dynamic models derived from several geometric assumptions. The most commonly used model simplifies the control issues based on the constant curvature (CC) approximation and linearized feedback[1][17]. This CC model performs worse when external loads are nonnegligible[18][19]. As an alternative, mechanicsmodified models were used in continuum robotics. Walker, Hannan and Gravagne have introduced the hyperredundant robotics[20][21] and largedeflection dynamic model was used in their researches[18]. Considering the backbone of continuum robots as an elastic rod, Webster et al.[22] and Mahvash et al.[23] have respectively applied Cosserat rod model in their researches. Although an increase in accuracy is found, solutions of those models, described in the form of nonlinear partial differential equations, are sensitive to parameters and timeconsuming[19][24], which inevitably increases the complexity of the control issues in continuum robotics.
Iii RL based on Inexplicit Prior Knowledge
Humans always have some intuitive inexplicit prior knowledge (IPK) about control of robots, which might be inaccurate, but is generally on the right path. To avoid useless exploration in a complex manipulator system, the general trend of movement can be pointed out and taught to the ignorant robot. All it needs to do is continuing amending the movement trend mapping from data and finally get reliable and explicit mapping. According to this idea, the main framework of our method is shown in Fig. 1.
Iiia Exploration guided by the prior knowledge
The socalled inexplicit in this article represents the approximate direction of each tendon motor. Certainly, this kind of information is much easier to obtain than a kinematic model. They can be tested by powering up each motor and recording their specialty. We use their motion directions as a coordinate system to measure the location of the target. This prior knowledge can be regarded as a basic controller. It provides the simplest way to control a robot. For each time step, the target direction is first confirmed. Then calculate horizontal and vertical coordinate components. Finally, randomly select a motor in that direction to perform the motion.
The intention of this portion is to prolong the length of the task horizon and try to make it possible to sample more successional actionstate pairs. In this paper, we adopt soft actor critic (SAC) [25] as our policy gradient algorithm and MBPO as our MBRL algorithm. The primary procedure of MBPO is to employ a uniform policy that generates random actions to guarantee the exploration scope. However, this will lead to a major risk of a robot crash and may cost a tremendous amount of time to reset. Both of them are insufferable in a realworld application.
We tackle this by setting two sets of action outputs, one from the IPK basic controller and another from MBPO. The replay buffer is augmented from to , where the subscript stands for information from IPK subsystem and for MBPO. By this, one experience can be divided into two parts and have different uses. Actions from IPK subsystem is used for practically interacting with the environment and get the real reward and the real next observation . In contrast, MBPO information is merely used in policy updates. According to , the approximation of reward and next observation
can be estimated.
(1)  
Intuitively, IPK actions guarantee that robots will eventually reach their target with a high probability, and the MBPO part can still improve its policy with a certain degree of precision. Therefore, the initial exploration procedure implements once and gain two times experience, it is obviously more efficient than the original MBPO does.
IiiB Fusion Controller
After the initial exploration procedure, MBPO trains a Gaussian process policy as the main policy. Correspondingly, the IPK subsystem also turns into a new link: fusion controller.
Although MBPO disentangles the task horizon and model horizon by querying the model only for short rollouts, it is still limited by the probability of reaching the target, especially in sparse reward problem. Since the IPK basic controller is rulebased, it is convenient to assess its performance. From the initial replay buffer and the log of their task length, we can revert the data to the full form. At the each time step, we can get the vector of target both before the action and after the action, then the deviation of each action from anticipative direction can be easily estimated. These deviations can be depicted as a Gaussian distribution. Moreover, the raw actions of SAC are also Gaussian distribution. How can we use both of this useful information?
A very naive thought is fusing the basic output Gaussian with the SAC action distribution. Kalman filter is a common method to fuse the measurement information of multiple sensors and tend to be more accurate than each of them. As Fig. 2 shown, we use a Kalman filter to integrate outputs from both two controllers and acquire a new fusion distribution. This procedure is before the reparameter trick of SAC.
(2)  
Our motivation for introducing IPK subsystem is to demonstrate and guide the MBRL algorithm in order to reduce wasting time on useless exploration at the beginning, but not limit it. Because some motion, like axial distance maintenance and realtime tracking, cannot gain enough information from IPK basic controller, they still need relay on the exploration. So the MBPO reward estimation here is more complicated. We set an exploitation coefficient to balance exploration and exploitation which is inspired by the temperature coefficient in MBPO.
Theorem III.1 (Exploitation Coefficient AutoAdjustment)
Let be the Gaussian distribution from the Gaussian policy and let be the fusional distribution. Then the exploitation coefficient is related to the KLdivergence between these two distributions.
(3)  
where
is a hyperparameter for KLdivergence limiting.
Proof. See Appendix A.1.
And use this coefficient to trade off exploration and exploitation.
(4) 
Meanwhile, this technique also should be used in Equation 2 as a weight parameter.
(5)  
The policy evaluation step is similar to Soft Policy Evaluation[25], it ensures that we can obtain soft value function for any policy . However, we need to prove that the new policy will achieve higher value than the old one by limiting the KLdivergence.
Theorem III.2 (Fusional Policy Improvement)
Proof. See Appendix A.2.
Implement Soft Policy Evaluation and Fusion Policy Improvement repeatedly, the policy will eventually converge to the optimal as proved in SAC Theorem 1[25].
Iv Experiment
In this paper, we proposed to train the continuum robot to aim at a target object by controlling the shift of multiple tendon drivers without the kinematic model, to track the movement of the target, and to maintain a certain axial distance. In minimally invasive surgery, the specialty of target tracking and axial distance keeping is critical for surgeons to concentrate on the practice since lesions will vibrate as the patient’s breathing and other organ movements. To verify our idea, we first carried out experiments in a designed simulator and analyze ablations of it. Then we deploy it directly to a realworld continuum robot we designed.
Iva Simulation
Mujoco is used to build a continuum robot model, with the physical manipulator to be referred. It can be divided into two motion sections, each of which is composed of 10 serial connected joints. Both of the sections are actuated by two sets of tendondriven system at the endpoint and have two degrees of freedom (DOF) separately. The panorama of the simulator is illustrated in the left of Fig
3.Following existing studies, we use the epoch return to evaluate the performance of different algorithms. It calculates the transformation of 3dimension Euclidean distance after each step, reward when the target reaches the visual center and punish when out of the field of view. To maintain the axial distance, it is also treated as a penalty. During the training process, each epoch has 1000 time steps with a 20 steps model rollout. We compare our method with the stateoftheart MBRL algorithm MBPO and MFRL algorithm SAC. To reveal the effect of IPK guided exploration, the fusion controller is unpacked into a basic rulebased controller and an MBPO controller guided by IPK. The performance comparison is shown in Fig.
4.From Fig. 4, we discover that by introducing inexplicit prior knowledge, the IPKMBPO improves faster than the original one. After about 10 epochs, the IPKMBPO reaches and surpasses the basic controller and converges to a better performance than either the basic controller or the other two SOTA algorithms. Benefit from the thought of fusion control, the performance of the terminal fusion controller can be kept in a perfect range throughout the whole training process.
IvB Ablation Study
The most critical part of the IPK framework is action fusion.
By recording the mean KLdivergence between IPKMBPO and the fusion controller in each epoch, the exploitation coefficient can be calculated. In Fig. 5, we can discover that both KLdivergence and the exploitation coefficient descend through the training process. It demonstrates that the perfect performance of the fusion controller is not just relying on the basic controller but more focus on the datadriven IPKMBPO controller. Moreover, it also confirms the exploitation coefficient autoadjustment theory in Theorem III.1.
IvC Experiment on realworld continuum robot
To validate the effectiveness of the IPK algorithm, a realworld continuum robot is designed, the same as the mechanical structure described in simulation. Plastic joints are evenly arranged and fixed on an elastic rod with large deflection, which provides necessary resilience as the backbone of the robot. Tendons are threaded through joints. Every two symmetrically arranged tendons attached to the same endpoint can provide one DOF by producing strains in opposite directions. Transmission structures in such tendonservo system sets are optimized by using screw rods with normal and reverse thread on both ends respectively. Then the two tendons linked with the same DOF can be driven by one servo motor, which avoids accuracyloss caused by motor synchronization and structural redundancy. As a result, a onetoone correspondence is formed between DOFs and motors. The physical structure of this part is illustrated in Figure 6(a). In this way, the structure of the continuum manipulator is greatly simplified with lighter weight and higher accuracy to fit the simulation within the error range.
Same as simulation, the continuum robot needs some extra devices to perceive the experiment environment. A pinhole camera is fixed on the endpoint to gather information for tracking tasks. Encoders on servo motors are used to ensure the IPK actions to be executed precisely, and protect the manipulator from being damaged in over range conditions. An extra camera is set up towards the robot, only for result evaluation. The gathered information is shown in Fig. 6(b).
The realworld experiment process was similar to that in simulation. Different from simtoreal studies, the realworld model is not transformed from the simulation but directly learn from the real environment. In this case, the model can learn the uncertainty in the real environment and take these errors into account. The experiment was carried out mainly in two steps. Firstly, the continuum manipulator was trained in tracking objects by using visual observation (Fig. 6(c)). In this case, in order to shorten the training process, the real object was replaced by a screen that kept playing the video of simulated objects in a loop. Once the tracking task was failed or mechanical limits were reached, the manipulator would come back to the zero position with the help of encoders and prepare for the next training epoch. With the priorexperience of basic actions, after only half an hour, 10,000 steps, the robot gained an acceptable performance. Secondly, based on the already learned model in tracking tasks, height information of the endpoint was added into rewards and made the manipulator learned to keep axial distance with the object (Fig. 6(d)). Then the robot would try to track the object with the least distance loss. After one half and an hour, 20,000 steps, the robot achieved convergence. Finally, the network weights were saved to reproduce the two tasks. The video of simulation and realworld experiments is available at https://youtu.be/MhqBSISXQc.
V Conclusion
The method of this paper takes full advantage of inexplicit prior knowledge and accelerates the learning process by guiding towards the approximate right direction. Furthermore, the exploration of MBRL is also ensured by some learned coefficients. An empirical result is given by visualizing the KLdivergence between action distributions and proved our theory. By achieving the experiment we conducted, the designed continuum robot can apply to the minimally invasive surgery.
Despite the delicate framework designing, the success is still merely proved in simple action space. More effort need to be taken to corroborate this idea on rigid robots and mobile robots.
Acknowledgment
This work was supported by the National Natural Science Foundation of China (Grant No. 61973210), Shanghai Science and Technology Commission (Grant No. 17441901000), the Medicalengineering Cross Projects of SJTU (Grant Nos. YG2019ZDA17, ZH2018QNB23), and the Scientific Research Project of Huangpu District of Shanghai (Grant No. HKQ201810).
References
 [1] B. A. Jones and I. D. Walker, “Kinematics for multisection continuum robots,” IEEE Transactions on Robotics, vol. 22, no. 1, pp. 43–55, 2006.
 [2] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al., “Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
 [3] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simtoreal transfer of robotic control with dynamics randomization,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8, IEEE, 2018.
 [4] T. G. Thuruthel, E. Falotico, F. Renda, and C. Laschi, “Modelbased reinforcement learning for closedloop dynamic control of soft robotic manipulators,” IEEE Transactions on Robotics, vol. 35, no. 1, pp. 124–134, 2018.
 [5] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, IEEE, 2012.
 [6] R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,” ACM Sigart Bulletin, vol. 2, no. 4, pp. 160–163, 1991.
 [7] G. Kalweit and J. Boedecker, “Uncertaintydriven imagination for continuous deep reinforcement learning,” in Conference on Robot Learning, pp. 195–206, 2017.

[8]
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep qlearning
with modelbased acceleration,” in
International Conference on Machine Learning
, pp. 2829–2838, 2016.  [9] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Modelbased value estimation for efficient modelfree reinforcement learning,” arXiv preprint arXiv:1803.00101, 2018.
 [10] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sampleefficient reinforcement learning with stochastic ensemble value expansion,” in Advances in Neural Information Processing Systems, pp. 8224–8234, 2018.
 [11] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Modelensemble trustregion policy optimization,” arXiv preprint arXiv:1802.10592, 2018.
 [12] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Modelbased policy optimization,” arXiv preprint arXiv:1906.08253, 2019.
 [13] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in Advances in Neural Information Processing Systems, pp. 6118–6128, 2017.
 [14] D. L. Moreno, C. V. Regueiro, R. Iglesias, and S. Barro, “Using prior knowledge to improve reinforcement learning in mobile robotics,” Proc. Towards Autonomous Robotics Systems. Univ. of Essex, UK, 2004.
 [15] N. Bougie and R. Ichise, “Deep reinforcement learning boosted by external knowledge,” in Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 331–338, ACM, 2018.
 [16] T. George Thuruthel, Y. Ansari, E. Falotico, and C. Laschi, “Control strategies for soft robotic manipulators: A survey,” Soft Robotics, vol. 5, no. 2, pp. 149–163, 2018. PMID: 29297756.
 [17] M. W. Hannan and I. D. Walker, “Kinematics and the implementation of an elephant’s trunk manipulator and other continuum style robots,” Journal of robotic systems, vol. 20, no. 2, pp. 45–63, 2003.
 [18] I. A. Gravagne, C. D. Rahn, and I. D. Walker, “Large deflection dynamics and control for planar continuum robots,” IEEE/ASME Transactions on Mechatronics, vol. 8, no. 2, pp. 299–307, 2003.
 [19] D. Trivedi, A. Lotfi, and C. D. Rahn, “Geometrically exact dynamic models for soft robotic manipulators,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1497–1502, IEEE, 2007.
 [20] I. A. Gravagne and I. D. Walker, “Kinematic transformations for remotelyactuated planar continuum robots,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065), vol. 1, pp. 19–26 vol.1, 2000.
 [21] M. Hannan and I. Walker, “Novel kinematics for continuum robots,” in Advances in Robot Kinematics, pp. 227–238, Springer, 2000.
 [22] D. C. Rucker and R. J. Webster III, “Statics and dynamics of continuum robots with general tendon routing and external loading,” IEEE Transactions on Robotics, vol. 27, no. 6, pp. 1033–1044, 2011.
 [23] M. Mahvash and P. E. Dupont, “Stiffness control of surgical continuum manipulators,” IEEE Transactions on Robotics, vol. 27, no. 2, pp. 334–345, 2011.
 [24] M. T. Chikhaoui, S. Lilge, S. Kleinschmidt, and J. BurgnerKahrs, “Comparison of modeling approaches for a tendon actuated continuum robot with three extensible segments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 989–996, 2019.
 [25] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
Comments
There are no comments yet.