Soft continuum robots, featured with flexible geometric shapes and resiliently deformable materials, are gradually exhibiting great potentials in tasks under dangerous and cluttered environments [majidi2014soft]. Such properties also allow them to closely mimic animal behaviors, which are widely studied in bio-inspired robotics [onal2012modular, coyle2018bio]
. However, due to the infinitely many degrees of freedom and hard-to-control dynamics of soft robots, planning and control of soft robots remains a challenging problem.
In this paper, we are motivated to study bio-inspired approaches for locomotion control of a soft snake robot, designed and fabricated in our lab [onal2013autonomous]. The crucial observation from nature is that most animals with soft bodies and elastic actuators can learn and adapt to various new motion skills with just a few trials. This is mainly attributed to the prior information encoded into their vertebrate neural circuits [yuste2005cortex] and the evolving operations by the primary motor cortex (M1) in the brain. Such a mechanism allows animals to learn, from a reasonable number of trials and errors, to regulate the inputs to neural circuits for achieving desired motion [kleim1998functional].
We propose a control design for soft snake robots with two key features: First, we use model-free deep reinforcement learning [sutton2000policy, schulman2017proximal] in adaptive control to mitigate the uncertainties in the dynamic responses of soft actuators; Second, we employ a Central Pattern Generator (CPG) network consisting of Matsuoka oscillators as low-level motion controller to ensure the stability of the learning-based control system and to improve the diversity of locomotion behaviors in the soft snake robot. Figure 2 shows the final structure of our controller composed of the deep policy Neural Networks (NNs) for the Proximal Policy Optimization Option-Critics (PPOC) algorithm and the Matsuoka CPG net. These two modules – a reinforcement learning (RL) module and a CPG net – are integrated as follows: The RL module learns to control neural stimuli inputs and frequency ratio to a CPG network given state feedback from the soft snake robot and the control objective. The neural stimuli input regulates the amplitudes and phases of the outputs in the CPG net in real-time, which steers the snake to the desired direction. While the frequency ratio has a long term effect that changes the oscillating frequency of the CPG net, resulting in changes in the velocity of the robot. The output of the CPG net is directly transformed as actuation inputs to the robot. The system is closed-loop and Bounded-input, Bounded-Output (BIBO) stable due to the stability of the CPG net [matsuoka1985sustained, Thm.1].
I-a Related work
In literature, bio-inspired control methods, including bipedal [mori2004reinforcement, endo2008learning, nassour2014multi, dzeladini2018cpg] and serpentine locomotion [crespi2005swimming, crespi2008online, ryu2010locomotion, bing2017towards, wang2017cpg, bing2019end], have been studied for the control design of robotic locomotion. The key concept of bio-inspired control is to generate motion trajectories mimicking animals’ behaviors and then to track these trajectories with a closed-loop control design. The CPG
, also known as a neural oscillator, is a classical nonlinear system that models the neuron circuits and their firing mechanisms that control organ contractions and body movements in animals[roberts1998central, yuste2005cortex]. In [ijspeert2008central], the authors developed a trajectory generator for a rigid salamander robot using Kuramoto CPGs and used low-level PD controllers to track the desired motion trajectories generated by the oscillator. In [ryu2010locomotion], the authors improved the synchronization property of the CPG by adapting its frequency parameter with additional linear dynamics. In [wang2017cpg], the authors introduced a control loop that adjusts the frequency and amplitude of the oscillation for adapting to changes in the terrain. The most recent work [bing2019end] employed Spiking Neural Net (SNN) under the Reward-Modulated Spike-Timing-Dependent Plasticity (R-STDP) scheme to learn operations of a Kuramoto CPG net that drives a rigid snake robot towards the goal object using visual information. Despite the success of bio-inspired control on those rigid robots, it may be infeasible to use the existing control scheme on soft robots. These rigid robot controllers depend on high-performance encoders for tracking the CPG trajectories with a small error. Such tracking performance is hard to achieve given the unstable properties of soft materials during contact, and the nonlinear and stochastic dynamical response from the soft actuators.
I-B Main contributions
Generally speaking, the proposed control design leverages adaptability and optimality in deep reinforcement learning and stability and diversity in behavior patterns in the CPG system. The insight from the CPG system guides the architecture design of deep NNs to improve data efficiency and learning rate using hierarchical RL, in particular, the option-critic framework [bacon2017option]. Moreover, this paper leverages the high-fidelity of a physics-based simulator and domain randomization techniques [tobin2017domain], the trained policy in simulator is directly applied to the real soft snake robot, resulting in desirable and intelligent tracking of moving targets.
Ii System Overview
Each soft link of the robot is made of Ecoflex™ 00-30 silicone rubber. The links are connected through rigid bodies enclosing the electronic components that are necessary to control the snake robot. In addition, the rigid body components have a pair of wheels, that model the anisotropic friction that a real snake would have from its scales. Only one chamber on each link is active (pressurized) at a time. The simulator made by Gasoto, et al.[renato2019] models the inflation and deflation of the air chamber, as well as the resulting deformation of the soft bodies with tetrahedral finite elements. Simulating a full snake with links takes less than per step, allowing real-time performance on a sampling frequency og 60 . The link curvature error between simulator and real robot is within corresponding to the input pressure ranged from to .
Figure 1 shows the notation of state space configuration of the robot. At time , state is the planar Cartesian position of the snake head, is the distance from to the goal position, is the distance travelled along the goal direction from the initial head position , is the instantaneous planar velocity of the robot, and
is the projection of this velocity vector to the goal direction,is the angular deviation between the goal direction and the velocity direction of the snake robot. According to [luo2014theoretical], the bending curvature of each body link at time is computed by where and are the relative bending angle and the length of the middle line of the soft body link.
Iii The Matsuoka Oscillator Network
Neural oscillators, including Matsuoka oscillator [matsuoka1985sustained] and Hindmarsh-Rose neuron model [storace2008hindmarsh] are driven by spiking or constant stimulus signals. These CPGs are less commonly used in bio-inspired robot control because the wave patterns generated by these CPGs are often difficult to predict and not directly operable. However, their complex bifurcation structures provide richer trajectory patterns, including limit cycles, equilibrium points, and even chaos. Based on this, we use Matsuoka oscillator as a lower level motion controller for the our soft snake robot.
Iii-a Primitive Matsuoka CPG
A primitive Matsuoka CPG consists of two mutually inhibited neurons named extensor and flexor. We use subscripts and to represent variables related to extensor neuron and flexor neuron, respectively. Such structure can be represented by the following dynamical system,
where the tuple , represents the activation state and self-inhibitory state of -th neuron respectively, is the output of -th neuron, is a weight parameter, and are the tonic inputs to the oscillator. There are two time constants in the system, is the rate of discharge and is the adaptation rate of the system. Both of them are multiplied by a frequency ratio . The parameters and are mutual inhibition weights between flexor and extensor in a primitive oscillator. Since the symmetricity condition in the previous work[matsuoka1985sustained, matsuoka1987mechanisms] suggests , a new symbol will be used to represent the equivalent mutual inhibition weights of primitive oscillator in the rest part of this paper. The parameter is the inhibition weight of the coupling signal from the output of another coupled primitive oscillator. In our system, all couplings terms are inhibiting signals (negatively weighted), and only the tonic inputs are activating signals (positively weighted).
Iii-B Configuration of the Matsuoka CPG network on soft snake robot
We first construct a coupled CPG system for the soft snake robot. In Fig. 2, the CPG Net section shows the structure of our CPG Network. It is an inverted, double-sized version of Network VIII introduced in Matsuoka’s paper [matsuoka1987mechanisms]. There are overall eight nodes in the network. Each pair of nodes (e.g., the two nodes colored green) in a row represents a primitive Matsuoka oscillator (1). The edges are corresponding to the coupling relations among the nodes. In this graph, all the edges with hollowed endpoints are positive activating signals, while the others with solid endpoints are negative inhibiting signals. The network includes four linearly coupled primitive Matsuoka oscillators. The oscillators are numbered from head to tail of the robot. The outputs from the primitive oscillators are the normalized input ratio to the maximum safe pressure input for the pneumatic actuators of each chamber ( stands for a full inflation of the -th extensor actuator and zero inflation of the -th flexor actuator, and vice versa for ). As the safe input pressure for each soft actuator in our robot is [renato2019], the actual pressure input to the -th chamber is psi. The primitive oscillator with green nodes on the head link stands for the pace-maker oscillator of the entire CPG net. The pace-maker works as an initiator in the oscillating system, followed by the rest parts oscillating under certain phase difference in sequence. Fig. 2 also shows that all tonic inputs are activating signals. For simplicity, we’ll use a vector
to represent all tonic inputs to the CPG net.
However, a raw configuration is not enough to guarantee stable and synchronized oscillation for the whole system. To avoid configuring a non-rhythmic system, the following constraint must be satisfied [matsuoka1985sustained]:
Given this constraint, Genetic Programming (GP) algorithms can be implemented to configure a set of parameters in the system that optimizes the performance of serpentine locomotion controlled by the CPG system [ijspeert1999evolving].
For the goal-reaching task and serpentine locomotion scenario, we define the fitness function–the optimization criteria–in GP as follows:
where parameters are constant coefficients 111In experiments, the following parameters are used: , , , and . . The function is a weighted sum over the instantaneous velocity towards the goal, its instantaneous angular deviation, and distance traveled on the goal direction at termination time . A better fitted configuration should stay at a large to show the existence of oscillation after a period of time , which is treated as an evidence for sustained oscillation in this problem. This is important since only a sustained Matsuoka oscillator can generate stable limit cycles [matsuoka1985sustained]. In addition, it is also required to have less angular deviation from the goal direction (with a small ), and with more distance travelled on the goal direction ().
After 300 generations, GP converges to a parameter configuration of the CPG net, shown in Table. I in the Appendix. We use the Salesman tournament algorithm for the selection step, a Gaussian sampler for the mutation step, and cross two point method for the crossover step. The GP algorithm is implemented in Deap 1.2.2 [DEAP_JMLR2012]
, a python library for evolutionary algorithms.
Iv Maneuverability of the Matsuoka CPG net
Maneuvering a CPG system to generate various trajectory patterns is nontrivial due to its high dimensionality and nonlinearity. Thus, we aim to understand and leverage the dynamical properties of the CPG system in learning-based control design for improving the computational and data efficiency. Inspired by the mechanism of vehicle driving, we explore the tonic input as a throttle, stable equilibrium as braking, frequency ratio as a gearbox, and the amplitude bias and duty cycle of the tonic input as two mechanisms for steering control. We summarize the key properties of the CPG net into two categories – steering control and speed control of the soft snake robot.
Iv-a Key properties of Matsuoka Oscillator for gait generation
Biased tonic inputs for steering
It has been pointed out that the steering pattern of the snake robot can be realized by making joint angle trajectories asymmetric [ye2004turning]. Previous methods realize steering by either directly adding a displacement [ijspeert2008central] to the output of the CPG system, or using a secondary system such as a deep neural network to manipulate the output from multiple CPG systems [mori2004reinforcement]. In addition to these methods, we observe two different types of tonic inputs that can generate imbalanced output waveforms in the Matsuoka oscillator, which are: (1) applying biased amplitudes and (2) biased duty cycles to the tonic inputs. This observation provides more flexibility in steering control of snake robots.
The asymmetric amplitude can be achieved by applying biased values to a specific pair of tonic inputs of the system. In experiments, we observe that imbalanced tonic inputs between flexor and extensor can cause asymmetry in the output of the CPG system, resulting in the similar effect of a biased output amplitude. An example is shown in Fig. 2(a), showing a clockwise turning. The corresponding joint trajectory is given in Fig. 2(c), where one CPG output (blue curve) has an noticeable bias to the negative amplitude axis. This shows that the flexor chamber of the first link is bending more to the right-hand side of the robot.
From the time perspective, the asymmetry is related to the difference in time duration between the actuation of extensor and flexor. To be more specific, such a time difference is determined by the duty cycle of each CPG output . The duty cycle can be controlled by switching different modes between limit cycles and equilibrium points of the system. Instead of setting a fixed tonic input vector , we can use different tonic input vectors switched periodically. When the duty cycle of actuating signals are different between flexors and extensors, an asymmetric oscillating pattern will occur.
We construct two tonic input vectors and , with and . As Fig. 2(d) shows, when we set the duty cycle of to be in one oscillating period, the rest of time slot for will be filled with . The CPG output on each link shows an imbalanced oscillation with longer time duration on the negative amplitude axis, indicating longer bending time on the flexor. As a result, the robot makes a clockwise (right) turn, with a circle trajectory presented in Fig. 2(b).
Frequency and amplitude for speed regulation
, Matsuoka has provided a method to estimate the amplitudeand frequency of Matsuoka oscillator. We denote the estimated frequency as and estimated amplitude as . Based on [matsuoka2011analysis, eq. (5), eq. (6)], since and are the only changing parameters, the mapping relations can be concluded as
It becomes clear that the frequency and amplitude of the Matsuoka CPG system can be controlled independently by the frequency ratio and the tonic inputs , for . Moreover, since the frequency ratio only influences the strength but not the direction of the vector field of the Matsuoka CPG system, manipulating will not affect the stability. Figure 4 shows the distribution of locomotion velocity over different amplitudes and frequencies by taking uniform samples within the region for with all to be the same in one sample, and . What we can observe from Fig. 4 is that, given fixed tonic input, the average velocity increase nearly monotonically with the frequency ratio . While the amplitude of tonic input does not affect the velocity that much, especially when is low. Thus, we may use primarily for speed control for the robot.
Iv-B Equilibrium stability for the termination of oscillation
The Matsuoka CPG network is a BIBO stable system [matsuoka1985sustained, Thm.1]. Given bounded tonic inputs, the state trajectories of the system will converge to either a stable limit cycle, or an asymptotically stable equilibrium, or a chaotic attracting region. It is noted that a Matsuoka oscillator without tonic inputs, i.e., , directly turns into a non-harmonic damped system [matsuoka1985sustained, matsuoka1987mechanisms], it is possible to start and stop the oscillation by tuning the tonic inputs with specific values.
Besides letting , we observe that when the tonic input vector follows an exclusive form, that is, for , the CPG system converges to a stable equilibrium. As the example in Fig. 5 shows, when is switched from to , the system converges to a equilibrium. Noticed that the equilibrium point is not at the origin point, but fixed at some nonzero constant value, which leads to a constant air pressure input to the snake chambers. This means that the exclusive equilibrium can also bend the body of the snake to a certain turning direction before completely stopping the oscillation.
In conclusion, the key properties of maneuverability of the Matsuoka CPG net are summarized as follows:
The value of tonic input contributes mainly to the oscillating amplitude. Biased tonic inputs will result in unsymmetrical oscillation and thus turning in the robot. A non-zero tonic input can be used for either initiating the oscillation or stopping the oscillation.
Oscillating frequency has a great impact on the locomotion speed of the snake robot. In our system, lower frequency leads to higher speed. Further, the speed control by changing the frequency ratio can be made independent of steering control by changing the tonic inputs.
V Learning Hierarchical PPOC-CPG Net for Set Point Tracking
We employ a model-free RL algorithm, proximal policy optimization (PPO) [schulman2017proximal], as the ‘brain’ of the CPG network. The algorithm is expected to learn the optimal policy that takes state feedback from the robot and outputs tonic inputs and frequency ratio of the CPG net to generate proper oscillating waveforms for goal-reaching locomotion.
V-a Preliminary: Proximal Policy Optimization
We use reinforcement learning (RL) to learn an approximately optimal controller for soft snake locomotion. In RL, the control problem is modeled as an Markov Decision Process (MDP) with state space , action space , a discounting factor , an initial state , a reward function and a probabilistic transition function . Given a policy and an initial state , the value function is the total expected return given policy in the MDP.
Policy search [sutton2000policy] in model-free RL approximates the policy with parameterized functions and performs gradient descent to find the optimal policy function parameters. Given the total expected return
where is the vector of parameters in the policy/value function approximation,
is a trajectory sampled in the Markov chain induced by the policy on the MDP,is the accumulated discounted reward along the path. Based on this a gradient estimator is deducted as follows [sutton2000policy],
is the advantage function for reducing the variance of the gradient estimator.
Hierarchical RL is introduced to improve the learning performance by exploring an action space consisting of low-level actions and high-level sub-policies called options. Each option includes a set of states at which the option can be initiated, is an intra-option policy, and is the termination condition–
outputs the probability that the optionshould terminated at the state . (parameterized on ). After augmenting the action space with options, the policy in the MDP learns simultaneously a high-level policy and the set of options defined by the intra-option policies and termination functions. The option-critic framework [bacon2017option] approximates the high-level policy and options (intra-option policies and termination functions) using function approximations and search for the optimal parameters in these approximators. In [klissarov2017learnings], PPO is extended with the option-critic framework, referred to as the Proximal Policy Optimization Option-Critic (PPOC) method.
V-B PPOC-CPG net
In the CPG net, the change in frequency ratio will not affect the stability of the CPG net but can regulate the amplitude and frequency of the output directly (4), resulting in speed control of the robot. Thus, we use the option-critic framework [sutton1999between, precup2000temporal] to learn the optimal controller with the tonic inputs (low-level primitive actions) and frequency ratio (high-level actions) of the CPG net.
Modeling the soft snake robot as an MDP, the state vector is given by (see Fig. 1). The primitive actions are a subset of continuous tonic input vectors. In particular, we define a four dimensional action vector and map to tonic input vector as follows,
This mapping bounds the tonic input within , as only positive inputs make sense in the CPG net, and reduces the dimension of the input space from 8 to 4. Thus, it enables more efficient policy search. Furthermore, the action space after dimension reduction still includes tonic input vectors necessary for braking and for turning (see Section IV).
The set of options is defined by the domain of frequency ratios . We do not include in the action space because the frequency ratio needs not to change as frequently as the tonic inputs. Thus, we treat as options, which can be switched from one to another infrequently given the outputs of termination functions. Specifically, each option is defined by where allows the option to be available at any state in the MDP, is a frequency ratio, and is the termination function. The options share the same NNs for their intro-option policies and the same NNs for termination functions. However, these NNss takes different frequency ratios as part of the inputs. The set of parameters to be learned by policy search include parameters for intra-option policy function approximation, parameters for termination function approximation, and parameters for high-level policy function approximation (for determining the next option). PPOC in the openAI Baselines [baselines] is employed as the policy search in the RL module.
Now, referring back to Figure 2
, we have a Multi-layer perceptron (MLP) neural network with two hidden layers to approximate the optimal control policy that controls the inputs of theCPG net in (1). The output layer of MLP is composed of action (green nodes), option in terms of frequency ratio (pink node) and the terminating probability (blue node) for that option. The input layer of MLP consists of state vector (yellow nodes) and the last time step outputs. The inclusion of previous outputs as part of the observation inputs for the option-critic network is observed in several previous works [mori2004reinforcement, peng2018sim, hwangbo2019learning]. The purpose of this setup is to let the actor network learn its consequential dynamics by tracking the past actions in one or multiple steps. Due to the BIBO stability of the CPG net and that of the soft snake robots, we ensure that the closed-loop robot system with the PPOC-CPG controller is BIBO stable.
V-C Curriculum and reward design for goal-reaching
We aim to learn the goal-reaching controller for the soft snake robot. Inspired by previous work that uses curriculum teaching [karpathy2012curriculum]
accelerate motor skills learning, we design different levels of tasks as follows: At each task level, the center of goal is sampled based on the current location and head direction of the robot. Each sampled target comes with a red circle indicating the accepting region of the goal. The sampling distribution is a uniform distribution in the fan area determined by the range of angleand distance bound in polar coordinate given by the predefined curriculum.
As shown in Fig. 6, when the task level increases, we have , , , and ; that is, the robot has to be closer to the goal in order to be success, the goal is sampled in a range further from the initial position of the robot. We select discrete sets of , and determine a curriculum table.
The task level will be automatically upgraded if the training performance reach the desired success rate out of trials, i.e indicates at least successful goal-reaching tasks out of at level .
The reward function at time is defined as
where are constant weights, is the goal radius given by the curriculum levels from level current level , is the linear distance between the head of the robot and the goal, and is an indicator function to tell if current is within the acceptance radius for the -th level. This reward can be explained by two objectives. The first part multiplied by is a potential-based reward for encouraging movement toward the goal; While the second part multiplied by evaluates the level of skill the current agent has for the goal-reaching task. For every task, if the robot enters an accepting region with radius defined by -th level curriculum, it will receive a summation of rewards for all (the closer to the goal the higher this summation), shaped by the approach angle (the straighter the direction towards the goal, the higher the reward). If the agent reaches the goal defined by the current level, a new goal is randomly sampled in the current or next level. There are two failing situations, where the desired goal will be re-sampled and updated. The first situation is starving, which happens when the robot gets stuck with locomotion speed under the given threshold for a certain length of time. The second case is missing the goal, which happens when the robot keeps moving towards the wrong direction to the goal region for a certain amount of time.
Vi Experiment validation
We used a four-layered neural net configuration with hidden layer neurons. There are inputs consisting of a state vector in , a vector containing previous actions in , and the previous option and the terminating probability. The outputs are in , with current action in , and current option and terminating probability. The action vector is decoded by (5) to generate tonic input vector . At every step, the algorithm samples the current termination function to decide whether to terminate the current option and obtain a new frequency ratio or keep the previous option. The KL-divergence was set to 0.02, while and (See [bacon2017option, klissarov2017learnings]
for details about parameters). The backpropagation of the critic net was done with Adam Optimizer and a step size of. The starvation time for failing condition is . The missing goal criterion is determined by whenever stays negative for over time steps.
Vi-a Policy training
For goal-reaching tasks with increasing difficulty, Figure 6(a) shows improving performance over different levels versus the number of learning episodes. Figure 6(b) shows the corresponding total reward with respect to the number of learning episodes. As shown in Fig. 6(a)
, we first train the policy net with fixed options (at this moment, the termination probability is always, and a fixed frequency ratio is used). When both the task level and the reward cannot increase anymore (at about 3857 episodes), we allow the learning algorithm to change along with termination function
, and keep training the policy until the highest level in the curriculum is passed. In this experiment, the learning algorithm equipped with stochastic gradient descent converges to a near-optimal policy afterepisodes of training. The whole process takes about hours with snakes training in parallel on a workstation equipped with an Intel Core i7 5820K, 32GB of RAM, and one NVIDIA GTX1080 ti GPU. In order to compensate for simulation inaccuracies, most notably friction coefficients, we employed a domain randomization technique [tobin2017domain], in which a subset of physical parameters are sampled from a distribution with mean on the measured value. The Domain Randomization (DR) parameters used for training are on Table II, on the Appendix.
Figure 8 shows a sampled trajectory in the simulated snake robot controlled by the learned PPOC-CPG policy. Below the trajectory plot in Fig. 8 is the recorded pressure input trajectory to the first chamber. In the picture, green circles indicate the goals reached successfully, and the red circle represents a new goal to be reached next. The colors on the path show the reward of the snake state, with a color gradient from red to green, indicating the reward value from low to high. Several maneuvering behaviors discussed in Section IV are exhibited by the policy. First, as the higher frequency can result in lower locomotion speed, the trained policy presents a specific two-phase behavior — (1) The robot starts from a low oscillation frequency to achieve a high speed when it is far from the goal; (2) then it switches to higher oscillation frequency using different options when it is getting closer to the goal. This allows it to stay close on the moving direction straight towards the goal. If this still does not work, the tonic inputs will be used to force stopping the oscillation with the whole snake bending to the desired direction (see IV-B), and then restart the oscillation to acquire a larger turning angle.
Vi-B Experiments with the real robot
We apply the learned policy directly to the real robot. The policy is tested on goal-reaching tasks guided by a mobile robot with an accuracy radius of meter (The robot base has a meter radius). The learning policy obtains the mobile robot position using the Mocap system in , and send the control commands to the robot through a low-latency wireless transmitter. An example trajectory is shown in Fig. 9. The real robot tracked the goals set by the mobile robot with an average speed of about . Though trained on fixed goals only, it can also follow the slow-moving target in the test. This result shows the feasibility of the learned policy on the real robot. In the future, we plan to perform more experiments to quantify the performance loss from the simulated to the real robot.
The contribution of this paper is two folds: First, we investigate the properties of Matsuoka oscillator for generating diverse gait patterns for rhythmic and even non-rhythmic locomotion in snake-like robots. Second, we construct a PPOC-CPG net that uses a CPG net to actuate the soft snake robot, and a reinforcement learning algorithm to learn a closed-loop near-optimal control policy that utilizes different oscillation patterns in the CPG net. This learning-based control method shows promising results on goal-reaching and tracking behaviors in soft snake robots. This control architecture may be extended to motion control of other robotic systems, including bipedal and soft manipulators. Our next step is to extend the proposed control framework to a three-dimensional soft snake robot and to realize more complex motion and force control in soft snake robots using distributed sensors and visual feedback.