Autonomous unmanned aerial vehicles (UAVs) are becoming increasingly popular for various tasks, such as search and rescue, payload (medicine, food) delivery in difficult-to-reach areas, aerial cinematography and wildlife monitoring [gonzalez2016unmanned]. Current solutions rely on quadcopters and fixed-wings. Although quadcopters can hover in a fixed position, they are not able to accomplish long-term missions due to their short battery life. The situation is opposite for the fixed-wings, which requires moving constantly to stay airborne. Therefore, for tasks involving long flight times, carrying more payload and hovering over a small region, the use of autonomous blimps is an attractive solution. However, autonomous blimp control remains a challenging problem, which we address in this paper using a learning based approach.
Classic blimp controller design usually relies on PID controllers [1013654, takaya2006pid] and nonlinear control [1570450, LIU2020105610]. PID struggles with plant nonlinearity, and nonlinear control methods require a dynamic model of the system which is often difficult to acquire (i.e. friction, wind, aerodynamic effect, etc.). Deep reinforcement learning (DRL), on the other hand, is a new control framework that has achieved success in a variety of applications that present similar challenges [NIPS2003_2455, bellemare2020autonomous].
For blimp control, the knowledge of some physical parameters, such as friction, aerodynamic effect, etc., are not negligible but remain difficult to estimate. A model-free DRL approach is particularly useful in such a case as it allows an agent to learn a control policy without any pre-specified physics and without the need to estimate those parameters explicitly. However, training such a model-free DRL agent requires significant amount of data and computational resources. The trained agent could also often learn unexpected and unsafe maneuvers as it only exploits the reward function. For example, an autonomous blimp agent can learn to fly backwards and still receive a high reward, while in reality such behavior is undesired as it can damage the hardware. Our insight to address these two issues is to leverage a classical model-free approach, e.g., PID, to constraint the policy search space. We do this by developing a novel framework based on deep residual RL (DRRL)[silver2018residual] that combines the advantages of both classical control and reinforcement learning. The use of the classical method in this framework not only provides stability in training but also implicitly outlines a safe behavior for the agent, constraining the policy search space and avoiding undesired behaviors.
The training process can also be unstable due to the partial observable nature of the environment, e.g. wind and buoyancy, and this effect is exacerbated by the time-delayed blimp dynamics. To address this issue, we integrate an LSTM (long short term memory)[hochreiter1997long] layer in our policy model to reduce the effect of partial observability.
Nevertheless, our DRRL framework still requires substantial training experience to derive a working RL policy. We address this problem by training the agent in a software-in-the-loop (SITL) simulation setup [price2020simulation] and parallelizing it to accelerate this process.
Furthermore, to deploy the agent on the real blimp, it is necessary to i) address the issue of sim-to-real gap, and ii) maintain smoothness in the actuator commands. Thus, in simulation, we apply domain randomization during training to improve the robustness of the agent. To protect the actuator and reduce the effects of chattering, we only include the increment of the actuator command instead of the actuator command itself (e.g., rotor acceleration instead of rotor speed) in the action space of the agent.
In summary, the novel contribution of this paper is a model-free DRRL-based approach for autonomously controlling a large blimp in forward velocity, yaw and altitude, simultaneously, in outdoor moderate wind conditions. Through rigorous simulations we show that our method outperforms state-of-the-art approaches based on a PID and is robust to different flight contexts, e.g., changes in wind speed and buoyancy. Finally, through real-world experiments, we demonstrate that using our approach we obtain a robust control policy that seamlessly generalizes to the real blimp.
Ii Related work
Control methods for blimps and airships, which have similar control schemes, have been well studied . Classic approaches usually rely on PID controllers[1013654, 770044, 894672, takaya2006pid]. While being simple and robust, they often suffer from plant nonlinearity. To overcome this, advanced approaches have been developed using nonlinear control theory, such as inverse optimal tracking control, dynamic inversion control, backstepping control, robust control[Cheng2018], and model predictive control . However, optimal control usually requires an accurate dynamic model which can be difficult to acquire, while robust control handles parameter uncertainty by trading-off the performance. Another key drawback is the lack of any real-world experiments and validation in most of these works. The buoyancy of a real blimp can change significantly due to the fluctuations in temperature over short time periods. The weight distribution could also vary and thus reduce the altitude control performance. Unfortunately, these effects have not been addressed in any of the prior works so far.
On the other hand, recently there has been a surge of interest in applying RL to robotics [singh2021reinforcement]. The earliest works include Gaussian processes (GPs) for system identification of a blimp  and its combination with value iteration and Q-learning approaches [5152660, 4399531] for blimp altitude control only. Despite sample efficiency, GPs are hard to scale up with problem dimensions and demand higher computational resources. As a result, they are able to achieve success only on low dimensional tasks, such as 1-D altitude control, whereas in our approach we show that the agent can feasibly and successfully learn a 3-dimensional task (forward velocity, yaw and altitude control). DRL
, on the other hand, leverages deep neural networks (NNs) for policy approximation. Thus, its policy class can be used for higher dimensional tasks. For example, authors in[nie2019three] train two DQN agents for rudder and elevator control of a blimp, respectively, and demonstrate better performance than a PID controller in simulation. The main challenge with DRL, however, is the lack of sample efficiency. In order to scale up the DRL formulation with the problem dimension, a highly increased amount of environment interactions is needed by the agent. Other challenges include, but are not limited to, adapting a trained policy to real-world scenarios [zhang2019bridging] and action smoothness[caps2021]. Furthermore, issues such as partial observability, disturbances and noise could also lead to unexpected behaviors. As described in the introduction, in our approach we address all these issues through our novel DRRL-based framework, training parallelization, domain randomization and action space design.
In this section, we first describe the blimp and the MDP problem formulation. Then we introduce the main goal of the work, the blimp control task (Sec.III-C). The objective in the blimp control task is to navigate the blimp to any given waypoint within the space m, where is the dimension of the bounding box. This is followed by our novel DRRL-based framework that describes our approach for the blimp control task. Subsequently, we describe yaw control task (Sec.III-E), where the aim is to control the blimp to a desired yaw angle with the tail rotor. The aim of this simplified task is to perform ablation study.
We first briefly describe our blimp (complete details are in our previous work [price2020simulation]), which has 8 actuators. The two main motors (thrusters), , are attached to a servo, , which allows thrust vectoring. At the tail of the blimp, four fins, , two positioned vertically and two horizontally, control yaw and pitch angle, respectively. There is a tail motor, , attached to the lower vertical fin, generating horizontal thrust allowing further yaw controllability. Therefore, the state vector of the actuators can be denoted as
Iii-B Markov Decision Process
We consider the RL problem as an infinite horizon discrete time Markov Decision Process,, defined by a tuple [sutton2018reinforcement]. At any time step and state , an agent draws an action from a continuous action space given the policy distribution parameterized by . The environment then samples the next state from an unknown transition distribution, i.e. . A reward is received based on some reward function . Given the discount factor , the goal of the agent is to obtain the optimal policy parameter that maximizes the expected value of the cumulative discounted reward (2),
Iii-C Blimp Control Task
We formulate the problem as a path following task as seen in previous works [1626776, nie2019three, 5611169]. In this setting, an imaginary path reference is generated based on waypoints for the controller to follow. Casting the path following task as a DRL problem, in this section we derive the observation space and action space representation.
Since the blimp does not have a lateral movement control, we only consider longitudinal, altitude, and velocity control. This allows us to easily decompose the problem into planar, altitude, and velocity control. The objective of the planar control is to control the blimp to arrive at any waypoint in the xy-plane, the altitude control is to reach the desired z, and the velocity control is to track the desired velocity.
Given the blimp position at and velocity , a target waypoint at in body frame cylindrical coordinates with desired velocity (Fig.~2), the control objective of the planar control is the minimization of the relevant distance and yaw angle, i.e. . The objective of the altitude and velocity control is to minimize the relevant altitude and the relevant velocity, respectively, i.e., , , where .
We denote the velocity vector of the blimp as , and attitude (roll, pitch, yaw) as . Assuming near zero lateral movement in the blimp (i.e. ), the velocity and pitch angle can be encoded by velocity magnitude () and the altitude velocity (), alone. Therefore, the base state vector is
We augmented the base state vector with additional components, based on the insights as explained below. It was observed that the training progress becomes more stable if yaw velocity, augmented to the base state vector. The airspeed sensor readings, , were augmented to enhance robustness against the wind. To prevent overshoot when reaching a waypoint in the planar control task, we augmented , the relative yaw angle of the blimp with respect to the subsequent waypoint. Consequently, the extended state representation is
Iii-D Novel DRRL-based framework
Our DRRL framework consists of two controllers – a stability provider and a performance optimizer, respectively. The classical approach offers stability guarantees and basic tracking performance which is the role usually played by a PID controller or a robust controller to enlarge region of attraction. Performance optimizer within this framework is a DRL agent that can learn to adjust the control decisions of the stability provider in order to maximize its own reward function. The control command from these two controllers are then mixed by a mixer , which we will described later. The overall structure is displayed in Fig. 3. In this paper, we choose PID controller as our stability provider for its simplicity and robustness. It integrates well with the DRL agent as it is also a model-free method. No dynamic model is required with this combination. Though its performance degrades quickly outside the tuned speed range, the system nevertheless remains stable and can still bring the blimp closer to the waypoint.
Iii-D1 PID Controller
The PID command, is determined as follows (5),
where . Since it is difficult to design a PID-based servo control, we leave this completely for the DRL agent to control.
Iii-D2 Observation and Action Space for the DRL agent
The full actuator state, , is described in (1). Since we forbid differential thrust, symmetric actuators are always in the same state. Thus, we feedback only one of them (i.e. ). The tail motor is controlled and observed together with bottom fin (i.e. ). The reduced state of actuators is therefore defined as . The full state for the DRL formulation, as used in (2), is now obtained below as the concatenation of , , and
Note that all states are scaled to the range and zero-initialized. The RL command, , is chosen based on the state vectors. Then the joint action command, , is simply the mixture of RL and PID actions.
We introduce 3 types of mixer: absolute, relative, and hybrid mixer (8-10). Absolute mixer offers RL agent more authority and is expected to have the highest performance after convergence at the cost of performance drop during exploration. This property is reversed for the relative mixer. Since the absolute mixer is too aggressive and requires rigorous tuning of the beta parameter, whereas the relative mixer is too conservative to change the system’s inherent stability properties, we introduce a hybrid mix as an intermediate solution.
where . To reduce the effect of chattering in the actuator state, we avoid mapping joint command, , to the actuator state directly. Instead, it is first mapped to the increment of actuator state by element-wise multiplication with a constant vector, , and then update the actuator state. The process is described below in (11). This way, we prohibit sudden significant changes in actuator states. Since our electronic speed control filters small changes, the damage from chattering effect is almost diminished. The disadvantage of this approach is that the control agility is reduced due to an additional pole introduced at the origin.
We summerized the overall control architecture in Fig. 3.
Iii-D3 Reward Function
The navigation requires moving the robot in space by specifying a target position or following a sequence of waypoints. The reward function is defined by (12)
where . The agent receives a success reward, , if the task is completed, i.e., a waypoint is successfully reached within a certain threshold . Tracking reward, , indicates the tracking performance as defined in (13). Action reward, , is defined to regularize actuator commands. Bonus reward, , specifies additional desired control property of preventing overshoot.
where measures euclidean distance between the blimp and the target waypoint position’s 2D projections on the ground plane. Parameters, , are derived via manual tuning and approximate the energy consumption of the rotors. The bonus reward is designed to reduce overshoot by reducing the relative yaw angle to the next waypoint when the blimp approaches current target waypoint.
Iii-E Yaw Control Task
Here, the agent observes yaw-related states and outputs a yaw command to control the tail motor, , together with a PID controller. The objective is to minimize the relative yaw angle, i.e. . Concretely, the observation space is defined as and the action space where and . The actuator state update is described as in (11) with . Different from (12), the reward function in this task, only combines success reward and tracking reward, or . The task is considered as success if the agent’s relative yaw angle is kept small for a certain period of time, as described in (14).
where is a time counting function and .
Iii-F Training Setup
In this section, we describe the important factors attributed to the robustness of the trained policy. At the beginning of each training episode, several waypoints are sampled in the space based on the position of previous waypoint. When the blimp reaches the current waypoint, it receives a success reward and activates the next waypoint. During training, observations and actions are injected with noise. Lastly, we apply domain randomization and sample new environment variable for each episode according to Table.1.
|wind in xy direction||[-1.5, 1.5]|
|wind in z direction||[-0.15, 0.15]|
|freeflop angle||[0.0, 1.5]|
|deflation rate||[0.0, 1.5]|
We train the policy network with PPO, which has achieved strong benchmark performance and training stability. To accelerate training, we parallelize the simulation and raise the speed up to 14 folds of the real-time. The architecture of our policy network (Fig. 4) includes an LSTM layer to reduce the impact of partial observability (e.g. wind, bouyancy, etc.).
Iv Experiment Design and Setup
With the real world scenario in mind, we address the following questions through our experiments. Does LSTM architecture reduce the impact of partial observability? What are the properties of the mixers in the DRRL framework? Can the DRRL agent improve the PID controller? How does the agent perform under the presence of disturbance, noise, and parameter uncertainty?
Iv-a Experimental Setup and compared methods
We integrate our DRRL training environment in the ROS/Gazebo SITL simulation following the OpenAI-Gym framework. The PPO implementation is based on RLlib. The agent is trained on a single computer (AMD Ryzen Threadripper 3960X, 24x 3.8GHz, NVIDIA GeForce RTX 2080 Ti, 11GB). Our simulated blimp model is designed based on our real robotic blimp (see Fig. 1). The following methods are evaluated and compared to each other.
DRRL agent: our proposed approach.
PID: the PID described in Sec. III-D1
Baseline: the baseline is a cascade PID controller, well-tuned to the simulation environment. Our previous work [price2020simulation]
has shown that we could deploy it to the real world without tuning, which implies a reliable quality of the simulation and robustness of such approach. This controller directly controls the actuator to follow the velocity reference from a path planner instead of the waypoints (as used by the above 2 methods) and relies on an extended Kalman filter for state estimation and noise filtering.
Iv-B Task Suite
In this section, we describe the design of the two control tasks that were introduced previously. The Yaw control task (III-E) is a simplified task to evaluate different design options. The goal is to acquire the best possible configuration to then train a near-optimal policy for blimp control task (III-C) within limited amount of time. To ensure reproducibility, the training experiments are conducted with different seeds. Table 2 displays the parameters for both tasks.
Iv-B1 Yaw Control Task
We carry out an empirical ablation study on training stability of DRRL agent with different PID controller, different policy, and mixer combination. We first use a PD and then a P-control, which correspond to a good and a poor PID controller, respectively, for this task. PD control has stability guarantee while P control is only marginally stable.
Iv-B2 Blimp Control Task
We design the DRRL agent for this task based on the conclusion from the ablation study of the yaw control task. Despite the difference in task complexity, the PPO agent hyperperameter remains identical. We first examine the training progress of the DRRL agent and compare to the PID controller. The training is performed 3 times with different seeds to ensure the reproducibility.
Then, we investigate its robustness and characteristics in different wind context w.r.t the PID and the Baseline methods. This comparison is performed on results averaged over 7 runs, each lasting for 30 minutes and for 2 desired trajectories (coil and square). Furthermore, these are subject to random uniformly sampled wind direction. The square trajectory consists of 4 waypoints and has 80 meters between each waypoint. The coil trajectory has 30 meter radius covered by 15 waypoints in total. Consecutive waypoints on it are separated by 45 degrees and 42.4m in their projection on the X-Y plane and by 2m in the Z direction. As the square trajectory has longer edges, it is easier to track it as compared to the coil. The coil trajectory is more challenging due to the shorter inter-waypoint distance. In this case, the blimp has to constantly slow down to control the yaw angle which can cause altitude loss.
We test the trained agent on a real blimp with 40 meters square trajectory. The real blimp has several different properties compared to the simulation, e.g. trim weight difference, buoyancy, maximum thrust etc. Many of these effects are not domain randomized during training. That is to say, it is a new flight context that the DRRL agent has not encountered before and thus it pose a great challenge for generalization.
|PPO||learning rate schedule|
|Environment||simulation frequency [hz]|
|policy frequency [hz]|
|observation noise |
|action noise |
|Yaw Control||training time [day]|
|est. wall clock time [hr]|
|Blimp Control||training time [day]|
|est. wall clock time [hr]|
V Experimental Evaluation
V-a Yaw Control Task
Fig. 4(a) shows that the final performance of DRRL with LSTM architecture increases the performance of good/poor PID control by 47%/13%. Without LSTM, the improvement is only 33%/13%. The maximum performance drop during exploration is 7%/25% with LSTM and 31%/47% without LSTM. This result suggests that the LSTM is an important building block for our DRRL framework. It can effectively stabilize and accelerate the training progress and reduce the performance drop during training.
The properties of the mixer type can be observed through the training progress in Fig. 4(b). Unsurprisingly, the absolute mixer has the largest performance growth and drop during training and achieves the highest final performance for both good and poor PIDs, since it grants the DRRL agent more control authority. On the contrary, the relative mixer neither improves nor degrades the PID performance. Equation (9) suggests that, as the agent control is dependent on the PID, the DRRL agent has no control authority if PID control is small. That is, when the yaw error is close to 0, P control and DRRL agent offer no control to reduce the angular velocity and overshoot the target angle. Consequently, the marginally stable system property remains unchanged. Lastly, the hybrid mix retains the properties from both mixers and achieves the intermediate performance as expected. Although absolute mixer appears to be the best design, it is important to note that it can reduce the training stability. This effect is amplified in higher dimensional space.
We draw the following conclusions from the yaw control experiment: 1) The LSTM plays an important role to stabilize and accelerate training. 2) The choice of the mixer has significant impact on the training stability and final performance. 3) In addition to the importance of LSTM and mixer, the design choice of the PID controller is also important. By including a derivative term boosts nearly 50% of the initial performance.
V-B Blimp Control Task
Following the conclusion from the ablation study (V-A), we design the DRRL with LSTM policy and absolute mixer as well as with the hybrid mixer. Even though the absolute mixer appears to be the best configuration in simpler task, the agent with this mixer consistently failed to obtain any functional policy and got stuck in a local optima where the policy always commands maximum tail rotor and results in repeated rotation movement. The hybrid mixer, on the other hand, successfully stabilizes the training progress (Fig. 6). It reaches 60% of the final performance within the first 2000 episodes and continues to grow steadily.
As demonstrated in Fig. 7, the agent successfully tracks both the square and coil trajectory, which implies that it does not overfit to any specific tracks and can generalize to any desired trajectory in the 3D space. The baseline trajectory is more consistent compare to the others. This is because EKF provides smoother state estimation while both the DRRL agent and PID control receive only noisy raw observations. On the other hand, although the trajectories of the DRRL agent and the PID controller look fairly similar, the DRRL agent has much less overshoots compared to both the other methods in the coil trajectory (Fig. 6(b)) as it can observe the subsequent waypoint. In the coil trajectory, we observe that the baseline struggles following denser waypoints. To prevent altitude loss, the baseline applies a constraint on the maximum yaw angular rate, which limits the maximum turn radius and reduces its agility.
In Tab. 3 we summarize the robustness tests and comparison of methods over different trajectory types, wind speeds and buoyancy. The DRRL agent receives highest amount of ‘total reward’ in 4 out of 6 experiment combinations. Higher ‘success reward’ implies that the agent can track more waypoints within the total time span. During experiments, although the desired velocity is m/s, the baseline seems to achieve only . As a consequences, it traverses less total distance and receives less amount of success reward. In terms of tracking reward, the baseline significantly outperform others. Since the tracking reward is dominated by the altitude loss, this suggests that the baseline can keep track of the altitude better than PID and DRRL agent. In the coil trajectory, although the PID and the DRRL agent can follow the trajectory well, we observe significant loss in altitude. PID control does not have sufficient speed to maintain the altitude and continues to sink, while the DRRL agent relies on the thrust vectoring to loiter at the desired altitude. Similarly, reducing the buoyancy can impair the altitude control of the RL agent and PID control. The baseline, while being worst at overall waypoint tracking, tracks the altitude well. It achieves this via thrust vectoring and because, by design, the baseline’s primary task includes maintaining the altitude.
V-C Real World Test
The result of the real test flight is displayed in Fig.8 and Table4. We reduce the square size to meters as oppose to meter in simulation due to the limitation of the test field. The wind speed was measured in average which was 4 times more than the DRRL agent had experienced in the simulation. Nevertheless, the DRRL agent could still hold its own position under the gusts and successfully reached several waypoints. Note that the row of trajectory snapshots in Fig. 8 show only a part of the complete trajectory. We also provide the baseline as a reference. But they are not comparable since the gusts were strong and arrived irregularly, and hence, the method that received more gusts would obtain less reward.
|Real Flight Evaluation|
In this work, we presented a novel framework based on DRRL for the blimp control task. It leverages an RL agent to improve the basic PID control performance through interaction with the environment. We presented and evaluated several techniques to stabilize the training progress and enhance the robustness of the trained RL agent, e.g., domain randomization, LSTM layer, and a hybrid mixer in the DRRL framework. Extensive robustness tests were conducted that demonstrated the DRRL agent’s capability to improve the PID performance and outperform it as well as another baseline approach. Through real blimp flights in outdoor environment and windy conditions, we demonstrated that the trained policy could even generalize to a real scenario without any modification.