Autonomous Blimp Control using Deep Reinforcement Learning

by   Yu Tang Liu, et al.

Aerial robot solutions are becoming ubiquitous for an increasing number of tasks. Among the various types of aerial robots, blimps are very well suited to perform long-duration tasks while being energy efficient, relatively silent and safe. To address the blimp navigation and control task, in our recent work, we have developed a software-in-the-loop simulation and a PID-based controller for large blimps in the presence of wind disturbance. However, blimps have a deformable structure and their dynamics are inherently non-linear and time-delayed, often resulting in large trajectory tracking errors. Moreover, the buoyancy of a blimp is constantly changing due to changes in the ambient temperature and pressure. In the present paper, we explore a deep reinforcement learning (DRL) approach to address these issues. We train only in simulation, while keeping conditions as close as possible to the real-world scenario. We derive a compact state representation to reduce the training time and a discrete action space to enforce control smoothness. Our initial results in simulation show a significant potential of DRL in solving the blimp control task and robustness against moderate wind and parameter uncertainty. Extensive experiments are presented to study the robustness of our approach. We also openly provide the source code of our approach.



There are no comments yet.


page 1


Deep Residual Reinforcement Learning based Autonomous Blimp Control

Blimps are well suited to perform long-duration aerial tasks as they are...

Emergent behavior and neural dynamics in artificial agents tracking turbulent plumes

Tracking a turbulent plume to locate its source is a complex control pro...

Evaluation of Deep Reinforcement Learning Methods for Modular Robots

We propose a novel framework for Deep Reinforcement Learning (DRL) in mo...

Adaptive Power System Emergency Control using Deep Reinforcement Learning

Power system emergency control is generally regarded as the last safety ...

Visualization of Deep Reinforcement Autonomous Aerial Mobility Learning Simulations

This demo abstract presents the visualization of deep reinforcement lear...

Simulation and Control of Deformable Autonomous Airships in Turbulent Wind

Abstract. Fixed wing and multirotor UAVs are common in the field of robo...

Experience Recommendation for Long Term Safe Learning-based Model Predictive Control in Changing Operating Conditions

Learning has propelled the cutting edge of performance in robotic contro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous unmanned aerial vehicles (UAVs) are becoming increasingly popular for various tasks, such as search and rescue, payload (medicine, food) delivery in difficult-to-reach areas, aerial cinematography and wildlife monitoring [gonzalez2016unmanned, wang2016detecting, 7139494, 4282475, aircap2019aerialswarms, mademlis2019high, markerlessmocap]. Current solutions rely on quadcopters and fixed-wings. Although quadcopters can hover in a fixed position, they are not able to accomplish long-term missions due to their short battery life. The situation is opposite for the fixed-wings, which have to move constantly to stay airborne. Therefore, for tasks involving long flight times, more payload and hovering over a small region, blimps provide an attractive solution.

A blimp is an airship without a rigid hull structure. Filled with helium, it becomes lighter than air and can hover for long periods without spending much energy. A blimp’s weight is usually concentrated at its gondola, which creates a huge inertia to stabilize itself. From a control perspective, this makes the blimp an inherently stable plant [li2007modeling] and allows it to recover from undesired states.

For blimp controller design, classic approaches usually rely on PID controllers [1013654, 770044, 894672, takaya2006pid, 1626776] and nonlinear control [1554402, 1570450, 4470489, 1177111, LIU2020105610, Cheng2018, 4776979, LIU2018231]. PID controller suffers from nonlinearity, and nonlinear control methods require a dynamic model of the system which is often difficult to acquire. Deep reinforcement learning (DRL), on the other hand, is a new control framework, which has achieved success in a variety of applications that present similar challenges [NIPS2003_2455, Hwangbo_2017, Silver_2016, Mnih2013PlayingAW, zhu2018dexterous, bellemare2020autonomous]

. The model-free RL approach is particularly useful when it is difficult to estimate physical parameters such as buoyancy and aerodynamic effects. The learning ability allows the controller to potentially adapt to the dynamic change caused by the environment.

Fig. 1:

Our autonomous blimp during a flight. Unlike common designs, our blimp has thrust vectoring increasing its agility. Inset: a gazebo model of our blimp.

In [Price:IAS:2021] we developed a software-in-the-loop (SITL) simulation and a manually-tuned PID controller for the blimp control task. There, we demonstrated its ability to follow a waypoint sequence in the real world. To address the previously described issues of PID or nonlinear control approaches, in the current work we explore a DRL based method. Here, an RL agent, QRDQN[mavrin2019distributional], is deployed in the simulation environment to improve exploration and training stability. We design a variety of training task suites following the OpenAI-Gym framework [stable-baselines3], which is compatible with a variety of off-the-shelf RL platforms [liang2018rllib, TFAgents, hoffman2020acme]. To achieve DRL training within a reasonable amount of time, we reduce problem difficulties by selecting task-specific features. We choose a discrete action space design to enforce continuity in the actuator commands. An action penalty is included in the reward function to regulate motor usage. To robustify the agent, we corrupt data by injecting noise to all observations and actions during training. We show that this integration address some of the common issues that may appear in a real world scenario.

Ii Related work

Control methods for blimps and airships, which have similar control schemes, have been well studied in the last decade [5420403]. Classic approaches usually rely on PID controllers. Popular applications include visual servo[1013654, 770044, 894672] and indoor miniature blimp[takaya2006pid, 1626776]. The PID control class, while simple and robust, often suffers from plant nonlinearity. To overcome this weakness, advanced approaches have been developed using nonlinear control theory. Existing methods in this context include inverse optimal tracking control[1554402], dynamic inversion control[1570450], and the more investigated, backstepping control[4470489, 1177111, LIU2020105610], robust control[Cheng2018, LIU2018231], and model predictive control [4776979, LIU2018231]. However, these methods usually require a dynamic model which can be difficult to acquire. The buoyancy of the blimp is heavily dependent on the constantly-changing surrounding environment. When temperature or pressure changes, buoyancy also changes and a controller has to adapt on the fly. Unfortunately, this effect has not been addressed in any of the prior works so far.

On the other hand, recently there has been a surge of interest in applying RL to robotics [singh2021reinforcement]. The earliest attempts in the classic RL use Gaussian processes (GPs) for system identification [4209179] and policy learning[5152660, 4399531]. Despite sample efficiency, GPs are hard to scale up with problem dimensions and demand higher computational resources. As a result, they are able to achieve success only on low dimensional tasks, such as 1-D altitude control. DRL, on the other hand, leverages NNs for policy approximation and has achieved much success. This policy class can interpret rich representations and derive diverse behavior. For example, Nie et al. [nie2019three] train two DQN agents for rudder and elevator control of a blimp, respectively, and demonstrate a better performance than a PID controller. In the field of autonomous underwater vehicle (AUVs)222Due to the lack of existent work, we include AUVs but only focus on those that have a fairly similar shape and task specification, Carlucho et al. [8604791] use a Nessie-VII model with continuous action/observation space using DDPG.

The main challenge with DRL is the lack of sample efficiency. In order to scale up the DRL formulation with the problem dimension, a highly increased amount of environment interactions is needed by the agent. Other challenges include adapting a trained policy to real-world scenario [zhang2019bridging], action smoothness[caps2021], etc. Furthermore, issues such as partial observability, disturbances and noise could also lead to unexpected behavior. Issuing stability certificates to RL agents is also an ongoing research topic. In case of AUVs, the robustness issue is addressed in [9414937], by training a PPO agent in an adversarial fashion at the cost of conservative behavior.

To increase sample efficiency and finish training within a reasonable time budget, in this work we train the policy network with a value-based RL agent, QRDQN, which is a more stable variant in the DQN family. Action and observation space are injected with noise during training to robustify the agent. Lastly, to enforce actuator smoothness, we choose a discrete action space (Sec.III-B3).

Iii Methodology

Iii-a Preliminary

Blimp shape and architecture can vary a lot from one another. Therefore, we first describe our blimp, which has 8 actuators (but our approach is agnostic to different configurations). The two main motors (for thrust) are attached to a servo which allows thrust vectoring. At the tail of the blimp, there are four fins controlling yaw and pitch angle and a tail motor, attached to the bottom fin, generating horizontal thrust allowing further yaw controllability. The state of the actuators can be denoted as


where stand for motor, servo, and fin states, respectively. Our goal is to navigate this blimp to any given waypoint in the space by controlling these actuators.

Iii-B Formulation

Iii-B1 Control

We formulate the problem as a path following task as seen in previous works[1626776, nie2019three, 5611169]. In this setting, an imaginative path reference is generated based on waypoints for the controller to follow. Casting the path following task as a DRL problem, in this section we show how we reduce the state space size and maintain a reasonable training time. Since the blimp does not have a lateral movement control, we only need to consider longitudinal and altitude control. This allows us to easily decompose the problem into a planar navigation control task and an altitude control task. The objective of the planar navigation control is to control the blimp to arrive at any waypoint in the xy-plane whereas the altitude control is to reach the desired altitude of the waypoint.

Given the blimp position at and a target waypoint at in body frame cylindrical coordinates (Fig. 2), the control objective of the planar navigation control is the minimization of the relevant distance and yaw angle, or . The objective of the altitude control is to minimize the relevant altitude, or . The spatial information between the target and the blimp can be fully contained in . Although it is possible to train the DRL method only using , this minimal setting ignores the velocity and pitch state of the blimp, leading to instability in training and an uncontrolled behavior when reaching the waypoint.

Fig. 2: The bodyframe in NED cylindrical coordinate system. is at the top fin of the blimp where GPU and IMU sensors are mounted. is the projection on xy-plane of the waypoint .

We denote the velocity of the blimp as , and attitude (roll, pitch, yaw) as . Assuming near zero lateral movement in the blimp (i.e. ), the velocity and pitch angle information can be encoded by velocity magnitude () and the altitude velocity (), alone. We augment our state with this velocity information and derive a compact representation for the overall state as


which encodes all the spatial, velocity, and attitude information. As we do not directly use the pitch angle information, this representation is agnostic to sensor calibration error in pitch angle, which can be easily misaligned from the simulation.

Iii-B2 Markov Decision Process

We consider the RL problem as an infinite horizon discrete time Markov Decision Process,

, defined by a tuple [sutton2018reinforcement]. At any time step and state , an agent draws an action from a discrete action space given the policy distribution parameterized by . The environment then samples the next state from an unknown transition distribution, i.e. . A reward is received based on some reward function . Given the discount factor , the goal of the agent is to find the optimal policy parameter that comes with the highest cumulative discounted reward (3),


Iii-B3 Observation and Action Space

The full actuator state, , is described in (1). Since we do not allow differential thrust, symmetric actuators are always in the same state. Thus, we only need to feedback one of them. The reduced state of actuators is therefore defined as . The full state for the DRL formulation, as used in (3), is now obtained below as the concatenation of and .


Note that all states are scaled to the range and zero-initialized. To prevent significant and sudden changes in the actuator command, we use discrete action space, denoted as . The action command is then mapped to the actuator command following Table 1 and then summed up with the actuator state. This process is described below in (5).


Iii-B4 Reward Function

The control tasks of the UAVs usually involve navigation and hover. Navigation requires moving the robot in space by specifying a target position or following a sequence of targets, whereas hovering requires staying near the target position. These two tasks can be combined and trained with the same setup by using appropriate reward functions. When the goal is far away, we use a reward function for navigation only, otherwise a hover reward function. The reward function is defined by (6)


where in this paper. The agent receives a success reward, , if the task is completed. Tracking reward, , indicates the tracking performance. Action reward, , is defined to regularize actuator commands.


where measures euclidean distance between the blimp and the target position. (7-8) indicate if this distance is short enough, the reward will be switched from navigation reward, , to hover reward, , which does not take yaw component into account (9-10). Note that we could also use or in the reward function to address other tasks.


where in this paper.

A Name
0 IDLE [0, 0, 0, 0, 0, 0, 0, 0]
1 THRUST+ [0, 0, 0, 0, 0, 0, 0.01, 0.01]
2 THRUST- [0, 0, 0, 0, 0, 0, -0.01, -0.01]
3 NOSE_UP [0, 0.025, 0.025, 0, 0, 0, 0, 0]
4 NOSE_DOWN [0, -0.025, -0.025, 0, 0, 0, 0, 0]
5 NOSE_LEFT [0.025, 0, 0, 0.025, 0.025, 0, 0, 0]
6 NOSE_RIGHT [-0.025, 0, 0, -0.025, -0.025, 0, 0, 0]
TABLE 1: discrete action space : (). The notation correspond to the angle of left/right/top/bottom fins. Note that in this work thrust vectoring is disabled ().

Iii-C Training Setup

In this section, we describe the important factors that contribute to stabilize the training and increase the robustness of the trained policy. During training, the target position is sampled randomly within the range of cubic meters w.r.t. the blimp. Random sampling is important to increase sample diversity and avoid overfitting to a specific track. We reset the task only after seconds so that there is sufficient time for the blimp to reach any target, and use the spare time to learn to stay within the target range. During training, to increase the robustness of the policy, observations and actions are injected with of noise and clip to the range . Lastly, while the simulation step time is , the policy step is . Since the blimp has a relatively long response time, we found it important to increase the step time for the action to take effect.

We train the policy network with the QRDQN agent. The value-based method is in general more sample efficient compared to gradient-based methods and can therefore accelerate the training. QRDQN leverages a quantile network to estimate value function, which is important to stabilize training by alleviating chattering

[mavrin2019distributional] effect and extreme value estimates. The architecture of our policy network is shown in Fig. 3. To reduce training time, we only apply less than quantiles and sacrifice some estimation resolution.

Fig. 3: The policy networks has the weights of neurons by layers. To prevent vanishing/exploding gradient, we add normalization layers to every linear layers. is a value array of size which evaluates each possible actions in . Policy chooses action based on a greedy law, .

Iv Experiment

In the experiments, with the real world scenario in mind, we address the following questions. Is our DRL formulation with the compact representation able to solve the complex 3-D path-following task? In order to answer this question, we introduce the navigation and hovering tasks which are the building blocks for further complicated tasks. To evaluate the agent performance, a PID controller (from our previous work [Price:IAS:2021]) is considered as a benchmark, which is simple but well-known for its robustness. Finally, through various experiments (see sub-sec. IV-C) we evaluate if the RL agent is ready to be deployed in the real world. In other words, we evaluate the agent’s robustness against unknown environmental changes.

Iv-a Experiment Setup

We integrate our RL training environment in the ROS/Gazebo SITL simulation following the OpenAI-Gym framework. The QRDQN implementation is based on the StableBaseline3 [stable-baselines3]. The agent is trained for 7 days on a single computer (AMD Ryzen Threadripper 3960X, 24x 3.8GHz, NVIDIA GeForce RTX 2080 Ti, 11GB). Our simulation environment is designed based on our real robotic blimp (see Fig.1). The baseline PID controller is well-tuned to the simulation environment. Our previous work [Price:IAS:2021]) has shown that we could deploy it to the real world without further tuning, which implies a good quality of the simulation.

Iv-B Task Suite

To evaluate the performance of the agent, the navigation and hover tasks are introduced in the Sec.III-B. For convenience, we visualize the target waypoints in the world ENU frame.

Iv-B1 Navigation

Four waypoints are created at an altitude of m to form a square with sides of m each. This has to be traversed in a counter-clockwise direction. A waypoint is registered when the blimp is within m radius and then the next waypoint is triggered. The early waypoint trigger allows less overshoot and achieves better performance. To make sure the comparison is fair, the track has to be performed 3 times to be marked as complete. The velocity for the PID controller is set to have a slow reference speed of to prevent overshoot, while the agent is not subjected to any speed limit but maximum throttle. The results are shown in Fig. 3(a).  The PID controller has a stable performance during the whole task and remains a challenging baseline. On the other hand, although our trained RL policy can complete the navigation task successfully, it shows higher discrepancy from the reference path. It spends most of the time hovering above the waypoints and reduces the altitude until the next waypoint gets triggered.

(a) Navigation Task: Left – the planar trajectory of the blimp. Right – the altitude trajectory. Red: reference. Black: PID controller. Green: RL policy. The PID controller completes the task around faster than the RL policy, which seems to favor an altitude meters above the target altitude.
(b) Hover Task: Left: The planar trajectory of the blimp. Right: The altitude trajectory. Red: reference. Black: PID controller. Green: RL policy. The RL policy hovers with less radius around the target but it loiters meters above the target altitude.
Fig. 4: Comparison of PID controller and RL policy in navigation and hover tasks.

Iv-B2 Hovering

The hovering task requires the blimp to stay as close to the target as possible without spending excessive amount of energy. The target waypoint is positioned at in the world ENU frame. The blimp is spawned at the target position. The result is in Fig. 3(b). The PID controller requires a larger radius compare to the RL policy. On the other hand, similar to the navigation task, the RL policy tends to hover meters above the target altitude. Our initial reasoning for this behavior was lack of training. However, after continuing training the same policy, the results become worse. When the waypoint is an arbitrary point in space and far from the origin, the agent accelerates towards the target at first, then hovers close to it, and finally makes a slow approach towards it. We argue that hovering close to the target altitude gives long-term advantages to the agent, analyzed as follows. First, the agent receives more action rewards () as it does not need to command anymore but only needs to wait until it slowly approaches the target altitude. Second, during this time, the distance is short enough to receive a good amount of hover reward (); and if the agent would rush to the target, it is most likely to overshoot and spend an excessive amount of energy to come back to the hover position, and subsequently overshoot again. Third, since the speed of the blimp becomes very small when approaching the target, the agent can easily stay longer within the target range and continuously receive abundant success rewards ().

We are able to reproduce this behavior as shown in Fig. 4(a). The blimp is spawned at , which is above the target altitude . We first observe that the blimp approaches the target, then stays close to it with a low speed. During this time, it still receives a good amount of tracking and success reward as shown in Fig. 4(b). The blimp then continues to sink m below the target, after which the policy brings it back and raises it above the target altitude. This overall behavior required s. This is followed by the hovering behavior as observed in Fig. 3(b). Such a behavior comes from the fact that the total reward is dominated by the success reward and the loss of altitude does not result in significant punishment in tracking reward. Therefore, manipulating the reward function and increase altitude weight could help get rid of this behavior.

(a) What happens if the agent is spawned far from the target? Left: the planar view of the blimp approaching the target at (50,50,70). Right: the altitude trajectory of the blimp. Red: waypoint. The blimp significantly loses its altitude near the target position.
(b) Reward exploitation by the agent. Red: total reward. Black: Success reward. Blue: tracking reward. Green: action reward. The agent still receives a good amount of reward despite the altitude loss.
Fig. 5: Analysis of the hovering behavior. Magenta: the time agent responds to altitude loss. The total reward is calculated based on the (6). The tracking reward does not penalize altitude discrepancy enough and causes the strange behavior.

Iv-C Robustness Study

To show the robustness of the agent, we test our agent with i) a fixed wind field, ii) changes in the blimp buoyancy, and ii) changes in the weight distribution along the gondola. In Fig. 6, the agent is able to handle small wind disturbance at but fails at . Under the wind condition, it takes a significantly large amount of time to finish the task. Notice that when the wind is at , the agent trajectory seems to be smoother. This is because wind slows down the agent and prevents overshooting the target. At the wind speed of , the agent tends to slow down when it approaches the target as in Fig. 7. This is the side effect from training navigation and hovering task together as the agent tries to slow down to stay within the range of success reward. It starts to slow down around m to the target. As a result, although the agent has enough thrust power to overcome wind, it gradually reduces both motors to zero speed and then is blown away by the wind. A naïve workaround is to toggle the target switch when the blimp is m from the target. But a toggle with such a huge radius is not realistic. Notice that in Fig. 7, the motors always have smooth transition due to the discrete action design which only allows motor speed change every seconds.

Fig. 6: Effect of wind. Blue: no wind. Green: wind. Orange: wind. The dark arrow indicates the direction of the wind field. This experiment was conducted for 3 laps for the ‘no wind’ case but for only 1 lap for the other two cases. When the wind speed is the agent is not able to complete the task.
Buoyancy Avg. Time (sec)
100% 238
95% 545
90% NA
85% NA
80% NA
a Effect of buoyancy on average time to complete a square.
Added mass (g) Avg. Time (sec)
0 238
-100 237
-250 NA
100 328
250 NA
b Effect of trim weight on average time to complete a square
TABLE 2: Buoyancy and weights change cause significant impact on the agent
Fig. 7: Behavior of the DRL policy when the blimp is subject to wind: Motor0 is the motor output . Distance to target is calculated by euclidean distance . The left side of the green dotted line is the first waypoint, and right side is the second. The second waypoint is harder to reach since the blimp has to move against the wind. Whenever it is close to the target, it starts to transition into hover mode. This causes the blimp to be blown away by the wind, from which it is never able to recover.

Another common scenario for the blimp in real world is the buoyancy change. Depending on the weather, the buoyancy of the blimp can change significantly. We test the performance of the agent with a decreasing amount of buoyancy w.r.t. the original state. Not to our surprise, the result in Tab. Ia shows that the decreasing amount of buoyancy does lead to worse performance. With buoyancy, policy performance suffers significantly and takes much longer to finish the task. Lower than that the agent is not able to control the blimp at all. The effect of weight distribution (also commonly affected in real world) also can not be ignored as it could introduce unnecessary vibrations if not balanced. To this end we perform another experiment, where we add and remove ballast to the front end of the blimp to break this balance. Results in Tab. Ib suggest that g of mass change does not affect the performance, but larger than that would impair the policy. These two experiments have shown that the RL agent is currently sensitive to the environment changes. We expect to improve the performance of the agent by increasing the penalty for the altitude loss.

V Discussion

In this work, we integrated the ROS/Gazebo SITL blimp simulation together with the RL training environment. We have derived a compact representation of the state space and action space which allows less training time and guarantee the actuator continuity. We have shown that such a setting is able to successfully complete the task. The trained policy network has a certain degree of robustness against wind and parameter uncertainty.

On the other hand, we have observed and analysed how the agent exploits the reward function. The altitude loss is unacceptably large for this agent to be deployed to the real world. Increasing the altitude reward weights and punishing the altitude loss could potentially address this issue. However, further experiments are needed to verify this hypothesis. In this work, the reverse thrust and thrust vectoring were not enabled. Given a more diverse action space, the agent is more likely to gain more rewards by staying closer to the target. Another problem is that when training navigation and hover in the same time the agent learns the conservative behavior when approaching a waypoint. This could be potentially eliminated by including disturbance to the training and making it harder to exploit the weakness of this approach. A more promising solution would be multi-task learning which trains navigation and hovering task independently.

There are many other open issues not been addressed in this work so far. In real world experiments, not presented in this paper, we have encountered several difficulties even when flying with a PID controller. For example, in this work we assumed the lateral movement can be neglected and longitude velocity is always positive. To our observation in real world, this is a dangerous assumption as it does not hold in the presence of moderate to strong wind. When the wind speed is larger than the vehicle’s, the speed can become negative and lateral movement can be created if the wind is blowing from the side. This can be dangerous and cause undesired behavior for the policy network.

Finally, blimp control has not received enough attention and still remains an underdeveloped field. RL-based methods do not provide any stability guarantee but provide the potential to learn continuously from data and improve its own performance. Conversely, the nonlinear controllers are robust against parameter uncertainty and disturbance at the expense of control performance. How to leverage these two approaches is the key to the success of future blimp control methods. Since blimp dynamic is heavily dependent on the environment, it serves as a perfect robotic platform to study adaptive learning control. Secondly, the modern DRL algorithms are still not sample efficient enough. Our next step is to leverage parallel training to accelerate gathering experience. This also allows us to increase the diversity in the training and offer the potential to leverage multi-tasking learning as mentioned in [espeholt2018impala]. Lastly, for the agent to counter partial observations such as wind disturbances, it is important to include past experiences in the decision-making process. For example, a recurrent network architecture might be a possible solution.