Policy Search for Model Predictive Control with Application to Agile Drone Flight

by   Yunlong Song, et al.

Policy Search and Model Predictive Control (MPC) are two different paradigms for robot control: policy search has the strength of automatically learning complex policies using experienced data, while MPC can offer optimal control performance using models and trajectory optimization. An open research question is how to leverage and combine the advantages of both approaches. In this work, we provide an answer by using policy search for automatically choosing high-level decision variables for MPC, which leads to a novel policy-search-for-model-predictive-control framework. Specifically, we formulate the MPC as a parameterized controller, where the hard-to-optimize decision variables are represented as high-level policies. Such a formulation allows optimizing policies in a self-supervised fashion. We validate this framework by focusing on a challenging problem in agile drone flight: flying a quadrotor through fast-moving gates. Experiments show that our controller achieves robust and real-time control performance in both simulation and the real world. The proposed framework offers a new perspective for merging learning and control.



There are no comments yet.


page 1

page 12

page 13

page 14


Learning High-Level Policies for Model Predictive Control

The combination of policy search and deep neural networks holds the prom...

Neural-MPC: Deep Learning Model Predictive Control for Quadrotors and Agile Robotic Platforms

Model Predictive Control (MPC) has become a popular framework in embedde...

Aggressive Perception-Aware Navigation using Deep Optical Flow Dynamics and PixelMPC

Recently, vision-based control has gained traction by leveraging the pow...

Coalition Control Model: A Dynamic Resource Distribution Method Based on Model Predicative Control

Optimization of resource distribution has been a challenging topic in cu...

Quadrotor Trajectory Tracking with Learned Dynamics: Joint Koopman-based Learning of System Models and Function Dictionaries

Nonlinear dynamical effects are crucial to the operation of many agile r...

A Comparison of Model-Free and Model Predictive Control for Price Responsive Water Heaters

We present a careful comparison of two model-free control algorithms, Ev...

Performance, Precision, and Payloads: Adaptive Nonlinear MPC for Quadrotors

Agile quadrotor flight in challenging environments has the potential to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Mobile robots operate in a dynamic world. Notably, quadrotors are agile robots that can navigate at high speeds in highly complex and dynamic environments otherwise inaccessible to humans. However, sudden environmental changes, like dynamic obstacles, can raise fundamental problems for the vehicle control since they require the vehicle to have fast reactions and replan its trajectory quickly.

An essential requirement for agile drone flight in dynamic environments is to adapt the vehicle trajectory rapidly depending on the environmental changes. State-of-the-art model-based approaches have shown to be effective for controlling the quadrotor in both static and dynamic environments [falanga2020dynamic, mellinger2011minimum, neunert2016fast, mueller2015computationally, Zhou19arxiv, karaman2012high, loianno2016estimation, allen2019real, richter2016polynomial, landry2016aggressive]. For example, in the context of drone racing [foehn2020alphapilot, loquercio2019deep, moon2019challenges], the drone has to fly through a sequence of static gates (subjected to small disturbance) at extremely high speeds.

Fig. 1: An overview of the proposed policy-search-for-model-predictive-control framework.
Fig. 2: An application of the proposed method for flying a quadrotor through a fast moving gate.

Model predictive control (MPC) [rawlings2009model] has been shown to be a powerful model-based approach for solving complex quadrotor control problems [Falanga2018, neunert2016fast, kamel2017model, nguyen2020model, foehn2020cpc]. For example, the perception-aware MPC [Falanga2018] is a framework that unifies both planning and perception objectives. MPC is increasingly gaining popularity in many robotic domains, thanks to its capability of simultaneously dealing with complex nonlinear dynamic systems while satisfying different state and input constraints.

Despite the successes, many MPC applications still experience significant challenges, such as the requirement of an accurate mathematical model and the necessity of solving trajectory optimization problems online with the limited computational power of small-scale computers. In practice, the closed-loop performance of MPC for a specific task is sensitive to several design choices, including cost function formulation, hyperparameters, and the prediction horizon. As a result, a series of approximations, heuristics, and parameter tuning is employed, producing sub-optimal solutions.

On the other hand, reinforcement learning (RL) [sutton2018reinforcement] methods, like policy search, allow solving continuous control problems with minimum prior knowledge about the task. The key idea of RL is to automatically train the policy via trial and error and maximize the task performance measured by the given reward function. While RL has achieved impressive results in solving a wide range of robot control tasks [hwangbo2017control, lee2020learning, song2021overtaking, song2021autonomous], the lack of interpretability of an end-to-end controller trained using RL is of significant concern by the control community [donti2020enforcing].

Ideally, the control framework should be able to combine the advantages of both methods—the ability of model-based controllers, like MPC, to safely control a physical robot using the well-established knowledge in dynamic modeling and optimization and the power of RL to learn complex policies using experienced data automatically. Therefore, the resulting control framework can handle large-scale inputs, reduce human-in-the-loop design and tuning, and eventually achieve adaptive and optimal control performance. Despite these valuable features, designing such a system remains a significant challenge.

To this end, one line of research in the learning community has been focusing on developing data-efficient policy search methods using model priors. For instance, guided policy search (GPS) algorithms [zhang2016learning, levine2016end, kaufmann2020deep]

opt for transforming RL into a supervised learning problem. The key idea in GPS is to use a trajectory optimization algorithm to collect training data for training neural networks via supervised learning. However, these methods still learn black-box control policies that suffer from poor generalizations. The second trend pertains to learning-based MPC 

[williams2017infoMPC, kabzan2019learning, ostafew2016robust, rosolia2019learning], which can leverage real-world data to improve dynamics modeling and use model predictive path integral control (MPPI) [williams2017model] for optimization. Such algorithms generally have their roots in stochastic optimal control and require sampling a large amount of data in real-time for optimization, making those methods computationally expensive.


In this work, we propose a new paradigm for merging learning and control: learning high-level policies for model predictive control using policy search. An overview of our approach is summarized in Fig. 1. Specifically, we consider the MPC as a parameterized controller and formulate the search of high-level decision variables for MPC as a probabilistic policy search problem. First, we use two general Gaussian policies for modeling the high-level decision variables and show that the policy updates have closed-form solutions. Second, we propose a self-supervised training method for learning neural network policies. Our key insight is that policy search is useful for making high-level decisions for MPC, allowing automatically learning and adapting hard-to-optimize parameters.

On the experiment side, we evaluate our approach by addressing a challenging problem towards autonomous agile drone flight in dynamic environments: controlling a quadrotor to fly through a sequence of fast-moving gates. The key advantage of our approach compared to the standard MPC formulation is that the desired traversal time, which is hard to optimize simultaneously with other state variables, can be learned offline and can be adaptively selected at runtime. The resulting controller, which consists of a trained neural network policy and an MPC, achieves real-time control performance of a physical drone. An illustration of the real-world experiment is shown in Fig. 2.

This work is an extension of our previous conference paper [Yunlong2020learning]. The earlier version of this work proposed learning Gaussian and neural network policies and demonstrated learning a single time variable for flying a quadrotor through a dynamic gate in simulation. In this paper, we additionally 1) introduce a new algorithm for learning a Gaussian linear policy, 2) demonstrate that our approach is a general framework that can learn multidimensional decision variables, not just a single variable, 3) demonstrate that our controller can control a drone to fly through multiple gates in simulation and outperforms a standard MPC and a trajectory sampling method, 4) deploy the algorithm on a physical drone and show that the trained neural network high-level policy can be transferred to the real world without fine-tuning.

Ii Related Work

Ii-a Policy Search for Robotics

Policy search [sutton2018reinforcement] is a central area of reinforcement learning concerned with how to find an optimal parametric policy by maximizing the expected return of sampled trajectories. Depending on their exploration strategies for the stochastic trajectory generation, policy search methods can be categorized into step-based and episode-based methods [deisenroth2013survey, sutton2018reinforcement]. Most variations of policy search methods make use of step-based exploration strategies by adding different exploration noise in the action space at each control time step. Step-based policy search algorithms [schulman2015trpo, williams1992simple, kakade2001natural, peters2010reps] are widely used for continuous control tasks, ranging from learning agile motor skills for legged robots [hwangbo2019learning] to controlling a simulated race car at its friction limits [song2021overtaking]. They learn end-to-end black-box control policies that can map observations directly to control commands.

By contrast, episode-based policy search methods [deisenroth2013survey, stulp2012path, sun2009efficient, sehnke2008policy] add exploration noise in the parameter space of the policy only at the beginning of the episode. In particular, episode-based methods are widely used for learning movement primitives [schaal2006dynamic, paraschos2013probabilistic, williams2008modelling], which are compact parameterizations of the robot’s control policy. For example, the task-parameterized dynamic motor primitives (DMPs) [schaal2006dynamic, ijspeert2013dynamical] are popular compact policy representations in robotics. Adjusting their parameters allows robots to learn new skills quickly and solve many challenging robot control problems, such as playing Baseball [peters2008reinforcement], Ball-in-the-cup [kober2009policy], and Table Tennis [kober2011adjust]. Episode-based policy search methods help learn compact skills representations that are not easy to model by human experts.

Ii-B Data-driven Control with Model Predictive Control

Ii-B1 MPC-guided Policy Search

Model-free policy search algorithms learn control policies via trial-and-error; however, they suffer from high sample complexities. Guided policy search [levine2013guided] converts model-free policy search to supervised learning by iteratively collecting the training data using offline trajectory optimization [levine2014learning1, levine2016end, levine2013guided, levine2014learning2] or model predictive control [zhang2016learning, kaufmann2020deep]

. A key advantage of guided policy search is that it effectively trains deep neural network control policies, where the policy can handle complex and high-dimensional inputs from onboard sensors. For example, a deep sensorimotor policy, trained using MPC and imitation learning, enables an autonomous quadrotor to fly extreme acrobatic maneuvers with only onboard sensing and computation 

[kaufmann2020deep]. However, this line of work usually only uses the model during training and results in a policy specialized in a single task. Despite all of the successes achieved by guided policy search, the lack of generalization and robustness of the end-to-end policy remains a primary challenge.

Ii-B2 Learning-based MPC

In the second paradigm, learning-based MPC [kabzan2019learning, williams2018information, williams2017infoMPC, ostafew2016robust, rosolia2019learning] can leverage real-world data to improve dynamic modeling or learn a cost function for MPC. It allows for a more robust and flexible MPC design. In particular, sampling-based MPC [williams2018information] algorithms are developed for handling complex cost criteria and general nonlinear dynamics. This is achieved by combining neural networks for the system dynamics approximation with the model predictive path integral (MPPI) control framework [williams2018information] for real-time control optimization. A crucial requirement for the sampling-based MPC is to generate a large number of samples in real-time, where the sampling procedure is generally performed in parallel by using graphics processing units (GPUs). Hence, it is computationally and memory expensive to run sampling-based MPC on embedded systems. These methods generally focus on learning dynamics for tasks where a dynamical model of the robots or its environment is challenging to derive analytically, such as aggressive autonomous driving around a dirt track [williams2017infoMPC].

Alternatively, differentiable MPC [amos2018differentiable] treats the MPC as a differentiable policy class for reinforcement learning. Hence, by differentiating through the optimization problem using the Karush–Kuhn–Tucker (KKT) conditions of the convex approximation at a fixed point of the controller, it can also learn the costs and dynamics of an MPC controller via end-to-end learning. The analytical derivative relies on a fixed point of the controller, which, however, often does not exist when using neural networks to approximate the dynamics [amos2018differentiable].

Fig. 3:

Taxonomy of existing methods combining machine learning with model predictive control.

Ii-B3 Learning Neural Network Policies for MPC

To combine the power of neural networks and the strength of standard MPC optimization, state-of-the-art systems [kaufmann2019beauty, drews2017aggressive, drews2019vision]

opt for using deep neural networks as standalone representation learning modules. Specifically, a neural network is trained to process high-dimensional data, such as images, and is used to predict low-dimensional state information for the MPC. For instance,


proposed to combine a Convolutional Neural Network (CNN) with an MPC controller to solve the problem of navigating a quadrotor to pass through multiple gates. The trained neural network predicts three-dimensional poses of the gate’s center from image observations, and then, the MPC outputs control commands for the quadrotor to navigate through the predicted waypoints. Similarly, the method in

[drews2017aggressive] tackles an aggressive autonomous driving problem by using a CNN-based policy to predict a cost map of the track, which is then directly used for online trajectory optimization. A key advantage of this line of work is that it combines the benefits of both neural networks for high-dimensional data processing and MPC for robot control.

Iii Preliminary

We introduce mathematical formulations for both MPC and policy search. We discuss three kinds of policy representations that are widely used in RL and use them to model the high-level policies.

Iii-a Model Predictive Control

We consider the problem of controlling a nonlinear deterministic dynamical system whose dynamics is defined by a differential equation , where

is the state vector,

is a vector of control commands, and is the derivative of the current state. In MPC, we approximate the actual continuous time differential equation using a set of discrete time integration , with as the time interval between consecutive states and as an approximated dynamical model.

Let be a vector of reference states, e.g., a planned trajectory. We define a vector of high-level decision variables as . For example, for our application (Section V.), defines the time variables at which the robot should be passing the corresponding gate. At every control time step , the system is in state . MPC takes the current state , the reference states , and the high-level decision vector as input.

Formally, MPC minimizes a cost function over a fixed finite time horizon by solving an optimization problem:

subject to

where represents equality constraints and represents inequality constraints.

Our goal is to find the optimal control command for the current state such that we can execute it and move the robot to the next state. It is achieved by solving the trajectory optimization problem in real-time and by repeating this process at every control time step. Specifically, MPC minimizes the cost in the future states and outputs an optimal trajectory () that consists of a sequence of control commands and states. Only the first command is executed on the robot.

In this work, we take the MPC as a parametric controller that is parameterized by the high-level decision variables . Therefore, modulating the variables can result in different MPC outputs, denoted as . For example, in the context of flying through a dynamic gate (Section V.), can be the desired time at which the vehicle passes through the gate. This formulation allows us to incorporate reinforcement learning as well as different function representations into the MPC design.

Iii-B Policy Search

We summarize policy search by following the derivation from [deisenroth2013survey], in particular, we focus our discussion on episode-based policy search (or episodic reinforcement learning). Unlike many step-based policy search algorithms, which explore the action space by adding exploration noise directly to the policy output, episode-based policy search adds perturbations in the policy parameter space. This kind of exploration is normally added at the beginning of an episode and a reward function  is used to evaluate the quality of trajectories  that are generated by sampled parameters . A comprehensive survey and tutorial about different policy search algorithms can be found here [chatzilygeroudis2019survey, deisenroth2013survey].

Policy search algorithms try to update the policy parameters  by maximizing the expected return of sampled trajectories


A list of episode-based policy search algorithms have been discussed in the literature, such as policy gradient methods [kohl2004policy, williams1992simple, peters2006policy]

, expectation-maximization (EM) methods 

[kober2009policy], and information-theoretic methods [christian2016hreps, peters2010reps].

Policy gradient methods use gradient-ascent for maximizing the expected return and are simple to implement. However, it requires manual selection of learning rates and has an unstable learning process or slow convergence. Information-theoretic approaches rely on solving constrained optimization for maximizing the objective, and at the same time, they constrain the information loss by bounding the Kullback-Leibler (KL) divergence between the new policy and the old policy. The requirement of solving constrained optimization limits the usage of information-theoretic algorithms to solve high-dimensional problems, such as optimizing neural network policies, making it challenging to implement in reality.

On the other hand, the EM-based policy search algorithms provide closed-form solutions for many commonly used policy representations, and hence, do not require the user to specify the learning rate. In addition, they provide a good trade-off between computational efficiency and sample complexity. This is realized by formulating policy search as a probabilistic inference problem with latent variables, which leads to a weighted maximum likelihood estimate. Subsequently, we can use the Expectation-Maximization algorithm to update the policy parameters. We focus on a probabilistic model in which the search for high-level decision variables in the MPC optimization is treated as a probabilistic inference problem.

Iii-C Policy Representations

We represent the high-level policy as 

, which is modeled as a probability distribution or a deterministic policy (e.g., a neural network), and use the policy to select high-level decision variables 

. Here, are the policy parameters that have to be trained.

Iii-C1 Gaussian Policy, Constant Mean

First, we consider a simple scenario where the goal is to find a set of fixed decision variables 

. The variables are independent of the robot’s state. We use a Gaussian distribution 

to represent the policy, where is a mean vector and is a diagonal covariance matrix. The covariance matrix is needed in order to incorporate exploration. Therefore, the policy parameters are .

Iii-C2 Gaussian Policy, Linear Mean

Second, we consider a more general problem in which we want to find a set of adaptive decision variables, denoted as . The decisions variables are associated with the robot’s context . We use a Linear Gaussian model  to denote the policy, in which the Gaussian mean is represented by a linear function approximator, linear with respect to the function parameters . Here, is a kernel featurizer that converts the states of dimension  into a vector of features of dimension 

using basis functions, such as Radial Basis Functions (RBF) 

[sutton2018reinforcement] or Random Fourier Features (RFF) [rajeswaran2017towards]. Therefore, the policy parameters are  .

Iii-C3 Neural Network Policy

We use a neural network as a deterministic policy representation. Here, represents the neural network and is the robot’s observation at different time step . The solution for updating the parameters  of a neural network in policy search is difficult to derive analytically due to the highly nonlinear property of neural networks. Many deep reinforcement learning algorithms are based on policy gradients [henderson2018deep], which are known to have unstable learning processes or slow convergence. By contrast, we use a self-supervised learning algorithm for training the neural network policy (Algorithm 3).

Iv Probabilistic Policy Search for MPC

Iv-a Problem Formulation

We treat MPC as a controller  that is parameterized by the high-level decision variables . Here, is a trajectory generated by MPC given , where are control commands and are corresponding states of the robot. By perturbing , MPC can result in completely different trajectories . To find the optimal trajectory for a given task, the optimal has to be defined in advance. First, we model 

as a high-level policy represented by a probability distribution, specifically a parameterized Gaussian distribution. Then, we optimize the policy using probabilistic policy search (or probabilistic inference) algorithms. A visualization of the inference problem is given in Fig 

4 (inspired by [deisenroth2013survey]).

Fig. 4: A graphical model of probabilistic policy search for model predictive control.

To formulate the policy search as a latent variable inference problem, similar to [neumann2011variational, toussaint2009robot, deisenroth2013survey], we introduce a binary “reward event” as an observed variable, denoted as . Maximizing the reward signal implies maximizing the probability of this “reward event”. The probability of this reward event is given by , where is a reward function for evaluating the goodness of the MPC solution

with respect to a given evaluation metric of the task. This leads to the following maximum likelihood problem 



which is intractable to solve directly and can be approximated efficiently using Monte-Carlo Expectation-Maximization (MC-EM) [kober2009policy, vlassis2009model]. MC-EM algorithms find the maximum likelihood solution for the log marginal-likelihood (3) by introducing a variational distribution , and then, decompose the marginal log-likelihood into two terms:


where the  is the Kullback–Leibler (KL) divergence between and the reward-weighted trajectory distribution . Here, is the lower bound of  as .

The MC-EM algorithm is an iterative method that alternates between performing an Expectation (E) step and a Maximization (M) step. In the expectation step, we minimize the KL-divergence , which is equivalent to setting . In the maximization step, we update the policy parameters by maximizing the expected complete data log-likelihood


where each sample is weighted by the probability of the “reward event”, denoted as . The trajectory distribution can be replayed by the high-level policy . To transform the reward signal of a sampled trajectory into a probability distribution of the “reward event”, we use the exponential transformation [neumann2011variational, toussaint2009robot, deisenroth2013survey]:


where the parameter denotes the inverse temperature of the soft-max distribution, higher value of implies a more greedy policy update.

Iv-B Learning Gaussian Policies

Iv-B1 Gaussian Policy, Constant Mean

We first focus on solving a simple problem of learning a Gaussian policy  whose mean is a vector of unknown variables. We consider the robot at a fixed state , which does not change during learning. At the beginning of each training iteration, we randomly sample a list of parameters of length from the current policy distribution  and evaluate the parameters via a predefined reward function , where are the trajectories predicted by solving the MPC given the sampled variables .

In the Expectation step, we transform the computed reward signal into a non-negative weight  (improper probability distribution) via the exponential transformation (6). In the Maximization step, we update the policy parameters by optimizing the weighted maximum likelihood objective:


where the policy parameters, both the mean and the covariance, are updated using the following closed-form expressions:


We repeat this process until the expectation of the sampled reward converges. After training (during policy evaluation), we simply take the mean vector of the Gaussian policy as the optimal decision variables for the MPC. Therefore, is the optimal MPC decision variables found by our policy search. A complete episode-based policy search for learning a high-level Gaussian policy for MPC is given in Algorithm 1. A detailed derivation of the above solution is available in the Appendix.

While not converged
 Sample variables:
 Sample trajectories:
Output: Trained Policy
Algorithm 1 Learning Gaussian Policies for MPC

Iv-B2 Gaussian Policy, Linear Mean

Algorithm 1 can learn a Gaussian policy that only suits for a single experiment setting or a specific scenario. For generalizing the learned policy to different scenarios, we extend the algorithm by learning a Gaussian linear policy  whose mean is a linear function approximator . We characterize a scenario by the robot’s context, denoted by a vector . The problem of learning  is called contextual policy search and can be defined by maximizing the expected returns over all different contexts:


where is the distribution over . The objective (9) can be optimized using the standard MC-EM algorithm, and it results in a different weighted maximum likelihood objective:


Maximizing Eq. (10) results in closed-form solutions for the policy parameters:


where is a matrix that contains converted feature vectors for all sampled contexts  and is the diagonal weighting matrix containing the weights . In the covariance matrix update, is the same as in Eq. (8). Here, is a small positive variable and

is an identity matrix. The introduce of

is for numerical stability when calculating the matrix inverse. A complete policy search for learning a Gaussian linear policy for MPC is given in Algorithm 2. A detailed derivation of the above solution is available in the Appendix.

We use the Random Fourier Features (RFF) [rajeswaran2017towards] as the featurizer


where each element  is randomly sampled from , is a bandwidth parameter, and is a random phase shift drawn from . The bandwidth is the only parameter that has to be tuned. The RFF-based linear policy has been used to solve many benchmark continuous control tasks, including the OpenAI gym benchmarks [brockman2016gym].

While not converged
 Sample observations:
 Sample variables:
 Sample trajectories:
Output: Trained Policy
Algorithm 2 Learning Gaussian Linear Policies for MPC

Iv-C Learning Neural Network Policies

Algorithm 2 can optimize a linear policy using an episodic policy search method. Such episodic policy search by design is used for learning policies in multi-task settings, where distributions over different tasks are well-defined. When controlling a robot in a highly dynamic environment, where the observations differ significantly from state to state, we use a step-based policy search algorithm. Also, we aim to learn a complex neural network policy for selecting adaptive decision variables and for processing relatively high dimensional observations. Such properties are potentially useful for the robot to adapt its behavior online in a highly dynamic environment.

We train the neural network policy by combining Algorithm 1 with supervised learning. A complete algorithm of learning neural network policies for MPC is given in Algorithm 3. We divide the learning process into two stages: 1) data collection, 2) policy learning. In the data collection stage, we randomly initialize the robot in a state and find the optimal decision variables via Algorithm 1. We aggregate the dataset by , where is the current observation of the robot. A sequence of optimal control actions are computed by solving the MPC optimization, given the current state of the robot and the learned variable . Only the first control command is applied to the robot; subsequently, the robot moves to the next state. Incrementally, we collect a set of data that has diverse training pairs consisting of an observation  as the neural network input and a ground-truth value  as the output.

It is important to note that at each simulation time step , we run Algorithm 1 to solve multiple MPC optimizations in order to find the optimal decision variable for the current state. This step can be viewed as an online learning process in the simulator and is difficult to be run in real-time. During policy learning, we train the neural network by minimizing the mean-squared-error between the labels and the output of the network

, using the standard stochastic gradient descent. After training, the neural network policy is deployed together with the MPC to control the vehicle. Since the resulting controller contains a high-level policy and an MPC, we name this controller High-MPC.

Data collection (repeat)
 Randomly reset the system:
 While not done:
   Algorithm 1 ()
  Data collection:
  MPC optimization:
  System transition:
Policy learning
Output: Learned deep high-level policy
Algorithm 3 Learning Neural Network Policies for MPC

V Flying A Quadrotor through dynamic gates

We apply the proposed policy-search-for-MPC framework to address a challenging problem towards agile drone flight in dynamic environments, which is learning to fly through dynamic gates. The ability to fly through fast-moving gates enables the drone to traverse inside a dynamic environment, where the free space is changing rapidly. It is a difficult task since it requires simultaneously planning an accurate trajectory that passes through the center of moving gates and controlling the quadrotor to precisely follow the trajectory.

V-1 Quadrotor Dynamics

We model the quadrotor as a rigid body controlled by four motors. The dynamics of the system can be written as:


where and are the position and the velocity vectors of the quadrotor in the world frame . We use a unit quaternion to represent the orientation of the quadrotor and use to denote the body rates (roll, pitch, and yaw respectively) in the body frame . Here, with is the gravity vector, is the inertia matrix, is the three dimensional torque, and

is a skew-symmetric matrix. Finally,

is the mass-normalized thrust vector. The full state of the quadrotor is defined as (we omit subscript for clarity).

V-2 Pendulum Dynamics

We model the dynamic gate as a nonlinear pendulum. We approximate the pendulum gate as a point mass that is suspended on a weightless and inextensible string of length from a fixed support whose position is . The pendulum is subject to three forces: the gravity, the tension force results from the string pulling upon the bob of the pendulum, and a damping force due to friction and air resistance. We approximate the damping force by , where is the angular velocity and is a damping factor. We consider the pendulum’s motion in the plane, meaning the pendulum rotates about the axis. Dynamics of the rotational motion is described by the following differential equations


where is the roll angle, and

is the moment of inertia. Dynamics of the translational motion are given by


where is the distance between the gate center and the fixed point. For the computational convenience, we transfer the Euler angles into a unit quaternion  to represent the gate orientation. The full state of the gate center with respect to the inertial frame is represented using the state vector .

V-a Learning to Fly Through Dynamic Gates

Trajectory Optimization and Cost Function: We formulate the problem of learning to fly through dynamic gates. Our main goal is to find a trajectory that passes through the center of the moving gates. Such a trajectory optimization problem involves 1) decide a sequence of traversal times at which should the dynamic gates be passed, 2) given these traversal times, find a trajectory that passes these gates. Since the gates are moving quickly, the optimization faces a chicken-and-egg dilemma, namely, without obtaining the traversal times, it cannot determine the gates’ state from which the vehicle should fly through, or without the gates’ state, it cannot decide the traversal times.

We first make the assumption that a vector of desired traversal times  for each gate  is given, where . Here, is the total number of moving gates. Since we know the current states and the dynamic model of the moving gates, we can predict the future trajectory  for each gate , where is a state vector that consists of position , linear velocity , and orientation  . Therefore, we define a gate-pass cost as the following quadratic cost


where is a diagonal cost matrix and a Boolean variable defined as


Minimizing this loss function encourages the discretized states

to stay as closer as possible to the gate states , but only at the given desired traversal time for the gate . For other discretized states at , the loss has no effects since .

However, such a cost formulation requires very accurate dynamic modeling for both the quadrotor and the moving gates, since only characterizes several sparse states that are close to time nodes of the given traversal time vector . In other words, the optimization treats the moving gates as a list of static waypoints and takes them as soft constraints, without considering their dynamic motions. To counteract potential model errors and uncertainties, we define a gate-follow cost

Here, is a diagonal cost matrix, defines the exponential weights for following the gate’s motion, defines the temporal spread of the weight, and specifies different weights for tracking different gates. The gate-follow cost provides an intuitive motivation: plan a trajectory that follows the gate if the time difference between the current time and the desired traversal time is small; and does not follow the gate if the time difference is significant. In other words, it minimizes the difference between the quadrotor states and the gates’ states gradually as the quadrotor approaches the gate.

In addition, we define a terminal cost and an action regularization cost :


where is a goal state for hovering and . Here, we use body-rate control as the inputs. The terminal cost encourages the quadrotor to fly toward a goal state .

In summary, we have the following optimization problem


where the represents the generated trajectory that consists of the state vector and the control inputs .

We define a vector of high-level decision variables that consists of the desired traversal time and the desired weights for each moving gates. Here, is the total number of gates. Given the decision variables , the current state of the quadrotor , the predicted future trajectories  of all moving gates, we can solve the trajectory optimization problem and find the optimal trajectory for the quadrotor to fly through all moving gates


Policy Search and Reward Function: The aforementioned optimization formulation relies on the assumption that the desired traversal time  for each individual gate is given. In addition, the gate-follow loss requires additional variables  to define the weights that are assigned to follow the gate motion in the plane. The high-level decision variables are difficult to model or tune due to the dynamic properties of the task. Therefore, a key requirement for our MPC to solve the problem is to have both the desired traversal time and the desired weights in advance. A similar MPC formulation was discussed in [neunert2016fast], where the time variable at which a static waypoint should be reached by a quadrotor is also unknown. They solved the problem by manually selecting the variable via trial-and-error. In our case, the decision variables are much more difficult to select manually.

We solve the problem of finding optimal decision variables in MPC using policy search. We define a Euclidean distance reward function to evaluate the goodness of the trajectory generated by the optimization,


where is the time node at which the predicted quadrotor trajectory intersects with the gate state. This reward is a sparse signal that evaluates the “performance” of the sampled —high rewards indicate smaller traversal distance error and low rewards indicate large traversal distance error of the predicted trajectory. Hence, maximizing this reward signal leads to desired traversal time variables that allow the optimization to find a trajectory that passes through the center of the gate. Here, is a regularization term used for choosing smaller time variables.

Vi Experiments

We design our experiments to evaluate the proposed policy-search-for-MPC framework. Specifically, we aim to answer the following questions: 1) can we learn state-independent policies for the optimization (Section VI-A), 2) can we learn state-dependent linear policies for solving the optimization under different contexts (Section VI-B), 3) can we learn a neural network policy for adapting MPC on the fly (Section VI-C), 4) and finally, what is the performance of our system in the real world (Section VI-D).

Vi-a Learning State-Independent Time Variables for Trajectory Optimization

The first problem is a single trajectory optimization problem, where the goal is to find a safe trajectory that passes through several dynamic gates for a given initial state. We define three dynamic gates that are modeled using the same pendulum dynamics and are initialized at different positions. We use a prediction time horizon of and a discretize time step of for trajectory optimization, it results in a total discretization of nodes. We use CasADi [andersson2019casadi] with IPOP [wachter2006implementation] as the solver for the numerical optimization. We learn a vector of decision variables  using Algorithm 1, where is modeled as a high-level policy and is represented using a Gaussian distribution .

Fig. 5: Left: A planned trajectory for flying through 3 dynamic gates. The initial states of the moving gates are indicated by grey color. The quadrotor velocity () is indicated by the color bar. Right: Top: The predicted positions of three moving gates (dashed line) and of the quadrotor (solid line). The learned traversal times for three gates (vertical line). Middle: Learned weights () for the gate-following loss. Bottom: Control commands of the quadrotor.

Fig. 5 shows the predicted trajectory with the learned parameters. The optimization successfully plans a trajectory that passes through the center of all moving gates at the given learned traversal time. Besides, we achieve small traversal distance errors for all gates. The distance errors for gates are respectively. It is important to highlight that the predicted quadrotor position gradually follows the predicted gate’s center only when the quadrotor is close to the gate. Such a feature has crucial effects on real-world deployment since the dynamic modeling of the system is prone to error and the quadrotor has to follow the gate center when it is approaching the gate. Moreover, for the time stages that are far away from the desired traversal time, the pendulum motion has less influence on the quadrotor, leaving more extra control authority to counteract disturbance.

Fig. 6

shows the learning progress of the Gaussian policy, which has randomly initialized weights (both mean and variance). The learning is data-efficient and stable as the policy converges after only a few training iterations, e.g., 10 iterations for

. We train the policy for 30 iterations to make sure that the policy is fully converged. The policy update at each training step requires 30 samples, resulting in 900 MPC optimizations in total. Besides, a high temperature results in a fast (greedy) policy update while a low temperature can have slow convergence.

Fig. 6: Learning curves of the Gaussian policy. Each curve is obtained using different temperature parameters

for the policy training. We train six different randomly initialized policies for each temperature parameter. We compute the mean (solid line) and standard deviation (shadow region).

Vi-B Learning State-Dependent Time Variables for Adaptive Trajectory Optimization Using A Linear Policy

For generalizing the learned high-level policies to different settings or contexts (denoted as a vector ), we train a linear function approximator . Specifically, we want to predict the decision variables conditioned on different initialization of the moving gates. We characterize different settings using , where are the initial angles of the gates about the -axis and are randomly initialized. We represent the policy using a Gaussian linear model  (Section III-C2), where the policy parameters  are updated using Algorithm 2. We use the RFF featurizer for , in which the feature bandwidth is specified as 0.1 and the feature dimension is 40.

Fig. 6 shows the learning progress of the Gaussian linear policy. Similar to training a Gaussian policy, the learning of a Gaussian linear policy is also very data-efficient and stable, thanks to the closed-form solution (Eq. (11)) for updating the policy parameters. Here, the policy update at each training step requires 300 samples, resulting in 9000 MPC optimizations for learning a Gaussian linear policy.

Fig. 7: Learning curves of the Gaussian linear policy. Each curve is obtained using different temperature parameters for the policy training. We train six different randomly initialized policies for each temperature parameter. We compute the mean (solid line) and standard deviation (shadow region).
Fig. 8: Evaluation of the trained linear high-level policy with randomly initialized pendulum states. The generated quadrotor trajectories are colored by their speeds that are indicated using the color bars. The traversal distance errors between the quadrotor center and the gate center at the desired traversal times are indicated in the figure title.

TABLE I shows the evaluation results. The goal is to plan a trajectory for passing through 3 individual moving gates, where the initial states of the gates are randomly selected. Given a planned trajectory, we compute the traversal distance from the quadrotor center to the gate center, at the predicted time instances. We run the experiment repeatedly 100 times and compute the mean and the standard deviation of the traversal distance errors. In addition, we define a success planning if the traversal distance errors for all gates are lower than . We report the success rates for the 100 trials in TABLE I. Note that G1, G2, and G3 represent each gate separately. Our approach outperforms two simple baselines: one uses randomly sampled time variables, and one uses a heuristic-based selection of the time variables. Fig. 8 shows a visualization of 9 randomly sampled examples.

Approach Number Success Rates Mean Standard Deviation
of Gates () () ()
G1 G2 G3 G1 G2 G3 G1 G2 G3
Random 3 0 0 0 2.75 3.46 3.45 0.45 0.25 0.99
Heuristic 3 100 0 0 0.14 1.98 1.14 0.02 0.50 0.29
Ours 3 100 100 94 0.17 0.21 0.30 0.04 0.06 0.15
TABLE I: Evaluation of the trained Linear High-Level Policy.

Vi-C Learning Adaptive Variables for Dynamic Gates via Neural Network Policies

For learning a policy that is useful for the online parameters adaption or compatible with high-dimensional sensory observations, we train a neural network policy. The trained policy can hence be used for adaptively making high-level decisions for the MPC at each control time step, resulting in a closed-loop controller.

We use a Multilayer Perceptron (MLP) as the policy representation. In reality, since the environment is only partially observable, we can only observe one gate ahead. We define the observation at the time step

as , which represents the difference between the quadrotor’s current state  and the next gate’s state . The output of the neural network is the desired traversal time for flying through the next gate.

We use Algorithm 3 to train the neural network policy in simulation, in which it combines Algorithm 1 with self-supervised learning. First, we reset the system in random states, meaning we use a randomly initialized position, velocity, and orientation for the quadrotor and a random initial angle for the pendulum gate. We find the optimal traversal time in this state using the policy search algorithm (Algorithm 1). We create a training pair  by associating the current observation with the current optimal variables. Then, we simulate the quadrotor to the next state by solving the MPC optimization using the optimal traversal time and apply the optimal control command to the simulated quadrotor. We also integrate the pendulum dynamics to simulate the gate motion.

We repeat this process at each simulation time step until the quadrotor flies through the gate or it reaches the maximum simulation steps. In total, we collect 40,000 samples, which takes a multi-core CPU several hours to collect the training data. However, the total sampling time can be significantly reduced using parallel processing or multithreading. We implement a fully-connected MLP with two hidden layers of 32 units each and ReLU as the activation function. The training of network weights takes less than 10 minutes on a standard notebook with an Nvidia Quadro P1000 graphics card.

We use an open-source quadrotor simulator called Flightmare 

[song2020flightmare] to simulate the race track and the quadrotor dynamics. Fig. 9 shows a visualization of the race track in Flightmare. We design a dynamic race track that contains 5 moving gates. All moving gates are modeled using the same pendulum dynamics and rotating around the -axis.

The gates are attached to different fixed points and are initialized randomly by sampling its initial rotational angle 

from a uniform distribution 

. Specifically, the gates are separated in the -axis with a fixed position offset of , where is the difference between two consecutive pendulum gates and . Besides, we initialize the quadrotor with different initial velocities of ) in the forward direction.

We run 20 trials for each combination of the position offset and the initial quadrotor velocity and compute the success rate and the averaged traversal distance error. The averaged traversal distance error is computed using the Euclidean distance from the quadrotor center the gate. We compute the success rate by defining a task as a success if all 5 traversal distance errors are less than .

We compare our approach against two baselines: a standard MPC formulation, which does not have access to the desired traversal time, and a fast motion primitive generator [mueller2015computationally]. Since the desired time for flying through the next gate is not known beforehand, the desired traversal position is also a prior unknown. We formulate a simple MPC optimization problem as the following


where is the predicted future state of the moving gate,  (Eq. 19) is the terminal cost, and  (Eq. 20) is the action cost. Here, is a time-invariant diagonal cost matrix, which is specified for tracking the gate motion in the -axis and the -axis. Solving the above optimization problem results in a trajectory that follows the gate motion in the plane (defined by the stage cost) and reaches an end position located behind the gate (defined by the terminal cost).

Fig. 9: A visualization of the dynamic racing environment in the Flightmare simulator [song2020flightmare].

As a second baseline, we use a minimum jerk trajectory [mueller2015computationally] together with a high-level trajectory sampling scheme. We sample possible traversal trajectories by searching for the desired traversal time. We define a time interval of   with a discretization time step of   for sampling. At each control time step, a total number of 30 trajectories are sampled. The start state of the trajectory is defined by the current quadrotor state. The end state is partially defined, meaning some components are left free. The waypoint (the end position of the traversal trajectory) is calculated using a pendulum model of the moving gate and the sampled time. The end velocities and accelerations are defined as and , which means that the end velocity in the forward direction (-axis) and the accelerations are left free. We use zero velocities for the end state at the -axis and -axis to reduce the risk of having over-aggressive motion primitives. We reject trajectories that do not satisfy the input constraints and select an optimal trajectory that has less traversal time.

Fig. 10: Baseline comparison of flying through 5 moving gates using our High-MPC (left), a standard MPC (middle), and a high-frequency motion primitive sampling method [mueller2015computationally] (right). Overall, our High-MPC outperforms both baselines in terms of gate traversal distance errors and success rates.

Fig. 10 shows the evaluation results. In summary, our approach achieves the highest success rates and smallest traversal distance errors, particularly when the quadrotor has a small initial velocity and the separation distances among gates are large enough. When the quadrotor is initialized with a high forward velocity and the distance among gates are small, Our approach is prone to failure of the task due to the physical constraint of the platform. For example, the quadrotor does not have enough control authority to break immediately.

The standard MPC has achieved lower success rates and larger traversal distance errors. This is due to the cost function formulation in the MPC optimization, in which the optimization does not have access to the desired traversal time, and thus, has to minimize the distance between the quadrotor and the gate center in all optimization states. Such an optimization scheme results in very aggressive trajectories and large control inputs [Yunlong2020learning]. Therefore, it is very important to obtain the traversal time prior to the optimization and update the time at each control time adaptively for the closed-loop control. Optimizing the time jointly with other optimization variables is also possible, but might result in a complicated optimization problem that is difficult to solve in real-time.

The sampling-based motion primitives method shows a competitive performance to our method. There are multiple challenges when applying the sampling-based method to our task. First, the motion primitive is defined for state-to-state transition, in which the end velocities are difficult to be specified in advance. For example, a small velocity in the flight direction can result in a slow forward motion while a high velocity can lead to aggressive flight and gate overshooting. Second, the final performance of a task highly depends on the high-level planner, which is used for selecting the optimal motion primitive. Designing such a high-level planner is a challenging task, in which prior research generally uses heuristic-based search [mueller2015computationally, zhou2019robust, liu2018search], which could produce suboptimal solutions.

Vi-D Real-world Deployment

To test the robustness and the real-time performance of our system, we deploy our approach in the real world with a physical racing drone. The drone is built from off-the-shelf components used for first-person-view racing (Fig. 11). The drone features a carbon frame with stiff 5-inch propellers, a Lumenier flight controller, an Odroid XU4 single board computer, and a Laird RM024 radio module for receiving control commands. The platform weights . We design a pendulum gate that comprises a straight steel stick and a wooden loop. The stick has a weight of and a length of . The wooden loop has a weight of and a radius of . We attach the pendulum gate to a fixed stand. The task goal is to control the quadrotor to fly through the pendulum gate and hover it at a goal position located behind the gate.

All the presented flight experiments were conducted with an OptiTrack motion capture system. We use Extended Kalman Filters (EKF) for estimating both the quadrotor state and the pendulum state using observations from the OptiTrack. The state estimation runs at

. We use ACADO [Houska2011a] for the MPC optimization and qpOASES as a solver in order to achieve real-time control performance of the quadrotor. As discretization step, we chose with a prediction time horizon of . The control command solved by the MPC with a desired traversal time variable obtained from the neural network high-level policy are updated at . Our approach can achieve real-time control and is computationally efficient. The computational time for solving an MPC optimization at each control time step is on average less than and for the neural network inference is on average less than (without TensorRT optimization).

We set up the experiment by putting the pendulum gate at a random initial angle and then dropping it. The pendulum gate swings back and forth freely with a periodic motion. The period of the pendulum decreases over time due to friction and air drag, which are difficult to model precisely. We approximate the pendulum dynamics using Eq. (15) & Eq. (16) with a roughly estimated damping factor . We use a simple dynamic model for predicting the future trajectory of the gate. We place our drone at a random position in front of the gate. Fig. 12 shows four different trials of the real world experiment conducted in a confined environment. As a result, our approach successfully controls the quadrotor to fly through the fast-moving gate.

Fig. 13 shows a comparison of the executed trajectories between the quadrotor and the moving gate. The vertical black dashed line indicates the traversal moment, at which the quadrotor flies through the gate. The quadrotor follows the gate’s motion in both the -axis (blue) and the -axis when it approaches to the gate. Besides, the desired traversal time  predicted by the neural network together with the corresponding gate-following weighting variable  is plotted on the right-hand side. Both and are used by the MPC for simultaneous planning a trajectory and controlling the vehicle. When is close to zero, the weighting variable increases to a maximum value , which indicates that the desired traversal time is now and the quadrotor should try to follow the periodic motion in the plane. By contrast, when is very large (e.g., ), the weighting variable decreases to a minimum value , meaning that there is no need to follow the gate since the gate is either far away or has been passed already. Note that is not the prediction horizon in the MPC optimization.

Fig. 11: A racing drone used for the real-world experiment.
Fig. 12: Controlling a quadrotor to fly through a dynamic gate using High-MPC under random initializations.
Fig. 13: Trajectories of the quadrotor (solid line) and the pendulum gate (dashed line). The plot on the right-hand side shows the predicted traversal time variable and the weight for minimizing the gate-following cost. The vertical dashed line indicates the traversal time.

Vi-E Connections with Prior Trajectory Planning Methods

In general, existing works on quadrotor trajectory planning (or kinodynamic motion planning) in dynamic environments can be divided into sampling-based and optimization-based approaches. Sampling-based algorithms [elbanhawi2014sampling], such as RRT [karaman2011sampling, webb2013kinodynamic], are provably optimal in the limit of infinite samples. To achieve real-time replanning, sampling-based algorithms leverage efficient motion primitive generators to provide a closed-form solution for state-to-state trajectories, e.g., the minimum jerk [mueller2015computationally] or minimum snap [mellinger2011minimum] trajectories. This line of work manifests a significant computational advantage and real-time planning performance, however, relies on simplified dynamics or differentially flat dynamics of a quadrotor. Moreover, they need to relax the single actuator constraints to limit the per-axis acceleration, which might render planned trajectories conservative.

Optimization-based approaches overcome these limitations by enforcing the system dynamics and thrust limits as constraints. They use discrete-time state-space representations for the trajectory, and can handle nonlinear dynamics and different constraints. For passing through dynamic gates, the allocation of the traversal times for the moving waypoints is a priori unknown, rendering the problem formulation complicated. State-of-the-art approaches address similar problems for passing through static waypoints using either heuristic search [neunert2016fast] or formulating a complicated optimization problem using complementary progress constraints [foehn2020cpc].

Vii Discussion

Vii-a Choice of Policy Representations

In Section IV, we have presented algorithms for training three different policies, including Gaussian policy, Gaussian linear policy, and neural network policy. Our formulation is general and is designed to be suitable for a wide variety of robotic tasks.

The Gaussian policy finds state-independent decision variables for a single trajectory optimization, namely, the learned decision variables are fixed parameters. In practice, many MPC applications require manually tuning of the hyperparameters, such as the prediction horizon. The Gaussian policy is useful for automatically selecting those hyperparameters.

The Gaussian linear policy learns state-dependent policies via episodic policy search and function approximations. Such policy representations are useful for generalizing robot skills to multiple contexts: a lower-level MPC, parameterized via the parameters , controls the robot for a given context  and a high-level policy  generalizes among different contexts . In the context of reinforcement learning, it is generally referred to as contextual policy search [deisenroth2013survey, kupcsik2013data]. Hence, the Gaussian linear policy has the advantage of efficiently learning linear function representations (with nonlinear kernel features) using a small number of samples.

The neural network policy learns state-dependent policies in a step-based setting, where the policy needs to adapt its decision variables based on the observation at each control time step . The above-mentioned episodic policy search cannot be used directly because the step-based policy search uses different exploration strategies and relies on different policy evaluation methods. Instead, we propose to combine Algorithm 1 with a self-supervised learning scheme for the step-based setting. We show the resulting algorithm can be used for learning a complex neural network policy. The neural network policy is more helpful for adapting decision variables online or potentially for processing high-dimensional observations.

Vii-B Design of the Loss Function and the Reward Function

The loss function design in the MPC optimization involves generating desired control commands for the robot such that the predicted future states match predefined high-level goals. Our approach allows more flexible and automatic loss function design: by taking the MPC as a parameterized controller and using policy search for automatically selecting the desired parameters. The reward function design for policy search has very flexible formulations, such as quadratic, sparse, or exponential.

Vii-C Limitations

First, learning neural networks is computationally expensive and data-hungry. During data collection, the proposed self-supervised policy search (see Algorithm 3) requires running Algorithm 1 in the loop, which is an expensive process since it needs to solve multiple MPC optimizations for each state. Second, Algorithm 3 was designed for learning neural networks for both online decision making and handling high-dimensional observations. Our experiments did not exploit the full potential of using deep neural networks for processing high-dimensional observations, such as images. Finally, our experiment results are based on accurate state estimation with low latency, which is general not the case when using onboard sensing and computing. It is also crucial to consider many factors for real-world scenarios, such as noisy state estimation and system delays.

Viii Conclusion

This paper proposed a novel framework that unifies the advantages of both probabilistic policy search and MPC. Our approach improves over the standard MPC formulation by augmenting the MPC controller with learned high-level policies that can automatically choose hard-to-optimize decision variables. Our framework allows a versatile design of different policy representations, ranging from state-independent Gaussian policies to complex neural networks.

As a second contribution, we addressed a challenging problem in agile drone flight: controlling a quadrotor to fly through fast-moving gates. We successfully demonstrated the effectiveness of our approach in both simulation and the real world. To the best of our knowledge, our approach is the first to attempt to address this problem and, hence, can serve as an important baseline for future work.

We release the source code of learning high-level policies for MPC and provide additional theoretical derivations in the Appendix. Future work concerns improving the sample efficiency of the high-level policy training using more advanced policy search methods, such as relative entropy policy search [peters2010relative]. On the application side, applying the proposed framework to other robotic tasks, such as addressing more complex trajectory optimization problems or using a high-level policy for dealing with model errors, are promising avenues for future research.



We thank Thomas Längle, Roberto Tazzari, Manuel Sutter, Elia Kaufmann, Antonio Loquercio, Philipp Foehn, Angel Romero, and Sihao Sun for their help or the valuable discussions.

We provide detailed derivations for both updating the Gaussian policy and the Gaussian linear policy using weighted maximum likelihood.

Appendix A

A-a Derivation of Algorithm 1

The objective of maximizing a Gaussian policy in Algorithm 1 is defined as

where the log-likelihood of the Gaussian policy is given by

In order to find the that maximizes the reward, we take the derivative to the policy parameters , separately, and set the gradients to zero. We first compute the solution for updating the mean 

By solving for , we obtain

Second, we compute the solution for the covariance matrix ,

By solving for , we obtain

Note that in Eq. (8), we use

as the demonimator to obtain an unbiased estimate of the covariance.

A-B Derivation of Algorithm 2

where the log-likelihood of the Gaussian policy is given by

Similar to the previous derivation, in order to find the maximizing the reward, we take the derivative to the policy parameters , separately, and set the gradients to zero. We first compute the solution for the parameters  that are used for approximating the mean,