Learning High-Level Policies for Model Predictive Control

07/20/2020 ∙ by Yunlong Song, et al. ∙ 0

The combination of policy search and deep neural networks holds the promise of automating a variety of decision-making tasks. Model Predictive Control (MPC) provides robust solutions to robot control tasks by making use of a dynamical model of the system and solving an optimization problem online over a short planning horizon. In this work, we leverage probabilistic decision-making approaches and the generalization capability of artificial neural networks to the powerful online optimization by learning a deep high-level policy for the MPC (High-MPC). Conditioning on robot's local observations, the trained neural network policy is capable of adaptively selecting high-level decision variables for the low-level MPC controller, which then generates optimal control commands for the robot. First, we formulate the search of high-level decision variables for MPC as a policy search problem, specifically, a probabilistic inference problem. The problem can be solved in a closed-form solution. Second, we propose a self-supervised learning algorithm for learning a neural network high-level policy, which is useful for online hyperparameter adaptations in highly dynamic environments. We demonstrate the importance of incorporating the online adaption into autonomous robots by using the proposed method to solve a challenging control problem, where the task is to control a simulated quadrotor to fly through a swinging gate. We show that our approach can handle situations that are difficult for standard MPC.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Model Predictive Control (MPC) [22, 16] is a powerful approach for dealing with complex systems with the capability of handling multiple inputs and outputs. MPC has become increasingly popular for robot control due to its robustness to model errors and its capability of incorporating actions limits and solving optimizations online. However, many popular MPC algorithms [6, 22, 10] rely on tools from constrained optimization, which means that convexification, such as a quadratic formulation of the cost function, and approximations of the dynamics are required [27]. The requirement of solving constrained optimization online limits the usage of MPC for dealing with high-dimensional states and complex cost formulation.

Model-free Reinforcement Learning (RL) offers the promise of automatically learning hard-to-engineer policies for complex tasks 

[12, 21, 4]. In particular, in combination with deep neural networks, deep RL [18, 8, 7] optimizes policies that are capable of mapping high-dimensional sensory inputs directly to control commands. However, the learning of deep neural network policies is highly data-inefficient and suffers from poor generalization. In addition, these methods typically provide little safety or stability guarantees for the system, which is particularly problematic when working with physical robots.

Instead of learning end-to-end control policies that map observations directly to robot’s control commands, we consider the problem of learning a high-level policy, where the policy chooses task-dependent decision variables for a low-level MPC controller. The MPC takes the decision variables as inputs and generates optimal control commands that are eventually executed on the robot. The policy parameters we are trying to learn can be hyperparameters that are hard-to-identify by human experts or a compact representation of high-dimensional states (see SectionIV).

Fig. 1: An overview of our approach for online adaptations of model predictive control using a learned deep high-level policy. The neural network policy is trained using self-supervised learning (Algorithm 2).

Contributions: In this work, we leverage intelligent decision-making approaches to the powerful model predictive control. First, we formulate the search of high-level decision variables for MPC as a probabilistic policy search problem. We make use of a weighted maximum likelihood approach [21] for learning the policy parameters, since it allows a closed-form solution for the policy update. Second, we propose a novel self-supervised learning algorithm for learning a neural network high-level policy. Conditioning on the robot’s observation in a rapidly changing environment, the trained policy is capable of adaptively selecting decision variables for MPC. We demonstrate the effectiveness of our approach, which incorporates a learned High-level policy into a MPC (High-MPC), by solving a challenging task of controlling a quadrotor to fly through a fast swinging gate.

Ii Related Work

The study of combining machine learning or reinforcement learning with model predictive control has been conducted in learning-based control.

Sampling-based MPC are discussed in [27, 26], in which the MPC optimizations are capable of handling complex cost criteria and making use of learned neural networks for dynamics modelling. A crucial requirement for the sampling-based MPC is to generate a large number of samples in real time, where the sampling is generally performed in parallel using graphics processing units (GPUs). Hence, it is computationally expensive to run sampling-based MPC in real time. These methods generally focus on learning dynamics for tasks where a dynamical model of the robots or its environment is difficult to derive analytically, such as aggressive autonomous driving around a dirt track [27].

MPC-guided policy search [14, 28, 13]

are methods that study the problems of learning a deep neural network control policy using an MPC as the teacher, and hence, they transform policy search into a supervised learning fashion. The trained end-to-end control policy can forgo the need for explicit state estimation and directly map sensor observations to actions. MPC-guided policy search has been demonstrated to be more data efficient than standard model-free reinforcement learning. However, it suffers from the problem of poor generalizations and stability.

Supervised learning for MPC [11, 5, 9, 15] has been studied in the literature. In [11, 15], the authors proposed to combine a CNN-based high-level policy with a low-level MPC controller to solve the problem of navigating a quadrotor to pass through multiple gates. The trained policy predicts three-dimensional poses of the gate’s center from image observations, and then, the MPC outputs control commands for the quadrotor such that it navigates to the predicted waypoints. Similarly, the method in [5] tackles an aggressive high-speed autonomous driving problem by using a CNN-based policy to predict a cost map of the track, which is then directly used for online trajectory optimization. Here, the deep neural network policies are trained using supervised learning, which requires ground-truth labels.

Iii Background

Iii-a Model Predictive Control

We consider the problem of controlling an nonlinear deterministic dynamical system whose dynamics is defined by a differential equation , where

is the state vector,

is a vector of the control command, and is the derivative of current state. In model predictive control, we approximate the actual continuous time differential equation using a set of discrete time integration , with as the time interval between consecutive states and as an approximated dynamical model.

At every time step , the system is in state . MPC takes the current state and a vector of additional references as input. MPC produces a sequence of optimal system states and control commands by solving an optimization online, using a mulitple-shooting scheme. The first control command is applied to the system, after which the optimization problem is solved again in the next state. MPC requires minimizing a quadratic cost over a fixed time horizon at each control time step by solving a constrained optimization:

subject to

where represents equality constraints and represents inequality constraints. Here, is a vector of reference states that are normally determined by a path planner and are directly related to the task goal. We represent a vector of high-level variables as , which has to be defined in advance by human experts, or learned using our policy search algorithm (Sec. IV).

Iii-B Episode-based Policy Search

We summarize episode-based policy search by following the derivation from [4]. Unlike step-based policy search [7, 24], which explores in the action space by adding exploration noise directly to the executed actions, episode-based policy search perturbs the parameters of a low-level controller in parameter space [4]. This kind of exploration is normally added in the beginning of an episode and a reward function is used to evaluate the quality of trajectories that are generated by sampled parameters. A list of episode-based policy search algorithms have been discussed in literature [21, 23, 3, 4]. We focus on a probabilistic model in which the search of high-level parameters for the low-level controller is treated as a probabilistic inference problem. A visualization of the inference problem is given in Fig 2, the graphical model is inspired by [4].

Fig. 2: Graphical model for learning a high-level policy  for MPC.

We make use of an MPC as the low-level controller where the decision variables in MPC is represented as a vector of unknown variables . We define a reward function as , which is used to evaluate the goodness of the MPC solution with respect to the given task. The goal of policy search is to find the optimal policy such that it automatically selects the high-level variables for the MPC. Therefore, it is equivalent to maximize an expectation of the reward signal. Here, the reward function is different from the cost function optimized by the MPC, but directly related to the task goal.

To formulate the policy search as a latent variable inference problem, similar to [4], we introduce a binary “reward event” as the observation, denoted as

. Maximizing the reward signal implies maximizing the probability of this “reward event”. This leads to the following maximum likelihood problem 



which can be solved efficiently using Monte-Carlo Expectation-Maximization (MC-EM) 

[12, 25]. MC-EM algorithms find the maximum likelihood solution for the log marginal-likelihood (2) by introducing a variational distribution , and then, decompose the marginal log-likelihood into two terms:


where is the lower bound of .

The MC-EM algorithm is an iterative method alternates between performing an Expectation (E) step and a Maximization (M) step. In the expectation step, we minimize the Kullback–Leibler (KL) divergence , which is equivalent to setting . In the maximization, we use the sampled distributions for estimating the complete-data log-likelihood by maximizing the following weighted maximum likelihood objective:



is an improper probability distribution for the trajectory

. The trajectory is collected by solving an MPC optimization problem using . The solution for updating the policy parameters has a closed-form expression.

Iv Methodology

Iv-a Problem Formulation

We make use of a Gaussian distribution

to model the high-level policy, where is the mean vector, is a covariance matrix, and hence, represents all policy parameters. We design a model predictive control with a vector of unknown decision variables  to be specified. The variables are directly related to the goal of a task and have to be specified in advance before MPC solves the optimization problem. MPC produces a trajectory that consists of a sequence of optimal system states and control commands . The cost function is defined by the variables and additional references states, such as a target position or a planned trajectory.

We define a reward function  which evaluates the goodness of the predicted trajectory  with respect to the task goal. The design of this reward function is more flexible than the cost function optimized by MPC, which allows us to work with complex reward criteria, such as exponential reward, discrete reward, and even sparse reward. For example, we can compute the reward by counting the total number of non-collision states in the predicted trajectory. Maximizing this reward can hence find the optimal collision free trajectory.

Iv-B Probabilistic Policy Search for MPC

We first focus on solving the problem of learning a high-level policy that does not depend on robot’s observations, where our goal is to find an optimal policy which maximizes the expected reward of predicted trajectories denoted as . We used a weighted maximum likelihood algorithm to solve the maximum likelihood estimation problem, where maximizing the reward is equivalent to maximizing the probability of the binary “event”, denoted as (Section III).

The maximization problem corresponds to weighted maximum likelihood estimation of where each sample is weighted by . To transform the reward signal of a sampled trajectory into a probability distribution , we use the exponential transformation [4]:


where the parameter denotes the inverse temperature of the soft-max distribution, higher value of implies more greedy policy update. A comparison of using different for the policy update is shown in Fig. 3. A complete episode-based policy search for learning a high-level policy in MPC is given in Algorithm 1.

While not converged
 Sample variables:
 Sample trajectories:
Output: Learned high-level policy
Algorithm 1 Probabilistic Policy Search for MPC

We represent our policy

using a normal distribution with randomly initialized policy parameters

. We consider the robot at a fixed state , which does not change during the learning. At the beginning of each training iteration, we randomly sample a list of parameters of length from the current policy distribution  and evaluate the parameters via a predefined reward function , where are the trajectories predicted by solving the MPC with sampled variables .

In the Expectation step, we transform the computed reward signal into a non-negative weight  (improper probability distribution) via the exponential transformation (5). In the Maximization step, we update the policy parameters by optimizing the weighted maximum likelihood objective , where the policy parameter, both the mean and the covariance, are updated using a closed-form expression. We repeat this process until the expectation of sampled reward converges. Here, is a vector of auxiliary variables. After training (during policy evaluation), we simply take the mean vector of the Gaussian policy as the optimal decision variables for the MPC. Therefore, is the optimal MPC decision variables found by our approach.

Iv-C Learning A Deep High-Level Policy

We extend Algorithm 1 of learning a high-level policy to learning a deep neural network high-level policy, where the trained neural network policy is capable of selecting adaptive decision variables for the MPC given different observations of the robot. Such properties are potentially useful for the robot to adapt its behavior online in a highly dynamic environment. For example, it is important to use an adaptive control scheme for mobile robots since the robot’s dynamics and its surrounding environment changes frequently.

First, we characterize an observation vector of the robot as , where the observation can be either high-dimensional sensory inputs, such as images, or low-dimensional states, such as the robot’s pose. Second, we define a general-purpose neural network denoted as , with being the network weights to be optimized. We train the deep neural network policy by combining the episode-based policy search (Algorithm 1) with a self-supervised learning approach. Our algorithm of learning a deep high-level policy is summarized in Algorithm 2.

Input: Algorithm 1
Data collection (repeat)
 Randomly reset the system:
 While not done:
   Algorithm 1 ()
  Data collection:
  MPC optimization:
  System transition:
Policy learning
Output: Learned deep high-level policy
Algorithm 2 Learning A Deep High-Level Policy

We divide the learning process into two stages: 1) data collection, 2) policy learning. In the data collection stage, we randomly initialize the robot in a state and find the optimal decision variables via Algorithm 1. We aggregate our dataset by , where is the current observation of the robot. An sequence of optimal control actions are computed by solving the MPC optimization, given the current state of the robot and the learned variable . The first control command is applied to the system, subsequently, the robot transitions to the next state. Incrementally, we collect a set of data that consists of a variety of observation-optimal-variables pairs . In the policy learning stage, we optimize the neural network by minimizing the mean-squared-error between the labels and the prediction of the network

, using stochastic gradient descent.

V Experiments

V-a Problem Formulation

V-A1 Passing Through a Fast Moving Gate

To demonstrate the effectiveness of our approach, we aim at solving a challenging control problem. Our task is to maneuver a quadrotor to pass through the center of a swinging gate that hangs from the ceiling via a cable. We assume that the gate oscillates in a same two-dimensional plane (Fig. 5). Thus, we model the motion of the gate as a simple pendulum. Such a quadrotor control problem can be solved via a traditional modular planning-tracking pipeline, where an explicit trajectory generator, such as a minimum snap trajectory [17] or motion primitives [19] is combined with a low-level controller. To forgo the need for an explicit trajectory generator, we intend to solve this problem using our proposed High-MPC, where we make use of a high-level policy to adaptively select a decision variable for a low-level MPC controller. Our approach automatically find an optimal trajectory for flying through the gate by solving an adaptive MPC optimization online,

Quadrotor Dynamics: We model the quadrotor as a rigid body controlled by four motors. We use the quadrotor dynamics proposed in [19]:

where and are the position and velocity of the quadrotor in the world frame . We use a unit quaternion to represent the orientation of the quadrotor and use to denote the body rates (roll, pitch, and yaw respectively) in the body frame . Here, with is the gravity vector, and

is a skew-symmetric matrix. Finally,

is the mass-normalized thrust vector. We use a state vector and an action vector to denote the quadrotor’s states and control commands separately.

Pendulum Dynamics: We use a simple pendulum which is modeled as a bob of mass attached to the end of a massless cord . The cord is hinged at a fixed pivot point denoted as . The pendulum is subject to three forces: the gravity, the tension force exerted by the cord upon the bob, and a damping force due to friction and air drag. The damping force is proportional to the angular velocity and denoted as , where is a damping factor. Hence, we use the following dynamical model

to simulate the motion of our gate, where is the angle displacement with respect to the vertical direction. We constrain the pendulum’s motion in the plane, where and . A Cartesian coordinate representation of the pendulum in the world frame can be obtained from the pendulum’s angle displacement with respect to and . We can represent the state of the gate’s center using the state vector .

Model Predictive Control: We solve the problem of passing through the swinging gate using non-linear model predictive control. We make use of discrete time models, where a list of quadrotor states and control commands are sampled with a discrete time step . We define the objective as a sum over three different cost components: a goal cost , a tracking cost , and an action regularization cost . Thus, we solve the following constrained optimization problem:

where are differences between the vehicle’s states and reference states at the stage , and defines the difference between the vehicle’s terminal state  and a hovering state . Here, is a regularization for predicted control commands , where the reference command is the command required for hovering the quadrotor. The control commands are constrained by .

Cost Functions: In MPC, we minimize a sum of quadratic cost functions over the receding horizon using a sequential quadratic program (SQP). We design quadratic cost functions using positive definite diagonal matrices , , and . In particular, both and are time-invariant matrices. Here, defines the importance of reaching to a hovering state at the end of the horizon and corresponds to the importance of taking the control commands that are not diverging too much from the reference command .

Since the gate is swinging in the plane, in order to pass through the gate, the quadrotor has to fly forward in the direction and simultaneously minimize its distance to the center of the gate in both and axes. Hence, the quadrotor has to track the pendulum’s motion in both axes when it approaches to the gate. To do so, we use a time-varying cost matrix , which is defined as:


where the exponential function defines the temporal importance for each states , and defines the temporal spread of states in terms of tracking the pendulum’s motion. Here, is a time variable that defines the best traversal time for the quadrotor, having helps the quadrotor go to the hovering point after passing through the gate. Hence, for states that are close to the , we have , which means that these states should strictly follow the pendulum in and . However, for states that are faraway from , we have , which indicates that it is not necessary for these states to follow the pendulum’s motion. Here, defines the maximum weight that should be assigned for tracking the pendulum. Without considering the importance of each state at different time stages, e.g., weighting the tracking loss in all time stages using the same constant cost matrix, the quadrotor flies trajectories that would oscillate around the forward axis (see Fig. 6).

Therefore, a key requirement for our MPC to solve the problem is to obtain the optimal traversal time in advance. A similar problem was discussed in [20], where a time variable at which a desired static waypoint should be reached by a quadrotor was determined by human experts. In our case, the time variables are more difficult to obtain, especially when we consider adapting the variable online.

V-B Learning Traversal Time

We first consider the scenario where the quadrotor always starts from the same initial hovering state with and the pendulum is hinged at a fixed pivot point with cord length meter (m). The pendulum’s initial angle and angular velocity are (in radians). We define a hovering state as a goal state for the quadrotor to hover after passing through the gate. Given the dynamics of the vehicle and the pendulum, we want to plan a trajectory in the future time horizon  seconds, such that the produced quadrotor trajectory intersects the center of the gate at the traversal time .

We learn the decision variable using Algorithm 1 (Section IV), where is modeled as a high-level policy and is represented using a Gaussian distribution . We first sample a list of of size , and then, collect a vector of predicted trajectories by solving MPC optimizations. We evaluate the sampled trajectories using the following reward function:

where correspond to 10 time stages that are close to the time stage determined by the samples via . Maximizing this reward signal indicates that the high-level policy tends to sample that allows the MPC to plan a trajectory that has a minimum distance between the quadrotor’s state and the center of the gate during the traversal. This reward is maximized by solving the weighted maximum likelihood objective (4) using Algorithm 1.

V-B1 High-Level Policy Training

Fig. 3 shows the learning progress of the high-level policy. The learning of such a high-level policy is extremely data-efficient and stable, where the policy converges in only a few trials. For example, the policy is converged after around 6 training iterations when using , where in total trajectories (equivalent to 180 MPC optimizations) were sampled. We use CasADi [2]

, which is an open-source tool for nonlinear optimization and algorithmic differentiation, for our MPC implementation. We use a discretization time step of

and a prediction horizon of . On average, each MPC optimization takes around on a standard laptop.

Fig. 3: This figure shows the learning progress of the high-level policy. Top: Averaged rewards using different over 7 runs of each, where policies are randomly initialized with different random seeds. Bottom: A visualization of policy distributions and sampled during training. The policy converges to an optimal solution after around 6 iterations.

V-B2 Traverse Trajectory Planning

Fig. 4 shows a comparison between the planned trajectory using our High-MPC (along with an optimized decision variable seconds) and the solution from a standard MPC. The standard MPC minimizes the same cost function with a constant cost matrix for all states and does not use the exponential weighting scheme. As a result, both methods are capable of planning trajectories that pass through the swinging gate, where absolute position errors at the traversal point in the plane are meters for High-MPC and meters for the standard MPC, respectively. Nevertheless, the control actions (the total thrust and body rates) produced by High-MPC are better for real-world deployment since the inputs reach their limit for lower amount of time, leaving more control authority to counteract disturbance. Our approach only tries to follow the pendulum’s motion in and directions at the time stages closed to the learned traversal time .

Fig. 4: A comparison of planned trajectories between our High-MPC (with trained

 (s)) and a standard MPC. The vertical line indicates the passing moment. Our High-MPC is better for real-world deployment since the produced actions are much smoother than the standard MPC and reach the limit for lower amount of time.

V-C Learning Adaptive Traversal Time

Fig. 5: Demonstrations of our High-MPC for flying through a swinging gate. The initial states of the quadrotor and the pendulum are randomly initialized. In the 3D plots, the initial states of the pendulum are indicated by the grey color, and the black gates show the moment when the quadrotor is intersecting in the gate. The color bars on the right side specify the quadrotor speed in the direction. The grey dash lines are planned trajectory by our MPC and colored dots are traveled trajectories. The quadrotor’s body frame is indicated by . The 2D plots show travelled trajectories of the quadrotor and the pendulum.

Learning a single high-level policy without taking the robot’s observation into account is only useful for selecting time-invariant variables or for planning a one-shot trajectory, where the dynamics are perfectly modeled. This, however, is generally not the case. For example, our task requires the MPC to constantly update its prediction based on the the vehicle’s state with respect to that of the dynamic gate. Hence, we also want to find a high-level policy which is capable of adaptively selecting the time variable depending on the robot’s observation.

V-C1 Deep High-Level Policy Training

To do so, we make use of a multilayer perceptron (MLP) to generalize the

to different contexts . We represent as an observation of the vehicle using , which represents the difference between the vehicle’s state and the pendulum’s state  at time step . We use Algorithm 2 (Section IV), where we combine the learning of an optimal high-level policy online with a supervised learning approach to train the MLP. We first randomly initialize the system, which means we use random initial states for the quadrotor, and drop the pendulum from random angles; then, we find the optimal traversal time at this state. We solve the MPC optimization using and apply the optimal control action to a simulated quadrotor. We repeat this process again at each simulation time step until the quadrotor flies through the gate or it reaches the maximum simulation steps. In total, we collect 40,000 samples that consist of observation-traversal-time pairs

. It takes a single core CPU several hours to collect the data, however, the total sampling time can be significantly reduced using parallel processing or multithreading. We use Tensorflow 

[1] to implement the a fully-connected MLP with two hidden layers of 32 units, and ReLU nonlinearities. The training of network weights takes less than 5 minutes on a notebook with a Nvidia Quadro P1000 graphics card.

Fig. 6: Comparisons between our High-MPC (left) and a standard MPC (right), where initial states of the system are the same for both methods. Top: the swinging gate is released from  (rad). Bottom: the swinging gate is released from  (rad).

V-C2 Passing Through a Fast Moving Gate via High-MPC

We evaluate the effectiveness of our High-MPC by controlling a simulated quadrotor to pass through a fast swing gate, where the quadrotor and the pendulum are randomly initialized in different states. Based on the state of the quadrotor, the motion of the pendulum (including 2s of predicted pendulum motion in the future), and the predicted traversal time, our High-MPC simultaneously plans a trajectory and controls the vehicle to pass through the gate. Fig. 5 shows six random examples of the quadrotor successfully flying through the swinging gate.

In addition, we compared the performance of our High-MPC to a standard MPC (Fig. 6), where the standard MPC optimizes a cost function without considering the temporal importance of difference states in tracking the pendulum motion. The standard MPC failed to pass through the gate and results in trajectories that are oscillating about the forward direction ( axis).

Vi Discussion and conclusion

In this work, we introduced the idea of formulating the design of hard-to-engineer decision variables in MPC as a probabilistic inference problem, which can be solved efficiently using an EM-based policy search algorithm. We combined self-supervised learning with the policy search method to train a high-level neural network policy. After training, the policy is capable of adaptively making online decisions for the MPC. We demonstrated the success of our approach by combining a trained MLP policy with a MPC to solve a challenging control problem, where the task is to maneuver a quadrotor to fly through the center of a fast-moving gate. We compared our approach (High-MPC) to a standard MPC and showed that ours achieve more robust results, and hence, it is more promising to deploy our method on real robots, thanks to the online decision variable adaptation scheme realized by the deep high-level policy. Besides, our approach has the advantage of tightly coupling planning and optimal control together, and hence, forgo the need for an explicit trajectory planner.

Nevertheless, our approach has limitations such as it requires multiple MPC optimizations in-the-training-loop in order to find optimal variables. It is possible to learn a vector of high-dimensional decision variables and more complex neural network policies, however, the sample complexity will increase by a large margin. To fully exploit the potential of automatically learning high-level policies for optimal control, we hope that our work sparks more researchers’ interests in this domain to derive new algorithms and opens up opportunities for solving more complex robotic problems, such as real-world robot navigation in a complex dynamic environment. To test the scalability and generalization of our High-MPC, in the near future we intend to deploy the algorithm on a real robot system.