## 1 Introduction

mpc is a well studied and widely adopted control technique, particularly in the process control industry. Its popularity stems in large part from its ability to control complex systems while respecting system constraints, ensuring safe operation. It operates by solving an ocp with the current state of the plant as the initial condition, and using a model of the plant to predict the plant response to the controlled variables. In this way it finds the sequence of control inputs that minimizes the objective function over the prediction horizon while remaining feasible in the sense of the trajectories remaining within the specified constraints. The first control input of the solution sequence is applied to the plant, and the mpc then solves the ocp again at the next sampling instance. The drawbacks of the mpc framework is that the quality of the input sequence relies heavily on the accuracy of the model of the plant dynamics, the hyperparameters of the mpc needs to be fine-tuned to the task at hand, and further that the computational complexity of solving the ocp is fairly high, limiting the type of platforms and applications that can implement mpc.

The prediction horizon length is a key parameter of the mpc framework. In conjunction with the step size it controls how far into the future the controller evaluates the consequences of its actions. If chosen too short, the computed trajectories are myopic in nature and might lead to instability and poor approximations of the infinite horizon solution, while the computational complexity grows at best linearly with increasing prediction horizon. Moreover, different regions of the state space might have varying requirements on the horizon length for stability and to find nearly optimal trajectories. This observation motivated the ahmpc. In Michalska and Mayne (1993) the horizon is adapted so that a terminal constraint is satisfied and the system enters a known region of attraction of a second terminal controller. Krener (2018)

proposes a heuristics-based approach, presenting one ideal but not implementable approach, and one practical method using iterative deepening search where stability criteria are checked on each iteration to determine the lowest stabilizing horizon. A more direct approach is presented in

Scokaert and Mayne (1998) where the prediction horizon is included as a decision variable of the mpc scheme. Gardezi and Hasan (2018)proposes a learning based approach in which they construct a rich dataset of numerous combinations of states and mpc computations with varying horizons, and then apply supervised learning on this dataset to develop an optimal horizon predictor.

rl (Sutton and Barto, 2018)

is a field of machine learning concerned with optimal sequential decision making. While rl has proven to be the state-of-the-art approach for certain classes of problems such as game-playing

(Schrittwieser et al., 2020), it has not seen many real world applications in control. This is in large part due to its data intensive nature, combined with its inability to handle constraints and therefore lack of guarantees for safe operation of the system, both in the learning stage and in production. However, rl can be employed for control in a safe manner by using it to augment existing control techniques such as mpc (Aswani et al., 2013; Fisac et al., 2018; Zanon and Gros, 2020), e.g. to learn the system dynamics (Nagabandi et al., 2018) or tune parameters (Mehndiratta et al., 2018).In this paper we propose to learn the optimal prediction horizon length of the mpc scheme as a function of the state using rl. To the best of our knowledge, this is the first work to employ rl for ahmpc. The contribution of this paper lies in exploring how the rl problem of optimizing the mpc prediction horizon can be formulated, and showcasing its effectiveness on two control problems. Further, we suggest to jointly learn the mpc value function due to its synergistic relationship with the prediction horizon, enhancing the adaptive capabilities. While the ahmpc approaches described earlier can be designed with favorable properties such as theoretical stability guarantees, they often assume access to privileged information such as terminal sets and control Lyapunov functions. Learned approaches on the other hand typically assume little is known, and as such are applicable to more problems.

The rest of the paper is organized as follows. Section 2 presents the algorithms and theory employed in this paper. Section 3 presents the formulation of learning the optimal mpc prediction horizon as rl problem, while Section 4 describes the experiments undertaken, the results of which are presented and discussed in Section 5. Finally, Section 6 concludes the paper with our thoughts about the proposed method and future prospects.

## 2 Background

### 2.1 Model Predictive Control

mpc is a model-based control method where the control inputs are obtained by solving at every time step an open loop finite-horizon ocp (1), using a model of the plant to predict the response to the control inputs from the current state of the plant. Solving the ocp yields a control input sequence that minimizes the objective function over the optimization horizon. The first control input of this sequence is then applied to the plant, and the ocp is solved again at the subsequent time step to get the next control input. In this paper we consider discrete-time state-feedback nonlinear constrained mpc for which the mpc receives exact measurements of the states at equidistant points in time. It reads as:

(1a) | ||||

s.t. | (1b) | |||

(1c) | ||||

(1d) |

Where we here and in the rest of the paper use the notation to indicate solving for the arguments that minimize the function. Here,

is the plant state vector at optimization step

and is the plant state at the current time, is the vector of the control inputs, are time-varying parameters whose values are projected over the optimization horizon, is the model dynamics, is the constraint vector and is the horizon. The state and control inputs are subject to constraints, which must hold over the whole optimization horizon for the *mpc solution to be considered feasible. The mpc objective function consists of the stage cost , the terminal cost , and the discounting factor . The stage cost is problem specific, e.g. consisting of a tracking error and an input-change term that discourages bang-bang control, where . The stage cost only evaluates the trajectory locally up to a length of steps, the terminal cost should ideally provide global information about the desirability of the considered terminal state, helping the mpc avoid local minima. The more accurate the terminal cost is wrt. to the infinite horizon solution to problem (1), the shorter the horizon can be while still achieving good control performance (Zhong et al., 2013; Lowrey et al., 2019).### 2.2 Value Function Estimation

The ideal choice for the terminal cost in the mpc scheme would be the optimal value function . A value function measures the expected total infinite horizon discounted cost accrued when following the control law (2), and the optimal value function is then the value of an optimal control law that chooses the optimal input at every point (3). Equation (3) is written in the form of a Bellman equation where the value is decomposed into a one-step cost and the total value from the next state . Computing exactly from (3) is intractable for problems with continuous state and input spaces, and iterative approaches such as Q-learning requires an enormous amount of data for such problems.

(2) | ||||

(3) |

The mpc scheme delivers local approximations to (3), and as such is a good surrogate for as the terminal cost . While computing exactly is not possible either — due to requiring running the mpc scheme with an infinite horizon — it can be approximated with fitted value iteration from data gathered when running the mpc.

(4a) | ||||

(4b) |

The value function approximator is parameterized by the parameters that are updated according to (4) to minimize the msbe. Moreover the mpc scheme provides good approximations to the n-step Bellman equation, which when employed in the update rule (4) is known to accelerate convergence and promote stability of the value function learning process:

(5) |

### 2.3 Reinforcement Learning

The system to optimize using rl is framed as a mdp which is defined by a set of components . is the state space of the system, is the action space, is the discrete-time state transition function which describes the transformation of the states due to time and actions, i.e. , is the cost function and is the discount factor describing the relative value of immediate and future costs.

The aim of rl methods is to discover optimal decision making for the problem as defined above, usually by constructing a policy — i.e. a (possibly stochastic) function that maps states to actions, here parameterized by — and/or a value function as in (2), where corresponds to . The objective to be optimized is then:

(6) |

that is, minimize the expected sum of costs acquired over the states visited by the policy in an infinite horizon. The expectation is taken over the initial state distribution , and the trajectory distribution generated by the policy and the state transition function.

### 2.4 Soft Actor Critic

sac (Haarnoja et al., 2018) is an actor-critic entropy-maximization rl algorithm with a parameterized stochastic policy. Entropy is a measure of the randomness of a variable, and is in the case of a continuous action space defined as

, i.e. the probability of taking a given action in state s given the policy

. In maximum entropy rl the objective is regularized by the entropy of the policy, that is, the aim of the policy is to minimize the sum of expected costs while simultaneously maximizing the expected entropy. This in turn yields multi-modal behaviour and innate exploration of the environment, as well as improved robustness because the policy is explicitly trained to handle perturbations. For the sake of brevity, we will limit the discussion of the specifics of the sac algorithm to the policy implementation, see Haarnoja et al. (2018) for details on how the policy is optimized. sac learns a parameterized stochastic policy implemented as:(7) |

Here and are the two outputs of the policy function approximator, representing the mean action and the covariance, respectively. is independently drawn Gaussian noise, denotes element-wise matrix multiplication, and is employed to squash the Gaussian’s infinite support to the interval . The policy can therefore control its entropy through the state-dependent noise covariance . When evaluating the policy we set , such that the policy becomes deterministic, as this tends to give better performance.

## 3 Method

### 3.1 Horizon Policy

We learn a policy to output the prediction horizon of the mpc scheme using sac. The prediction horizon is a positive integer, that for convenience we choose to upper bound. As such we modify the output of the sac policy by linearly scaling the output from the ’s limits of -1 and 1, to 1 and , and then round the output to the closest integer:

(8) |

The gradients of the rl problem are not affected by these transformations as they are applied in the environment, while the gradients are calculated based on the unscaled and unrounded outputs from . This does however mean that the agent must “learn” that similar outputs from the policy will be rounded to the same action in (8), and thus lead to the same subsequent state and cost. We considered alternative ways of formulating the policy as a discrete distribution from which integer horizon lengths could be drawn directly, such as N-head nn, Poisson models, and negative binomial models, but settled on the described rounding approach due to its simplicity and favorable results.

The cost function of the horizon policy consists of a control performance cost , i.e. the mpc stage cost , a constraint violation cost , and a computation cost to encourage lower horizons when suitable:

(9) |

where are weighting factors.

is a binary variable indicating whether a hard constraint of the problem was violated — upon which the episode is ended — and

is the number of steps left in the episode such that the agent receives a penalty proportional to how early the episode is ended. We assume the computational complexity of the mpc scheme grows linearly in the horizon length, i.e. , as a lower bound for the true complexity. This generally holds true for the interior point method we use if one assumes local convergence and an initial guess that is reasonable (Rao et al., 1998).The rl state space consists of the mpc state space and the time-varying parameters , as these are necessary to ensure the Markov property.

### 3.2 MPC Value Function

The mpc’s value function is trained jointly with the rl horizon policy to minimize the msbe as described in Section 2.2

, using 32-step bootstrapping. We found that N-step learning provided sufficient stabilization such that other common techniques in value estimation such as target networks and multiple estimators were not needed

(Fujimoto et al., 2018). We experimented with two types of approximators, nn and polynomial regression models, finding that they achieved similar prediction accuracy. We therefore use quadratic polynomial regression models due to their convexity, reducing the computational complexity of the mpc scheme.### 3.3 Evaluation

Since the environments are randomized we construct a test set consisting of 10 episodes for which all stochastic variables such as state initial conditions and references are drawn in advance and thereby fixed for all policies, ensuring a fair comparison. The learned horizon policy is compared against the standard mpc scheme with a fixed horizon, to assess the contribution of the learning. Each fixed horizon mpc also has its own value function estimated using a dataset of 15k time steps.

## 4 Experiments

We illustrate our approach on two systems. We set , and for the inverted pendulum and collision avoidance systems, respectively. We use the hyperparameters suggested in the sac paper (Haarnoja et al., 2018), with the following exceptions: is a 2-layer fully connected nn with 32 nodes in each layer, for both the mpc scheme and the rl algorithm, and a reward scaling of 0.6 for the inverted pendulum system and 0.3 for the collision avoidance system.

### 4.1 Inverted Pendulum

The first system we experiment on is the classic control problem of stabilizing an inverted pendulum mounted on a cart that is fixed on a track, so that the cart can only move back and forth in one dimension. The cart’s position is constrained to the size of the track and the pendulum angle is constrained to be above perpendicular to the surface. The controller should also track a time-varying position reference. As the position of the cart and stabilization of the pendulum are intricately linked, respecting both of the constraints while tracking the position reference requires a fairly high optimization horizon. Each episode is terminated after a maximum of 100 time steps, or when a constraint is violated.

The state space consists of the states , where and is position and velocity of the cart along the horizontal axis, while and is the angle to the upright position and the angular velocity of the pendulum. The system dynamics are described by the equations in (10), where and are the mass of the pendulum and total mass of cart and pendulum, and is the length of the pendulum. For the mpc model the dynamics are discretized with a step time of .

The stage cost reflects the objective of stabilizing the pendulum in the up position, formulated through minimizing the negative potential energy of the system, while tracking the position reference .

(10a) | ||||

(10b) | ||||

(10c) | ||||

(10d) | ||||

(10e) |

### 4.2 Collision Avoidance

The second system we consider is a reference tracking problem, in which a vehicle is controlled to follow a trajectory where obstacles are placed in the path that need to be avoided. The mpc receives information about the reference trajectory as well as any obstacles in its vicinity, however the position of the obstacles grows more uncertain the longer the prediction horizon is. This means longer horizons considers increasingly uncertain information, and a short or medium horizon might be more suited in some situations. The episode is ended when reaching the end point of the trajectory, when colliding with an obstacle, or after a maximum of 150 time steps.

For the vehicle we employ a unicycle model (11), where the mpc provides a forward velocity as well as an angular velocity to turn the vehicle. The mpc model is discretized with a step time of . The positions and sizes of the obstacles are randomly generated at the beginning of every episode, and their projected positions supplied to the mpc are randomly drawn within a two-dimensional cone originating from the vehicle, such that the uncertainty grows the further away the object is from the vehicle. An episode is illustrated in Figure 1.

(11a) | ||||

(11b) | ||||

(11c) | ||||

(11d) |

The stage cost is defined as where and is the vehicle position and trajectory reference at time . Further, soft constraints with slack variables are added around each obstacle with 150% of the obstacles radius.

## 5 Results

The standard mpc scheme with various prediction horizons and the learned rl ahmpc are compared in Figure 3. The rl policy outperforms the standard mpc scheme for all horizons lengths, improving on the second best achieving policy by about and for the inverted pendulum and collision avoidance systems, respectively. The improvement is more significant for the latter system as the performance objective varies more with the prediction horizons, and the rl policy is able to identify when to use long and short horizons. For the inverted pendulum system, all horizons capable of respecting the constraints achieve similar performance costs, and as such the difference lies mainly in the computation term, although the rl policy achieves the lowest performance cost here as well. rl’s ability to find improvement in a problem with such a noisy cost landscape and with such little potential improvement speaks to its strength. Moreover, the gains from reducing computation would be greater when using e.g. active set methods for the mpc scheme which typically yields quadratic growth in computational complexity (Lau et al., 2015).

Total cost on the test set for the rl policy at different stages of the learning process. The solid line is the mean score while the shaded region is one standard deviation over three initialization seeds.

In the collision avoidance environment, the best performing fixed horizon is the short 10 step horizon. With the shortest 5 step horizon, the mpc is unable to navigate around all the obstacles, preferring to stay still in front of large obstacles, although the addition of the value function mitigates this issue to some extent. Longer prediction horizons allows the mpc to recognize that sometimes the longer way around the closest obstacle yields a shorter total path due to other obstacle locations, but its planned routes are more sensitive to the uncertainty in the projected locations. A robust mpc scheme could alleviate this deficiency, however the rl policy is also able to recognize this issue and leverage the strengths of both short and long horizons.

We found that implementing value function estimation in the mpc scheme could significantly improve the performance when using horizons in a neighborhood of the horizon scale where performance changes abruptly, as illustrated in Figure 2, which shows the percentage improvement for each policy when including as the terminal cost. The rl horizon policy does not benefit as much from the value function as we would expect, even being the best performing policy when removing the value function from it but not from the other policies. The benefit would probably be more significant in problems that are more temporally or spatially complex. In the collision avoidance problem the shortest horizons show the largest improvements, while for the inverted pendulum system the most improved horizons are the ones that lie close to the apparent minimum horizon required to successfully stabilize the pendulum and track the position reference. For both systems, the longer horizons benefit less from the addition of the value function. This is in part due to the fact that both these systems are heavily influenced by future information that is not available to the value function estimator, i.e. accurate information about distant obstacles and the future position reference for the cart.

We note that the performance costs and the value function improvement is not monotonic wrt. the horizon length. This could partly be explained by the randomness in the data collection stage for the value function estimation.

Figure 4 shows the progression of the training process of the rl horizon policy. It learns quickly, converging after around 15 thousand time steps for both systems, corresponding to about 10 and 25 minutes of data collection for the inverted pendulum and collision avoidance systems, respectively. Moreover, we find that the rl horizon policy itself converges even faster and that the value function estimation is the slower, less data efficient component. From these results it seems evident that rl is able to cope well with the rounding described in Section 3.1.

## 6 Conclusion

We have shown in this paper that rl can be used to automatically tune and adapt the prediction horizon of the mpc scheme on-line with only minutes of data collection, at least for simple systems. An important further work is to investigate how this affects the stability properties of the mpc framework, and if any guarantees can be given.

## References

- Provably safe and robust learning-based model predictive control. Automatica 49 (5), pp. 1216–1226. Cited by: §1.
- A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control 64 (7), pp. 2737–2752. Cited by: §1.
- Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, Cited by: §3.2.
- Machine learning based adaptive prediction horizon in finite control set model predictive control. IEEE Access 6 (), pp. 32392–32400. External Links: Document Cited by: §1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, Cited by: §2.4, §4.
- Adaptive Horizon Model Predictive Control. IFAC-PapersOnLine 51 (13), pp. 31–36 (en). External Links: ISSN 2405-8963, Document Cited by: §1.
- A comparison of interior point and active set methods for fpga implementation of model predictive control. Proc. European Control Conference, pp. . Cited by: §5.
- Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
- Automated tuning of nonlinear model predictive controller by reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3016–3021. Cited by: §1.
- Robust receding horizon control of constrained nonlinear systems. IEEE Transactions on Automatic Control 38 (11), pp. 1623–1633. External Links: Document Cited by: §1.
- Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7559–7566. External Links: Document Cited by: §1.
- Application of interior-point methods to model predictive control. Journal of optimization theory and applications 99 (3), pp. 723–757. Cited by: §3.1.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. Cited by: §1.
- Min-max feedback model predictive control for constrained linear systems. IEEE Transactions on Automatic Control 43 (8), pp. 1136–1142. External Links: Document Cited by: §1.
- Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. External Links: ISBN 0262039249, Document Cited by: §1, §2.2.
- Effective multi-step temporal-difference learning for non-linear function approximation. External Links: 1608.05151 Cited by: §2.2.
- Safe reinforcement learning using robust mpc. IEEE Transactions on Automatic Control. Cited by: §1.
- Value function approximation and model predictive control. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Vol. , pp. 100–107. External Links: Document Cited by: §2.1.