1 Introduction
A common approach to stochastic control, sometimes referred to as certainty equivalent model predictive control
, involves at each time forecasting future outcomes of exogenous random variables and optimizing control actions over a planning horizon under the assumption that these forecasts will be realized
(Bertsekas, 2005; Box et al., 2008). The first of the sequence of actions is taken. Forecasting and optimization are repeated at each subsequent time step. In this paper, we develop a regression algorithm that fits time series forecasting models for use in such a context.We focus attention on linear systems with quadratic cost, though the issues and methods we introduce generalize. In particular, we will consider a dynamic system that evolves over discrete time steps . At each time , the state is updated according to
where is a control action, is a random disturbance, and , , and are known matrices of appropriate dimension. Each control action is selected just before the disturbance is observed. The objective is to minimize average expected cost
where the per period cost function is a positive definite quadratic of the form , for some , , and . The following example illustrates a specific context.
Example 1
Consider a problem of balancing an inverted pendulum on a cart, as illustrated in Figure 1
. Take the state to be a fourdimensional vector
, where is the position of the cart, is the angle of the pendulum, and and are their rates of change. The cart can move freely along the horizontal axis, and is controlled through applying a voltage to its motor. We discretize time, and consider a linearization of the system dynamics around the balance point as in Landry et al. (2005), which in the absence of disturbances can be written as for some and . We introduce to the system a disturbance which can be thought of as a force exerted by a gust of wind at time . With the disturbances, the linearized system equation becomes for some . The objective is to center the cart and balance the pole while minimizing energy expenditure, and this is represented by a per period cost function , for some and , where the first term captures how far the current state is from being centered and balanced and the second term reflects the energy applied by the control action.This example may seem simple and specialized, but our framework captures a broad set of applications. For example, we could model the control of a robot that receives instructions from a human user by encoding in both the physical state and a tentative plan. Then, could represent changes to the plan communicated by the user and could represent an action taken by the robot as it tries to follow the plan.
An important issue is how future disturbances are forecasted. In the inverted pendulum problem, for example, accurate predictions of future wind patterns facilitate more effective control. We consider the use of linear time series models with coefficients estimated based on a historical sequence of disturbances. The main issue we focus on in this paper is how one should estimate coefficients. Note that the setting of linear systems with quadratic cost and linear time series model is one of the most commonly used in applied work and also offers a simple starting point for exploring how learning and control should be coordinated.
Given a linear model, it is common to estimate coefficients by minimizing squared error over historical data and then use the model with the resulting coefficients to generate forecasts for model predictive control. However, as we will discuss further, least squares regression leaves room for improvement when the observed process is not itself generated by the specified time series model. In particular, in such situations it can be beneficial to take the control objective into account when computing coefficients.
One approach that does this is empirical optimization, sometimes referred to as empirical risk minimization, which computes coefficients that would have minimized historical average cost given the historical sequence of disturbances. Under certain technical assumptions, if an infinite history of observations is available, empirical optimization yields coefficients that minimize future average expected cost. However, with finite data, empirical optimization often works poorly because it overspecializes to the data.
The main contribution of this paper is a new approach to fitting coefficients which we refer to as directed time series regression. This approach combines merits of least squares regression and empirical optimization. As we will demonstrate through a computational study of the inverted pendulum balancing problem, using directed time series regression can lead to large improvements in controller performance over either of the aforementioned alternatives. In relation to prior work, directed time series regression can be viewed as an extension of directed regression (Kao et al., 2009) to a context involving forecasting and control. This extension builds on several new contributions relative to what is presented in their work. First, dynamic control problems are more complex than the static decision problem addressed in their work, since a state process drives decisions and can become unstable whereas there is no notion of evolving state in static problems. Second, the empirical optimization problem considered in their work is a convex quadratic optimization problem, while the empirical optimization problem in our new context is not convex. We propose efficient approximation algorithms to address this problem. Third, the crossvalidation technique employed in the algorithm of Kao et al. (2009) does not apply in our new time series context, and we devise a sliding window crossvalidation technique for this purpose.
2 Certainty Equivalent Model Predictive Control
Recall that the problem at hand involves controlling a system that evolves according to
with a goal of minimizing average expected cost, where the per period cost function is a positive definite quadratic of the form .
Although the time horizon of interest is infinite, to simplify the problem, at each time , model predictive control (MPC) aims to optimize a finite horizon objective of the form
where is the horizon time and the subscript indicates that the expectation is conditioned on . After deriving a control policy that optimizes this objective, MPC applies the policy at time to generate a control action . Then, after observing and , a new finite horizon problem is formulated with a horizon spanning times through . This new problem is solved to produce a control action . The process is repeated at each subsequent time step.
In order to compute expected future costs, MPC requires a model that provides a distribution over future disturbance sequences. Certainty equivalent MPC simplifies the problem by assuming that future disturbances follow a deterministic sequence generated at time by a forecasting model. Hence, the objective of the optimization problem solved at time becomes
(1) 
where the state dynamics are governed by the same linear difference equation as before but with disturbances given by .
For our context, which involves a linear system and a quadratic cost function, certainty equivalent MPC generates a control policy that is linear in state and forecasts. To simplify our discussion we will assume that disturbances are scalarvalued, though the ideas and methods we will present extend to the vectorvalued case. Under this assumption, there are matrices and , which depend only on , , , , , , and , such that at each time , the action employed by certainty equivalent MPC is given by
(2) 
where . In particular, this action minimizes the objective in (1).
Certainty equivalent MPC leads to effective decisions for a wide variety of control problems. For example, in our context of linearquadratic control, if the process that generates disturbances is known and ergodic and each forecast is equal to the expected disturbance then the performance of certainty equivalent MPC becomes arbitrarily close to optimal as grows. Certainty equivalent MPC has also been applied with success to a number of important nonlinear systems.
3 Forecasting Model
We consider the use of a model driven by linear features
Here, each is a scalar constant and each represents a scalar feature that is useful for predicting the next disturbance. We consider a model of the form
where is a vector of model coefficients and is i.i.d Gaussian noise. Note that the conditional expectation of disturbance at time is linear in past disturbances.
It is straightforward to generate forecasts given the coefficients , features , and past disturbances . In particular, let for , and for , recursively compute
and
In the event that the data is generated by our model, each forecast produced in this way is a conditional expectation: .
Since, conditioned on , the forecasts are conditionally independent of previous disturbances, there is a function which maps the most recent disturbances to forecasts of the next disturbances. In particular,
where . Note that depends on the coefficient vector and is generally not linear in . From (2) we see that control actions generated by certainty equivalent MPC based on forecasts produced by our model can be written as
Let denote a control policy that maps and to . Note that this control policy depends on and again is generally not linear in .
4 Time Series Regression Algorithms
We now discuss algorithms for computing coefficients to fit a forecasting model of the kind we have described to a time series. In particular, each algorithm will compute a vector given observed samples .
4.1 Least Squares
The most common approach is least squares regression (LS). Here, we compute coefficients
Note that the sum begins with because earlier samples are required to compute the features . The optimization problem can be solved efficiently since it is a linear least squares problem. The resulting coefficient vector represents the maximum likelihood estimate assuming that the observed time series is generated by our forecasting model. As such, this is a natural approach to fitting a forecasting model that accurately captures the generating process.
4.2 Empirical Optimization
LS does not take the control problem into account when computing coefficients. This assumes a separation principle whereby model fitting can be decoupled from the subsequent selection of control actions. Empirical optimization (EO) provides an alternative that does account for the control problem when computing coefficients. The algorithm computes coefficients that minimize “historical cost”:
(3) 
where is set to the initial state of the system and subsequent control actions are selected by . The objective here is the sum of costs that would have been realized if we were to apply certainty equivalent MPC alongside our forecast model with coefficients , starting at time . It is easy to see that if the disturbance process is ergodic, as the number of observed samples increases, historical average cost converges to future average cost, and therefore, EO computes coefficients that optimize future average cost. However, for reasonable values of , EO tends to overspecialize to the data, and this results in poor future performance.
It may appear difficult to compute because is generally not linear in . However, as in the case of the computational study we will later discuss, we have had positive experience applying local optimization methods initialized with . A more detailed description of our implementation is given in the appendix.
Nevertheless, this nonconvex optimization problem can still be challenging as the number of parameters increases, mainly because evaluating the gradients and Hessians of requires iterative computation. We now propose an approximation algorithm that relieve this problem by linearization. Note that is typically close to linear in a region of that includes reasonable choices of , for reasons we now explain. The policy can be written as a sum of terms that are linear and terms that are nonlinear in . Linear terms are multiples of terms of the form . Nonlinear terms are higher order terms that involve products of terms of the form . In most problems of practical interest, terms of the form are significantly smaller than one because past disturbances exhibit decaying influence on future ones. As such, the nonlinear terms tend to be significantly smaller than the linear ones, which therefore play a dominant role in . This motivates the following approximation algorithm.
Suppose we have a reasonable estimate for the model coefficients, which we will refer to as base coefficients. In practice, a typical choice for it is . Given , it is straightforward to iteratively generate forecasts and features , . Now fix these features and define
In other words, here we fix base coefficients when rolling forward but allow another set of coefficients to take effect at the last step of prediction. Apparently is linear in . Furthermore, by similar notation we define
and
both of which are also (affine) linear in . As we can see, this approximation trades in some model flexibility for linearity. This leads to an alternative formulation for EO, which we refer to as linearized EO (LEO):
(4) 
Unlike (3), (4) can be solved efficiently by quadratic programming. A policy function can then be built upon .
4.3 Directed Time Series Regression
Directed time series regression (DTSR) aims to combine the merits of LS and EO. Directed regression was first introduced in Kao et al. (2009) in a context involving repeated independent decision problems as opposed to intertemporal control. That work also developed a theory that motivated the algorithm and demonstrated its benefits through a computational study. What we consider here is an extension of directed regression that is suitable for time series.
A naive version of DTSR (NDR) produces a vector of coefficients that is a convex combination of those computed by LS and EO:
where is selected by crossvalidation. This algorithm, albeit simple and intuitive, requires solving EO as an intermediate step and therefore can be inefficient for complicated problems. This disadvantage motivates the following approximation algorithm, linearized DTSR (LDR), which produces a coefficient vector by taking a convex combination of and
A policy function then follows.
We will use a sliding window crossvalidation procedure to select the parameter for NDR and LDR. The windows are defined by a sequence , each element specifying a boundary that separates the th training set window from the th validation set window. To select the parameter for NDR, for each we compute and using the observations , and then evaluate each choice of by simulating the system with control actions generated by , starting at state . The performance of each is judged based on the sum of costs incurred over time . For each , the best value of is selected, and we take to be the average over among the selected values of . In our computational study, we set , , and to , and , each rounded off to the closest integer. The crossvalidation procedure for LDR is essentially the same, except the replacement of by and by . Figure 2 illustrates this sliding window crossvalidation procedure.
5 Computational Study
To assess relative merits of the algorithms we have discussed, we conducted a computational study involving an inverted pendulum problem of the kind described in Example 1.
5.1 System Dynamics
We discretize time, sampling at a rate of 100Hz. We employ a linearization of this system around the balance point, as derived in Landry et al. (2005). We also use the same physical parameters therein, and we assume that at each time there is a gust of wind accelerating the pendulum’s angle by . Specifically, the system dynamics are characterized by the following matrices:
Since our goal is to keep the cart centered and the pendulum vertical while minimizing energy consumption, we use a cost function parameterized by the following matrices:
Hence, the cost in period is , where the state is and the action influences acceleration. Along with the matrices , and , we take to be in our experiment and evaluate the policy matrices and accordingly.
5.2 Generative and Forecasting Models
In our study, we randomly sample an ensemble of generative models, each of which is used to generate a time series to which our algorithms are applied. Each generative model is sampled as follows:

Sample i.i.d from .

Sample i.i.d from .

Select such that when control actions are selected according to , the average expected cost is equal to one.
These steps result in an AR(30) process for which the first five coefficients tend to be an order of magnitude larger than the next twentyfive. The last two steps in the sampling process deserve some explanation. Step 3 normalizes the signaltonoise ratio of the disturbance process. This reduces variations among sampled generative models in how helpful a forecasting model can be. Step 4 serves to normalizes the control objective which would otherwise dramatically differ from one generative model to another. The control policy is one that is optimal when future disturbances are not forecastable. The disturbance variance is chosen so that the average expected cost when the system is controlled by this naive policy is equal to one.
Since the five most recent disturbance dominate others in influencing future disturbances, it is natural to select them as features of a forecasting model. In particular, we consider a forecasting model with five features:
This model approximates the generative model but at the same time neglects useful information that can be extracted from disturbances observed beyond five time periods in the past. It is our intention to consider a forecasting model that does not perfectly capture the generative model since it is in such contexts that DTSR adds value. It is also important to recognize that this sort of model misspecification is inevitable in practical applications.
5.3 Results
For our computational study, we sampled 2,000 generative models using the above procedure. For each model, a sequence of 5,360 disturbances are generated, initialized with an assumption that all disturbances prior to are equal to zero. We take the last samples of this trajectory as observations to be used by regression algorithms. In other words, each algorithm makes use of disturbances for . In our experiment, we repeat the LS, EO, NDR, LEO, and LDR algorithms for each .
To provide a lower bound, we consider a “modelclairvoyant (MC) policy.” This is the optimal policy to use given full knowledge of the generative model. The average expected cost incurred by this policy, which we will refer to as the MC cost, can be derived in closed form. Clearly, policies generated by the aforementioned five algorithms must incur average expected costs no less than .
We can also evaluate in closed form the average expected costs , , , , and , of policies generated by our regression algorithms. We will focus our presentation on excess costs , , , , and . Subtracting factors out the influence of costs that are inevitable regardless of the choice of control policy. Figure 3 plots excess costs as a function of . These results are qualitatively similar to those reported in Kao et al. (2009) in a context involving repeated independent decision problems rather than intertemporal control. For small , EO and LEO suffer from overspecialization. For large , the degree of overspecialization diminishes and the bias in LS due to model mispecification prevents LS from doing as well as EO and LEO. LDR always outperforms the best of these three methods, sometimes by as much as 12%. It is also interesting to note that LDR provides a consistent gain over NDR, in spite of the substantial saving in computation.
We can also compare quantities with simple physical interpretations. Suppose this cartpendulum system is considered to reach a state of failure if the cart’s position or the pendulum’s angle deviates from zero by more than certain threshold values. Let the thresholds on position and angle be and , which are roughly twice of the rootmeansquare values of and in our simulations. We refer to the number of time periods elapsed before a system initialized at the zero state reaches failure as timeuntilfailure. For each of our 2,000 models, we generate 50 disturbance trajectories. For each of these trajectories and each of the policies under consideration (MC, LS, EO, NDR, LEO, and LDR), we simulated the system until failure. Figure 4 summarizes results. It is clear that NDR and LDR are much closer to MC than LS, EO, and LEO are in terms of timeuntilfailure.
6 Conclusion
In this paper we have presented directed timeseries regression, an algorithm that computes coefficients of a timeseries model for use in certainty equivalent model predictive control. Two versions of this algorithm have been proposed, namely NDR and LDR, which we will collectively refer to as DTSR in the following discussion. We have shown that when the quantity of available data is limited and the forecasting model differs from the generative model, this algorithm offers a significant advantage by combining merits of least squares regression and empirical optimization. As side products, we have also proposed a methodology that transforms the original problem into a linearized version, and a sliding window crossvalidation algorithm for time series with control. Both techniques can potentially enlarge the applicability of directed regression theory in other fields.
The main point of the paper is that the coordination of forecasting and control can be fruitful and we hope this observation will stimulate work along these lines involving broader classes of control problems and learning algorithms. Our approach extends to more complex settings than the one considered in this paper, for example, cases where disturbances are multidimensional and the system is nonlinear. It would also be interesting to explore the use of DTSR in contexts involving nonstationary time series.
Let us conclude with a discussion of several threads of work that relate to the ideas we have presented. Motivated by a similar perspective on fitting a mispecified model for use by a control algorithm, the approach developed in Abbeel and Ng (2004)
learns a firstorder Markov model in a way that improves controller performance when data is not generated by a firstorder Markov model. This approach takes the discount factor of the control problem into account when learning the transition matrix. Aside from the contextual differences between discrete Markov processes and linear autoregressive models, a conceptual advance in our work can be seen in the fact that DTSR takes the entire control objective into account.
Differences between LS and EO are analogous to differences that have been studied between generative and discriminative methods for learning (see, e.g., Ng and Jordan (2001)). When data samples are scarce, generative methods often provide better results, as does LS. On the other hand, when there is ample data, discriminative methods are advantageous, as is the case for EO. DTSR provides a useful way of combining the merits of LS and EO, and the idea may generalize to offer an approach that can more broadly be used to combine merits of generative and discriminative methods.
Control theorists working on system identification typically adopt weighted leastsquare linear regression to fit a linear system
(Ljung, 1998). While this approach puts more emphasis on learning the critical components, it does not explicitly consider the control objective. Econometricians have analyzed properties of parameter estimates for misspecified time series models (White, 1980; Domowitz and White, 1982). However, this line of work does not treat the use of such models and estimated parameters for decision or control. Operations researchers have developed methods that factor objectives into how point estimates such as forecasts are generated when they are to be used for decision or control (Granger, 1969; Liyanage and Shanthikumar, 2005; Chua et al., 2008). However, this line of work does not consider model misspecification.Appendix
Implementation Details for EO
The particular local optimization method we use in our computational study is a variation of the GuassNewton method (Bertsekas, 1999). This suits our problem well because
is quadratic. The particular variation we use deviates from the standard GaussNewton method in that each iteration carries out a backtracking line search to determine the size of the step taken toward what would be the next GaussNewton iterate. Further, whenever the approximate Hessian is not positive definite, we add to it a small multiple of the identity matrix, in the spirit of the LevenbergMarquardt method. In our computational study, the backtracking line search starts with a step size of one and halves the step size repeatedly so long as that improves the objective value in (
3). When a multiple of the identity matrix is added, which is rare, the multiplier is .References
 Abbeel and Ng (2004) P. Abbeel and A. Y. Ng. Learning firstorder Markov models for control. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1–8. MIT Press, Cambridge, MA, 2004.
 Bertsekas (1999) D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
 Bertsekas (2005) D. P. Bertsekas. Dynamic programming and optimal control. Athena Scientific, 2005.
 Box et al. (2008) G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control. Wiley, 2008.
 Chua et al. (2008) L. Y. Chua, J.G. Shanthikumar, and Z. M. Shen. Solving operational statistics via a Bayesian analysis. Operations Research Letters, 36:110–116, 2008.
 Domowitz and White (1982) I. Domowitz and H. White. Misspecified models with dependent observations. Journal of Econometrics, 20:35–58, 1982.
 Granger (1969) C. W. J. Granger. Prediction with a generalized cost of error function. Operation Research, 20(2):199–207, 1969.
 Kao et al. (2009) Y.H. Kao, B. Van Roy, and X. Yan. Directed regression. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 889–897. 2009.
 Landry et al. (2005) M. Landry, S. A. Campbell, K. Morris, and C. O. Aguilar. Dynamics of an inverted pendulum with delayed feedback control. Journal of Applied Dynamical Systems, 4(2):333–351, 2005.
 Liyanage and Shanthikumar (2005) L. H. Liyanage and J. G. Shanthikumar. A practical inventory control policy using operational statistics. Operations Research Letters, 33:341–348, 2005.
 Ljung (1998) L. Ljung. System Identification: Theory for the User (2nd Edition). Prentice Hall PTR, 1998.

Ng and Jordan (2001)
A. Y. Ng and M. I. Jordan.
On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes.
In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 841–848. MIT Press, Cambridge, MA, 2001.  White (1980) H. White. Using least squares to approximate unknown regression functions. International Economic Review, 21(1), 1980.