1 Introduction
In the design of optimal control systems, one seeks a controller that performs some desired task while minimizing a given cost functional. In the classical setting, a welldefined system or plant model (i.e,. a set of differential equations governing the dynamics of the system) is assumed to be known or identified beforehand. By using this model, controllers are designed offline
by using dynamic programming (i.e., by solving the HamiltonJacobiBellman (HJB) partial differential equations) or by solving the necessary conditions provided by the Pontryagin’s maximum principle (PMP)
[1].In this work we consider the novel problem of learning optimal controllers online for systems that are operated repeatedly and whose models are not fully specified in each round of operation. In particular, we assume a fixed but unknown probability distribution over the parameters governing the system model. During each operation, the environment samples parameters from this distribution and they govern the system dynamics. We apply a control and are given a feedback signal on its performance at the end of operation. Over many repeated operations, our objective is to locate that control that works the best for the unknown probability distribution. This is in effect personalizing the controller to the specific system conditions that it faces upon deployment.
Personalization has already been studied, to some effect, in the framework of online convex optimization (OCO) for the full information setting and multiarmed bandit (MAB) problems in the partial feedback setting. And these have found applications in settings such as targeted online advertisements [2], recommendation systems [3] and others. As we will see, our optimization problem is nonconvex and even if full information is available, OCO cannot be applied directly [4]. On the other hand, several newer versions of bandit problems such as linear bandits [5], armed bandits [6] and Gaussian processes based algorithms [7] extend banditstyle algorithms to continuous domains where convexity is not always assumed. These algorithms vary in terms of what is assumed about the objective function. In terms of practicality, many of these algorithms are either too complicated for real applications or have only been shown to work on simplistic examples [8]. A closely related paper [9] looks at the discrete linearquadratic regulator (LQR) problem where one can change the control within the operating regime. In this work, we address personalization by building on the principles behind linear bandits and develop a semidefinite programming based algorithm to assess practicality in the presence of nonconvexity.
Applications of personalization: Several systems such as the traffic control systems, mass transit systems, cooling systems deployed at public places, etc are repeatedly operated. They also have this feature that the system dynamics differ from one round of operation to another. For instance, the traffic profile at a junction varies from cycle to cycle. It also varies from junction to junction. An optimal controller in this setting should ideally personalize to the traffic distribution seen at its junction as well as take into account the variation in the realization of the traffic patterns at its junction. In mass transit systems such as buses and trains, the number of commuters boarding and alighting depends on the route and timing that the bus operates in. This number affects the acceleration and deceleration profile of the transit vehicle and its fuel efficiency. Thus, an acceleration and deceleration controller for the transit vehicle should personalize its control for the vehicle’s route and timing. Conditional on this, it should also take into account the variation in the commuter demand encountered on this route at those times. In cooling systems deployed at large public spaces, the cooling efficiency is determined by the number of people using the space that varies based on the space characteristics and time. Even within a specific space and time period, a cooling controller may have to take into account the variation in the usage to increase operational efficiency.
2 Problem Formulation
We consider a continuoustime system governed by a linear constant coefficient differential equation as follows:
(1) 
where for , is the state of the system and is the control input (here is the time of end of control). Lets say want to come up with a controller to steer the system given in Equation (1) from a given initial condition to a given final condition (for simplicity, let ) minimizing a scalar valued cost given by:
(2) 
We assume we know the functional form of as well as matrices and . Without loss of generality, we can restrict our optimal control search to the space of linear feedback controls, i.e., controls of the form , where is a matrix of gain parameters (this assumes that is large enough for the dynamics to settle down). Further, the gain matrix should be such that the system is stable [9] (we will automatically ensure this in our algorithm below).
Lets now assume that we are operating the system repeatedly. That is, we assume matrices and for are known and fixed beforehand (we also assume suitable observability and detectability conditions involving and ). What we are not explicitly given in each round are the values of the parameters , . We assume that each is an indicator function of the event that happens with probability (thus,
is a categorical random variable). In this setting, we are interested in searching for a controller that minimizes the expected cumulative cost of operation over all rounds. To recap, let
be a candidate algorithm. In each round of operation the following events occur:
The environment draws a realization of from an unknown but fixed probability distribution ( lies in a simplex in ). The realization is kept fixed for the round.

Algorithm picks a control (parametrized by ) from a set of stabilizing controllers (say ) and applies it on the system.

A scalar cost value is revealed to the algorithm that summarizes the cost of operation in the round.
The evolution of state in each round of operation depends on the realization of random variable and the control input parametrized by that chooses. The expected cost of choosing a controller parametrized by is given by the map . The random variable models the stochasticity present in each operational cycle. For instance, the load on an autonomous vehicle/elevator changes as a function of the number of passengers alighting and boarding during each run. The number of people present at a location at various points of time also changes the loading on the corresponding cooling system in place. In the next section, we describe a solution technique that optimizes for the cumulative cost while choosing controllers.
3 Solution Approach
Our algorithm chooses to apply a control and gets to see a realization of the cost of operation in each round . It is able to use this feedback to deduce which realization of occurred^{1}^{1}1Although, such knowledge leads us to the full information setting, OCO is not applicable due to nonconvexity.. This lets it update its belief about the unknown . Based on this belief, it optimistically picks the next control to be applied. The algorithm is described in Algorithm 1. The choice of controller depends indirectly on the performances of previously explored controllers similar to many previous works [9, 10, 11, 12].
Inputs: Before Algorithm 1 is deployed, we explore the system by applying different controllers for some initial set of rounds
. This gives us the initial count vector
of the realizations of (identification of the realization is described below). In addition to , our algorithm also takes in a confidence parameter to be used for optimistic controller selection, the objective parameterized by and matrices and the set of stable controllers .Optimistic controller selection: In any round
, we have an empirical estimate of
, denoted by . This is similar to the maximum likelihood estimation step in linear stochastic bandits [9, 13]. By using the method of types and Pinsker’s inequality (for instance, see Theorem 11.2.1 in [14]), we can upper bound the probability that the unknown is far from estimate as:where . If we now want to ensure that this probability is upper bounded by a value , then belongs to the set with probability at least . We define this set as . Thus while picking the controller for round , we can optimistically search for a value from simultaneously. The optimization problem (line 3 in Algorithm 1) can be written explicitly as^{2}^{2}2This formulation builds on an SDP based formulation for a deterministic LQR problem.:
This optimization problem is nonconvex
and we devise some heuristics (alternating minimization over
and and a way to deal with coupling constraints) in the experiments. Ideally, the control is given by for any .Identification of the realization: In our setting, we have possible realizations of the system . If we are given feedback in round , we can solve the following optimization problem with each of the pairs and the fixed control to get cost values and deduce the realization :
Realization is equal to (ties broken arbitrarily). The above nonconvex optimization problem can be transformed into a semidefinite program and solved relatively easily when compared to the optimistic optimization problem for control selection formulated earlier.
An experts based alternative: An alternative algorithm that is intuitive but suboptimal is as follows. We can compute the optimal controllers corresponding to each system model beforehand. We can then treat each of these as experts and apply the randomized weighted majority algorithm [15]. We can do this because we can get full information in each round and not just the cost of the controller we picked. But note that the regret bound does not hold because the optimal controller need not belong to the set of experts . We want to find a controller, not necessarily optimal for any of the system models , that minimizes the expected cost of operation over multiple rounds while minimizing regret. This is what is achieved by Algorithm 1.
4 Experiments
We show the effectiveness of our solution approach through a three dimensional system given by
with , , and . Further, and , where is a categorical random variable with a fixed probability mass function (unknown to the algorithm) given by . Further, we set and .
The experiment is run for 30 rounds and in each round, a controller (Kproposed) is chosen according to Algorithm 1 and the cost it incurs is logged. We also evaluate the following static controllers: (K1) the optimal controller for , (K2) optimal controller for , and (Krobust) the robust optimal controller. The performances of all these controllers are plotted in Figure 1. We observe that the proposed controller is the best in terms of the total cost accumulated. Notice that controller K2 also accumulates similar cumulative cost, and is a good choice as well, but this is not known a priori to a learning agent.
5 Conclusions and future directions
In this work, we proposed an approach to personalize control systems to the operating environment in the setting where there is repetition of operation and a certain type of stochasticity is present. In particular, we proposed an algorithm that uses the optimism under uncertainty principle. This way of personalization is very useful in reducing operational costs in a variety of applications (for instance, minimizing energy consumption in various transportation and cooling system applications).
This is still a work in progress and investigating bounds on regret for this setting is of immediate interest. It is also interesting to explore better algorithms to deal with the nonconvex optimization problem that needs to be solved in each round. Also, the uncertainty model can be extended to the setting where there is a Dirichlet prior on the the unknown probability distribution . Extensions to classes of nonlinear and noisy dynamical systems are also worth pursuing.
Acknowledgement
The author would like to thank Deepak Patil for initial discussions on this topic.
References
 [1] L. Pontryagin, V. Boltyanskii, R. Gamkrelidze, and E. Mischenko, The Mathematical Theory of Optimal Control. Interscience Publishers, 1962.
 [2] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal, “Mortal multiarmed bandits,” in Advances in Neural Information Processing Systems, 2009, pp. 273–280.
 [3] Y. Deshpande and A. Montanari, “Linear bandits in high dimension and recommendation systems,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 2012, pp. 1750–1754.

[4]
L. Zhang, T. Yang, R. Jin, and Z.H. Zhou, “Online bandit learning for a
special class of nonconvex losses,” in
TwentyNinth AAAI Conference on Artificial Intelligence
, 2015.  [5] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári, “Parametric bandits: The generalized linear case,” in Advances in Neural Information Processing Systems, 2010, pp. 586–594.
 [6] S. Bubeck, G. Stoltz, C. Szepesvári, and R. Munos, “Online optimization in xarmed bandits,” in Advances in Neural Information Processing Systems, 2009, pp. 201–208.
 [7] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” arXiv preprint arXiv:0912.3995, 2009.
 [8] D. Russo and B. Van Roy, “Learning to optimize via informationdirected sampling,” in Advances in Neural Information Processing Systems, 2014, pp. 1583–1591.
 [9] Y. Abbasiyadkori, C. Szepesvari, S. Kakade, and U. V. Luxburg, “Regret bounds for the adaptive control of linear quadratic systems,” in Journal of Machine Learning Research  Proceedings Track (COLT11, 2011.
 [10] S. Bittanti, M. C. Campi, et al., “Adaptive control of linear time invariant systems: the bet on the best principle,” Communications in Information & Systems, vol. 6, no. 4, pp. 299–320, 2006.
 [11] J. P. Hespanha, D. Liberzon, and A. S. Morse, “Overcoming the limitations of adaptive control by means of logicbased switching,” Systems & Control Letters, vol. 49, no. 1, pp. 49–65, 2003.
 [12] Y. AbbasiYadkori and C. Szepesvari, “Bayesian optimal control of smoothly parameterized systems: The lazy posterior sampling algorithm,” arXiv preprint arXiv:1406.3926, 2014.
 [13] V. Dani, T. P. Hayes, and S. M. Kakade, “Stochastic linear optimization under bandit feedback.” in COLT, 2008, pp. 355–366.
 [14] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
 [15] S. Arora, E. Hazan, and S. Kale, “The multiplicative weights update method: a metaalgorithm and applications.” Theory of Computing, vol. 8, no. 1, pp. 121–164, 2012.