1 Introduction
The ultimate goal in the field of adaptive control and reinforcement learning is to produce a truly independent learning agent. Such an agent can start in an unknown environment and follow one continuous and uninterrupted chain of experiences, until it performs as well as the optimal policy.
In this paper we consider this goal for the fundamental problem of controlling an unknown, linear timeinvariant (LTI) dynamical system. This problem has received significant attention in the recent ML literature. However, all existing methods assume some knowledge about the environment, usually in the form of a stabilizing controller.
We henceforth describe a control algorithm that only has blackbox access to an LTI system without any prior information. The algorithm is guaranteed to attain sublinear regret, converging on average to the performance of the best controller in hindsight from a set of reference policies. Furthermore, its guarantees apply to the setting of nonstochastic control, in which both the perturbations and cost functions can be adversarially chosen. Our results are accompanied by a novel lower bound on the startup cost of computing a stabilizing controller. We show that this cost is inherently exponential in the natural parameters of the problem for any deterministic control method.
The question of controlling unknown systems under adversarial noise was posed in (21); our results quantify the difficulty of this task and provide an efficient solution. As far as we know, these results are the first finitetime sublinear regret bounds known for blackbox, singletrajectory control in terms of both upper and lower bounds, in both the stochastic and nonstochastic models.
1.1 Statement of results
Consider a given LTI dynamical system with blackbox access. The only interaction of the controller with the system is by sequentially observing a state and applying a control . The evolution of the state is according to the dynamics equation
where , . The system dynamics are unknown to the controller, and the disturbance can be adversarially chosen at the start of each round. An adversarially chosen convex cost function is revealed after the controller’s action, and the controller suffers . In this model, a controller is simply a (possibly randomized) mapping from all previous states and costs to a control. The total cost of executing a controller , whose sequence of controls is denoted , is defined as
For a randomized control algorithm, we consider the expected cost.
In the adversarial disturbances setting the optimal controller cannot be determined a priori, and depends on the disturbance realization. For this reason, we consider a comparative performance metric. The goal of the learning algorithm is to choose a sequence of controls such that the total cost over iterations is competitive with that of the best controller in a reference policy class . Thus, the learner with only blackbox access, and in a single trajectory, seeks to minimize regret defined as
As a comparator class, we consider the set of all Disturbanceaction Controllers (Definition 3), whose control is a linear function of past disturbances. This class is known to contain state of the art Linear Dynamical Controllers (Definition 4), and the and optimal controllers.
Let denote the upper bound on the system’s natural parameters, and be the controllability parameter of the stabilized system, see Section 2.1. The following statements summarize our main results in Theorem 6 and Theorem 8:

[leftmargin=0.5cm]

We give an efficient algorithm, whose regret in a single trajectory is with high probability at most

We show that any deterministic control algorithm must suffer a worstcase exponential startup cost in the regret, due to limited information. Formally, we show that for every deterministic controller , there exists an LTI where
Further, this lower bound holds even without disturbances, and when the systeminput matrix is full rank.
To the best of our knowledge, these are the first finitetime regret bounds for control in a single trajectory with blackbox access to the system. Our result quantifies the price of information for controlling unknown LTI systems to be exponential in the natural parameters of the problem, and show that this is inevitable.
One of the main technical hurdles to designing an efficient algorithm is obtaining a stabilizing controller from blackbox interactions, in the presence of adversarial noise and changing convex costs. Our method consists of two phases. In the first phase, we identify the dynamics matrices up to some accuracy in the spectral norm by injecting large controls into the system. Previous works on system identification under adversarial noise either require nonexplosive dynamics, or the knowledge of a strongly stable controller. However, our approach is not limited by these assumptions.
In the second phase, we use an SDP relaxation for the LQR proposed in (4)
to solve for a strongly stable controller given the system estimates. The SDP was originally derived in the stochastic noise setting, but we show that it is applicable for our task. After we identify a strongly stable controller, we use the techniques of
(8) for regret minimization.For the lower bound, our approach is inspired by lower bounds for gradientbased methods from the optimization literature. Given a controller, we show a system construction that forces the states, and thus costs, to grow exponentially until all information about the system is learned. As far as we know, this is the first finitetime lower bound which is exponential in dimension for any online control setting.
1.2 Related work
The focus of our work is adaptive control, in which the controller does not have a priori knowledge of the underlying dynamics, and has to learn them as well as controlling the system according to given convex costs. This task, recently put forth in the machine learning literature, differs substantially from the classical literature on control theory that we survey below in the following aspects:

[leftmargin=0.75cm]

The system is unknown ahead of time to the learner, nor is a stabilizing controller given.

The learner does not know the cost functions in advance. They can be adversarially chosen.

The disturbances are not assumed to be stochastic, and can be adversarially chosen.
To the best of our knowledge, our result is the only one thus far that holds under all three conditions.
Robust and Optimal Control.
When the underlying system is known to the learner, the noise is stochastic, and costs are known, it is possible to compute the (closed or open loop) optimal controller a priori. In the LQR setting, where is i.i.d. Gaussian, and the learner incurs constant quadratic cost . The optimal policy for infinite horizon LQR is know to be linear: , where is the solution to the algebraic Ricatti equation (20; 22). Robust control was formulated in the framework of control, which solves for the best controller under worstcase noise.
LDS with Stochastic Noise:
The problem of controlling an LDS with known dynamics is extensively studied in the classical control literature, see (20; 22). In the online LQR setting (1; 6; 12; 4), the noise remains i.i.d. Gaussian, but the performance metric is regret instead of cost. Recent algorithms in (12; 5; 4) attain regret, with polynomial runtime and polynomial dependence on relevant problem parameters in the regret. This was improved to in (3) for strongly convex costs. Regret bounds for partially observed systems were studied in (10; 9; 11), and the most recent bounds are in (19). In contrast to these results, our setting permits noni.i.d., even adversarial noise sequences. Further, all of these results assume the learner is given a stabilizing controller.
Nonstochastic Control:
The nonstochastic control problem for linear dynamical systems (2; 8) is the most relevant setting to our work. Under this setting, the controller has no knowledge of the system dynamics or the noise sequence. The controller generates controls at each iteration to minimize regret over sequentially revealed adversarial convex cost functions, against all disturbanceaction control policies. If a strongly stable controller is known, (8) gives an algorithm that achieves regret, where is an upper bound on the system’s natural parameters and is the controllability parameter of the stabilized system, see Section 2.1. This was recently extended in (19)
to partially observed systems, and better bounds for certain families of loss functions. However, all of these works assume that a stabilizing controller is given to the learner, and are
not blackbox as per our definition.Identification and Stabilization of Linear Systems:
If the system has stochastic noise, least squares can be used to identify the dynamics in the partially observable and fully observable settings (14; 18; 15). Using this method of system identification, recent work by (7) finds a stabilizing controller in finite time. However, no explicit bounds were given on the cost or the number of total iterations required to identify the system to sufficient accuracy. In contrast, our paper provides explicit bounds for optimally controlling the system. Moreover, least squares can lead to inconsistent solutions under adversarial noise. The algorithm in (16) tolerates adversarial noise, but the guarantees only hold for nonexplosive systems. On the other hand, our results do not assume stability of the system (spectral radius bounded by 1), but only controllability.
2 Setting and Background
To enable the analysis of nonasymptotic regret bounds, we consider regret minimization against the class of strongly stable linear controllers. The notion of strong stability was formalized in (4) to quantify the rate of convergence to the steadystate distribution.
Definition 1 (Strong Stability).
is a strongly stable controller for a system if , and there exist matrices , such that , and , .
The regret definition in Section 1.1 is meaningful only when the comparator set is nonempty. As shown in (4), a system has a strongly stable controller if it is strongly controllable. This notion is formalized in the next definition.
Definition 2 (Strong Controllability).
Given a system , let denote
Then is strongly controllable if has full rowrank, and .
Assumption 1.
The system is strongly controllable for , and for some .
Under Assumption 1, the noiseless dynamical system starting from can be driven to the zero state in steps. Furthermore, Lemma B.4 in (4) gives an upper bound on the reset cost, defined as . In Section 2.3 we show that a bounded reset cost implies the existence of a strongly stable controller. As a consequence of the CayleyHamilton theorem, a controllable system’s controllability index is at most .
Finally we make the following mild assumptions on the noise sequence and the cost functions.
Assumption 2.
The noise sequence is bounded such that for all .
Assumption 3.
The cost functions are convex, and for all such that , . Without loss of generality, assume .
2.1 Notations
Using the convention from the theory of Linear Programming
(13), we henceforth use to denote an upper bound on the natural parameters, which we interpret as the complexity of the system, i.e.
= controllability parameter of the true system.

= controllability index of the true system.

= dimension of the states .

= dimension of the controls .

= upper bound on the Lipschitz constant of the cost functions .

= upper bound on the spectral norm of system dynamics .
Given a strongly stable controller , we denote as the upper bound on the controllability parameter of the stabilized system , and . We henceforth prove an upper bound on for the controller we recover, and show in Section 4.3 that . We use to denote bounds that hold with probability at least , and omit the factor.
2.2 Disturbanceaction Controller
In the canonical parameterization of the nonstochastic control problem, the total cost of a linear controller is not convex in . This problem is solved by considering a class of controllers called Disturbanceaction Controllers (DACs) (2), which executes controls that are linear in past noises. The total cost of DACs is convex with respect to their parameters, and the cost of any strongly stable controller can be approximated by this class of controllers. Techniques in online convex optimization can then be used on this convex reparameterization of the nonstochastic control problem. It is shown in (8) that for an unknown system and a known strongly stable controller , a DAC can achieve sublinear regret against all such controllers parametrized by where is strongly stable.
Definition 3 (Disturbanceaction Controllers).
A disturbanceaction controller with parameters where outputs control at state ,
DACs also include the class of Linear Dynamic Controllers (LDCs), which is a generalization of static feedback controllers. Both and optimal controllers under partial observation can be wellapproximated by LDCs.
Definition 4 (Linear Dynamic Controllers).
A linear dynamic controller is a linear dynamical system with internal state , input and output that satisfies
2.3 SDP Relaxation for LQ Control
In Linear Quadratic control the cost functions are known ahead of time and fixed,
and the noise is i.i.d., . Given an instance of the LQ control problem defined by , the learner can obtain a strongly stable controller by solving the SDP relaxation for minimizing steadystate cost, proposed in (4). For , the SDP is given by
subject to  
Indeed, a strongly stable controller can be extracted from any feasible solution to the SDP, as guaranteed by the following lemma.
Lemma 5 (Lemma 4.3 in (4)).
Assume that and let . Let be any feasible solution for the SDP, then the controller is strongly stable.
Existence of Strongly Stable Controllers
3 Algorithm and main theorem
Now we describe our main algorithm for the blackbox control problem, Algorithm 1. Overall we use the explorethencommit strategy, and split the algorithm into three phases. In phase 1, we identify the underlying system dynamics to within some accuracy with large controls. In phase 2, we extract a strongly stable controller for the estimated system using the SDP in Section 2.3, and show that it is also strongly stable for the true system. We then alleviate the effects of using large controls by decaying the system to a state with constant magnitude. Finally in phase 3, we invoke Algorithm 1 in (8), which uses a DAC together with a known strongly stable controller to achieve sublinear regret.
Our main theorem below is stated using asymptotic notation that hides constants independent of the system parameters, and uses for an upper bound on the system parameters as defined in section 2.1. Exact specification of the constants appear in the proofs.
Theorem 6.
Under Assumptions 1, 2, 3, with high probability the regret of Algorithm 1 is at most
This is composed of

[leftmargin=0.8cm]

Phase 1: after rounds we have . The total cost is at most .

Phase 2: Computing has zero cost. Decaying the system has total cost at most where , are as defined in the algorithm. By the bound on , this phase has total cost .

Phase 3: Nonstochastic control with a known strongly stable controller incurs regret at most , with high probability.
4 Analysis Outline
We provide an outline of our regret analysis in this section. All formal statements are in the appendix.
4.1 Blackbox system identification
In this phase we obtain estimates of the system without knowing a stabilizing controller. Recall the definition of , and let . The procedure AdvSysId (Algorithm 2) consists of two steps. In the first step, we estimate each for (in particular we obtain close to ), and guarantee that , are small. In the second step, we take to be the solution to the system of equations in : . Note that Algorithm 2 is deterministic, and we bound the total cost in this phase instead of regret, matching the setting in our lower bound.
For the first step, the algorithm estimates matrices
by using controls that are scaled standard basis vectors once every
iterations, and using zero controls for the iterations in between. The state evolution satisfiesIntuitively, we choose scaling factors such that iterations after a nonzero control is used, the state is dominated by , the scaled th column of . In the algorithm is the concatenation of estimates for , and we concatenate the ’s to obtain . We show in Lemma 12 that , which implies the closeness of to , respectively.
Under the assumption that is strongly controllable, is the unique solution to the system of equations in : . By perturbation analysis of linear systems, the solution to the system of equations is close to , as long as are sufficiently small. By our choice of , we conclude that , . Lemma 14 shows that the total cost of this phase is bounded by .
4.2 Computing a stabilizing controller
The goal of phase 2 is to recover a strongly stable controller from system estimates obtained in phase 1 by solving the SDP presented in Section 2.3. The key to our task is setting the trace upper bound appropriately, so that the SDP is feasible and the recovered controller is strongly stable even for the original system. We justify our choice of in Lemma 18, and show that by our choice of , are sufficiently accurate and is strongly stable for the true system. We remark that (17) is an alternative procedure for recovering , given system estimates.
subject to  
4.2.1 Decaying the system
In phase 1 the algorithm uses large controls to estimated the system, and after iterations the state might have an exponentially large magnitude. Equipped with a strongly stable controller, we decay the system so that the state has a constant magnitude before starting phase 3. We show in Lemma 19 that following the policy for iterations decays the state to at most in magnitude.
4.3 Nonstochastic control
5 Lower Bound on Blackbox Control
In this section we prove that any deterministic blackbox control algorithm incurs a loss which is exponential in the system dimension, even for stochastic control of LTI systems. We note that this lower bound is only consistent with the setting in phases 1 and 2, where the algorithms are deterministic.
Definition 7 (Blackbox Control Algorithm).
A (deterministic) blackbox control algorithm outputs a control at each iteration , where is a function of past information, i.e. .
Theorem 8.
Let be a control algorithm as per Definition 7. Then there exists a stabilizable system that is also strongly controllable, and a sequence of oblivious perturbations and costs, such that with , and , we have
Let for all . Consider the noiseless system for some and orthogonal . Under this system for all , and a stabilizing controller is . Observe that is constant. The system is also strongly controllable because . Let denote the rows of and , respectively. Fix a deterministic algorithm , and let be the control produced by at time . There exists such that under this system, outputs controls such that .
The construction.
Set . We construct Q and V as follows: let , set ; for , define
Let be the component of that is independent of ,
where denotes the projection of onto vector . Set for some to be specified later, and set .
The next lemma justifies this iterative construction of by showing that the trajectory is not affected by the choice of for . As a result, without loss of generality we can set after obtaining , and set after receiving .
Lemma 9.
As long as is orthogonal, the states satisfy for some constants that only depend on and .
Proof.
We prove the lemma by induction. For our base case, is trivially and it is fixed for all choices of . Set . Assume the lemma is true for , and we have specified for , for . The specified rows of are orthonormal by construction. Note that by our construction, is obtained first, and then we set . Since only depends on the current trajectory up to , it is welldefined, and we can obtain . By definition of , we can write for some coefficients . Set as in the lemma. The next state is then
(1)  
is orthogonal  
We have shown in the inductive step that does not depend on the choice of as long as is orthogonal, hence we can set . Moreover, is not affected by for by inspection. ∎
The magnitude of the state.
In this section we specify the constants in the construction to ensure that the state has an exponentially increasing magnitude. Let , . Set . The quantities and are welldefined when we set after obtaining and . Intuitively, applied to aligns to , and we grow the magnitude of the component in multiplicatively.
Lemma 10.
The states satisfy , and .
Proof.
Observe that where ; therefore we have .
Size of the system.
Our construction only requires to be specified, and without loss of generality we take . By inspection, can be written as , where and is a permutation matrix that satisfies . Therefore the spectral norm of is at most . We conclude that for this system, , and the total cost is at least .
6 Conclusion
We present the first endtoend, efficient blackbox control algorithm for unknown linear dynamical systems in the nonstochastic control setting. With high probability the regret of our algorithm is sublinear in with an exponential startup cost. Further, we show that this startup cost is nearly optimal by giving a lower bound for any deterministic blackbox control algorithm.
As far as we know, this is the first explicit regret upper and lower bounds for blackbox online control even in the stochastic setting with quadratic losses, for general systems. Our bounds hold, however, under the more difficult setting of adversarial noise and general convex cost functions.
References
 [1] (2011) Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26. Cited by: §1.2.
 [2] (2019) Online control with adversarial disturbances. In International Conference on Machine Learning, pp. 111–119. Cited by: §1.2, §2.2.
 [3] (2019) Logarithmic regret for online control. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 10175–10184. External Links: Link Cited by: §1.2.
 [4] (2018) Online linear quadratic control. External Links: 1806.07104 Cited by: Appendix B, Appendix B, Appendix B, Appendix C, §1.1, §1.2, §2.3, §2.3, §2, §2, §2, Lemma 5.
 [5] (201909–15 Jun) Learning linearquadratic regulators efficiently with only regret. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1300–1309. External Links: Link Cited by: §1.2.
 [6] (2018) Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pp. 4188–4197. Cited by: §1.2.
 [7] (2019) Finitetime adaptive stabilization of linear systems. IEEE Transactions on Automatic Control 64 (8), pp. 3498–3505. Cited by: §1.2.
 [8] (2019) The nonstochastic control problem. External Links: 1911.12178 Cited by: Appendix A, Appendix C, Appendix C, §1.1, §1.2, §2.2, §3, §4.3, §4.3, Theorem 22, 11, 3.
 [9] (2020) Logarithmic regret bound in partially observable linear dynamical systems. External Links: 2003.11227 Cited by: §1.2.
 [10] (2020) Regret bound of adaptive control in linear quadratic gaussian (lqg) systems. External Links: 2003.05999 Cited by: §1.2.
 [11] (2020) Regret minimization in partially observable linear quadratic control. External Links: 2002.00082 Cited by: §1.2.
 [12] (2019) Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826. Cited by: §1.2.
 [13] (19941995) Lecture on informationbased complexity of convex programming. Cited by: §2.1.
 [14] (2019) Nonasymptotic identification of lti systems from a single trajectory. In 2019 American Control Conference (ACC), Vol. , pp. 5655–5661. Cited by: §1.2.
 [15] (201909–15 Jun) Near optimal finite time identification of arbitrary linear dynamical systems. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5610–5618. External Links: Link Cited by: §1.2.
 [16] (201925–28 Jun) Learning linear dynamical systems with semiparametric least squares. In Proceedings of the ThirtySecond Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA, pp. 2714–2802. External Links: Link Cited by: §1.2.
 [17] (2020) Naive exploration is optimal for online lqr. External Links: 2001.09576 Cited by: §4.2.
 [18] (201806–09 Jul) Learning without mixing: towards a sharp analysis of linear system identification. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 439–473. External Links: Link Cited by: §1.2.
 [19] (2020) Improper learning for nonstochastic control. External Links: 2001.09254 Cited by: §1.2, §1.2.
 [20] (1994) Optimal control and estimation. Cited by: §1.2, §1.2.
 [21] (2019) Sample complexity bounds for the linear quadratic regulator. Ph.D. Thesis, UC Berkeley. Cited by: §1.
 [22] (1996) Robust and optimal control. PrenticeHall, Inc., USA. External Links: ISBN 0134565673 Cited by: §1.2, §1.2.
Appendix A Proofs for Section 4.1
In this section we present proofs for phase 1 of Algorithm 1. We show that the estimates satisfy , and bound the total cost of this phase. We first bound the magnitude of states in each iteration to guide our choice of scaling factors .
Claim 11.
In Algorithm 2, for all , let , , we have .
Proof.
This can be seen by induction. For our base case, consider , where . . Assume for . If , then , , and .
Otherwise, we have , and . Moreover, . Therefore
∎
With appropriate choice of , we ensure that and are close in the Frobenius norm.
Lemma 12.
For , satisfies
In particular, .
Proof.
Observe that by definition,