Though adaptive control  of unknown Linear Quadratic Gaussian (LQG) systems  is a well-studied topic by now [4, 7, 6, 2], existing algorithms cannot be utilized for controlling an unknown NCS in which plant and network parameters are unknown. In departure from the traditional adaptive controllers for LQG systems, an algorithm now also needs to continually estimate the unknown network behaviour besides simultaneously learning and controlling the plant in an online manner. An important concern is that in general it is not optimal to design and operate network estimator independently of the process controller. Thus, the optimal controls should utilize the information gained about network quality in addition to using the information gained about plant parameters. Similarly, decisions made by the network scheduler should also “aid” the controller in “learning” the unknown plant parameters.
This work addresses the problem of adaptive control of a simple NCS in which data packets from the controller to the plant, are communicated over an unreliable channel. We model the plant as a LQG system. We propose a learning rule that maintains estimates and confidence sets for both a) (unknown) plant parameters , and also b) (unknown) channel reliability . Controls are then generated using the principle of optimism in face of uncertainty , and depend upon both a) and b). We denote our algorithm as Upper Confidence Bounds for Networked Control Systems (UCB-NCS).
We show that UCB-NCS yields the same asymptotic performance as the optimal controller that has knowledge of the system and network parameters. We also quantify its finite-time performance by providing upper-bounds on its “regret” . Regret scales as , where is the operating time horizon and is a problem dependent constant. It also depends on the channel reliability through a certain quantity which we call the “margin of stability” (14). A larger value of means that the learning algorithm has a lower regret.
UCB-NCS has many appealing properties. For instance, network estimator needs to communicate only occasionally the value of its optimistic estimate of network reliability to the controller which then uses it to generate controls.
Ii System Model
We assume that the system of interest is linear, and evolves as follows
where are the system matrices, is the instantaneous state of the wireless channel, and are the system state and control input at time respectively. are Bernoulli i.i.d. with mean value . is the process noise, and is assumed to be i.i.d. with
The objective is to minimize the operating cost
We let denote the system parameters. is not known to controller. We assume that the system is scalar, i.e., .
Iii Preliminaries on Jump Markov Linear Systems
There are matrices such that the optimal control at is given by . We let denote the optimal matrices when system parameter is equal to .
We let denote the “cost-to-go” when system state is equal to , channel state is and system dynamics are described by . In fact value function is piecewise linear, and we let denote the corresponding matrices. We also let be the optimal operating cost.
: For a random variable (r.v.), let denote its projection onto the space of measurable funcions, i.e., its conditional expectation w.r.t. sigma-algebra . For 222 denotes the set of integers., we let . For a set of r.v. s , we let denote the smallest sigma-algebra with respect to which each r.v. in is measurable. For functions , we say if . For a set , we let denote its complement.
Iv Upper Confidence Bounds for NCS (UCB-NCS)
Let . A learning policy, or an adaptive controller is a collection of maps . Let denote the estimates of at time defined as follows. Let , and .
Let be the confidence intervals associated with the estimates at time defined as follows,
The learning rule decomposes the cumulative time into episodes, and implements a single stationary controller within each single episode that chooses as a function of . Let denote the starting time of -th episode. The controller implemented within episode is obtained at time by solving the following optimization problem.
where is the set of “allowable” parameters. Let denote a solution to above problem. It implements the optimal controller corresponding to the case when true system parameters are equal to . . Thus, for .
A new episode begins when either or doubles or the operating time spent in current episode becomes equal to length of previous episode. The learning rule also ensures that the durations of episodes are at least time-slots, i.e., . We set
i.e., it is the current value of the UCB estimate of . UCB-NCS is summarized in Algorithm 1.
V Large Deviation Bounds on Estimation Errors
We now analyze the estimation errors .
We then have that
It can be shown that
Note that is a martingale difference sequence w.r.t. , while is adapted to . Thus, bound on follows by using self-normalized bounds on martingales from Corollary 1 of .
To analyze , we observe,
The first term within braces is bounded using Corollary 2 of . To bound the second term, we observe that it is upper-bounded by . We then use bounds on to bound it. Bound on estimation error of is obtained using Azuma-Hoeffding inequality.
Vi Large Deviation Bounds on the System State
We now bound under UCB-NCS. System evolution under UCB-NCS is given by
Consider the deviations
and the events,
where , and . It follows from Azuma-Hoeffding inequality that
Fix a sufficiently large 333It suffices to let , and define
The following result by combining union bound with the bound (12).
We now focus on upper-bounding on .
Throughout, we assume that the true system parameter , and the set used by UCB-NCS, satisfy the following.
Let . Then,
We call as the “margin of stability” of the NCS. Note that depends upon a) , b) .
Consider an element of , and assume there are episodes during the time period . Let denote the number of times channel state assumes value , and let denote the UCB estimate of during the -th episode. Let denote the duration of -th episode. We have the following,
Following is easily proved.
Under Assumption 1, we have the following on
Note that we have suppressed dependence of function upon .
Vii Regret Analysis of UCB-NCS
Define , the regret incurred by UCB-NCS until time as follows
For , define
Similarly, let be drawn i.i.d. according to .
On the set , can be upper-bounded as follows,
Consider the Bellman optimality equation at time when the true system parameter is assumed equal to ,
where the second equality follows since the learning rule applies controls by assuming that is the true system parameter. Note that on , serves as a lower bound on the optimal cost , so that serves as an upper-bound on . Proof is completed by re-arranging the terms in (VII), and summing them from to . We now bound the terms on .
We decompose as follows, , where,
We further decompose as follows,
where is as in (16).
is a martingale, though its increments are not bounded. However, its increments are upper-bounded as . It follows from Lemma 4 that its increments are upper-bounded as on . The proof then follows from Proposition 34 of . Henceforth denote
We decompose as follows,
Note that under UCB-NCS, we have that . Let
After performing simple algebraic manipulations, we can show that
and the last inequality in (VII-B) follows from Cauchy-Schwartz inequality. The terms are bounded in Lemma 10 and Lemma 11 in Appendix. We substitute these bounds in (VII-B) and obtain the following result.
On , we have
It remains to bound in order to bound . This is done in Lemma 12 of Appendix.
On , we have .
Viii Main Result
Theorem 1 (Bound on Regret)
Ix Conclusion and Future Work
We propose UCB-NCS, an adaptive control law, or learning rule for NCS, and provide its finite-time performance guarantees. We show that with a high probability, its regret scales as upto constant factors. We identify a certain quantity which we call margin of stability of NCS. Regret increases with a smaller margin, which indicates a low network quality.
Results in this work can be extended in various directions. So far we considered only scalar systems. A natural extension is to the case of vector systems. Another direction is to derive lower-bounds on expected value of regret that can be achieved under any admissible control policy.
Lemma 10 (Bounding )
On , we have
Let be the time step at which the latest episode begins. Since under UCB-NCS we have , it can be shown that
Now consider the following inequality,
For , we have,
where the first inequality follows from Lemma 14, and second inequality follows from the size of the confidence intervals (5). On , we have , and also ; so we use inequality (IX) with set equal to , and combine the resulting inequalities with (26) in order to obtain the following,
Lemma 11 (Bounding )
On , we have
Follows from Lemma 4.
On we have