I Introduction
Artificial intelligence (AI) offers an attractive set of tools that are mostly modelfree, yet useful in solving stochastic and optimal control problems arising in cyberphysical systems (CPS), Internet of Things (IoT), and largescale industrial systems. AIbased solutions have seen a major resurgence in recent years, partly owing to recent advances in computational capacities and owing to advances in deep neural networks for function approximation and feature extraction. Oftentimes, the use of reinforcement learning algorithms or AI in conjunction with traditional controllers reduces the complexity of system design while boosting efficiency.
The abovementioned systems are all characterized by large sizes. However, typical resources, such as communication channels, computational resources, network bandwidth etc., do not scale with system size. In other words, resource allocation is an important problem in this setting. In addition, in a distributed control setting that involves feedback, resource allocation is required to be “controlaware”, i.e., it is needed to aid in optimizing closedloop control performance. In such feedback driven systems, controllers often rely on information collected from various sensors to make intelligent decisions. Hence, efficient information dispersion is essential for decision making over communication networks to be effective. As noted earlier, this is a hard problem since the number of communication channels available is much smaller than what is ideally required to transfer data from sensors to controllers.
Fig. 1 illustrates a simplified representation of the class of CPS and IoT systems of interest. The system consists of independent subsystems that communicate over a shared communication network, which contains channels. We assume that ( is much smaller than ), and that transmissions are via errorfree channels. Each subsystem consists of one smart sensor, one controller, and one plant. Within each subsystem, there is feedback from the sensor to the controller. These feedback loops are closed over this resourceconstrained communication network.
At every stage, DeepCAS, our deep reinforcement learningbased modelfree scheduling algorithm, decides which of the subsystems are allocated channels to close the feedback loop. DeepCAS
takes scheduling decisions by adapting to the control actions while trying to minimize the control loss. At every stage, the smart sensors compute estimates of the subsystem states, using
Kalman Filter (I), for transmission to the corresponding controller, see Fig. 1. The controller runs Kalman Filter (II) to estimate the subsystem state in the absence of transmissions. In addition to Kalman Filter (I), each smart sensor also implements a copy of Kalman Filter (II) and the control algorithm. In other words, the smart sensor is cognizant of the state estimate used by the controller at every time instant. DeepCAS obtains feedbacks (i.e., rewards) from sensors for taking scheduling decisions.Previously, several scheduling strategies have been proposed to determine the access order of different sensors and/or actuators; see [1] and references therein. A popular approach is to use periodic schedules [2, 3, 4, 5] since they are easy to implement and they facilitate stability analysis of networked control systems. Unfortunately, finding optimal periodic schedules for control applications may not be easy since both period and sequence need to be found. Further, restricting to periodic schedules may lead to performance loss [6]. With a handful of exceptions, the determination of optimal schedules indeed requires solving a mixedinteger quadratic program, which is computationally infeasible for all but very small systems; see [7, 6].
Event and selftriggering algorithms present popular alternatives to periodic scheduling; see [8] and references therein. Linear, quadratic optimal control problems subject to such scheduling schemes have been investigated in [9, 10, 11, 12]. Many of the aforementioned results only consider singleloop control systems. There exists limited literature that study multiloop control systems [9, 10, 11]. One limitation is that many of these results only investigate linear scalar systems.
Our contribution in the present work is in the development of a deep reinforcement learningbased controlaware scheduling algorithm, DeepCAS. At its heart lies the Deep QNetwork (DQN), a modern variant of Q learning, introduced in [13]. In addition to being readily scalable, DeepCAS is completely modelfree. To optimize the overall control performance, we propose the following sequential design of control and scheduling: In the first step, we design an optimal controller for each independent subsystem. As discussed in [12], under limited communication, the control loss has two components: (a) best possible control loss (b) error due to intermittent transmissions. If , then (b) vanishes. Since we are in the setting of , the goal of the scheduler is to minimize (b)
. To this end, we first construct an associated Markov decision process (MDP). The state space of this MDP is the difference in state estimates of all controllers and sensors (obtainable from the smart sensors). The singlestage reward is the negative of the loss component
(b). Since we are using DQN to solve this MDP, we do not need the knowledge of transition probabilities. The goal of
DeepCAS is to find a scheduling strategy that maximizes the reward, i.e., minimizes (b).Ii Networked Control System: Model, Assumptions, and Goals
Iia Model for each subsystem
As illustrated in Fig. 1, our networked control system consists of independent closedloop subsystems. The feedback loop within each subsystem (plant) is closed over a shared communication network. For , subsystem is described by
(1) 
where and are matrices of appropriate dimensions, is the state of subsystem , is the control input, and is zeromean i.i.d. Gaussian noise with covariance matrix . The initial state of subsystem ,
, is assumed to be a Gaussian random vector with mean
and covariance matrix and of each other.At a given time , we assume that only noisy output measurements are available. We, thus, have:
(2) 
where is zeromean i.i.d. Gaussian noise with covariance matrix . The noise sequences, and , are independent of the initial conditions .
IiB Control architecture and loss function
The dynamics of each subsystem is a stochastic linear timeinvariant (LTI) system given by (1). Further, each subsystem is independently controlled. Dependencies do arise from sharing a communication network. Subsystem has a smart sensor which samples the subsystem’s output and computes an estimate of the subsystem’s state. This value is then sent to the associated controller, provided a channel is allocated to it by DeepCAS. If the controller obtains a new state estimate from the sensor, then it calculates a control command based on this state estimate. Otherwise, it calculates a control command based on its own estimate of the subsystem’s state.
The control actions and scheduling decisions (of DeepCAS) are taken to minimize the total control loss given by
(3) 
where is the expected control loss of subsystem and is given by
where and are positive semidefinite matrices and is positive definite.
IiC Smart sensors and preprocessing units
Within our setting, the primary role of a smart sensor is to take measurements of a subsystem’s output. Also, it plays a vital role in helping DeepCAS with scheduling decisions. It is from the smart sensors that DeepCAS gets all the necessary feedback information for scheduling. For these tasks, each smart sensor employs two Kalman filters: (1) Kalman Filter (I) is used to estimate the subsystem’s state, (2) a copy of Kalman Filter (II) is used to estimate the subsystem’s state as perceived by the controller. Note that the controller employs Kalman Filter (II). Below, we discuss the setup in more detail.
Kalman filter (I): Since we assume that the sensors have knowledge of previous plant inputs, the sensors employ standard Kalman filters to compute the state estimate and covariance recursively as:
starting from and .
Kalman filter (II): The controller runs a minimum mean square error (MMSE) estimator to compute estimates of the subsystem’s state as follows:
(4)  
(5) 
with .
IiD Goal: minimizing the control loss
For the control problem studied, the certainty equivalent (CE) controller is still optimal; see [12] for details. Using the control commands, generated by the CE controllers, the minimum value of the total control loss, (3), has two components: (a) best possible control loss (b) error due to intermittent communications. Hence, the problem of minimizing control loss has two separate components: (i) designing the best (optimal) controller for each subsystem and (ii) scheduling in a controlaware manner.
Component I: Controller design. The controller in feedback loop takes the following control action, , at time :
(6) 
where is the state estimate used by the controller,
(7) 
and is recursively computed as
(8) 
with initial values . Let be the state estimate of Kalman Filter (I), as employed by the sensor. We have when the sensor and controller of the feedback loop have communicated. Otherwise, is the state estimate obtained from Kalman Filter (II). The minimum value of the control loss of subsystem is given by
(9) 
where and stems from communication errors in subsystem . Recall that there are subsystems and communication channels.
Component II: Controlaware scheduling. The main aim of the scheduling algorithm is to help minimize of (3). To this end, one must minimize
(10) 
of (9) for every . Note that in (10) is the control horizon. At any time , the scheduler decides which among the subsystems may communicate. Note that when a communication channel is assigned to subsystem at time .
In the following section, we present a deep reinforcement learning algorithm for controlaware scheduling called DeepCAS. DeepCAS communicates only with the smart sensors. At every time instant, sensors are told if they can transmit to their associated controllers. Then, the sensors provide feedback on the scheduling decision for that stage. Note that we do not consider the overhead involved in providing feedback.
Iii Deep reinforcement learning for controlaware scheduling
As stated earlier, at the heart of DeepCAS
lies the DQN. The DQN is a modern variant of Qlearning that effectively counters Bellman’s curse of dimensionality. Essentially, DQN or Qlearning finds a solution to an associated Markov decision process (MDP) in an iterative modelfree manner. Before proceeding, let us recall the definition of an MDP. For a more detailed exposition, the reader is referred to
[14]. An MDP, , is given by the following tuple , where
is the statespace of ;

is the set of actions that can be taken;

is the transition probability, i.e., is the probability of transitioning to state when action is taken at state ;

is the one stage reward function, i.e., is the reward when action is taken at state ;

is the discount factor with .
Below, we state the MDP associated with our problem.

The state space consists of all possible augmented error vectors. Hence, the state vector at time is given by .

Action space is given by the size subsets of the channels: . Hence, the cardinality of the action space is given by .

At time , the reward is given by .

Although it would seem natural to use , we use since it hastens the rate of convergence.
Note that the scheduler (DeepCAS) takes action just before time and receives rewards just after time , based on transmissions at time . Also, note that DeepCAS only gets nonzero rewards from nontransmitting sensors. DeepCAS is modelfree. Hence, it does not need to know transition probabilities.
Let us suppose we use a reinforcement learning algorithm, such as Qlearning, to solve . Since the learning algorithm will find policies that minimize the future expected cumulative rewards, we expect to find policies that minimize scheduling effects on the entire system. This is a consequence of our above definition of reward . Below, we provide a brief overview of Qlearning and DQN, the reinforcement learning algorithm at the heart of DeepCAS. Simply put, DeepCAS is a DQN solving the above defined MDP .
DeepCAS. At any time , the scheduler is interested in maximizing the following expected discounted future reward:
Recall that is the single stage cost given by . learning is a useful methodology to solve such problems. It is based on finding the following Qfactor for every stateaction pair:
where is a policy that maps states to actions. The algorithm itself is based on the Bellman equation:
Note that DeepCAS has no knowledge of networked control system dynamics. This unknown dynamics is represented by , in the above equation. Since our state space is continuous, we use a deep neural network (DNN) for function approximation. Specifically, we try to find good approximations of the Qfactors iteratively. In other words, the neural network takes as input state and outputs for every possible action , such that . This deep function approximator, with weights
, is referred to as a Deep QNetwork. The Deep QNetwork is trained by minimizing a timevarying sequence of loss functions
given bywhere is the expected costtogo based on the latest update of the weights; is the behavior distribution [13]. Training the neural network involves finding , which minimizes the loss functions. Since the algorithm is run online, training is done in conjunction with scheduling. At time , after feedback (reward) is received, one gradient descent step can be performed using the following gradient term:
(11) 
To make the algorithm implementable, we update the weights using samples than finding the above expectation exactly. At each time, we pick actions using the greedy approach [13]. Specifically, we pick a random action with probability , and we pick a greedy action with probability . This greedy approach for picking actions induces the behavior distribution . In other words, the actions at every stage are picked using distribution . Note that a greedy action at time is one that maximizes . Initially it is desirable to explore, hence is set to . Once the algorithm has gained some experience, it is better to exploit this experience. To accomplish this, we use an attenuating to .
Although we train our DNN in an online manner, we do not perform a gradient descent step using (11), since it can lead to poor learning. Instead, we store the previous experiences , , in an experience replay memory . When it comes to training the neural network at time , it performs a single minibatch gradient descent step. The minibatch (of gradients) is randomly sampled from the aforementioned experience replay . The idea of using experience replay memory, to overcome biases and to have a stabilizing effect on algorithms, was introduced in [13].
DQN for controlaware scheduling
Iv Experimental results
Recall that DQN is at the heart of our DeepCAS, which uses a deep neural network to approximate Qfactors. The input to this neural network is the appended error vector. The hidden layer consists of 1024 rectifier units. The output layer is a fully connected linear layer with a single output for each of the actions. The discount factor in our Qlearning algorithm is fixed at . The size of the experience replay buffer is fixed at . The exploration parameter is initialized to , then attenuated to at the rate of . For training the neural network, we use the optimizer ADAM [15] with a learning rate of and a decay of . The control horizon is set to . Note that we used the same set of parameters for all of the experiments presented below.
We conducted three sets of experiments. For the first two sets, we used the reward described in Section III. For the last experiment, we used the total control cost as the reward. The reader is referred to (9) in Section IID for the control cost associated with subsystem . Using the full control cost as the reward allows us to discuss the stability of the networked control system, see Section V for details.
Iva Experiment 1 (N=, M=, and T=)
For our first experiment, we used DeepCAS to schedule one channel for three subsystems. We considered three secondorder singleinputsingleoutput (SISO) subsystems consisting of one stable (subsystem ) and two unstable subsystems (subsystems and ). If there were three channels, then there would be no scheduling problem and the total optimal control loss would be . Since there is only a single channel available, one expects a solution to the scheduling problem to allocate it to subsystems and for a more substantial fraction of the time, as compared to subsystem . This expectation is fair since subsystems and are unstable while subsystem is stable. Once trained, on an average DeepCAS indeed allocates the channel to subsystem 1 for 52% of the time, to subsystem 2 for 12% of the time, and to subsystem 3 for 36% of the time.
We train DeepCAS
continuously over many epochs. Each epoch corresponds to a single run of the control problem with horizon
. At the start of each epoch, the initial conditions for the control problem are chosen as explained in § II. The blackcurve in Fig. 2 illustrates the learning progress in Experiment 1. The abscissa axis of the graph represents the epoch number while the ordinate axis represents the average control loss. The plot is obtained by taking the mean of Monte Carlo runs. Since DQN is randomly initialized, scheduling decisions are poor at the beginning, and the average control loss is high. As learning proceeds, the decisions taken improve. After only epochs, DeepCAS converges to a scheduling strategy with an associated control loss of around .Traditionally, the problem of scheduling for control systems is solved by using control theoretic heuristics to find periodic schedules. For
Experiment 1, we exhaustively searched the space of all periodic schedules, with periods ranging from to . Using this strategy, we were able to acheive a minimum possible control loss of . In comparison, DeepCAS finds a scheduling strategy with an associated control loss of . In addition to being faster, DeepCAS does not need any system specification and can schedule efficiently for very long control horizons.IvB Experiment 2 (N=, M=, and T=)
For our second experiment, we train DeepCAS to schedule three channels for a system with six secondorder SISO subsystems. If , then the total control loss would be . As before, learning is done continuously over many epochs. The redcurve in Fig. 2 illustrates the learning progress of DeepCAS in scheduling three channels among six subsystems. The abscissa and ordinate axes are as before. As evidenced in the figure, DeepCAS quickly finds schedules with an associated control loss of around .
We are unable to compare the results of Experiment 2 with any optimal periodic schedules. This is because optimal periodic scheduling strategies do not extend to the system size and control horizon considered here. Further, performing an exhaustive search for finding periodic schedules is not possible since the number of possibilities are in the order of , where is the periodlength.
IvC Experiment 3 (same setup as Experiment 1 but with as reward)
The systems considered hitherto have independent subsystems. This facilitates the splitting of the total control cost into two components; see (9). The onestage reward in our algorithm is the negative of the error due to lack of communication defined in (10). However, in general multiagent settings, the previously mentioned splitting may not be possible. To show that our results are readily extensible to more general settings, we repeated Experiments 1 and 2 with negative of the onestage control cost as the reward. The results of the modified experiments are very similar to the original ones. The learning progress of the modified Experiment 1, with full cost, is given by the greencurve in Fig. 2.
V Stability issues
In our framework, the controller and scheduler run in tandem. The control policy, , is fixed before the scheduler is trained. As a consequence of training, the scheduler finds a scheduling policy . Thus, the controllerscheduler pair finds a policy tuple . To investigate the stabilizing properties of DeepCAS, we make the following mild assumptions on this policy tuple.

, where is the singlestage control loss and
is the single stage loss of subsystem at time . In other words, we assume that the limit of the average cost sequence exists. This limit may be infinite, i.e., .

The discount factor used for training is such that , for some . Again, it could be that . In which case, (A2) is trivially satisfied.
In our framework, the controller uses a control policy, , that solves the average cost control problem. The scheduler learns a scheduling policy, , to solve the discounted cost problem. Since they run in tandem, the control loss value , at any time , depends on both the control and scheduling actions taken at time . Further, we have empirically observed that our scheduler can be successfully trained for all discount factors close to . Before proceeding, consider the following theorem due to Abel:
Theorem (Abel, [16])
Let be a sequence of positive real numbers, then
It follows from (A1) and Abel’s theorem that
Recall that our scheduler can be successfully trained to solve the discounted cost problem for all discount factors close to (but not equal to) . In other words, given a discount factor , the scheduler finds a policy such that
If we couple this observation with (A2), we get:
for some and . If we choose as the discount factor for our training algorithm, it follows that:
We claim that system stability follows from this set of inequalities. To see this, observe that . Hence, . In other words, the following claim is immediate.
Claim
Under (A1) and (A2), the scheduling algorithm can be successfully trained for discount factors close to , consequently . Further, the policy thus found, stabilizes the system, i.e., .
Vi Conclusions
This paper considered the problem of scheduling the sensortocontroller communication in a networked control system, consisting of multiple independent subsystems. To this end, we presented DeepCAS, a reinforcement learningbased controlaware scheduling algorithm. This algorithm is modelfree and scalable, and it outperforms scheduling heuristics, such as periodic schedules, tailored for feedback control applications.
References
 [1] P. Park, S. C. Ergen, C. Fischione, C. Lu, and K. H. Johansson, “Wireless network design for control systems: a survey,” IEEE Communications Surveys & Tutorials, vol. 20, no. 2, pp. 978 – 1013, Secondquarter 2018.
 [2] H. Rehbinder and M. Sanfridson, “Scheduling of a limited communication channel for optimal control,” Automatica, vol. 40, no. 3, pp. 491–500, March 2004.
 [3] D. HristuVarsakelis and L. Zhang, “LQG control of networked control systems,” International Journal of Control, vol. 81, no. 8, pp. 1266–1280, 2008.
 [4] L. Shi, P. Cheng, and J. Chen, “Optimal periodic sensor scheduling with limited resources,” IEEE Transactions on Automatic Control, vol. 56, no. 9, pp. 2190–2195, 2011.
 [5] L. Orihuela, A. Barreiro, F. GómezEstern, and F. R. Rubio, “Periodicity of Kalmanbased scheduled filters,” IEEE Transactions on Automatic Control, vol. 50, no. 10, pp. 2672–2676, 2014.
 [6] M. Zanon, T. Charalambous, H. Wymeersch, and P. Falcone, “Optimal scheduling of downlink communication for a multiagent system with a central observation post,” IEEE Control Systems Letters, vol. 2, no. 1, pp. 37–42, Jan. 2018.
 [7] T. Charalambous, A. Ozcelikkale, M. Zanon, P. Falcone, and H. Wymeersch, “On the resource allocation problem in wireless networked control systems,” in Proceedings of the IEEE Conference on on Decision and Control, 2017.
 [8] W. Heemels, K. H. Johansson, and P. Tabuada, “An introduction to eventtriggered and selftriggered control,” in Proceedings of the IEEE Conference on Decision and Control, Dec. 2012.
 [9] C. Ramesh, H. Sandberg, and K. H. Johansson, “Design of statebased schedulers for a network of control loops,” IEEE Transactions on Automatic Control, vol. 58, no. 8, pp. 1962–1975, Aug. 2013.
 [10] A. Molin and S. Hirche, “Pricebased adaptive scheduling in multiloop control systems with resource constraints,” IEEE Transactions on Automatic Control, vol. 59, no. 12, pp. 3282–3295, Dec. 2014.
 [11] E. Henriksson, D. E. Quevedo, H. Sandberg, and K. H. Johansson, “Multiple loop selftriggered model predictive control for network scheduling and control,” IEEE Transactions on Control Systems Technology, vol. 23, no. 6, pp. 2167–2181, 2015.
 [12] B. Demirel, A. S. Leong, V. Gupta, and D. E. Quevedo, “Tradeoffs in stochastic eventtriggered control,” arXiv:1708.02756, 2017.
 [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” in NIPS Deep Learning Workshop, 2013.
 [14] D. P. Bertsekas and J. N. Tsitsiklis, NeuroDynamic Programming. Athena Scientific, 1996.
 [15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceeding of the International Conference for Learning Representations, 2015.
 [16] O. HernandezLerma and J. B. Lasserre., Discretetime Markov control processes: basic optimality criteria. Springer Science & Business Media, 2012, vol. 30.