The term autonomous is a combination of the Greek words auto (self) and nomous (law, rule) . In system theory, an autonomous system is a system that is self-governing and does not explicitly depend on the independent variable. If the independent variable is time, these systems are also called time-invariant systems. On the other hand, in control theory, an autonomous control is a self-governing control system in the sense that it acts independently and does not rely on prior knowledge of the system dynamics and human intervention. An autonomous controller should be able to learn from what it perceives to compensate for partial or incorrect prior knowledge.
Autonomous control design for a multi-agent system (MAS) has gained significant interest due to applications in a variety of disciplines from robot swarms to power systems and wireless sensor networks. In a distributed MAS, decisions are made locally using only agents’ available information. This provides scalability, flexibility and avoids a single point of failure [2, 3, 4, 5]. However, designing autonomous controllers for MASs is challenging and requires learning from experience. Moreover, distributed MASs are prone to cyber-physical attacks. Due to their networked nature, attacks can escalate into disastrous consequences and significantly degrade the performance of the entire network . In a contested environment with adversarial inputs, corrupted data communicated by a single compromised agent can be propagated to the entire network through its neighbors. This corrupted data will be used by autonomous agents for learning which misleads the entire network and, consequently, causes no emergent behavior or an emergent misbehavior. The main bottleneck in deploying successful distributed MASs is designing secure control protocols that can learn about system uncertainties while showing some level of functionality in the presence of cyber-physical attacks.
Reinforcement learning (RL) [7, 8, 9], inspired by learning mechanisms observed in mammals, has been successfully used to learn optimal solutions online in single agent systems for both regulation and tracking control problems [10, 11, 12, 13, 14, 15, 16] and recently for MASs [17, 18, 19]. Existing RL-based controllers for leader-follower MASs assume that the leader is passive and without any control input. In this case, the leader is not able to react to environmental or mission changes by replanning its trajectories. On the other hand, existing active leader controllers (e.g., [20, 21, 22]) are not autonomous as they require having complete knowledge of the leader and agent’s dynamics. Moreover, these approaches are generally far from optimal and only take into account the stability, which is the bare minimum requirement. Finally, existing learning-based RL solutions to MASs are not resilient against cyber-physical attacks.
Resilient control protocols for MASs have been designed in the literature [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] to mitigate attacks. Most of the existing approaches either use the discrepancy among the state of agents and their neighbors to detect and mitigate attacks, or use an exact model of agents to predict expected normal behavior and, thus, detect an abnormality caused by attacks. However, we will show that a stealthy attack on one agent can cause an emergent misbehavior in the network with no discrepancy between agent’s states and, therefore, the former approaches cannot mitigate these type of attacks. Moreover, this discrepancy could be a result of a legitimate change in the state of the leader. Blindly rejecting the agent’s neighbor information can harm the network connectivity and the convergence of the network. On the other hand, model-based approaches require a complete knowledge of the agent’s dynamics, which may not be available in many practical applications and avoid the design of autonomous controllers. H control protocols have also been proposed to attenuate the effect of disturbances in MASs [36, 37, 38]. However, as shown in this paper, standard H control protocols can be misled and become entirely ineffective by a stealthy attack. To the author’s knowledge, designing an autonomous and resilient controller that does not require any knowledge of the agent’s dynamic and can survive against cyber-physical attacks has not been investigated yet.
This paper presents an autonomous and resilient distributed control protocol for leader-follower MASs with a non-autonomous leader. To alleviate the effects of attacks on the MASs, a distributed observer-based control protocol is first developed to prevent corrupted sensory data caused by attacks on sensors and actuators from propagating across the network. To this end, only the leader communicates its actual sensory information and other agents estimate the leaders’ state using a distributed observer. To further improve resiliency, distributed H control protocols are designed to attenuate the effect of the attacks on the compromised agent itself. Non-homogeneous game algebraic Riccati equations (ARE) are derived for solving the optimal H synchronization problem for each agent. An off-policy RL algorithm is developed to learn the solutions of the non-homogeneous game ARE without requiring the complete knowledge of the agent’s dynamics. To avoid the usage of corrupted data coming from compromised neighbors during and after learning, a trust-confidence based control protocol is developed for attacks on communication links and attacks that hijack the entire node. A confidence value is defined for each agent based solely on its local evidence. Then, each agent communicates its confidence value with neighbors to indicate the trustworthiness of its own information. Moreover, a trust value is defined for each neighbor to determine the significance of the incoming information. The agent incorporates these trust values along with the confidence values received from neighbors in its update law to eventually isolate the compromised agent.
In this section, a background of the graph theory is provided. A directed graph consists of a pair in which is a set of nodes and is a set of edges. The adjacency matrix is defined as , with if , and otherwise. The set of nodes with edges incoming to node is called the neighbors of node , namely . The graph Laplacian matrix is defined as , where is the in-degree matrix, with as the weighted in-degree of node . A (directed) tree is a connected digraph that in-degree of every node is one, except the root node. A directed graph has a spanning tree if there exists a directed tree that connects all nodes of the graph. A leader can be pinned to multiple nodes, resulting in a diagonal pinning matrix with the pinning gain when the node has access to the leader node and , otherwise. is the
-vector of ones anddenotes the range space of .
denotes the minimum eigenvalue of matrix A.denotes the Euclidean norm for vectors or the induced 2-norm for matrices. The notation is Kronecker product of matrices and .
Assumption 1. The communication graph has a spanning tree, and the leader is pinned to at least one root node.
Iii Standard Synchronization Control Protocols and Their Vulnerability to attacks
In this section, the standard synchronization control protocol for MASs is reviewed and its vulnerability to attacks is examined. Consider agents with identical dynamics given by
where and are the state and control input of agent , respectively. denotes the attack signal injected into agent . , , and are the drift, input, and attack dynamics, respectively.
Assumption 2. The pair is stabilizable.
Let the leader dynamics be non-autonomous, i.e., the control input of the leader is a nonzero signal, and is given by
where and denote the state and input of the leader, respectively. and are the same as other agents.
Assumption 3. The control input is given and bounded, i.e., there exists a positive constant such that .
Define the tracking error for agent as
Define the local neighborhood tracking error for agent as 
where is the pinning gain, and for at least one root node . The standard distributed tracking control protocol is then given by 
where and are positive scalar coupling gains, and is a design matrix gain. is a nonlinear function defined for such that
It can be seen that the following condition is required to assure synchronization
Definition 1. In a graph, agent is reachable from agent if there exists a directed path of any length from node to node .
Definition 2: An agent is called a disrupted/compromised agent, if it is directly under attack. Otherwise, it is called an intact agent.
It is shown in  for the proof of Theorem 1 that in the absence of attack, if the controller is designed to make the local neighborhood tracking error for all agents go to zero, the synchronization is guaranteed. In the following theorem, however, it is shown that even though the local neighborhood tracking (4) goes to zero for all agents, in the presence of a specific designed attack, it does not guarantee synchronization for intact agents that have a path to a compromised agent. Note that the leader is assumed to be a trusted agent with more advanced sensors and with higher security. Note also that the leader does not receive any information from other agents, which makes it secure against attacks on other agents and the communication network.
Lemma 1 . Let be a diagonal matrix with at least one nonzero positive element, and be the Laplacian matrix. Then, is a nonsingular M-matrix.
Consider the MAS (1)-(2) with the control protocol (5). Assume that agent is under an attack that is generated by , where the eigenvalues of are a subset of the eigenvalues of the agent’s dynamic . Then, intact agents that are reachable from agent do not synchronize to the leader, while their local neighborhood tracking error (4) is zero.
See Appendix A. ∎
Remark 1. H is one of the most common and effective approaches to attenuate disturbances. However, Theorem 2 implies that the standard H controllers for MASs that use the exchange of relative information can be bypassed by the attacker. This is because although the goal of the H is to attenuate the effect of adversarial input on the local neighborhood tracking error, Theorem 2 shows that the attacker can make the local neighborhood tracking error go to zero, while agents are far from synchronization. Therefore, a different controller framework and H controller is presented in this paper that guarantees attenuating attacks on sensors and actuators of a compromised agent.
Iv The Proposed Attack Mitigation Approach
In this section, the proposed resilient control approach is presented. First, a distributed observer-based H control protocol is developed to not only prevent attacks on physical components, i.e., attacks on sensors and actuators (we call them Type 1 attacks), from propagating throughout the network but also attenuate their effect on the compromised agent. Then, a trust-confidence based control protocol is examined to identify and isolate neighbors that are compromised by attacks on the communication network or attacks that take over the control of a compromised agent (we call them Type 2 attacks). Figure 1 shows the structure of the proposed control framework.
Iv-a Overall structure of the proposed approach
We now formulate a resilient observer-based H distributed control protocol for MAS (1)-(2) in the presence of attacks. In the proposed approach, only the leader communicates its actual sensory information and followers do not exchange their actual state information. This stops propagating Type 1 attacks from a compromised agent to others. To this end, the followers estimate the leader’s state using a distributed observer and communicate this estimation to their neighbors to achieve consensus on the leader state.
The distributed observer is designed as
where is defined in (6) and is a revisited local neighborhood observer tracking error for agent defined by
where is the confidence of agent and is the trust value of agent to its neighbor . The confidence and trust values along with the design parameters , and in (8) are designed in subsection B to mitigate Type 2 attacks, i.e., to identify and remove entirely compromised agents or attacks on communication network, and thus guarantee that for all intact agents, regardless of attacks. To further increase resiliency at the local level and attenuate the effects of Type 1 attacks on the compromised agent itself, the control input in (1) is designed as a function of and in subsection C (see Theorem 4) to guarantee that the following bounded -gain condition is satisfied for the agent
with defined as the controlled or performance output and is obtained by
where and represent the discount factor and the attenuation level of the attack , respectively, and the weight matrices and are symmetric positive definite. If condition (10) is satisfied, then, the H norm of , i.e., the transfer function from the attack to the performance output , is less than or equal to . Note also that does not need to be a bounded energy signal because of the discount factor . The problem formulation can now be given as follows.
Problem 1. (Resilient H Synchronization Problem) Consider agents defined in (1)-(2) with the distributed observer given by (8)-(9). Design the control protocol in (1) along with , , , and in (8)-(9) such that
The bounded -gain condition (10) is satisfied when .
The synchronization problem is solved, i.e., , when .
Iv-B The proposed distributed observer design
The distributed observer (8)-(9) only communicates the observer state , which cannot be affected by Type 1 attacks on physical components. A trust-confidence mechanism is designed in the following to mitigate Type 2 attacks. To this end, a confidence value is defined for each agent to indicate the trustworthiness of its own observer information. Agents communicate their confidence value with their neighbors to alert them to put less weight on the information they are receiving from them, depending on how low their level of confidence is. This slows down the propagation of Type 2 attacks. If an agent is not confident about its own observer information, it then assigns a trust value to its neighbors and incorporates these trust values along with the confidence values received from neighbors in its update law to determine the significance of the incoming information. Figure 2 shows the block diagram of the proposed distributed monitor.
Note also that it is assumed that the attacker designs its signal based on Theorem 2 to deceive intact agents, so that they cannot monitor any anomaly by examining their local neighborhood tracking error. This is considered the worst attack scenario. If, however, the attacker does not satisfy the conditions of Theorem 2, then, intact agents can easily detect attacks by using Kullback–Leibler divergence criteria to check discrepancy between the normal statistical properties of the local neighborhood tracking error and its actual ones.
Iv-B1 Confidence value
A confidence value is defined for each agent which shows the validity of its information. To proceed, define
for agent where is defined in (9). Based on Theorem 2, and, consequently, converges to zero for intact agents. Now, define
for agent . In contrast to , does not converge to zero if agent is in the path of an attacker. This is because requires , which indicates that agent and its neighbors are synchronized and, therefore, are not in the path of an attacker. In the absence of attack, and converge to zero and have the same behavior. Therefore, by comparing and one can detect whether or not the agent is in the path of a compromised agent. The confidence value in (9) for an intact agent is defined as
where is a discount factor used to determine how much we value the current experience with regard to the past experiences. is a threshold value to account for factors other than attacks, i.e., channel fading and disturbance. If agent is not in the path of any compromised agent, is zero almost all the time and, consequently, is almost one. On the other hand, if agent is affected by an attacker, then , and is less than one and its value depends on how close the agent is to the source of the attack. Equation (14) can be implemented by the differential equation . The worst case scenario is assumed in which a disrupted agent broadcasts the confidence value 1 to its neighbors to fool them.
Iv-B2 Trust value
The trust value is defined to determine the importance of the incoming information of each agent’s neighbor. To calculate the trust value of agent to agent , we first measure the difference between the state of agent and the average of the state of all neighbors of agent using
where denotes the average value of the neighbors of agent and is the number of neighbors of agent . The discount factor determines how much we value the current experience of interaction with regard to the past experiences. is a threshold value to take into account factors other than attacks. Equation (16) can be implemented as . Now, we define the trust value of agent to its neighbor given as in (9) as
can also be normalized to satisfy . If there is no attack and the network is also synchronized, then, is zero, and, consequently, is one . Moreover, when there is no attack and agent receives considerably different values from its neighbors before synchronization, e.g. as a result of a change in the state of the leader, since is close to one as there is no attack, is almost one . On the other hand, if agent is affected by an attack, then is small and the trust of agent to agent depends on .
Assumption 4. The network connectivity is at least , i.e., at least half of the neighbors of each agent are intact .
where is the graph Laplacian matrix and is the diagonal pinning matrix.Then, based on Lemma 1, is a non-singular M-matrix. The following lemmas are used in the proof of Theorem 3.
Lemma 2 . Let and Assumption 2 be satisfied. Then, the solution to the following ARE equation
is positive definite.
See Appendix B. ∎
Remark 2. Note that for Type 1 attacks, the proposed control framework does not impose any constraints on the number of neighbors or the total number of agents under attacks. Note also that in contrast to existing mitigation approaches, we do not discard the neighbor’s information for an agent based solely on the difference between their values. Therefore, when the discrepancy between agents is because of a legitimate change in the leader, the confidence and trust values for each agent become 1 and, consequently, all agents synchronize to the leader.
The following subsection shows how to design a resilient observer-based H distributed controller. Non-homogeneous game AREs are derived for solving the optimal H synchronization problem.
Iv-C The proposed resilient controller
It was shown in Theorem 3 that for all intact agents regardless of attacks. Similar to , one can show that if in (1) is designed to guarantee , using the separation principle, one can guarantee . Therefore, in the following, the control input is designed to solve Problem 1 with replaced with .
Define the error between the state of agent and its observer as . The dynamic of the error becomes
Define the augmented system state as
Note that is used to compensate the non-homogeneous term in the augmented system (23). With the aid of (10) (while is replaced with ), define the discounted performance function in terms of the augmented system (23) as
The value function for linear systems is quadratic with the form as
and the corresponding Hamiltonian function becomes
Remark 3. It is assumed here that the full state of agents is available for measurement. However, if not available, the proposed design procedure can be extended for the case of dynamics controllers in which the states of agents are estimated using a local observer. This is because local observers can estimate agents’ state without any exchange of information with their neighbors. On the other hand, if the entire agent is compromised and its state observer is manipulated, its neighbors detect it and discard its information using the proposed trust-confidence mechanism.
(Non-homogenous game ARE) The optimal solution for the discounted performance function (26) is
where and are the solution of the following non-homogeneous game ARE
See Appendix C. ∎
See Appendix D. ∎
V Model-Free Resilient Off-Policy RL for Solving Optimal Synchronization for Intact Agents
In this section, an RL algorithm is proposed to solve Problem 1 on-line without requiring any knowledge of the agents’ dynamics.
The off-policy RL allows separating the behavior policy from the target policy for both control input and attack. In order to find the optimal control (29) without the requirement of the knowledge of the system dynamics, the off-policy RL algorithm  is used in this subsection. Off-policy algorithm has two separate stages. In the first stage, an admissible policy is applied to the system and the system information is recorded over the time interval . Then, in the second stage, without requiring any knowledge of the system dynamics, the information gathered in stage 1 is repeatedly used to find a sequence of updated policies and converging to and . To this end, the augmented system dynamics (23) is first written as
In this case, the control policy and worst case attack signal (56) can be written as
Taking time derivative of along the augmented system dynamic (32) yields
The following off-policy integral RL Bellman equation is derived by multiplying both sides of (34) by and integrating
The off-policy RL algorithm presented by iterating on (35) to solve the non-homogeneous game ARE, is listed in Algorithm 1.
The following theorem shows that using the proposed control framework, the learning mechanism is resilient against both Types 1 and 2 attacks.
Similar to , one can show that the off-policy RL Algorithm 1 solves Problem 1 in an optimal manner, as long as . That is, for intact agents and for a compromised agent the condition (10) is satisfied. This boils down the proof to show that even if the system is under attack. On the other hand, Theorem 3 shows that regardless of attacks. This completes the proof. ∎
Remark 4. In the proposed Algorithm 1, steps 4 to 9 for finding the trust and confidence values are continually employed even after learning. However, once the optimal gain is found, the learning steps 10-12 are skipped, because the gains required for the control policy are computed and there is no need for further computation, unless another learning phase is initiated by any change in the agent dynamics. One might argue that the off-policy Algorithm 1 requires to measure the attack signal which is restrictive. However, the off-policy Algorithm can learn about the worst-case attack signal using actual measurable disturbances (either applied intentionally or coming from nature), instead of measuring attack signals. Once it learned the worst-case scenario, it can attenuate attacks without measuring them.
Vi Simulation Results
In this section, an example is provided to verify the effectiveness of the proposed control protocol. The communication graph is given in Fig. 3.
Consider 5 agents with dynamics as
The leader dynamics is given by
The design parameters are , , and, for all agents. Now, assume that Agent 2 is affected by a Type 1 attack with the attack signal given as
The agents’ states are shown in Fig. 4. It is observed from Fig. 3(a) that when the standard distributed controller (5) is used, before attack all agents synchronize to the leader. However, after the attack, Agents 4 and 5, which have a path to the compromised Agent 2, do not synchronize to the leader. One can see from Fig. 3(b) that, as stated in Theorem 2, the local neighborhood tracking error (4) converges to zero for all intact agents except the compromised agent.
The performance of the observer-based H controller (29) in the presence of Type 1 attack (38) is shown in Fig. 5. One can see that the compromised agent is the only agent that does not follow the leader. Moreover, the H controller attenuates the effect of the attack on the disrupted agent, which can be seen by comparing the deviation level of the compromised agent state from its desired value in Figs. 3(a) and 5.
Now, assume that Agent 2 is under Type 2 attack. In this case, the attack signal (38) is applied to the observer of Agent 2. However, Assumption 4 is not satisfied. The result is shown in Fig. 6. It can be seen from Fig. 5(a) that using the trust-confidence mechanism, only the compromised agent and its direct neighbor do not synchronize to the leader. The confidence value of the agents is shown in Fig. 5(b). One can see that Agents 4 and 5 are not confident about their own observer information, since they are in the path of the compromised agent. To satisfy Assumption 4, it is considered that 2 incoming links from Agent 5 and Agent 1 are connected to Agent 4. Fig. 7 shows the agents output and confidence value when Assumption 4 is satisfied. It can be seen from Fig. 6(a) that only the compromised agent does not synchronize to the leader and all intact agents converge to the leader. The confidence value of the agents is shown in Fig. 6(b). One can see that Agent 4 is not confident about its own observer information since it is the only immediate neighbor of the compromised agent.
A resilient autonomous control framework is proposed for a leader-follower MAS with an active leader. It is first shown that existing standard synchronization control protocols are prone to attacks. Then, a resilient learning-based control protocol is presented to find optimal solutions to the synchronization problem in the presence of attacks and system dynamic uncertainties. A distributed observer-based H controller is first designed to prevent propagating the effects of attacks on sensors and actuators throughout the network, as well as attenuating the effect of these attacks on the compromised agent itself. Non-homogeneous game algebraic Riccati equations are derived to solve the H optimal synchronization problem. Off-policy reinforcement learning is utilized to learn their solution without requiring any knowledge of the agent’s dynamics. Then, a trust-confidence based distributed control protocol is proposed to mitigate attacks that hijack the entire node and attacks on communication links. It is shown that the proposed RL-based H control protocol is resilient against attacks.
Appendix A Proof of Theorem 2
Let be the graph Laplacian matrix of the entire network, in which the leader is the only root node and followers are non-root nodes. Then, it can be partitioned as
where denotes a vector whose -th element is nonzero and indicates that the follower is connected to the leader. indicates the interaction between the leader and the followers. Without loss of generality, assume . Using (6), the global dynamic of the MAS (1) under attack with the control input (5) in terms of the Laplacian matrix (39) and after some manipulations, one has
where and , and in indicates that the leader is a trusted node and is not under attack. It can be seen that agents reach a steady state, i.e., , if the last terms of (40) tend to zero, i.e., , where or . Otherwise, since the attack signal has common eigenvalues with the agent dynamics, the agents’ states go to infinity. In the latter case, the local neighborhood tracking error goes to zero as . To prove the former case, we first show that , if the attack signal is designed as given in the statement of Theorem. Note that , if there exists a nonzero vector such that
Based on Assumption 1, the followers have at least one incoming link from the leader. On the other hand, captures the interaction among all followers, as well as the incoming link from the leader. The former is a positive semi-definite Laplacian matrix and the latter is a diagonal matrix with at least one nonzero positive element added to it. Therefore, as stated in Lemma 1, is nonsingular and, thus, the solution to (42) becomes
Since the eigenvalues of the attack signal are assumed a subset of the eigenvalues of the agent dynamics , for every there exists a nonzero vector such that (41) holds. Therefore, . Now, using (39), the global form of the state neighborhood tracking error (4) can be written as
Since (41) is satisfied, one has
Equation (46) implies that the local neighborhood tracking error is zero, i.e., for intact agents that are not directly under attacks, i.e., . This completes the proof.
Appendix B Proof of Theorem 3
Type 1 attacks cannot affect the observer state since the observer cannot be physically affected by an attacker. On the other hand for Type 2 attacks, based on Assumption 4, the total number of compromised agents is assumed less than half of the network connectivity, i.e., . Therefore, even if neighbors of an intact agent are attacked and collude to send the same value to misguide it, there still exists intact neighbors that communicate values different than the compromised ones. Thus, for some and, therefore, although in (12) is zero based on Theorem 2, in (13) is nonzero and, consequently, its confidence value in (14) will decrease and the attack will be detected. Moreover, since at least half of its neighbors are intact, it can update its trust values to remove the compromised neighbors. On the other hand, the entire network is still connected to the agent under attack and, therefore, the graph is still connected with the intact agents. Therefore, there exists a spanning tree in the graph associated with all intact agents. Let be the graph of remaining intact agents with as its weights. Since has a spanning tree, based on the above discussion, the inequality defined in Lemma 2 is still satisfied.