A Game-Theoretic Approach to Design Secure and Resilient Distributed Support Vector Machines

02/07/2018 ∙ by Rui Zhang, et al. ∙ 0

Distributed Support Vector Machines (DSVM) have been developed to solve large-scale classification problems in networked systems with a large number of sensors and control units. However, the systems become more vulnerable as detection and defense are increasingly difficult and expensive. This work aims to develop secure and resilient DSVM algorithms under adversarial environments in which an attacker can manipulate the training data to achieve his objective. We establish a game-theoretic framework to capture the conflicting interests between an adversary and a set of distributed data processing units. The Nash equilibrium of the game allows predicting the outcome of learning algorithms in adversarial environments, and enhancing the resilience of the machine learning through dynamic distributed learning algorithms. We prove that the convergence of the distributed algorithm is guaranteed without assumptions on the training data or network topologies. Numerical experiments are conducted to corroborate the results. We show that network topology plays an important role in the security of DSVM. Networks with fewer nodes and higher average degrees are more secure. Moreover, a balanced network is found to be less vulnerable to attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Support Vector Machines (SVMs)[2] have been widely used for classification and prediction tasks, such as spam detection[3]

, face recognition

[4] and temperature prediction[5]

. They are supervised learning algorithms that can be used for prediction or detection by training samples with known labels. However, just like many other machine learning algorithms, SVMs are also vulnerable to adversaries who can exploit the systems

[6]. For example, an SVM-based spam filter will misclassify spam emails after training wrong data created intentionally by attacker[7, 8, 9]. Moreover, an SVM-based face recognition systems may give wrong authentications to fake images created by attacker [10].

Traditional SVMs are learning algorithms that require a centralized data collection, communication, and storage from multiple sensors [11]. The centralized nature of SVMs requires a significant amount of computation for large-scale problems, and makes SVMs unsuitable for online information fusion and processing. Despite the fact that various solutions have been introduced to address this challenge, e.g., see [12] and [13], they have not changed the nature of the SVM algorithm and its architecture.

Distributed Support Vector Machines (DSVM) algorithms are decentralized SVMs in which multiple nodes or agents process data independently, and communicate training information over a network, see, for example, [14, 15]. This architecture is attractive for solving large-scale machine learning problems since each node learns from its own data in parallel, and transfers the learning results from one node to the others to achieve the global performance as in centralized algorithms. In addition, DSVM algorithms do not require a fusion center to store all the data. Each node performs its local computation without sharing the content of the data with other nodes, which effectively reduces the cost of memory and the overhead of data communications.

In spite of the productivity and efficiency of DSVM, the decentralized training system is more vulnerable than its centralized counterpart [16, 17]. The DSVM has an increased attack surface since each node in the network can be vulnerable to attacks. An attacker can not only select a few nodes to compromise their individual learning process [18], but also send misinformation to other nodes to affect the performance of the entire DSVM network [19]. In addition, in the case of large-scale problems, it is not always possible to protect a large number of nodes at the same time [20]. Hence there will always exist vulnerabilities so that an attacker can find the weakest links or nodes to compromise.

As a result, it is important to study the security of DSVM under adversarial environments. In this work, we focus on a class of consensus-based DSVM algorithms [21], in which each node in the network updates its training result based on its own training data and the results from its neighboring nodes. Nodes achieve the global training results once they reach consensus. One compromised node will play a significant role in affecting not only its own training result but spreading the misinformation to the entire network.

Machine learning algorithms are inherently vulnerable as they are often open-source tools or methods, and security is not the primary concern of designers. An attacker can easily acquire the information regarding the DSVM algorithms and the associated network topologies. With this knowledge, an attacker can launch a variety of attacks, for example, manipulating the labels of the training samples [22], and changing the testing data [23]. In this work, we consider a class of attacks in which the attacker has the ability to modify the training data. An example of this has been described in [24], where an adversary modifies training data so that the learner is misled to produce a prediction model profitable to the adversary. This type of attack represents a challenge for the learner since it is hard to detect data modifications during a training process [25]. We further identify the attacker by his goal, knowledge, and capability.

  • The Goal of the Attacker: The attacker aims to destroy the training process of the DSVM learner and increase his classification errors.

  • The Knowledge of the Attacker: To fully capture the damages caused by the attacker, we assume that the attacker has a complete knowledge of the learner, i.e., the attacker knows the learner’s data and algorithm and the network topology. This assumption is under a worst-case scenario by Kerckhoffs’s principle: the enemy knows the system [26].

  • The Capability of the Attacker: The attacker can modify the training data by deleting crafted values to damage the training process of the DSVM learner.

One major goal of this work is to develop a quantitative framework to address this critical issue. In the adversarial environments, the goal of a learner is to minimize global classification errors in a network, while an attacker breaks the training process with the aim of maximizing that errors of classification by modifying the training data. The conflict of interests enables us to establish a nonzero-sum game framework to capture the competitions between the learner and the attacker. The Nash equilibrium of the game enables the prediction of the outcome and yields optimal response strategies to the adversary behaviors. The game framework also provides a theoretic basis for developing dynamic learning algorithms that will enhance the security and the resilience of DSVM. The major contribution of this work can be summarized as follows:

  1. We capture the attacker’s objective and constrained capabilities in a game-theoretic framework and develop a nonzero-sum game to model the strategic interactions between an attacker and a learner with a distributed set of nodes.

  2. We fully characterize the Nash equilibrium by showing the strategic equivalence between the original nonzero-sum game and a zero-sum game.

  3. We develop secure and resilient distributed algorithms based on alternating direction method of multipliers (ADMoM)[27]. Each node communicates with its neighboring nodes and updates its decision strategically in response to adversarial environments.

  4. We prove the convergence of the DSVM algorithm. The convergence is guaranteed without any assumptions on the network topology or the form of data.

  5. We demonstrate that network topology plays an important role in resilience to adversary behaviors. Networks with fewer nodes and higher average degrees are shown to be more secure. We also show that a balanced network (i.e., each node has the same number of neighbors) is less vulnerable.

  6. We show that nodes with more training samples and fewer neighbors turn out to be more secure for a specified network. One way to defend against attacker’s action is to add more training samples, which may increase the training time and require more memory for storage.

I-a Related Works

A general tool to study machine learning under adversarial environment is game theory[28, 29, 30]. In [28]

, Dalvi et al. have formulated a game between a cost-sensitive Bayes classifier and cost-sensitive adversary. In

[29], Kantarcıoğlu et al. have introduced Stackelberg games to model the interactions between the adversary and the learner, which shows that the game between them is possible to reach a steady state where actions of both players are stabilized. In [30], Rota et al. have presented a game-theoretic formulation where a learner and an attacker make randomized strategy selections. The major focus of their work is on developing centralized machine learning tools. In our work, we extend the security framework of machine learning algorithms to a distributed framework for networks. Hence, it can be seen that the performance of the distributed machine learning algorithms is also related the security of networks.

Game theory has also been widely used in network security [31, 32, 33, 34, 35, 36, 37, 38]. In [31], Lye et al. have analyzed the interactions of an attacker and an administrator as a two-player stochastic game at a network. In [32], Michiardi et al. have presented a game-theoretic model in ad hoc networks to capture the interactions between normal nodes and misbehaving nodes. However, when solving distributed machine learning problems, the features and properties of data processing in each node can cause unanticipated consequences in a network.

In our previous work [1] , we have established a preliminary framework to model the interactions between a consensus-based DSVM learner and an attacker. In this paper, we develop fully distributed algorithms and investigate their convergence, security and resilience properties. Moreover, new sets of experiments are performed to show the influence of network topologies and the number of samples at each node on the resilience of the network.

I-B Organization of the Paper

The rest of this paper is organized as follows. Section II outlines the design of distributed support vector machines. In Section III, we establish game-theoretic models for the learner and the attacker. Section IV deals with the distributed and dynamic algorithms for the learner and the attacker. Section V presents the convergence proof of the algorithm. Section VI and Section VII present numerical results and concluding remarks, respectively. Appendices A, B, and C provide the proof of the Propositions 1, 2 and Lemma 1, respectively.

I-C Summary of Notations

Notations in this paper are summarized as follows. Boldface letters are used for matrices (column vectors); denotes matrix and vector transposition; denotes values at step ; denotes the -th entry of a matrix; is the diagonal matrix with on its main diagonal; is the norm of the matrix or vector; denotes the set of nodes in a network; denotes the set of neighboring nodes of node ; denotes the action set which is used by the attacker.

Ii Preliminaries

In this section, we present a two-player machine learning game in a distributed network involving a learner and an attacker to capture the strategic interactions between them. The network is modeled by an undirected graph with representing the set of nodes, and representing the set of links between nodes. Node communicates only with his neighboring nodes . Note that without loss of generality, graph is assumed to be connected; in other words, any two nodes in graph are connected by a path. However, nodes in do not have to be fully connected, which means that nodes are not required to directly connect to all the other nodes in the network. The network can contain cycles. At every node , a labelled training set of size is available, where represents a -dimensional data, and they are divided into two groups with labels . Examples of a network of distributed nodes are illustrated in Fig. 1(a).

(a) Network example.
(b) SVM at compromised node .
Fig. 1: Network example: There are nodes in this network as shown in Fig. (a). Each node contains a labelled training set . Node can communicate with its 4 neighbors: node , , and . An attacker can take over node and . The compromised nodes are marked in red. In each node, the learner aims to find the best linear discriminant line, for example, the black dotted line shown in (b). In compromised nodes, an attacker modifies the training data which leads to a wrong discriminant line of the learner, for example, the black solid line shown in (b).

The goal of the learner is to design DSVM algorithms for each node in the network based on its local training data , so that each node has the ability to give new input a label of or without communicating to other nodes . To achieve this, the learner aims to find local maximum-margin linear discriminant functions at every node with the consensus constraints forcing all the local variables to agree across neighboring nodes. Variables and of the local discriminant functions can be obtained by solving the following convex optimization problem [21]:

(1)

In the above problem, the term

is the hinge loss function. It can also be written as slack variable

with the constraints and , where account for non-linearly separable training sets. is a tunable positive scalar for the learner.

Iii Distributed Support Vector Machines with Adversary

Optimization Problem (1) is formed by the DSVM learner who seeks to find the maximum-margin linear discriminant function. We assume that an attacker has a complete knowledge of the learner’s Problem (1), and he can modify the value of the node into , where , and is the attacker’s action set at node . We use and to represent nodes with and without the attacker, respectively. Note that, and . A node in the network is either under attack or not under attack. The behavior of the learner can be captured by the following optimization problem:

(2)

For the learner, the learning process is to find the discriminant function which separates the training data into two classes with less error, and then use the discriminant function to classify testing data. Since the attacker has the ability to change the value of the original data into , the learner will find the discriminant function that separates the data in more accurate, rather than the data in . As a result, when using the discriminant function to classify the testing data , it will be prone to be misclassified.

By minimizing the objective function in Problem (2), the learner can obtain the optimal variables , which can be used to build up the discriminant function to classify the testing data. The attacker, on the other hand, aims to find an optimal way to modify the data using variables to maximize the classification error of the learner. The behavior of the attacker can thus be captured as follows:

(3)

In above problem, the term represents the cost function for the attacker. norm is defined as , i.e., the total number of nonzero elements in a vector. Here, we use the norm to denote the number of elements which are changed by the attacker. The objective function with -norm captures the fact that the attacker aims to make the largest impact on the learner by changing the least number of elements. denotes the action set for the attacker. We use the following form of :

which is related to the atomic action set

indicates the bound of the sum of the norm of all the changes at node . A higher

indicates that the attacker has a large degree of freedom in changing the value

. Thus training these data will lead to a higher risk for the learner. Notice that can vary at different nodes, and we use to represent the situation when are equal at every node. from the atomic action set has the same form with , but and are bounded by same . Furthermore, the atomic action set has the following properties.

The first property (P1) states that the attacker can choose not to change the value of . Property (P2) states that the atomic action set is bounded and symmetric. Here, “bounded” means that the attacker has the limit on the capability of changing . It is reasonable since changing the value significantly will result in the evident detection of the learner.

Problem (2) and Problem (3) can constitute a two-person nonzero-sum game between an attacker and a learner. The solution to the game problem is often described by Nash equilibrium, which yields the equilibrium strategies for both players, and predicts the outcome of machine learning in the adversarial environment. By comparing Problem (2) with Problem (3), we notice that the first three terms of the objective function in Problem (3) are the same as the objective function in Problem (2). The last term of the objective function in Problem (3) is not related to the decision of the learner when he solves Problem (2), and thus it can be treated as a constant for the learner. Moreover, both the constraints in Problem (2) and (3) are uncoupled. As a result, the nonzero-sum game can be reformulated into a strategically equivalent zero-sum game, which takes the minimax or max-min form as follows:

(4)

Note that there are two sets of constraints: (4a) only contributes to the minimization part of the problem, while (4b) only affects the maximization part. The first term of is the inverse of the distance of margin. The second term is the error penalty of nodes without attacker. The third term is the error penalty of nodes with attacker, and the last term is the cost function for the attacker. On the one hand, minimizing the objective function captures the trade-off between a larger margin and a small error penalty of the learner, while on the other hand, maximizing the objective function captures the trade-off between a large error penalty and a small cost of the attacker. As a result, solving Problem (4) can be understood as finding the saddle point of the zero-sum game between the attacker and the learner.

Definition 1.

Let and be the action sets for the DSVM learner and the attacker respectively. Notice that here . Then, the strategy pair is a saddle-point solution of the zero-sum game defined by the triple , if

where is the objective function from Problem (4).

Based on the property of the action set and atomic action set, Problem (4) can be further simplified as stated in the following proposition.

Proposition 1.

Assume that is an action set with corresponding atomic action set . Then, Problem (4) is equivalent to the following optimization problem:

(5)
Proof.

See Appendix A. ∎

In Problem (4), the third term of function is the sum of hinge loss functions of the nodes under attack. This term is affected by the decision variables of both players. However, Problem (5) transforms that into hinge loss functions without attacker’s action and a coupled multiplication of and . Notice that here can be seen as the combination of all the in node . In this way, the only coupled term is , which is linear in the decision variables of the attacker and the learner respectively.

Iv ADMoM-DSVM and Distributed Algorithm

In the previous section, we have combined Problem (2) for the learner with Problem (3) for the attacker into one minimax Problem (4), and showed its equivalence to Problem (5). In this section, we develop iterative algorithms to find equilibrium solutions to Problem (5).

Firstly, we define , the augmented matrix , the diagonal label matrix , and the vector of slack variables . With these definitions, it follows readily that , where is a matrix with its first

columns being an identity matrix, and its

column being a zero vector. We also relax the norm to norm to represent the cost function of the attacker. Thus, Problem (5) can be rewritten as

(6)

Note that is a identity matrix with its -st entry being . is used to decompose the decision variable to its neighbors , where . Problem (6) is a minimax problem with matrix form coming from Problem (4). To solve Problem (6), we first prove that the minimax problem is equivalent to the max-min problem, then we use the best response dynamics for the min-problem and max-problem separately.

Proposition 2.

Let represent the objective function in Problem (6), the minimax problem

yields the same saddle-point equilibrium as the max-min problem

Moreover, there exists an equilibrium of the minimax or max-min Problem (6), but the equilibrium is not necessarily unique.

Proof.

See Appendix B. ∎

Proposition 2 illustrates that the minimax problem is equivalent to the max-min problem, and thus we can construct the best response dynamics for the min-problem and max-problem separately when solving Problem (6). The min-problem and max-problem are archived by fixing and , respectively. We will also show that both the min-problem and the max-problem can be solved in a distributed way.

Iv-a Max-problem for fixed

For fixed , the first two terms of the objective function and the first three constraints in Problem (6) can be ignored as they are not related to the max-problem. We have

(7)

Note that is independent in the Problem (7), and thus we can separate Problem (7) into sub-max-problems solving which is equivalent to solving the global max-problem. We have relaxed the norm to norm to represent the cost function of the attacker. By writing the equivalent form of the -norm optimization, we arrive at the following problem

(8)

Problem (8) is a convex optimization problem, the objective function and the first two constraints are linear while the third constraint is convex. Note that each node can achieve their own without transmitting information to other nodes. The global Max-Problem (7) now is solved in a distributed fashion using Sub-Max-Problems (8).

Iv-B Min-problem for fixed

For fixed , we have

(9)

Note that term is ignored since it does not play a role in the minimization problem. Furthermore, we use the alternating direction method of multipliers to solve Problem (9).

The surrogate augmented Lagrangian function for Problem (9) is

(10)

Notice that and donate the Lagrange multipliers with respect to and . “Surrogate” here means that does not include the constraints (9a) and (9b). “Augmented” indicates that contains two quadratic terms which are scaled by constant , and these two terms are used to further regularize the equality constraints in (9). ADMoM solves Problem (9) by following update rules[39]:

(11)
(12)
(13)
(14)

Note that (11)-(14) contains two quadratic programming problems and two linear computations. Furthermore, (11)-(14) can be simplified into the following proposition.

Proposition 3.

Each node iterates with randomly initialization and ,

(15)
(16)
(17)

where .

Proof.

A similar proof can be found in [21]. By solving (12) directly, we have that , and thus, (12) can be eliminated by directly plugging the solution into (11), (13), and (14).

By plugging the solution of (12) into (13) and (14), we can achieve that , and , respectively. Let be the initial condition, we have that for . Thus, (12), (13), and (14) can be simplified further as , and , respectively.

By plugging the solution of (12) into (11), the sixth and seventh terms of the objective function in (11) can be simplified as . Moreover, notice that the following equality holds for the forth and fifth terms of the objective function in (11):

where . Note that the first equality holds as for , the third equality holds as , which holds when . Thus, we only need to calculate at each iteration for (11). As a result, (13) and (14) can be written as (17).

Using these results, we can rewrite Problem (11) as follows

Let and denote the Lagrange multipliers associated with the constraints and , respectively. As a result, we have the Lagrange function for (11) as

By KKT conditions, we have

Note that and , thus, the second equality yields . Let and , the first equality yields (16). can be achieved by solving the dual problem of Problem (11), which yields (15). ∎

Note that (15) is a quadratic programming problem with linear inequality constraints. (16) and (17) are direct computations. is a diagonal matrix. Thus, always exists and is easy to compute. (15)-(17) are fully distributed iterations as each node uses their own sample data and . But the computations of and at node require the value of form neighboring nodes. This can be achieved by allowing communications between nodes. The centralized Min-Problem (9) can be solved in a fully distributed fashion now.

Iv-C Distributed algorithm for minimax problem

By combining the above Proposition 3 with Problem (8), we have the method of solving Problem (6) in a distributed way as follows: The first step is that each node randomly pick an initial and , then solve Max-Problem (8) with , and obtain, the next step is to solve Min-Problem (9) with using Proposition 3, and obtain , then we repeat solving max-problem with from the previous step and min-problem with from the previous step until the pair achieves convergence. The iterations of solving Problem (6) can be summarized as follows:

Proposition 4.

With arbitrary initialization and , the iterations per node are given by:

(18)
(19)
(20)
(21)

where .

Iterations (18)-(21) are summarized into Algorithm 1. Note that at any given iteration of the algorithm, each node computes its own local discriminant function for any vector as

(22)

Algorithm 1 solves the minimax problem using ADMoM technique. It is a fully decentralized network operation, and it does not require exchanging training data or the value of decision functions, which meets the reduced communication overhead and privacy preservation requirements at the same time. The nature of the iterative algorithms also provides resiliency to the distributed machine learning algorithms. It provides mechanisms for each node to respond to its neighbors and the adversarial behaviors in real time. When unanticipated events occur, the algorithm will be able to automatically respond and self-configure in an optimal way. Properties of Algorithm can be summarized as followings.

Algorithm 1
Randomly initialize and
for every .
1:  for do
2:        for all do
3:             Compute via (18).
4:        end for
5:        for all do
6:             Compute via (19).
7:             Compute via (20).
8:        end for
9:        for all do
10:            Broadcast to all neighbors .
11:       end for
12:       for all do
13:             Compute via (21).
14:       end for
15: end for

Iv-D Game of Games

The zero-sum minimax Problem (5) is a global game between the two players, i.e., a learner and an attacker. The game captures the interactions on a network of nodes. However, based on the properties of network, the two-person zero-sum game can be treated as small games between a local learner and a local attacker. If we treat each node as a player, then the global game can be decomposed into smaller games in which each node constitutes a local game between the local learner at the node and the local attacker who attacks the node. We call this unique structure “Game of Games”.

To state more formally, we let the two-player zero sum game be represented by which is equivalent to a game of games defined by where