Traversing the Local Polytopes of ReLU Neural Networks: A Unified Approach for Network Verification

11/17/2021
by   Shaojie Xu, et al.
9

Although neural networks (NNs) with ReLU activation functions have found success in a wide range of applications, their adoption in risk-sensitive settings has been limited by the concerns on robustness and interpretability. Previous works to examine robustness and to improve interpretability partially exploited the piecewise linear function form of ReLU NNs. In this paper, we explore the unique topological structure that ReLU NNs create in the input space, identifying the adjacency among the partitioned local polytopes and developing a traversing algorithm based on this adjacency. Our polytope traversing algorithm can be adapted to verify a wide range of network properties related to robustness and interpretability, providing an unified approach to examine the network behavior. As the traversing algorithm explicitly visits all local polytopes, it returns a clear and full picture of the network behavior within the traversed region. The time and space complexity of the traversing algorithm is determined by the number of a ReLU NN's partitioning hyperplanes passing through the traversing region.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

11/30/2020

Locally Linear Attributes of ReLU Neural Networks

A ReLU neural network determines/is a continuous piecewise linear map fr...
05/20/2020

ReLU Code Space: A Basis for Rating Network Quality Besides Accuracy

We propose a new metric space of ReLU activation codes equipped with a t...
10/14/2021

Sound and Complete Neural Network Repair with Minimality and Locality Guarantees

We present a novel methodology for repairing neural networks that use Re...
12/22/2020

Bounding the Complexity of Formally Verifying Neural Networks: A Geometric Approach

In this paper, we consider the computational complexity of formally veri...
03/26/2022

Efficient Global Robustness Certification of Neural Networks via Interleaving Twin-Network Encoding

The robustness of deep neural networks has received significant interest...
04/12/2022

Local and global topological complexity measures OF ReLU neural network functions

We apply a generalized piecewise-linear (PL) version of Morse theory due...
07/23/2020

Hierarchical Verification for Adversarial Robustness

We introduce a new framework for the exact point-wise ℓ_p robustness ver...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction & Related Work

Neural networks with rectified linear unit activation functions (ReLU NNs) are arguably the most popular type of neural networks in deep learning. This type of network enjoys many appealing properties including better performance than NNs with sigmoid activation

Glorot, Bordes, and Bengio (2011), universal approximation ability Arora et al. (2018); Lu et al. (2017); Montufar et al. (2014); Schmidt-Hieber (2020)

, and fast training speed via scalable algorithms such as stochastic gradient descent (SGD) and its variants

Zou et al. (2020).

Despite their strong predictive power, ReLU NNs have seen limited adoption in risk-sensitive settings Bunel et al. (2018). These settings require the model to make robust predictions against potential adversarial noise in the input Athalye et al. (2018); Carlini and Wagner (2017); Goodfellow, Shlens, and Szegedy (2014); Szegedy et al. (2014). The alignment between model behavior and human intuition is also desirable Liu et al. (2019): prior knowledge such as monotonicity may be incorporated into model design and training Daniels and Velikova (2010); Gupta et al. (2019); Liu et al. (2020); Sharma and Wehrheim (2020); users and auditors of the model may require a certain degree of explanations of the model predictions Gopinath et al. (2019); Chu et al. (2018).

The requirements in risk-sensitive settings has motivated a great amount of research on verifying certain properties of ReLU NNs. These works often exploit the piecewise linear function form of ReLU NNs. In Bastani et al. (2016)

the robustness of a network is verified in very small input region via linear programming (LP). To consider the non-linearity of ReLU activation functions,

Ehlers (2017); Katz et al. (2017); Pulina and Tacchella (2010, 2012)

formulated the robustness verification problem as a satisfiability modulo theories (SMT) problem. A more popular way to model ReLU nonlinearality is to introduce a binary variable representing the on-off patterns of ReLU neurons. Property verification can then be solved using mixed-integer programming (MIP)

Anderson et al. (2020); Fischetti and Jo (2017); Liu et al. (2020); Tjeng, Xiao, and Tedrake (2018); Weng et al. (2018).

The piecewise linear functional form of ReLU NNs also creates distinct topological structures in the input space. Previous studies have shown that a ReLU NN partitions the input space into convex polytopes and has one linear model associated with each polytope Montufar et al. (2014); Serra, Tjandraatmadja, and Ramalingam (2018); Sudjianto et al. (2020). Each polytope can be coded by a binary activation code, which reflects the on-off patterns of the ReLU neurons. The number of local polytopes is often used as a measure of the model’s expressivity Hanin and Rolnick (2019); Lu et al. (2017). Built upon this framework, multiple studies Sudjianto et al. (2020); Yang, Zhang, and Sudjianto (2020); Zhao et al. (2021) tried to explain the behavior of ReLU NNs and to improve their interpretability. They viewed ReLU NN as a collection of linear models. However, the relationship among the local polytopes their linear models has not been fully investigated.

In this paper, we explore the topological relationship among the local polytopes created by ReLU NNs. We propose algorithms to identify the adjacency among these polytopes, based on which we develop traversing algorithms to visit all polytopes within a bounded region in the input space. Our paper has the following major contributions:

  1. The polytope traversing algorithm provides a unified framework to examine the network behavior. Since each polytope contains a linear model whose properties are easy to verify, the full verification on a bounded domain is achieved after all the covered polytopes are visited and verified. We provide theoretical guarantees on the thoroughness of the traversing algorithm.

  2. Property verification based on the polytope traversing algorithm can be easily customized. Identifying the adjacency among the polytopes is formulated as LP. Within each local polytope, the user has the freedom to choose the solver most suitable for the verification sub-problem. We demonstrate that many common applications can be formulated as convex problems within each polytope.

  3. Because the polytope traversing algorithm explicitly visits all the local polytopes, it returns a full picture of the network behavior within the traversed region and improves interpretability.

Although we focus on ReLU NN with fully connected layers through out this paper, our polytope traversing algorithm can be naturally extended to other piecewise linear networks such as those containing convolutional and maxpooling layers.

The rest of this paper is organized as follows: Section 2 reviews how polytopes are created by ReLU NNs. Section 3 introduces two related concepts: the boundaries of a polytope and the adjacency among the polytopes. Our polytope traversing algorithm is described in Section 4. Section 5 demonstrates several cases of adapting the traversing algorithm for network property verification. The paper is concluded in Section 6.

2 The Local Polytopes in ReLU NNs

2.1 The case of one hidden layer

A ReLU NN partitions the input space into several polytopes and forms a linear model within each polytope. To see this, we first consider a simple NN with one hidden layer of neurons. It takes an input and outputs by calculating:

(1)

For problems with a binary or categorical target variable (i.e. binary or multi-class classification), a sigmoid or softmax layer is added after

respectively to convert the convert the NN outputs to proper probabilistic predictions.

The ReLU activation function inserts non-linearity into the model by checking a set of linear inequalities: , where is the th row of matrix and is the th element of . Each neuron in the hidden layer creates a partitioning hyperplane in the input space with the linear equation . The areas on two sides of the hyperplane are two halfspaces. The entire input space is, therefore, partitioned by these hyperplanes. We define a local polytope as a set containing all points that fall on the same side of each and every hyperplane. The polytope encoding function (2) uses an element-wise indicator function to create a unique binary code for each polytope. Since the th neuron is called “ON” for some if , the code also represents the on-off pattern of the neurons. Using the results of this encoding function, we can express each polytope as an intersection of halfspaces as in (3), where the binary code controls the directions of the inequalities.

(2)
(3)

Figure 1.(b) shows an example of ReLU NN trained on a two-dimensional synthetic dataset (plotted in Figure 1.(a)). The bounded input space is and the target variable is binary. The network has one hidden layer of 20 neurons. The partitioning hyperplanes associated with these neurons are plotted as the blue dashed lines. They form in total 91 local polytopes within the bounded input space.

For a given , if , the ReLU neuron turns on and passes through the value. Otherwise, the neuron is off and suppresses the value to zero. Therefore, if we know the th neuron is off, we can mask the corresponding and by zeros and create and that satisfy (5). The non-linear operation, therefore, can be replaced by the a locally linear operation after zero-masking. Because each local polytope has a unique neuron activation pattern encoded by , the zero-masking process in (4) is also unique for each polytope. Here,

is a vector of 1s of length

and denotes element-wise product.

(4)
(5)

Within each polytope, as the non-linearity is taken out by the zero-masking process, the input and output have a linear relationship:

(6)

The linear model associated with polytope has the weight matrix

and the bias vector

. The ReLU NN is now represented by a collection of linear models, each defined on a local polytope .

In Figure 1.(b), we represent the linear model in each local polytopes as a red solid line indicating . In this binary response case, the two sides of this line have the opposite class prediction. We only plot the line if it passes through its corresponding polytope. For other polytopes, the entire polytopes fall on one side of their corresponding class-separating lines and the predicted class is the same within the whole polytope. The red lines all together form the decision boundary of the ReLU NN and are continuous when passing through one polytope to another. This is a direct result of ReLU NN being a continuous model.

Figure 1:

Examples of trained ReLU NNs and their local polytopes. (a) The grid-like training data with binary target variable. (b) A trained ReLU NN with one hidden layer of 20 neurons. The heatmap shows the predicted probability of a sample belong to class 1. The blue dashed lines are the partitioning hyperplanes associated with the ReLU neurons, which form 91 local polytopes in total. The red solid lines represent the linear model within each polytope where class separation occurs. (c) A trained ReLU NN with two hidden layers of 10 and 5 neurons respectively. The blue dashed lines are the partitioning hyperplanes associated with the first 10 ReLU neurons, forming 20 level-1 polytopes. The orange dashes lines are the partitioning hyperplanes associated with the second 5 ReLU neurons within each level-1 polytope. There are in total 41 (level-2) local polytopes. The red solid lines represent the linear model within each level-2 polytope where class separation occurs.

2.2 The case of multiple layers

We can generalize the results to ReLU NNs with multiple hidden layers. A ReLU NN with hidden layers hierarchically partitions the input space and is locally linear in each and every level- polytope. Each level- polytope has a unique binary code representing the activation pattern of the neurons in all hidden layers. The corresponding partitioning hyperplanes of each level, , , can be calculated recursively level by level, using the zero masking procedure:

(7)
(8)
(9)

We emphasis that , , , and depend on all polytope code up to level : . The subscription is dropped to simplify the notations.

At each level , the encoding function and the polytope expressed as an intersection of halfspaces can be written recursively as:

(10)
(11)
(12)
(13)

Finally, the linear model in a level- polytope is:

(14)

Figure 1.(c) shows an example of ReLU NN with two hidden layers of size 10 and 5 respectively. The partitioning hyperplanes associated with the first 10 neuron are plotted as the blue dashed lines. They form 20 level-1 polytopes within the bounded input space. Within each of the level-1 polytope, the hyperplanes associated with the second 5 neurons further partition the polytope. In many cases, some of the 5 hyperplanes are outside the level-1 polytope and, therefore, not creating a new sub-partition. The hyperplanes do create new partitions are plotted as the orange dashed lines. The orange lines are only straight within a level-1 polytope but are continuous when passing through one polytope to another, which is also a result of ReLU NN being a continuous model. In total, this ReLU NN creates 41 (level-2) local polytopes. As in Figure 1.(b), the linear model within each level-2 polytope is represented as a red solid line if class separation occurs within the polytope.

3 Polytope Boundaries and Adjacency

Beyond viewing ReLU NNs as a collection of linear models defined on local polytopes, we explore the topological relationship among these polytopes. A key concept is the boundaries of each polytope. As shown in (13), each level- polytope with corresponding binary code is an intersection of halfspaces induced by a set of inequality constraints. Two situations can rise among these inequalities. First, an arbitrary may lead to conflicting inequalities and makes an empty set. This situation can be common when the number of neurons is much larger than the dimension of the input space. Second, there can be redundant inequalities which means removing them does not affect set . We now show that the non-redundant inequalities are closely related to the boundaries of a polytope.

Definition 3.1

Let contains all that satisfy linear inequalities: . Assume that . Let contains all ’s that satisfy linear inequalities: . Then the inequality is a redundant inequality with respect to (w.r.t.) if .

With the redundant inequality defined above, the following lemma provides an algorithm to identify them. The proof of this lemma is in the Appendix.

Lemma 3.1

Given a set , then is a redundant inequality if the new set formed by flipping this inequality is empty: .

We can now define the boundaries of a polytope formed by a set of linear inequalities using a similar procedure in Lemma3.1. The concept of polytope boundaries also leads to the definition of adjacency. Intuitively, we can move from one polytope to its adjacent polytope by crossing a boundary.

Definition 3.2

Given a non-empty set formed by linear inequalities: , then the hyperplane is a boundary of if the new set formed by flipping the corresponding inequality is non-empty: . Polytope is called one-adjacent to .

Since for each polytope the directions of its linear inequalities are reflected by the binary code, two one-adjacent polytopes must have their code differ by one bit. Figure 2.(a) demonstrates the adjacency among the local polytopes. The ReLU NN is the same as in Figure 1.(b). Using the procedure in Definition 3.2, 4 out of the 20 partitioning hyperplanes are identified as the boundaries of polytope No.0 and marked in red. The 4 one-adjacent neighbors to polytope No.0 are No.1, 2, 3, and 4; each can be reached by crossing one boundary.

As we have shown in the Section 2.2, ReLU NNs create polytopes level by level. We follow the same hierarchy to define the polytope adjacency. Assume two non-empty level- polytopes, and , are inside the same level- polytope, which means their corresponding code and only differs at level-. We say that polytope is a level- one-adjacent neighbor of if and only differs in one bit.

The condition that and only differs at level- is important. In this way, the two linear inequalities associated with each pair of bits in and have the same coefficients, and the difference in and only changes the direction of the linear inequality. On the other hand, if the two codes differ at a level , then according to the recursive calculation in (8) and (9), the codes starting from level will correspond to linear inequalities of different coefficients, leaving our Definition 3.2 of adjacency not applicable.

Figure 2.(b) demonstrates the hierarchical adjacency among the local polytopes. The ReLU NN is the same as in Figure 1.(c). Level-1 polytopes and are both (level-1) one-adjacent to . Within the level-1 polytope , Level-2 polytopes and are (level-2) one-adjacent to each other. Similarly, we can identify the level-2 adjacency of the other two pairs and . Note that in the plot, even thought one can move from polytope to by crossing one partitioning hyperplane, we do not define these two polytopes as adjacent, as they lie into two different level-1 polytopes.

4 Polytope Traversing

Figure 2: Demonstration of the BFS-base polytope traversing algorithm. (a) Traversing the 8 local polytopes within the bounded regions. The ReLU NN is the same as in Figure 1.(b). The lines marked in red are the boundaries of polytope No.0. (b) Traversing the 6 local polytopes within the bounded region. The ReLU NN is the same as in Figure 1.(c). The polytopes are indexed as “(level-1, level-2)”. (c) The evolution of the BFS queue for traversing the local polytopes in (a). The gray arrows show the traversing order. The colored arrows at the bottom indicate the one-adjacent neighbors added to the queue. (d) The evolution of the hierarchical BFS queue for traversing the local polytopes in (b). The level-1 BFS queue is shown vertically while the level-2 BFS queue is shown horizontally.

4.1 The case of one hidden layer

The adjacency defined in the previous section provides us an order to traverse the local polytopes: starting from an initial polytope , visiting its all one-adjacent neighbors, then visiting all the neighbors’ neighbors and so on.

This algorithm can be viewed as breath-first search (BFS) on a polytope graph. To create this graph, we turn each polytope created by the ReLU NN into a node. An edge is added between each pair of polytopes that are one-adjacent to each other. The BFS algorithm uses a queue to keep track the traversing progress. At the beginning of traversing, the initial polytope is added to an empty queue and is marked as visited afterwards. In each iteration, we pop out the first polytope from the queue and identify all of its one-adjacent neighbors. Among these identified polytopes, we add those that have not been visited to the back of the queue and mark them as visited. The iteration stops when the queue is empty.

The key component of the polytope traversing algorithm is to identify a polytope’s one-adjacent neighbors. For a polytope coded by of bits, there are at most one-adjacent neighbors with codes corresponding to flipping one of the bits in . Each valid one-adjacent neighbor must be non-empty and can be reached by crossing a boundary. Therefore, we can check each linear inequality in (3) and determine whether it is a boundary or redundant. Some techniques of identifying redundant inequalities are summarized in Telgen (1983). By flipping the bits corresponding to the identified boundaries, we obtain the codes of the one-adjacent polytopes.

Equivalently, we can identify the one-adjacent neighbors by going through all candidate codes and selecting those corresponding to non-empty sets. Checking the feasibility of a set constrained by a set of linear inequalities is often referred to as the “Phase-I Problem” of LP and can be solved efficiently by modern LP solvers. During BFS iterations, we can hash the checked codes to avoid checking them repetitively. The BFS-based polytope traversing algorithm is summarized in Algorithm 1. We now state the correctness of this algorithm with its proof in Appendix.

Theorem 4.1

Given a ReLU NN with one hidden layer of neurons as specified in (1), Algorithm 1 covers all non-empty local polytopes created by the neural network. That is, for all , there exists one as defined in (3) such that and , where is the result returned by Algorithm 1.

Algorithm 1 visits all the local polytopes created by a ReLU NN within . The time complexity is exponential to the number of neurons, as all possible activation patterns are checked once in the worst-case scenario. The space complexity is also exponential to the number of neurons as we hash all the checked activation patterns. Furthermore, for each activation pattern, we solve a phase-I problem of LP with inequalities in . Traversing all local polytopes in , therefore, becomes intractable for neural networks with a large number of neurons.

Fortunately, traversing in is usually undesirable. Firstly, a neural network may run into extrapolation issues for points outside the sample distribution. The polytopes far away from the areas covered by the samples are often considered unreliable. Secondly, many real-life applications, to be discussed in Section 5, only require traversing within small bounded regions to examine the local behavior of a model. In the next section, we introduce a technique to improve the efficiency when traversing within a bounded region.

1:A ReLU NN with one hidden layer of neurons as specified in (1).
2:An initial point .
3:Initialize an empty queue for BFS.
4:Initialize an empty set to store the codes of all visited polytopes.
5:Initialize an empty set to store all checked codes.
6:Calculate ’s initial polytope code using (2).
7:Append to the end of the .
8:Add to both and .
9:while  is not empty do
10:     Pop out the first element in the front of BFS queue: .
11:     for  do
12:          Create a candidate polytope code by flipping one bit in : and .
13:          if  then
14:               Check if is empty using LP.
15:               Add to .
16:               if  then
17:                    Append to the end of the .
18:                    Add to .                               
19:Return .
Algorithm 1 BFS-Based Polytope Traversing

4.2 Polytope traversing within a bounded region

We first consider a region with each dimension bounded independently: , . These linear inequalities creates a hypercube denoted as . During the BFS-based polytope traversing, we repetitively flip the direction of one of the inequalities to identify the one-adjacent neighbors. When the bounded region is small, it is likely that only a small number of the hyperplanes cut through the hypercube. For the other hyperplanes, the entire hypercube falls onto only one side. Flipping to the other sides of these hyperplanes would leave the bounded region. Therefore, at the very beginning of polytope traversing, we can run through the hyperplanes to identify those cutting through the hypercube. Then in each neighbor identifying step, we only flip these hyperplanes.

To identify the hyperplanes cutting through the hypercube, we denote the two sides of a hyperplane and : and . If neither nor is empty, we say the hyperplane cuts through . and are both constrained by inequalities, checking their feasibility can again be formulated as a phase-I problem of LP. We name this technique hyperplane pre-screening and summarize it in algorithm 2.

1:A set of hyperplanes , .
2:A bounded traversing region , e.g. , .
3:Initialize an empty set to store all hyperplanes cutting through .
4:for  do
5:     Get two halfspaces and .
6:     if  and  then
7:          Add to .      
8:Return .
Algorithm 2 Hyperplane Pre-Screening

Hyperplane pre-screening effectively reduces the complexity from to , where is the number of hyperplanes cutting through the hypercube. The number corresponds to the worst-case scenario. Since the BFS-based traversing only checks non-empty polytopes and their potential one-adjacent neighbors, the number of activation patterns actually checked can be less than . In general, the fewer hyperplanes go through the faster polytope traversing finishes.

Figure 2.(a) shows traversing the 8 local polytopes within the bounded region. The ReLU NN is the same as in Figure 1.(b). The lines marked in red are the hyperplanes cutting through the bounded region and are identified by the pre-screening algorithm. The evolution of the BFS queue is shown in Figure 2.(c). The gray arrows show the traversing order. The colored arrows at the bottom indicate the one-adjacent neighbors added to the queue. When polytope No.0 is popped from the queue, its one-adjacent neighbors, No.1, 2, 3, and 4, are added to the queue. Next, when polytope No.1 is popped, its one-adjacent neighbors, No.5 and 6, are added. Polytope No.0, although as a one-adjacent neighbor to No.1, is ignored since it has been visited. Similarly, when polytope No.2 is popped, only one of its one-adjacent neighbors, No. 7, is added, since all others have been visited (including those in the queue). The algorithm finished after popping Polytope No.8 as no new polytopes can be added and the queue is empty. All 8 local polytopes in the bounded region are traversed.

Because is bounded by a set of linear inequalities, the correctness of BFS-based polytope traversing as stated in Theorem 4.1 can be easily extended to this bounded traversing case. It can be proved by showing that for any two non-empty polytopes overlapped with , we can move from one to another by repetitively finding a one-adjacent neighbor within . We emphasis that the correctness of BFS-based polytope traversing can be proved for any traversing region bounded by a set of linear inequalities. This realization is critical to generalize our results to the case of ReLU NNs with multiple hidden layers. Furthermore, as any closed convex set can be represented as the intersection of a set of (possibly infinite) halfspaces, the correctness of BFS-based polytope traversing is true for any closed convex .

4.3 Hierarchical polytope traversing in the case of multiple hidden layers

The BFS-based polytope traversing algorithm can be generalized to ReLU NNs with multiple hidden layers. In section 2.2, we described how a ReLU NN with hidden layers hierarchically partition the input space into polytopes of different level. Then in section3, we showed the adjacency of level- polytopes is conditioned on all of them belonging to the same level- polytope. Therefore, to traverse all level- polytopes, we need to traverse all level- polytopes and within each of them traversing the sub-polytopes by following the one-adjacent neighbors.

The procedure above leads us to a recursive traversing scheme. Assume a ReLU NN with L hidden layers and a closed convex traversing region . Starting from a sample , we traverse all level-1 polytopes using the BFS-based algorithm. Inside each level-1 polytope, we traverse all the contained level-2 polytopes, so on and so forth until we reach the level-L polytopes. As shown in (13), each level- polytope is constrained by linear inequalities, the way to identify level- one-adjacent neighbors is largely the same as what we have described in Section 4.1. Two level- one-adjacent neighbors must have the same linear inequalities corresponding to , and have one of the last inequalities differ in direction, so there are cases to check.

We can use hyperplane pre-screening at each level of traversing. When traversing the level- polytopes within in a level- polytope , we update the bounded traversing region by taking the intersection of and . We then screen the partitioning hyperplanes and only select those passing through this update traversing region.

The BFS-based hierarchical polytope traversing algorithm is summarized in Algorithm 3. The correctness of this algorithm can be proved based on the results in Section 4.2, which guarantees the thoroughness of traversing the level- polytopes within in any level- polytope. Then the overall thoroughness is guaranteed because each level of traversing is thorough. We state the result in the following theorem.

Theorem 4.2

Given a ReLU NN with hidden layers and a closed convex traversing region . Algorithm 3 covers all non-empty level- polytopes created by the neural network that overlap with . That is, for all , there exists one as defined in (13) such that and , where is the result returned by Algorithm 3.

Figure 2.(b) shows traversing the 6 local polytopes within the bounded region. The ReLU NN is the same as in Figure 1.(c). The evolution of the hierarchical BFS queue is shown in Figure 2.(d). The level-1 BFS queue is shown vertically while the level-2 BFS queue is shown horizontally. Starting from level-1 polytope , the algorithm traverses the two level-2 polytopes inside it (line 10 in Algorithm 3). It then identifies the two (level-1) one-adjacent neighbors of : and . Every time a level-1 polytope is identified, the algorithm goes into it to traverse all the level-2 polytopes inside (line 36). At the end of the recursive call, all 6 local polytopes in the bounded region are traversed.

5 Network Property Verification Based on Polytope Traversing

The biggest advantage of the polytope traversing algorithm is its ability to be adapted to solve many different problems of practical interest. Problems such as local adversarial attacks, searching for counterfactual samples, and local monotonicity verification can be solved easily when the model is linear. As we have shown in Sections 2.2, the local model within each level- polytope created by a ReLU NN is indeed linear. The polytope traversing algorithm provides a way to analyze not only the behavior of a ReLU NN at one local polytope but also the behavior within the neighborhood, and therefore enhances our understanding of the overall model behavior. In this section, we describe the details of adapting the polytope traversing algorithm to verify several properties of ReLU NNs.

Figure 3: Demonstration of different applications of the polytope traversing algorithm. We use the ReLU NN in Figure 1.(b) as an example. (a) Conducting local adversarial attack by finding the maximum (green) and minimum (red) model predictions within a bounded region. (b) Creating counterfactual samples that are closest to the original sample. The distance are measured in (green) and (red) norms. (c) Monotonicity verification in a bounded region. The polytope in red violates condition of model prediction monotonically decreasing along the horizontal axis.
1:A ReLU NN with hidden layers.
2:A closed convex traversing region .
3:An initial point .
4:Initialize an empty set to store the codes of all visited polytopes.
5:
6:function HIERARCHICAL_TRAVERSE()
7:     Initialize an empty queue for BFS at level .
8:     Initialize an empty set to store all checked level- codes.
9:     Calculate ’s initial polytope code recursively using (12).
10:     if  then
11:          Add to
12:     else
13:          HIERARCHICAL_TRAVERSE(,+1)      
14:     if  then
15:          Get the level- polytope code specified by the front segment of : .
16:          Use to get the level- polytope as in (13).
17:     else
18:                
19:     Form the new traversing region .
20:     Append the code segment to the end of the .
21:     Add the code segment to .
22:     Get the hyperplanes associated with .
23:     Pre-Screen the hyperplanes associated with using Algorithm 2 with bounded region .
24:     Collect the pre-screening results .
25:     while  is not empty do
26:          Pop the first element in the front of BFS queue: .
27:          for  do
28:               Create a candidate polytope code by flipping one bit in : and .
29:               if  then
30:                    Get set
31:                    Check if is empty using LP.
32:                    Add to .
33:                    if  then
34:                         Append to the end of the .
35:                         if  then
36:                              Add to
37:                         else
38:                              Find a point
39:                              HIERARCHICAL_TRAVERSE(,+1)                                                                            
40:
41:HIERARCHICAL_TRAVERSE(,1)
42:Return .
Algorithm 3 BFS-Based Hierarchical Polytopes Traversing in a Bounded Region

5.1 Local Adversarial Attacks

We define the local adversarial attack problem as finding the perturbation within a bounded region such that the model output can be changed most adversarially. Here, we assume the model output to be a scalar in

and consider three regression cases with different types of response variable: continuous, binary, and categorical. The perturbation region is a convex set around the original sample. For example, we can allow certain features to increase or decrease by certain amount; or we can use a norm (

, , ) ball centered at the original sample.

In the continuous response case, the one-dimensional output after the last linear layer of a ReLU NN is directly used as the prediction of the target variable. Denote the model function as , the original sample as , and the perturbation region as . The local adversarial attack problem can be written as:

(15)

which means we need to find the range of the model outputs on . We can traverse all local polytopes covered by , finding the model output range within each intersection , then aggregating all the local results to get the final range. Finding the output range within each is a convex problem with linear objective function, so the optimality can be guaranteed within each polytope. Because our traversing algorithm covers all polytopes overlapped with , the final solution also has guaranteed optimality.

In the case of binary response, the one-dimensional output after the last linear layer of a ReLU NN is passed through a logistic/sigmoid function to predict the probability of a sample belonging to class 1. To conduct adversarial attack, we minimize the predicted probability

if the true response is 1, and maximize the prediction if the true response is 0:

(16)

Because of the monotonicity of the logistic function, the minimizer and maximizer of the probabilistic output are the same minimizer and maximizer of the output after the last linear layer (i.e. the predicted log odds), making it equivalent to the case of continuous response.

In the case of categorical response with levels 1 to , the output after the last linear layer of a ReLU NN is in and is passed through a softmax layer to be converted to probabilistic predictions of a sample belonging to each class. The adversarial sample is generated to minimize the predicted probability of the sample being in its true class. Within each local polytope, the linear models are given by (14), and the predicted probability of class can be minimized by finding the maximizer of the following optimization problem:

(17)

where is the th row of the matrix and is the th element in . Since the objective function in (17) is convex, the optimality of local adversarial attack with polytope traversing is guaranteed.

Figure 3.(a) demonstrates local adversarial attack in the case of regression with binary response. The ReLU NN is the same as in Figure 1.(b), which predicts the probability of a sample belong to class 1. The predictions across the whole domain are shown as the heat map. Within the region bounded by the black box, we find the minimum and maximum predictions and marked them by red and green respectively. Due to the nature of linear models, the minimizer and maximizer always fall on the intersections of partitioning hyperplanes and/or region boundaries.

5.2 Counterfactual sample generation

In classification problems, we are often interested in finding the smallest perturbation on a sample such that the model changes its class prediction. The magnitude of the perturbation is often measured by , , or norm. The optimization problem can be written as:

(18)

where is the original sample, indicates a specific type of norm, and is a ReLU NN outputting class predictions.

We can adapt the polytope traversing algorithm to solve this problem. In the case of binary response, each local polytope has an associated hyperplane separating the two classes: , where and are given in (14), and is the threshold converting predicted log odds to class. Finding the counterfactual sample within a local polytope can be written as a convex optimization problem:

(19)

where is the original class (0 or 1) predicted by the model.

We start the traversing algorithm from the polytope where lies. In each polytope, we solve (19). It is possible that the entire polytope fall on one side of the class separating hyperplane and (19) does not have any feasible solution. If a solution can be obtained, we compare it with the solutions in previously traversed polytopes and keep the one with the smallest perturbation. Furthermore, we use this perturbation magnitude to construct a new bounded traversing region around . Because no points outside this region can have a smaller distance to the original points, once we finish traversing all the polytopes inside this region, the algorithm can conclude. In practice we often construct this dynamic traversing region as , where is the smallest perturbation magnitude so far. When solving for (19) in the proceeding polytopes, we add to the constraints. is updated whenever a smaller is found. Because the new traversing region is always a subset of the previous one, our BFS-based traversing algorithm covers all polytopes within the final traversing region under this dynamic setting. The final solution to (18) is guaranteed to be optimal, and the running time depends on how far the original point is away from a class boundary.

In the case of categorical response with levels 1 to , the output after the last linear layer of a ReLU NN has dimensions and the dimension of the largest value is the predicted class. We ignore the softmax layer at the end because it does not change the rank of the dimensions. Assuming the original example is predicted to belong to class , we generate counterfactual samples in the rest of classes.

We consider one of these classes at a time and denote it as . Within each ReLU NN’s local polytope, the linear models are given by (14). The area where a sample is predicted to be in class is enclosed by the intersection of halfspaces:

(20)

Therefore, within each local polytope, we solve the convex optimization problem:

(21)

We compare all feasible solutions of (21) under different and keep the one counterfactual sample that is closest to . The traversing procedure and the dynamic traversing region update is the same as in the binary response case. Since (21) is convex, the final solution to (18) is guaranteed to be optimal.

Figure 3.(b) demonstrates counterfactual sample generation in the case of binary classification. The ReLU NN is the same as in Figure 1.(b) whose class decision boundaries are plotted in red. Given an original sample plotted as the black dot, we generate two counterfactual samples on the decision boundaries. The red dot has the smallest distance to the original point while the green dot has the smallest distance.

5.3 Local monotonicity verification

We can adapt the polytope traversing algorithm to verify if a trained ReLU NN is monotonic w.r.t. certain features. We consider the regression cases with continuous and binary response. In both cases, the output after the last linear layer is a scalar. Since the binary response case uses a logistic function at the end which is monotonically increasing itself, we can ignore this additional function. The verification methods for the two cases, therefore, are equivalent.

To check whether the model is monotonic w.r.t. a specific feature within a bounded convex domain, we traverse the local polytopes covered by the domain. Since the model is linear within each polytope, we can easily check the monotonicity direction (increasing or decreasing) by checking the sign of the corresponding coefficients. After traversing all local polytopes covered by the domain, we check their agreement on the monotonicity direction. Since a ReLU NN produces a continuous function, if the local models are all monotonically increasing or all monotonically decreasing, the network is monotonic on the checked domain. If there is a disagreement in the direction, the network is not monotonic. The verification algorithm based on polytope traversing not only provides us the final monotonicity result but also tells us in which part of the domain monotonicity is violated.

Figure 3.(c) demonstrates local adversarial attack in the case of regression with binary response. The ReLU NN is the same as in Figure 1.(b), which predicts the probability of a sample belong to class 1. The predictions across the whole domain are shown as the heat map. We check if the model is monotonically increasing w.r.t. along the horizontal axis. The domain to check is bounded by the black box. Among the 5 polytopes overlapped with the domain, one of them violates the monotonically increasing condition and is marked in red.

5.4 Comparison with algorithms based on mixed-integer programming

The three applications above have been traditionally solved using MIP Anderson et al. (2020); Fischetti and Jo (2017); Liu et al. (2020); Tjeng, Xiao, and Tedrake (2018); Weng et al. (2018). Our algorithms based on polytope traversing have several advantages. First, our method exploits the topological structure created by ReLU NNs and fully explains the model behavior in small neighborhoods. For the cases created by a ReLU NN with neurons, MIP eliminates the searching branches using branch-and-bound. Our method, on the other hand, eliminates the searching branches by checking the feasibility of the local polytopes and their adjacency. Since a small traversing region often covers a limited number of polytopes, our algorithm has short running time when solving local problems.

Second, since our algorithm explicitly identifies and visits all the polytopes, the final results contain not only the optimal solution but also the whole picture of the model behavior, providing explainability to the often-so-called black-box model.

Third, our method requires only linear and convex programming solvers and no MIP solvers. Identifying adjacent polytopes requires only linear programming. Convex programming may be used to solve the sub-problem within a local polytope. Our algorithm allows us to incorporate any convex programming solver that is most suitable for the sub-problem, providing much freedom to customize.

Last but probably the most important, our algorithm is highly versatile and flexible. Within each local polytope, the model is linear, which is often the simplest type of model to work with. Any analysis that one runs on a linear model can be transplanted here and wrapped inside the polytope traversing algorithm. Therefore, our algorithm provides a unified framework to verify different properties of piecewise linear networks.

6 Conclusion

We explored the unique topological structure that ReLU NNs create in the input space; identified the adjacency among the partitioned local polytopes; developed a traversing algorithm based on this adjacency; and proved the thoroughness of polytope traversing. Our polytope traversing algorithm could be extended to other piecewise linear networks such as those containing convolutional or maxpooling layers.

References

  • Anderson et al. (2020) Anderson, R.; Huchette, J.; Ma, W.; Tjandraatmadja, C.; and Vielma, J. P. 2020. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, 1–37.
  • Arora et al. (2018) Arora, R.; Basu, A.; Mianjy, P.; and Mukherjee, A. 2018. Understanding Deep Neural Networks with Rectified Linear Units. In International Conference on Learning Representations.
  • Athalye et al. (2018) Athalye, A.; Engstrom, L.; Ilyas, A.; and Kwok, K. 2018. Synthesizing robust adversarial examples. In

    International conference on machine learning

    , 284–293. PMLR.
  • Bastani et al. (2016) Bastani, O.; Ioannou, Y.; Lampropoulos, L.; Vytiniotis, D.; Nori, A.; and Criminisi, A. 2016. Measuring neural net robustness with constraints. Advances in neural information processing systems, 29: 2613–2621.
  • Bunel et al. (2018) Bunel, R.; Turkaslan, I.; Torr, P. H.; Kohli, P.; and Kumar, M. P. 2018. A unified view of piecewise linear neural network verification. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 4795–4804.
  • Carlini and Wagner (2017) Carlini, N.; and Wagner, D. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), 39–57. IEEE.
  • Chu et al. (2018) Chu, L.; Hu, X.; Hu, J.; Wang, L.; and Pei, J. 2018. Exact and consistent interpretation for piecewise linear neural networks: A closed form solution. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1244–1253.
  • Daniels and Velikova (2010) Daniels, H.; and Velikova, M. 2010. Monotone and partially monotone neural networks. IEEE Transactions on Neural Networks, 21(6): 906–917.
  • Ehlers (2017) Ehlers, R. 2017.

    Formal verification of piece-wise linear feed-forward neural networks.

    In International Symposium on Automated Technology for Verification and Analysis, 269–286. Springer.
  • Fischetti and Jo (2017) Fischetti, M.; and Jo, J. 2017. Deep neural networks as 0-1 mixed integer linear programs: A feasibility study. arXiv preprint arXiv:1712.06174.
  • Glorot, Bordes, and Bengio (2011) Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectifier neural networks. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    , 315–323. JMLR Workshop and Conference Proceedings.
  • Goodfellow, Shlens, and Szegedy (2014) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
  • Gopinath et al. (2019) Gopinath, D.; Converse, H.; Pasareanu, C.; and Taly, A. 2019. Property inference for deep neural networks. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 797–809. IEEE.
  • Gupta et al. (2019) Gupta, A.; Shukla, N.; Marla, L.; Kolbeinsson, A.; and Yellepeddi, K. 2019. How to Incorporate Monotonicity in Deep Networks While Preserving Flexibility? arXiv preprint arXiv:1909.10662.
  • Hanin and Rolnick (2019) Hanin, B.; and Rolnick, D. 2019. Deep ReLU Networks Have Surprisingly Few Activation Patterns. Advances in Neural Information Processing Systems, 32: 361–370.
  • Katz et al. (2017) Katz, G.; Barrett, C.; Dill, D. L.; Julian, K.; and Kochenderfer, M. J. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, 97–117. Springer.
  • Liu et al. (2019) Liu, C.; Arnon, T.; Lazarus, C.; Strong, C.; Barrett, C.; and Kochenderfer, M. J. 2019. Algorithms for verifying deep neural networks. arXiv preprint arXiv:1903.06758.
  • Liu et al. (2020) Liu, X.; Han, X.; Zhang, N.; and Liu, Q. 2020. Certified monotonic neural networks. arXiv preprint arXiv:2011.10219.
  • Lu et al. (2017) Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; and Wang, L. 2017. The expressive power of neural networks: A view from the width. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6232–6240.
  • Montufar et al. (2014) Montufar, G. F.; Pascanu, R.; Cho, K.; and Bengio, Y. 2014. On the Number of Linear Regions of Deep Neural Networks. Advances in Neural Information Processing Systems, 27: 2924–2932.
  • Pulina and Tacchella (2010) Pulina, L.; and Tacchella, A. 2010. An abstraction-refinement approach to verification of artificial neural networks. In International Conference on Computer Aided Verification, 243–257. Springer.
  • Pulina and Tacchella (2012) Pulina, L.; and Tacchella, A. 2012. Challenging SMT solvers to verify neural networks. Ai Communications, 25(2): 117–135.
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. 2020. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4): 1875–1897.
  • Serra, Tjandraatmadja, and Ramalingam (2018) Serra, T.; Tjandraatmadja, C.; and Ramalingam, S. 2018. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, 4558–4566. PMLR.
  • Sharma and Wehrheim (2020) Sharma, A.; and Wehrheim, H. 2020. Testing monotonicity of machine learning models. arXiv preprint arXiv:2002.12278.
  • Sudjianto et al. (2020) Sudjianto, A.; Knauth, W.; Singh, R.; Yang, Z.; and Zhang, A. 2020. Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification. arXiv preprint arXiv:2011.04041.
  • Szegedy et al. (2014) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014.
  • Telgen (1982) Telgen, J. 1982. Minimal representation of convex polyhedral sets. Journal of Optimization Theory and Applications, 38(1): 1–24.
  • Telgen (1983) Telgen, J. 1983. Identifying redundant constraints and implicit equalities in systems of linear constraints. Management Science, 29(10): 1209–1222.
  • Tjeng, Xiao, and Tedrake (2018) Tjeng, V.; Xiao, K. Y.; and Tedrake, R. 2018. Evaluating Robustness of Neural Networks with Mixed Integer Programming. In International Conference on Learning Representations.
  • Weng et al. (2018) Weng, L.; Zhang, H.; Chen, H.; Song, Z.; Hsieh, C.-J.; Daniel, L.; Boning, D.; and Dhillon, I. 2018. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, 5276–5285. PMLR.
  • Yang, Zhang, and Sudjianto (2020) Yang, Z.; Zhang, A.; and Sudjianto, A. 2020. Enhancing explainability of neural networks through architecture constraints. IEEE Transactions on Neural Networks and Learning Systems.
  • Zhao et al. (2021) Zhao, W.; Singh, R.; Joshi, T.; Sudjianto, A.; and Nair, V. N. 2021. Self-interpretable Convolutional Neural Networks for Text Classification. arXiv preprint arXiv:2105.08589.
  • Zou et al. (2020) Zou, D.; Cao, Y.; Zhou, D.; and Gu, Q. 2020. Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning, 109(3): 467–492.

7 Acknowledgments

The authors would like to thank Lin Dong, Linwei Hu, Rahul Singh, and Han Wang from Wells Fargo, and Sihan Zeng from Georgia Institute of Technology for their valuable inputs and feedback on this project.

8 Appendix

8.1 Proof of Lemma 3.1

Lemma 8.1

Given a set , then is a redundant inequality if the new set formed by flipping this inequality is empty: .

Proof:

Let be the set formed by removing inequality : . Then . If , then and the inequality satisfies Definition 3.1.

Note the other direction of Lemma 3.1 may not hold. One example is when identical inequalities appear in the set: both inequalities in are redundant by definition if . However, the procedure in Lemma3.1 will not identify them as redundant.

8.2 Proof of Theorem 4.1

Theorem 8.2

Given a ReLU NN with one hidden layer of neurons as specified in (1), Algorithm 1 covers all non-empty local polytopes created by the neural network. That is, for all , there exists one as defined in (3) such that and , where is the result returned by Algorithm 1.

Proof:

Since each partitioning hyperplane divide into two halfspaces, all activation patterns encoded by covers the entire input space. We construct a graph with nodes, each representing a possible polytope code. Some the nodes may correspond to an empty set due to conflicting inequalities. For each pair of non-empty polytope that are one-adjacent to each other, we add an edge to their corresponding nodes. What left to prove is that any pair of non-empty polytopes are connected.

W.l.o.g. assume two nodes with code and that differ only in the first bits. Also assume the polytopes and are both non-empty. We will show that there must exist a non-empty polytope that is one-adjacent to with code different from in one of the first bits. As a result, is now one bit closer to .

We prove the claim above by contradiction. Assuming claim is not true, we flip any one of the first bits in , and the corresponding polytope must be empty. By Definition 3.1, the inequality , must all be redundant, which means they can be removed from the set of constraints Telgen (1982, 1983):

(22)

The derived relationship in (22) plus the assumption that all must be empty lead to the conclusion that , which contradict with the non-empty assumption.

Therefore, for any two non-empty polytopes and , we can create a path from to by iteratively finding an intermediate polytope whose code is one bit closer to . Since the polytope graph covers all input space and all non-empty polytopes are connected, BFS guarantees the thoroughness of traversing.