Robustness of Neural Networks to Parameter Quantization

03/26/2019 ∙ by Abhishek Murthy, et al. ∙ Texas Tech University 0

Quantization, a commonly used technique to reduce the memory footprint of a neural network for edge computing, entails reducing the precision of the floating-point representation used for the parameters of the network. The impact of such rounding-off errors on the overall performance of the neural network is estimated using testing, which is not exhaustive and thus cannot be used to guarantee the safety of the model. We present a framework based on Satisfiability Modulo Theory (SMT) solvers to quantify the robustness of neural networks to parameter perturbation. To this end, we introduce notions of local and global robustness that capture the deviation in the confidence of class assignments due to parameter quantization. The robustness notions are then cast as instances of SMT problems and solved automatically using solvers, such as dReal. We demonstrate our framework on two simple Multi-Layer Perceptrons (MLP) that perform binary classification on a two-dimensional input. In addition to quantifying the robustness, we also show that Rectified Linear Unit activation results in higher robustness than linear activations for our MLPs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks entail interconnected computational nodes that transform

weighted combinations of their inputs using nonlinear functions. The interconnections lead to compositional behavior at the network-level, which enables neural networks to approximate highly nonlinear functions as their responses. The advent of the Backpropagation algorithm 

[9], the availability of large datasets [19], and optimized hardware [30]

has led to widespread success in supervised and unsupervised learning.

Figure 1: Robustness analysis of neural network enables safe parameter quantization.

Supervised learning of a neural network is the process of optimizing the network’s parameters using reference data. Supervised learning can be used to i) learn classifiers, which can label an input into one of finitely many classes and ii) learn the more general class of regressors, which capture relationships across continuous domains. Learning a model, also known as training, involves formulating a loss function that quantifies the performance of the model as a function of the parameters, and then minimizing the function using numerical techniques over the reference data, also known as training data. Backpropagation is the most popular class of numerical techniques used to optimize the parameters of the modern neural networks. Unsupervised learning, on the other hand, entails learning patterns and underlying distributions in unlabelled data.

Large networks contain millions of parameters and are trained using Graphics Processing Units (GPUs). Deploying trained neural networks in real-world production systems entails fetching the input from the user/client-device and then passing it through the neural network, also known as the forward pass, and obtaining the output, which could be a class-label or a regressed value, in real time. Web services, which perform the forward pass on the cloud can utilize the power of GPUs for time-sensitive calculations. The downside is that such applications suffer from i) the latency of sending the input to the remote server and waiting for the output of the neural network and ii) privacy concerns of exposing potentially sensitive inputs on the network.

An alternative design involves performing the forward pass on the client device (edge) by running the neural networks on it. This eliminates the network latencies and also avoids exposing the user’s inputs to the network. Running neural networks on edge devices, such as mobile phones, tablets and low-power devices like wearables and Raspberry Pis present unique challenges. Storing the millions of parameters in floating-point representations incurs significant memory costs and the computational power needed for the forward pass may be prohobitive. Executing complex neural networks on the low computational power and memory available on edge devices is a well-known challenge in the industry and thus is an active area of interest.

In addition to dedicated hardware for low-power devices, the community has evolved three main approaches to the problem of running neural networks on resource-constrained edge devices.

  1. Quantization of Parameters: The precision of the floating-point representation used to store the network parameters is reduced to lower the memory footprint of storing the network in the memory [14].

  2. Pruning: The edges, represented by the weights, between nodes that do not significantly influence the network’s output are made 0 and thus removed from the network, resulting in a reduction in the memory footprint [25].

  3. Optimized Neural-Network Architectures: The network architecture is designed to reduce the floating-point operations, thereby reducing the running time of the forward pass, see [11] for an example.

These techniques have emerged using empirical benchmarking and have found limited success in the community. Today, there are a handful of applications that deploy neural networks on edge devices. The main reason behind this lack of widespread success is the unpredictability of the aforementioned techniques in preserving the performance of the network after training. Specifically, the state of the art on estimating the impact of pruning and quantization on the network’s accuracy is limited to testing on a finite number of test cases.

In this paper, we introduce a framework to quantify the robustness of neural networks to parameter quantization, thereby automating the process of bounding the change in performance of the neural network.

We introduce notions of local and global robustness of networks to parameter changes. Given a bounded perturbation in the parameter vector, local robustness measures the maximal change in the confidence of class assignment for an input. Global robustness extends this notion to the entire input-space. We cast these notions into instances of SMT problems and solve them automatically using solvers, such as dReal 

[31]. See Fig. 1 for an overview.

Robustness of neural networks has been an active area of research, but most of the authors have focused on input perturbations, rather than parameter changes. Our framework is focused on parameter perturbations. In summary, the main contributions of our paper are as follows.

  • An automated framework is presented for bounding the deviation in the performance of neural networks due to parameter quantization. The framework enables the implementation of deep-learning-based applications on edge devices, like mobile phones, tablets and other embedded environments.

  • We present two use-cases to demonstrate our framework: the parameters of two small MLPs that perform binary classification are perturbed and the robustness is analyzed using our approach.

  • In addition to estimating parameter robustness, we also show that ReLU activations are more robust than linear activations for our MLPs.

The rest of the paper is organized as follows. Section 2 presents background on neural networks and SMT solvers. Section 3 introduces the theory of local and global robustness to parameter perturbations and Section 4 details the corresponding SMT problem formulations. Section 5 presents the case studies and their corresponding trained neural networks. Section 6 presents robustness analysis on the neural networks. Section 7 reviews related work and Section 8 presents our conclusions and the directions for future work.

2 Background

Figure 2: Weighted averaging, followed by nonlinear activation.

Every node of an NN performs two operations: weighted averaging of the inputs, and a nonlinear transformation of the weighted sum using a so-called activation function, see Fig. 2

. Some of the commonly used activation functions are depicted in Fig. 

3.

A neural network is formed by interconnecting several such nodes using different architectures. Each connection from node to node is characterized by the weight that is used for the output of node for the weighted average performed at node . Typical architectures consists of layers of nodes connected to the nodes of the subsequent layer. The output of the neural network is vector with each entry representing the output of the corresponding node in the final layer. It is common to construct the network to have output nodes if the goal is to assign one (or more) of possible class assignments. Moreover, all the values lie in and sum to 1. The input is assigned the class label if the value of the node is the highest among the outputs.

Figure 3: Commonly used activation functions in neural networks.

We introduce notations for the parameter vector of a neural network as follows. denotes an instance of the neural network with being the vector of parameter assignments. Given an input , and the instance if the neural network , the vector of outputs is returned by the function . is the index of the highest output and thus corresponds to the class label assigned to the input.

The dReal Solver

The dReal [13] tool is an SMT solver [10] for nonlinear theories over the reals.The tool can handle first order formula defined by nonlinear real functions such as polynomials, trigonometric functions, exponential functions, etc. It implements the framework of -complete decision procedure [12], which has two possible outputs:

  • unsat: no variable assignment satisfying the formula.

  • -sat: exists a variable assignment satisfying the formula if we consider a user-specified numerical perturbation .

We note that the satisfiability of first-order formula over the real is undecidable [5]. The tool is implemented in the framework of delta-complete analysis, which provides an algorithm for the originally undecidable problem by using approximation (the use of in the analysis).

The latest version of dReal [21] now implements Optimization Modulo Theory (OMT) [29, 7]. OMT is an extension of SMT which allows for finding models that optimize given objectives.

3 Parameter Robustness

In this section, we present various definitions of parameter robustness analysis for neural networks.

We begin with a definition of parameter robustness locally to an input similar to local input robustness as presented in [15, 32, 6].

Definition 1

An NN with parameter vector is -parameter robust locally at an input if and only if:

(1)

Definition 1 gives a quantitative measure on the change in confidence of labeling a certain input. This definition, however, does not cover all inputs in the input domain. The following definition address this:

Definition 2

An NN with parameter vector is -parameter robust globally for a input domain if and only if:

(2)

Though the definitions of parameter robustness described above give a quantitative measure on the change of confidence, it does not say whether the decision label will actually be changed. For example, if the confidence value changes positively for a given label, the decision label will remain the same, even though could be higher. As a result, the above robustness measures gives only an idea on relative change in confidence value, but not how the actual label will be changed.

Now, we define the parameter robustness that specifies whether an actual label of an input changes. Both local and global versions are defined as follows:

Definition 3

An NN with parameter vector is locally -parameter robust locally at an input if and only if:

(3)
Definition 4

An NN with parameter vector is locally -parameter robust globally for an input domain if and only if:

(4)

Definition 4 states that for an NN to be -parameter robust globally for all input in the domain, no input cannot be mislabeled. This is rather a very strict definition of robustness. In particular, when a quantization technique is applied to NN, it is expected that the labels for some inputs will be changed, at least the inputs close to the decision boundary. To incorporate this, we slightly modify the definition 4 as follows:

Definition 5

An NN with parameter vector is locally -parameter robust globally for an input domain if and only if:

(5)

where denotes the level set of the confidence function, which is used to label the input, i.e, represents a decision boundary.

The -parameter robustness of NN is illustrated in 4. The red line represents the decision boundary and ‘-’ and ‘+’ represent the decision labels. The yellow lines are distance away from the decision boundary. Definition 5 states that all the inputs that are or more distance away from the decision boundary (i.e., all the points either above the top yellow line or below the bottom yellow line) will be labeled as same in both and . The inputs between the yellow lines, however, may be be mislabeled, as illustrated by the points inside the yellow circles in the figure.

Figure 4: Illustration of -parameter robustness on a two-class classifier.

4 Verification and Estimation of Parameter Robustness

In this section we will present how to verify and estimate parameter robustness using SMT solver.

4.1 Verifying Parameter Robustness

We apply SMT solver to verify all the parameter robustness defined in Section 3. The key idea is to construct a formula for each of them by the negating their definition. The robustness property will then be verified if the SMT solver returns unsat. The formula for all the parameter robustness given to SMT solver are as follows:

  • To verify Eq. 1, we use the following formula:

    (6)
  • To verify Eq. 2, we use the following formula:

    (7)

    where we define the input domain as a bounding box, i.e.,

  • To verify Eq. 3, we use the following formula:

    (8)

    Here we encode as follows:

    That is falls in the same side of the decision boundary both in and . For the verification purpose, we consider its negation.

  • To verify Eq. 4, we use the following formula:

    (9)
  • To verify Eq. 5, we use the following formula:

    (10)

We verify all the robustness properties on dReal solver [21].

4.2 Estimating Maximum Parameter Robustness

For -parameter robustness, we allow -perturbation on the parameter space and check whether the confidence value is bounded by . The estimation problem is defined as computing maximum possible value of for a given value of . We are interested in this estimation problem, as the maximum value of represents the least robustness measure for a given value. The estimation problem can be formulated as an optimization problem as follows:

  • -Estimation for -parameter robustness locally at :

    (11)
    subject to:

    where, is the maximum value of . Note that, instead of maximizing, , we minimize its negation, as the SMT solver we used implements only the minimization problem.

  • -Estimation for -parameter robustness globally for :

    (12)
    subject to:

Similarly, for -parameter robustness, we consider estimation problem for . For a given value , we want to maximize , which tells us how far away the boundary needs to be shifted so that no input beyond it cannot be mislabeled. We formulate this estimation problem as follows:

(13)
subject to

where maximum value of is the maximum value of .

5 Case Studies

We describe two datasets and the corresponding neural networks as case studies for our robustness analysis framework.

(a) ReLU activation.
(b) Linear activation.
Figure 5: Two MLPs trained on the Athletes dataset.

The first dataset, known as cats, contains the height, weight, and gender of 144 domesticated cats (47 female and 97 male) [2]

. The gender identification problem entails learning a classifier to estimate if a cat is Male or Female based on its height and weight. We present a simple one-node model that implements logistic regression and examine its robustness.

Given the height () and the weight () of a cat, the classifier, learned using Python Scikit-Learn 0.20.3, is given by , where . We assign the class label “Male” if the and “Female” otherwise. The parameters of the model that were learned on 78% of the data were , and . The multinomial loss function was optimized using the lbgfs algorithm [28]. The testing accuracy on the remaining 22% was 87.5%.

A second dataset contains the official statistics on the 11, 538 athletes (6,333 men and 5,205) men that participated in the 2016 Olympic games at Rio de Janeiro [1]. Each row contains an id, the name, nationality, gender, date birth, height, weight, sport of the athletes and the medals tally. The gender identification problem entails learning an MLP to guess the gender of the athlete based on their height and weight. We present two MLPs for this problem and examine their robustness in the next section.

Given the height and weight of an athlete as the input, the MLPs is constructed using two layers: a hidden layer and an output node. Three nodes that make up the hidden layer perform weighted averaging of the inputs and transform them using a nonlinear activation. Their outputs are then fed to the output node, which again takes a weighted average and uses the Sigmoid activation to obtain a number between 0 and 1. If the output is greater than 0.5, the input is assigned “Male”, otherwise it is assigned “Female”. We implemented two variations of the model. We used ReLU and linear activations in the three nodes of the hidden layer. The two models and their parameters are illustrated in Fig. 5.

The models were implemented in Keras and trained on a GPU-based instance of Amazon Web Services. The training accuracy was 77.19% and 77.36% for the ReLU and Linear versions respectively after 200 epochs.

In the next section, we apply our robustness analysis framework to examine the effect of quantizing the parameters of the logistic regression model for the Cats dataset and the two MLPs for the athletes dataset.

6 Results

In this section we discuss our results. For all three NNs (one for first case study and two for second case study), we present results for - and -parameter robustness both locally at an input and globally at input domain.

CAT ATH-ReLU ATH-Linear
0.005 0.00691 0.166 0.545
0.01 0.05054 0.0825 0.219
CAT ATH-ReLU ATH-Linear
0.005 (0.024, 0.021) (0.082, 0.076) (0.268, 0.218)
0.01 (0.052, 0.04) (0.165, 0.144) (0.44, 0.34)
Table 1: Estimated values for and for - and -parameter robustness, respectively, globally for input domains

Table 1 shows the estimated value of of -parameter robustness, computed using Eq. (12), and of -parameter robustness, computed using Eq. (13), for entire input domains. We compute them for two different values. The column CAT, ATH-ReLU and ATH-Linear represent the results for cat classifier, athletic classifier with ReLU activation and athletic classifier with Linear activation, respectively. The tuple in the table for represents values for male and female class, repectively. If we compare the results of ATH-ReLU and ATH-Linear, it is clear that the former classifier is much more robust than the latter for pertubation of the parameter values.

(a) -parameter robustness for
(b) -parameter robustness for
(c) -parameter robustness for
(d) -parameter robustness for
Figure 6: Parameter robustness analysis of Cat classifier.

Fig. 6 illustrates parameter robustness of the Cat classifier presented. For -parameter robustness, we choose two different values ( and ). For both cases, we randomly chose points from the input domain. We then computed for all inputs using Eq. (11). Fig. 6(a,b) show -parameter robustness locally at each randomly selected points. The blue line represents the decision boundary of NN, whereas the colorbar represents the range of . It is clear from the figures that value is higher in the bottom right region, which means the region is more susceptible to be mislabeled in the perturbed network. Note that does not mean that the input would actually be mislabeled (see explanation in Section 3).

Fig. 6(c,d) illustrate both - and -parameter robustness for two different values. For -parameter robustness, we selected random inputs from the domain. We then checked whether the input labeled will be flipped in the perturbed network using Eq. (8). In the figures, green and red points represent non-flippable and flippable inputs, respectively. The top (bottom) red line is generated by adding (subtracting) to the decision boundary, where is computed using Eq. (13).

(a) -parameter robustness for
(b) -parameter robustness for
(c) -parameter robustness for
(d) -parameter robustness for
Figure 7: Parameter robustness analysis of Athletics classifier with ReLU activation.

Figs. 7 and 8 illustrate the parameter robustness analysis of the athletics classifier with ReLU and Linear activation, respectively. Comparing these figures, we can conclude that the athletics classifier with ReLU activation is much more robust as compared to the classifier with linear activation.

(a) -parameter robustness for
(b) -parameter robustness for
(c) -parameter robustness for
(d) -parameter robustness for
Figure 8: Parameter robustness analysis of Athletes classifier with linear activation.

7 Related Work

Robustness analysis of neural networks is an active area of research. In this section, we compare and contrast some of the recent papers with our framework. Robustness typically refers to an NN’s ability to handle perturbations in the input data. The efforts to characterize robustness can be broadly classified into two types: model-centric approaches and data-centric approaches.

Model-centric approaches focus on improving the problem formulation to construct robust networks. Distillation training, one of the earliest attempts, entails training one model to predict the output probabilities of another model that was trained on an earlier, baseline standard to emphasize accuracy

[16, 27]. In [26], the authors proposed a new set of attacks for the , , and

distance metrics to construct upper bounds on the robustness of neural networks and thereby demonstrate that defensive distillation is limited in handling adversarial examples. Adversarial perturbations, random noise, and geometric transformations were studied in

[4]

and the authors highlight close connections between the robustness to additive perturbations and geometric properties of the classifier’s decision boundary, such as the curvature. Spatial Transform Networks, which entail geometrical transformation of the a network’s filter maps were proposed in

[18] to improve the robustness to geometric perturbations. Recently, a generic analysis framework CROWN was proposed to certify NNs using linear or quadratic upper and lower bounds for general activation functions [17]

. The authors extended their work to overcome the limitation of simple fully-connected layers and ReLU activations to propose CNN-Cert. The new framework can handle various architectures including convolutional layers, max-pooling layers, batch normalization layer, residual blocks, as well as general activation functions and capable of certifying robustness on general convolutional neural networks

[3].

Data-centric approaches entail identifying and rejecting perturbed samples, or increasing the training data to handle perturbations appropriately. Binary detector networks that can spot adversarial samples [24, 22], and augmenting data to reflect different lighting conditions [20] are typical examples. Additionally, robust optimization using saddle point(min-max) formulation [23] and region-based classification by assembling information in a hypercube centered [8] have also shown promising results. The above-mentioned approaches focus on perturbations to data, but our framework focuses on perturbations to the parameters with the end goal of safely implementing the neural networks on resource-constrined platforms.

8 Conclusions and Directions for Future Work

We presented a framework to automatically estimate the impact of rounding-off errors in the parameters of a neural network. The framework uses SMT solvers to estimate the local and global robustness of a given network. We applied our framework on a single-node logistic regression model and two small MLPs. We will consider convolutional neural networks in the future and investigate the scalability of our framework to larger parameter vectors. Compositionality will be critical to analyzing real-world neural networks and we will explore extending the theory of approximate bisimulation and the related Lyapunov-like functions to our problem.

References