R3Net: Random Weights, Rectifier Linear Units and Robustness for Artificial Neural Network

We consider a neural network architecture with randomized features, a sign-splitter, followed by rectified linear units (ReLU). We prove that our architecture exhibits robustness to the input perturbation: the output feature of the neural network exhibits a Lipschitz continuity in terms of the input perturbation. We further show that the network output exhibits a discrimination ability that inputs that are not arbitrarily close generate output vectors which maintain distance between each other obeying a certain lower bound. This ensures that two different inputs remain discriminable while contracting the distance in the output feature space.

Comments

There are no comments yet.

Authors

• 11 publications
• 4 publications
• 21 publications
• Robustness of classification ability of spiking neural networks

It is well-known that the robustness of artificial neural networks (ANNs...
01/30/2018 ∙ by Jie Yang, et al. ∙ 0

read it

• Input Validation for Neural Networks via Runtime Local Robustness Verification

Local robustness verification can verify that a neural network is robust...
02/09/2020 ∙ by Jiangchao Liu, et al. ∙ 0

read it

• Identifying Weights and Architectures of Unknown ReLU Networks

The output of a neural network depends on its parameters in a highly non...
10/02/2019 ∙ by David Rolnick, et al. ∙ 0

read it

• Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections

The paper describes a new algorithm to generate minimal, stable, and sym...
02/21/2018 ∙ by Xin Zhang, et al. ∙ 0

read it

• Intriguing properties of neural networks

Deep neural networks are highly expressive models that have recently ach...
12/21/2013 ∙ by Christian Szegedy, et al. ∙ 0

read it

• Deep Semi-Random Features for Nonlinear Function Approximation

We propose semi-random features for nonlinear function approximation. Th...
02/28/2017 ∙ by Kenji Kawaguchi, et al. ∙ 0

read it

• Optimally Sorting Evolving Data

We give optimal sorting algorithms in the evolving data framework, where...
05/09/2018 ∙ by Juan Jose Besa, et al. ∙ 0

read it

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Neural networks and deep learning architectures have revolutionized data analysis over the last decade

[1]. Appropriately trained neural networks have been shown to excel in classification and regression tasks, in many cases outperforming humans [2, 3]. The field is continually being enriched with active research pushing classification performance to increasingly higher levels. The rapidly increasing computational power and data storage have only added to the power of neural networks. However, very little is known regarding why the networks are able to gain this superior performance. In addition, it is known that learnt neural networks can be fragile when it comes to handling perturbations in the input [4, 5, 6, 7, 8]. It is hypothesized that this is because of the layers of the network being trained to fit the data closely. For example, in image classification, additive noise at very low signal-to-noise ratio levels added to images have been known to disporportionately change the class labels, even when the additive noise is practically unnoticeable to human eyes [5]. Such instability makes the network easy target to adversarial attacks and hacking [7]. This observation has led many researchers to investigate and develop deep networks with features that exhibit robustness to deformation or perturbation by building on invariances [9, 10, 11, 12, 13].

Randomness of features has been used with great success as a mean of reducing the computational complexity of neural networks while achieving comparable performance as with fully learnt networks [14, 15, 16]. In the case of the simple, yet effective, extreme learning machine (ELM), all layers of the network are assigned randomly chosen weights and the learning takes place only at the extreme layer [17, 18, 19]. It has also been shown recently that a performance similar to fully learnt networks may be achieved by training a network with most of the weights assigned randomly and only a small fraction of them being updated throughout the layers [20]

. These approaches indicate that randomness has much potential in terms of high-performance at low computational complexity. This motivates us to propose a neural network architecture that uses random weights in the layers followed by a structured sign splitter and rectified linear unit (ReLU) activation functions. We show that the output of each layer exhibits robustness in terms of perturbation in the input–the perturbation of the output of each layer has an upper and a lower bound in terms of the input perturbation. We believe that this is a step towards mathematically explaining the efficiency of random weights and ReLUs in neural networks observed in practice. We name our proposed architecture as R3Net, motivated by words ‘random weights’, ‘rectifier linear units’, and ‘robustness’. In this article, we show only analytical results, and refrain from providing simulation results. Simulation results will be shown in an extended manuscript later.

I-a Notation

For a scalar , we denote its sign as and magnitude as . Then, we have . Sign takes values in the set . For a vector , the corresponding sign vector is found by component wise operation and the sign vector is denoted by . We use to denote the magnitude of a real vector where magnitude is used scalar-wise. For vector , we denote the non-negative part by and non-positive part by , such that . We denote the ReLU function by such that . We then denote by the stack of ReLU activation functions applied component-wise on . Therefore, . We use to denote a set and to its complement set. Cardinality of a set is denoted by . We use to denote the norm of a vector, and to denote the Frobenius norm.

Ii Noise robustness and discrimination ability

It is well known that ANNs which involve a chain of blocks comprised of linear and nonlinear transformations lead to impressive performance given large amounts of reliable training data. However, as shown recently, this is not sufficient to guarantee that the ANN is a stable one [4, 5, 6, 7, 8]. In order that an ANN be stable, it is desirable that it possesses noise robustness. Let and be two input vectors such that , and the corresponding feature vectors generated by ANN be and . In order to characterize a scenario with input perturbation noise , we assume . Then, the desired property of ANN in terms of robustness is expressible as

 ∥~y1−~y2∥2=∥f(x1)−f(x1)∥2≤B∥x1−x2∥2, (1)

where .Further, it is often desirable that the feature vector continues to maintain a certain minimum distance between if the input vectors are different. In other words, we would like to have the following property:

 A∥x1−x2∥2≤∥~y1−~y2∥2, (2)

where . This ensures that the targets do not go arbitrarily close when the inputs are not close and it is possible to discriminate one feature from the other. The upper bound helps to provide noise robustness: the perturbation in the output is a constant multiple of input perturbation .

Iii Single Block Construction

In order to investigate the desired properties, we first consider a single block of ANN, usually referred to as a layer in neural network literature. The block has an input vector and an output vector , where

is the linear transform or weight matrix and

the component-wise nonlinearity (the ReLU in our case). The dimension of

is the number of neurons in the block. If we can ensure that one block of the ANN provides both noise robustness and point discrimination property, then, the full ANN comprising multiple blocks connected sequentially can be guaranteed to hold robustness and discriminative properties. This argument boils down to the construction of matrix

which promotes noise robustness and discriminative power in each block.

Iii-a ReLU function: Properties and a limitation

We now discuss some properties and a limitation of the ReLU. The ReLU operation on a scalar is given by

 g(x)≜max(x,0).

As a consequence of which we can see that the vector transformation consisting of component-wise ReLU operations has

Property 1.

ReLU function provides sparse output vector such that .

Property 2.

Let us consider . For two vectors and , we have corresponding vectors and , and output vectors and . Then, we have the following relation

 0≤∥y1−y2∥2=∥g(z1)−g(z2)∥2≤∥z1−z2∥2. (3)
Proof.

For scalars and , we have and . We have the following relation

 (y1−y2)2=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩(x1−x2)2ifx1>0,x2>0x21ifx1>0,x2<0x22ifx1<0,x2>00ifx1<0,x2<0.

Therefore, we observe that the ReLU function satisfies

 0≤(y1−y2)2≤(x1−x2)2.

Then, considering the vectors and , we have that

 0 ≤∥y1−y2∥2=∑i(y1(i)−y2(i))2≤∑i(z1(i)−z2(i))2 =∥z1−z2∥2

where is the ’th scalar element of and is the the ’th scalar element of . ∎

The upper bound relation in Property 3 implies that ReLU is Lipschitz continuous with Lipschitz constant 1. This show that the output perturbation of the ReLU is bounded by the input perturbation thereby providing noise robustness. On the other hand, the lower bound being zero does not support our need of maintaining a minimum distance between two points and . An example of the extreme effect is the case when and are non-positive vectors, and we get . This is then a limitation of the ReLU in achieving a good discriminative power.

Iii-B Overcoming the limitation

We now engineer a remedy of the limitation of the ReLU. Let us consider and where is a linear transform matrix. In other words, we introduce an additional linear transform after in the block. For two vectors and , we have the corresponding vectors and , and output vectors and . Our interest is to show that there exists a matrix for which we have both noise robustness and discriminative power properties, given and the ReLU.

Proposition 1.

Let us construct a matrix as follows

 V=[In−In]≜Vn. (4)

For the output vectors and , we have

 0<κ∥z1−z2∥2≤∥¯y1−¯y2∥2≤∥z1−z2∥2, (5)

where and is a function of and .

Proof.

We have and where

 Vn=[In−In].

For two vectors and , we have corresponding vectors and , and output vectors and . Let us define a set

 M(z1,z2)={i|s(z1(i))=s(z2(i))≠0}⊆{1,2,…,n}. (6)

Then, we have

 ∥z1−z2∥2 =∑i=1(z1(i)−z2(i))2 =∑i(s(z1(i))⊙|z1(i)|−s(z2(i))⊙|z2(i)|))2 =∑i∈M(z1,z2)(|z1(i)|−|z2(i)|)2 +∑i∈Mc(z1,z2)(|z1(i)|+|z2(i)|)2. (7)

Expressing the vectors in terms of , we have the outputs of the ReLU operation as follows

 ¯y1=g(Vnz1)=[|z+1||z−1|]and¯y2=g(Vnz2)=[|z+2||z−2|].

From (7) and (8), we have the following relation

 ∥¯y1−¯y2∥2≤∥z1−z2∥2,

where equality holds when , which is the case when sign patterns of and match exactly. We next define the parameter

 γ≜maxz1,z2( ∑i∈M(z1,z2) (|z1(i)|−|z2(i)|)2, ∑i∈Mc(z1,z2) |z1(i)|2+|z2(i)|2).

We note that and hence, it follows that

 0<γ∥z1−z2∥2∥z1−z2∥2≤∥¯y1−¯y2∥2≤∥z1−z2∥2.

On defining

 κ≜γ∥z1−z2∥2, (9)

we get (5) since . ∎

The difference signal can be treated as the perturbation noise. Note that . To investigate the effect of the perturbation noise, we now state our main assumption.

Assumption 1.

A with a low strength (that means is low) does not create a high change in the sign patterns of and . This means that for a small perturbation, is close to the entire index set and is close to an empty set. On the other hand, for a high perturbation noise strength, we assume that is close to an empty set and is close to a full set.

Remark 1 (Tightness of bounds and effect of noise).

For a low perturbation noise strength, we have and . This follows from the proof of Proposition 1, specifically equations (7) and (8). In fact, if then and . We interpret that a low perturbation noise passes through the transfer function almost unhindered. On the other hand, a perturbation noise with high strength is attenuated. Let us construct an illustrative example. Assume that is a full set and . In that case and we can comment that the perturbation noise is significantly attenuated.

Property 3.

The output vector is sparse and .

Proof.

Let us assume that has no scalar component that is zero. Now, for an extreme case where is positive, we have . Similarly, if is negative, then, we have . For these two extreme cases . In any other case when has zero scalars, the inequality result will follow. ∎

Iii-C Input-output relation

We now establish relation between the block input vector and output vector . For two vectors and , we have corresponding vectors and , and output vectors and . Our interest is to show that it is possible to construct a matrix for which we have both noise robustness and discriminative power properties.

Assumption 2.

We assume that the input vector is a sparse vector, that means the sparsity level

The assumption is valid if the vector is considered as the output of a similar block in a feedforward network.

Assumption 3.

From theory of compressed sensing, specifically restricted-isometry-property (RIP) of random matrices, we assume that if , then, we can construct matrix with a restricted-isometry-constant (RIC) , such that the following result holds

 (1−δ)∥q1−q2∥2≤∥z1−z2∥2≤(1+δ)∥q1−q2∥2, (10)

where sparse vectors and have a sparsity level .

Remark 2 (Construction of W matrix).

The matrix is a randomly drawn instance. A popular approach is to draw

independently from the Gaussian distribution

. One may also use other distributions, such as Bernoulli, Rademacher [21].

Proposition 2.

For a randomly constructed matrix with number of rows , we can combine the inequalities in (5) and (10) to get

 κ(1−δ)∥q1−q2∥2≤∥¯y1−¯y2∥2≤(1+δ)∥q1−q2∥2. (11)

If we construct a block with the transfer function where we use randomly chosen matrix with appropriate size, then, the block provides noise robustness and discriminative power properties.

When there is no requirement on to be sparse, we can construct as an instance of random orthonormal matrix, such that and . In that case, we have the relation

 ∥q1−q2∥2=∥z1−z2∥2 (12)

for a pair of irrespective of sparsity. Combining the above relation with the relation (5), we have the following result when we use random instance of orthonormal matrix

 κ∥q1−q2∥2≤∥¯y1−¯y2∥2≤∥q1−q2∥2. (13)
Remark 3 (Tightness of bounds and effect of noise).

With the relation , we assume that Assumption 1 holds as the perturbation noise varies. Therefore, we follow similar arguments in Remark 1. For a low perturbation noise strength , we have and . We interpret that a low perturbation noise passes through the transfer function almost unhindered. On the other hand, a perturbation noise with high strength is attenuated.

Iv Block Chain Construction

A feedforward ANN is comprised of similar operational blocks in a chain. Let us consider two blocks in feedforward connection. These can be ’th and ’th blocks of an ANN. For the ’th block, we use a superscript to denote appropriate variables and systems. Let the ’th block have nodes. The input to the ’th block is assumed to be sparse. The output of ’th block is also sparse, and this output is used as the input to the succeeding ’th block. This means . Then, the output of ’th block is

 ¯y(l+1) = g(Vn(l+1)z(l+1))=g(Vn(l+1)W(l+1)q(l+1)) = g(Vn(l+1)W(l+1)¯y(l)) = g(Vn(l+1)W(l+1)g(Vn(l)W(l)q(l)))

Corresponding to the two vectors and , and their appropriate transforms, we have the following relations

 κl(1−δl)∥q(l)1−q(l)2∥2 ≤∥¯y(l)1−¯y(l)2∥2≤(1+δl)∥q(l)1−q(l)2∥2, κl+1(1−δl+1)∥¯y(l)1−¯y(l)2∥2 ≤∥¯y(l+1)1−¯y(l+1)2∥2≤(1+δl+1)∥¯y(l)1−¯y(l)2∥2.

As a consequence of the above relations, the feedforward chain with two blocks follows

 κlκl+1(1−δl)(1−δl+1)∥q(l)1−q(l)2∥2 ≤∥¯y(l+1)1−¯y(l+1)2∥2≤(1+δl)(1+δl+1)∥q(l)1−q(l)2∥2 (14)
Theorem 1.

Let a feedforward ANN using ReLU activation function be constructed as follows.

1. The ANN comprises layers where the ’th layer has the transfer function . The blocks are in a chain. The input to the first block is . The output of ANN is

 ¯y(L) =g(Vn(L)z(L)) =g(Vn(L)W(L)g(Vn(L−1)W(L−1)…g(Vn(1)W(1)x).
2. In the ANN, matrices are randomly constructed with appropriate sizes, that is where is assumed a maximum sparsity level for , and .

Then, the ANN provides both noise robustness and discriminative power properties jointly characterized by the following relation

 L∏l=1κl(1−δl)∥x1−x2∥2 ≤∥¯y(L)1−¯y(L)2∥2≤L∏l=1(1+δl)∥x1−x2∥2, (15)

where and are two input vectors to the ANN and their corresponding outputs are and , respectively. We assume that and are also sparse in some basis.

Proof.

The proof follows by applying the relation (14) for all . ∎

Theorem 2.

If we construct an ANN where matrices are randomly constructed othonormal matrices with , then, the ANN will provide the following relation

 (16)
Remark 4 (Tightness of bounds and effect of noise).

We follow similar arguments in Remark 3. We interpret that a low perturbation noise passes through the block chain almost unhindered. On the other hand, a perturbation noise with high strength is attenuated.

As an illustration of the concept, we show the plot showing the perturbation in output of one block alongwith the corresponding input perturbations in Figure 1. In the experiment, we consider to be a isotropic multivariate Gaussian with with and , where is drawn from isotropic multivariate Gaussian distribution with . We choose with and with entries drawn from for various realizations of and and . The figure shows the scatter plot of samples. We observe that all values of lie strictly below line. Further, we observe that the contraction is greater when a larger dimensional is used.

V Conclusion

We show that random weights, sign splitter and rectified linear units provide a good combination to address two important properties of artificial neural networks–robustness and discriminative ability. We note the results with random orthonormal matrices are equally valid for standard real orthonormal matrices, for example, discrete cosine transform (DCT), Haar transform, Walsh-Hadamard transform, etc making our approach universal in nature. We believe that our analysis provides clues on the effectiveness of using random feature weights and ReLU functions in deep neural architectures.