I Introduction
Neural networks and deep learning architectures have revolutionized data analysis over the last decade
[1]. Appropriately trained neural networks have been shown to excel in classification and regression tasks, in many cases outperforming humans [2, 3]. The field is continually being enriched with active research pushing classification performance to increasingly higher levels. The rapidly increasing computational power and data storage have only added to the power of neural networks. However, very little is known regarding why the networks are able to gain this superior performance. In addition, it is known that learnt neural networks can be fragile when it comes to handling perturbations in the input [4, 5, 6, 7, 8]. It is hypothesized that this is because of the layers of the network being trained to fit the data closely. For example, in image classification, additive noise at very low signaltonoise ratio levels added to images have been known to disporportionately change the class labels, even when the additive noise is practically unnoticeable to human eyes [5]. Such instability makes the network easy target to adversarial attacks and hacking [7]. This observation has led many researchers to investigate and develop deep networks with features that exhibit robustness to deformation or perturbation by building on invariances [9, 10, 11, 12, 13].Randomness of features has been used with great success as a mean of reducing the computational complexity of neural networks while achieving comparable performance as with fully learnt networks [14, 15, 16]. In the case of the simple, yet effective, extreme learning machine (ELM), all layers of the network are assigned randomly chosen weights and the learning takes place only at the extreme layer [17, 18, 19]. It has also been shown recently that a performance similar to fully learnt networks may be achieved by training a network with most of the weights assigned randomly and only a small fraction of them being updated throughout the layers [20]
. These approaches indicate that randomness has much potential in terms of highperformance at low computational complexity. This motivates us to propose a neural network architecture that uses random weights in the layers followed by a structured sign splitter and rectified linear unit (ReLU) activation functions. We show that the output of each layer exhibits robustness in terms of perturbation in the input–the perturbation of the output of each layer has an upper and a lower bound in terms of the input perturbation. We believe that this is a step towards mathematically explaining the efficiency of random weights and ReLUs in neural networks observed in practice. We name our proposed architecture as R3Net, motivated by words ‘random weights’, ‘rectifier linear units’, and ‘robustness’. In this article, we show only analytical results, and refrain from providing simulation results. Simulation results will be shown in an extended manuscript later.
Ia Notation
For a scalar , we denote its sign as and magnitude as . Then, we have . Sign takes values in the set . For a vector , the corresponding sign vector is found by component wise operation and the sign vector is denoted by . We use to denote the magnitude of a real vector where magnitude is used scalarwise. For vector , we denote the nonnegative part by and nonpositive part by , such that . We denote the ReLU function by such that . We then denote by the stack of ReLU activation functions applied componentwise on . Therefore, . We use to denote a set and to its complement set. Cardinality of a set is denoted by . We use to denote the norm of a vector, and to denote the Frobenius norm.
Ii Noise robustness and discrimination ability
It is well known that ANNs which involve a chain of blocks comprised of linear and nonlinear transformations lead to impressive performance given large amounts of reliable training data. However, as shown recently, this is not sufficient to guarantee that the ANN is a stable one [4, 5, 6, 7, 8]. In order that an ANN be stable, it is desirable that it possesses noise robustness. Let and be two input vectors such that , and the corresponding feature vectors generated by ANN be and . In order to characterize a scenario with input perturbation noise , we assume . Then, the desired property of ANN in terms of robustness is expressible as
(1) 
where .Further, it is often desirable that the feature vector continues to maintain a certain minimum distance between if the input vectors are different. In other words, we would like to have the following property:
(2) 
where . This ensures that the targets do not go arbitrarily close when the inputs are not close and it is possible to discriminate one feature from the other. The upper bound helps to provide noise robustness: the perturbation in the output is a constant multiple of input perturbation .
Iii Single Block Construction
In order to investigate the desired properties, we first consider a single block of ANN, usually referred to as a layer in neural network literature. The block has an input vector and an output vector , where
is the linear transform or weight matrix and
the componentwise nonlinearity (the ReLU in our case). The dimension ofis the number of neurons in the block. If we can ensure that one block of the ANN provides both noise robustness and point discrimination property, then, the full ANN comprising multiple blocks connected sequentially can be guaranteed to hold robustness and discriminative properties. This argument boils down to the construction of matrix
which promotes noise robustness and discriminative power in each block.Iiia ReLU function: Properties and a limitation
We now discuss some properties and a limitation of the ReLU. The ReLU operation on a scalar is given by
As a consequence of which we can see that the vector transformation consisting of componentwise ReLU operations has
Property 1.
ReLU function provides sparse output vector such that .
Property 2.
Let us consider . For two vectors and , we have corresponding vectors and , and output vectors and . Then, we have the following relation
(3) 
Proof.
For scalars and , we have and . We have the following relation
Therefore, we observe that the ReLU function satisfies
Then, considering the vectors and , we have that
where is the ’th scalar element of and is the the ’th scalar element of . ∎
The upper bound relation in Property 3 implies that ReLU is Lipschitz continuous with Lipschitz constant 1. This show that the output perturbation of the ReLU is bounded by the input perturbation thereby providing noise robustness. On the other hand, the lower bound being zero does not support our need of maintaining a minimum distance between two points and . An example of the extreme effect is the case when and are nonpositive vectors, and we get . This is then a limitation of the ReLU in achieving a good discriminative power.
IiiB Overcoming the limitation
We now engineer a remedy of the limitation of the ReLU. Let us consider and where is a linear transform matrix. In other words, we introduce an additional linear transform after in the block. For two vectors and , we have the corresponding vectors and , and output vectors and . Our interest is to show that there exists a matrix for which we have both noise robustness and discriminative power properties, given and the ReLU.
Proposition 1.
Let us construct a matrix as follows
(4) 
For the output vectors and , we have
(5) 
where and is a function of and .
Proof.
We have and where
For two vectors and , we have corresponding vectors and , and output vectors and . Let us define a set
(6) 
Then, we have
(7) 
Expressing the vectors in terms of , we have the outputs of the ReLU operation as follows
(8) 
The difference signal can be treated as the perturbation noise. Note that . To investigate the effect of the perturbation noise, we now state our main assumption.
Assumption 1.
A with a low strength (that means is low) does not create a high change in the sign patterns of and . This means that for a small perturbation, is close to the entire index set and is close to an empty set. On the other hand, for a high perturbation noise strength, we assume that is close to an empty set and is close to a full set.
Remark 1 (Tightness of bounds and effect of noise).
For a low perturbation noise strength, we have and . This follows from the proof of Proposition 1, specifically equations (7) and (8). In fact, if then and . We interpret that a low perturbation noise passes through the transfer function almost unhindered. On the other hand, a perturbation noise with high strength is attenuated. Let us construct an illustrative example. Assume that is a full set and . In that case and we can comment that the perturbation noise is significantly attenuated.
Property 3.
The output vector is sparse and .
Proof.
Let us assume that has no scalar component that is zero. Now, for an extreme case where is positive, we have . Similarly, if is negative, then, we have . For these two extreme cases . In any other case when has zero scalars, the inequality result will follow. ∎
IiiC Inputoutput relation
We now establish relation between the block input vector and output vector . For two vectors and , we have corresponding vectors and , and output vectors and . Our interest is to show that it is possible to construct a matrix for which we have both noise robustness and discriminative power properties.
Assumption 2.
We assume that the input vector is a sparse vector, that means the sparsity level
The assumption is valid if the vector is considered as the output of a similar block in a feedforward network.
Assumption 3.
From theory of compressed sensing, specifically restrictedisometryproperty (RIP) of random matrices, we assume that if , then, we can construct matrix with a restrictedisometryconstant (RIC) , such that the following result holds
(10) 
where sparse vectors and have a sparsity level .
Remark 2 (Construction of matrix).
The matrix is a randomly drawn instance. A popular approach is to draw
independently from the Gaussian distribution
. One may also use other distributions, such as Bernoulli, Rademacher [21].Proposition 2.
For a randomly constructed matrix with number of rows , we can combine the inequalities in (5) and (10) to get
(11) 
If we construct a block with the transfer function where we use randomly chosen matrix with appropriate size, then, the block provides noise robustness and discriminative power properties.
When there is no requirement on to be sparse, we can construct as an instance of random orthonormal matrix, such that and . In that case, we have the relation
(12) 
for a pair of irrespective of sparsity. Combining the above relation with the relation (5), we have the following result when we use random instance of orthonormal matrix
(13) 
Remark 3 (Tightness of bounds and effect of noise).
With the relation , we assume that Assumption 1 holds as the perturbation noise varies. Therefore, we follow similar arguments in Remark 1. For a low perturbation noise strength , we have and . We interpret that a low perturbation noise passes through the transfer function almost unhindered. On the other hand, a perturbation noise with high strength is attenuated.
Iv Block Chain Construction
A feedforward ANN is comprised of similar operational blocks in a chain. Let us consider two blocks in feedforward connection. These can be ’th and ’th blocks of an ANN. For the ’th block, we use a superscript to denote appropriate variables and systems. Let the ’th block have nodes. The input to the ’th block is assumed to be sparse. The output of ’th block is also sparse, and this output is used as the input to the succeeding ’th block. This means . Then, the output of ’th block is
Corresponding to the two vectors and , and their appropriate transforms, we have the following relations
As a consequence of the above relations, the feedforward chain with two blocks follows
(14) 
Theorem 1.
Let a feedforward ANN using ReLU activation function be constructed as follows.

The ANN comprises layers where the ’th layer has the transfer function . The blocks are in a chain. The input to the first block is . The output of ANN is

In the ANN, matrices are randomly constructed with appropriate sizes, that is where is assumed a maximum sparsity level for , and .
Then, the ANN provides both noise robustness and discriminative power properties jointly characterized by the following relation
(15) 
where and are two input vectors to the ANN and their corresponding outputs are and , respectively. We assume that and are also sparse in some basis.
Proof.
The proof follows by applying the relation (14) for all . ∎
Theorem 2.
If we construct an ANN where matrices are randomly constructed othonormal matrices with , then, the ANN will provide the following relation
(16) 
Remark 4 (Tightness of bounds and effect of noise).
We follow similar arguments in Remark 3. We interpret that a low perturbation noise passes through the block chain almost unhindered. On the other hand, a perturbation noise with high strength is attenuated.
As an illustration of the concept, we show the plot showing the perturbation in output of one block alongwith the corresponding input perturbations in Figure 1. In the experiment, we consider to be a isotropic multivariate Gaussian with with and , where is drawn from isotropic multivariate Gaussian distribution with . We choose with and with entries drawn from for various realizations of and and . The figure shows the scatter plot of samples. We observe that all values of lie strictly below line. Further, we observe that the contraction is greater when a larger dimensional is used.
V Conclusion
We show that random weights, sign splitter and rectified linear units provide a good combination to address two important properties of artificial neural networks–robustness and discriminative ability. We note the results with random orthonormal matrices are equally valid for standard real orthonormal matrices, for example, discrete cosine transform (DCT), Haar transform, WalshHadamard transform, etc making our approach universal in nature. We believe that our analysis provides clues on the effectiveness of using random feature weights and ReLU functions in deep neural architectures.
References
 [1] D. Yu and L. Deng, “Deep learning and its applications to signal and information processing [exploratory dsp],” IEEE Signal Process. Mag., vol. 28, no. 1, pp. 145–154, Jan 2011.

[2]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “Imagenet large scale visual recognition challenge,”
Intl. J. Computer Vision
, vol. 115, no. 3, pp. 211–252, Dec 2015.  [3] S. F. Dodge and L. J. Karam, “A study and comparison of human and deep learning recognition performance under visual distortions,” CoRR, vol. abs/1705.02498, 2017. [Online]. Available: http://arxiv.org/abs/1705.02498
 [4] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53–65, Jan 2018.
 [5] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,” ArXiv eprints, Dec. 2014.
 [6] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” CoRR, vol. abs/1607.02533, 2016. [Online]. Available: http://arxiv.org/abs/1607.02533
 [7] V. Behzadan and A. Munir, “Vulnerability of deep reinforcement learning to policy induction attacks,” CoRR, vol. abs/1701.04143, 2017. [Online]. Available: http://arxiv.org/abs/1701.04143
 [8] A. Fawzi, S. M. MoosaviDezfooli, and P. Frossard, “The robustness of deep networks: A geometrical perspective,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 50–62, Nov 2017.
 [9] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114–4128, Aug 2014.
 [10] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Machine Intelligence, vol. 35, no. 8, pp. 1872–1886, Aug 2013.
 [11] S. Mallat, “Understanding deep convolutional networks,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, 2016.

[12]
T. Wiatowski, P. Grohs, and H. Bölcskei, “Energy propagation in deep convolutional neural networks,”
IEEE Trans. Info. Theor., vol. PP, no. 99, pp. 1–1, 2018.  [13] T. Wiatowski and H. Bölcskei, “A mathematical theory of deep convolutional neural networks for feature extraction,” CoRR, vol. abs/1512.06293, 2015. [Online]. Available: http://arxiv.org/abs/1512.06293
 [14] R. Vidal, J. Bruna, R. Giryes, and S. Soatto, “Mathematics of Deep Learning,” ArXiv eprints, Dec. 2017.
 [15] R. Giryes, G. Sapiro, and A. M. Bronstein, “Deep neural networks with random gaussian weights: A universal classification strategy?” IEEE Trans. Signal Process., vol. 64, no. 13, pp. 3444–3457, July 2016.
 [16] S. Chatterjee, A. M. Javid, M. Sadeghi, P. P. Mitra, and M. Skoglund, “Progressive learning for systematic design of large neural networks,” CoRR, vol. abs/1710.08177, 2017. [Online]. Available: http://arxiv.org/abs/1710.08177
 [17] G.B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” J. Trans. Sys. Man Cyber. Part B, vol. 42, no. 2, pp. 513–529, Apr. 2012.
 [18] G. Huang, G.B. Huang, S. Song, and K. You, “Trends in extreme learning machines: A review,” Neural Networks, vol. 61, no. Supplement C, pp. 32 – 48, 2015.
 [19] T. Hussain, S. M. Siniscalchi, C. C. Lee, S. S. Wang, Y. Tsao, and W. H. Liao, “Experimental study on extreme learning machine applications for speech enhancement,” IEEE Access, vol. PP, no. 99, pp. 1–1, 2017.
 [20] A. Rosenfeld and J. K. Tsotsos, “Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing,” ArXiv eprints, Feb. 2018.
 [21] R. G. Baraniuk, “Compressive sensing [lecture notes],” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118–121, July 2007.
Comments
There are no comments yet.